feat(lifecycle): kill path, set_timeout/refresh, and stable endAt#624
feat(lifecycle): kill path, set_timeout/refresh, and stable endAt#624chengjoey wants to merge 1 commit into
Conversation
| destroyReq.Annotations = make(map[string]string) | ||
| destroyReq.Annotations[constants.CubeAnnotationsInsType] = req.InstanceType | ||
| } | ||
| if destroyReq.Annotations == nil { |
There was a problem hiding this comment.
Dead nil-check: this second destroyReq.Annotations == nil guard can never be reached.
If req.Annotations was nil, the first block (L43-46) already initializes the map. If it was non-nil, the struct literal at L40 already copied the non-nil map. Harmless but should be cleaned up.
| return endAt | ||
| } | ||
| } | ||
| return time.Now().UnixMilli() + int64(timeoutSeconds)*1000 |
There was a problem hiding this comment.
Critical: refreshTimeoutMeta can report success without persisting state.
When getTimeoutProvider() returns nil, or RefreshTimeout fails with an error, this falls through to computing and returning a synthetic endAt (now + timeout*1000). The caller returns HTTP 200 with this plausible end_at in the response — but the lifecycle meta was not updated in Redis. The sweeper never sees the new timeout, and the sandbox is killed/paused at the wrong time.
Either propagate the error to the caller (return an error response code) or write the fallback value to Redis.
| } | ||
| endAt, err := p.LookupEndAt(ctx, sandboxID) | ||
| if err != nil { | ||
| return 0 |
There was a problem hiding this comment.
LookupSandboxEndAt silently swallows provider errors.
When TimeoutProvider.LookupEndAt returns an error (e.g. transient Redis timeout), this returns 0 with no logging. Since this is called for every sandbox in info/list responses, a transient Redis blip silently makes all sandboxes show end_at=0 with no observable signal. Please log at warn level before returning 0.
| default: | ||
| _ = s.o.Redis.ClearState(ctx, sid) | ||
| _ = s.o.ProxyPush.SetState(ctx, sid, "running") | ||
| return errors.New("cubemaster kill: " + killErr.Error()) |
There was a problem hiding this comment.
Rollback failure can permanently deny traffic. Both ClearState and SetState("running") errors are silently discarded with _ =. If either fails (transient Redis hiccup, proxy channel timeout), the sandbox's state remains as "killing" in the nginx dict — and the Lua gate returns 410 Gone permanently for every request until operator intervention.
Consider retry-with-backoff for rollback operations, or set a TTL on proxy shared-dict state entries so stale "killing" self-heals.
chenhengqi
left a comment
There was a problem hiding this comment.
Please update your commit message and PR summary. I don't have a whole picture what this PR trying to achieve.
I presume we don't have to touch CubeProxy for killing sandboxes.
0970025 to
cb7473d
Compare
done |
| return 0 | ||
| } | ||
| endAt, err := p.LookupEndAt(ctx, sandboxID) | ||
| if err != nil { |
There was a problem hiding this comment.
Error silently swallowed. LookupEndAt already surfaces the error but LookupSandboxEndAt discards it without logging. If Redis is down, sandbox list responses return end_at=0 with zero operator visibility. Compare with refreshTimeoutMeta (sandbox_timeout.go:126–128) which correctly logs the error.
Recommendation: Add log.G(ctx).Warnf(...) before return 0 on this line.
| rsp.Ret.RetMsg = "should provide sandboxID" | ||
| return | ||
| } | ||
| if req.Timeout <= 0 { |
There was a problem hiding this comment.
No upper bound on timeout accepts indefinite sandbox lifetime. An int32 can hold ~2.1 billion seconds (~68 years). A caller that sends set_timeout(2147483647) can permanently anchor a sandbox, defeating the resource-reclamation purpose of the timeout feature. The creation path defaults to 60s, but this endpoint accepts any positive value.
Recommendation: Define a configurable MaxTimeoutSeconds and validate against it:
const maxTimeoutSeconds = 86400 // 24h, or from config
if req.Timeout <= 0 || req.Timeout > maxTimeoutSeconds {
rsp.Ret.RetCode = int(errorcode.ErrorCode_MasterParamsError)
rsp.Ret.RetMsg = fmt.Sprintf("timeout must be between 1 and %d seconds", maxTimeoutSeconds)
return
}| // doesn't carry: AutoPause / AutoResume / TemplateID / HostID / HostIP / | ||
| // InstanceType), rewrites CreatedAt + TimeoutSeconds + EndAt, and publishes | ||
| // an OpUpdate event so every sidecar replica converges on the new view. | ||
| func (p *storeTimeoutProvider) RefreshTimeout(ctx context.Context, sandboxID string, timeoutSeconds int) (int64, error) { |
There was a problem hiding this comment.
Read-modify-write race. Two concurrent set_timeout calls for the same sandbox can interleave: both read the meta, both modify it in memory, then both HSET — the second write silently overwrites the first. The first caller receives a success response with their expected end_at, but the second caller's value is what's persisted. The reverse interleaving (shorter timeout winning) can cause premature sandbox death.
Recommendation: Use a Redis Lua script or WATCH/MULTI/EXEC to atomically read, modify, and write the meta. A Lua EVALSHA would also collapse the three round-trips (HGET → modify → HSET + XADD) into one, which would also be a nice performance win.
| req.InstanceType = cubebox.InstanceType_cubebox.String() | ||
| } | ||
|
|
||
| if !sandboxExists(ctx, req.SandboxID) { |
There was a problem hiding this comment.
TOCTOU race between existence check and meta write. sandboxExists checks local caches; then refreshTimeoutMeta calls LoadMeta on Redis. If the sandbox is concurrently deleted between these two calls: (1) sandboxExists returns true (cache not yet invalidated), (2) LoadMeta returns nil (meta was HDEL'd), (3) RefreshTimeout returns 0, nil, then (4) refreshTimeoutMeta falls through to the local fallback on line 132, producing a fake end_at. The client gets HTTP 200 with end_at that was never persisted. The caller thinks they set a timeout, but the sandbox is already gone.
Recommendation: Eliminate the separate sandboxExists check and rely on LoadMeta returning nil as the authoritative "not found" signal. Propagate that to the HTTP layer as a 404 rather than returning a fake end_at.
| one.TemplateID = templateID | ||
| one.Annotations = buildAnnotationsFromLabels(sandboxLabels) | ||
| one.Labels = sandboxLabels | ||
| one.EndAt = LookupSandboxEndAt(ctx, sandbox.GetId()) |
There was a problem hiding this comment.
N+1 Redis round trips per sandbox list page. Each sandbox in the response issues a separate HGET + json.Unmarshal to Redis. With 50 sandboxes per page and callers like monitoring polling making repeated list calls, this adds up fast.
Recommendation: Batch with a single HMGET for all sandbox IDs in the page, then build a map[sandboxID]endAt locally in one pass.
| return 0 | ||
| } | ||
| endAt, err := p.LookupEndAt(ctx, sandboxID) | ||
| if err != nil { |
There was a problem hiding this comment.
Error silently swallowed. LookupEndAt already surfaces the error but LookupSandboxEndAt discards it without logging. If Redis is down, sandbox list responses return end_at=0 with zero operator visibility. Compare with refreshTimeoutMeta (sandbox_timeout.go:126-128) which correctly logs the error.
Recommendation: Add log.G(ctx).Warnf(...) before return 0 on this line.
| rsp.Ret.RetMsg = "should provide sandboxID" | ||
| return | ||
| } | ||
| if req.Timeout <= 0 { |
There was a problem hiding this comment.
No upper bound on timeout. An int32 can hold ~2.1 billion seconds (~68 years). A caller that sends set_timeout(2147483647) can indefinitely anchor a sandbox, defeating the resource-reclamation purpose of the timeout feature. The creation path has a default (60s), but this endpoint accepts any positive value.
Recommendation: Define a configurable MaxTimeoutSeconds and validate against it, e.g.:
const maxTimeoutSeconds = 86400 // 24h
if req.Timeout <= 0 || req.Timeout > maxTimeoutSeconds {
// return params error
}| // doesn't carry: AutoPause / AutoResume / TemplateID / HostID / HostIP / | ||
| // InstanceType), rewrites CreatedAt + TimeoutSeconds + EndAt, and publishes | ||
| // an OpUpdate event so every sidecar replica converges on the new view. | ||
| func (p *storeTimeoutProvider) RefreshTimeout(ctx context.Context, sandboxID string, timeoutSeconds int) (int64, error) { |
There was a problem hiding this comment.
Classic read-modify-write race. Two concurrent set_timeout calls for the same sandbox will interleave: both read the meta, both modify it in memory, then both HSET — the second write silently overwrites the first. The first caller receives a success response with their expected end_at, but the second caller's value is what's persisted. The reverse interleaving (shorter timeout winning) can cause premature sandbox death.
Recommendation: Use a Redis Lua script or WATCH/MULTI/EXEC to atomically read, modify, and write the meta. A Lua EVALSHA would also collapse the three round-trips (HGET → modify → HSET + XADD) into one.
| req.InstanceType = cubebox.InstanceType_cubebox.String() | ||
| } | ||
|
|
||
| if !sandboxExists(ctx, req.SandboxID) { |
There was a problem hiding this comment.
TOCTOU race between existence check and meta write. sandboxExists checks local caches, then refreshTimeoutMeta calls LoadMeta on Redis. If the sandbox is concurrently deleted between these two calls: (1) sandboxExists returns true, (2) LoadMeta returns nil (meta was just HDEL'd), (3) RefreshTimeout returns 0, nil, (4) refreshTimeoutMeta falls through to the local fallback (line 132), (5) the client gets HTTP 200 with a locally-computed end_at that was never persisted. The caller believes the timeout was set, but the sandbox is already gone.
Recommendation: Eliminate the separate existence check and rely on LoadMeta returning nil as the authoritative "not found" signal. Propagate that to the HTTP layer as a proper 404 response rather than returning a fake end_at.
| one.TemplateID = templateID | ||
| one.Annotations = buildAnnotationsFromLabels(sandboxLabels) | ||
| one.Labels = sandboxLabels | ||
| one.EndAt = LookupSandboxEndAt(ctx, sandbox.GetId()) |
There was a problem hiding this comment.
N+1 Redis round trips per sandbox list page. Each sandbox in the list response issues a separate HGET + json.Unmarshal to Redis. With 50 sandboxes per page (and paginated callers like monitoring polling), this is 50 sequential Redis round-trips per list API call.
Recommendation: Batch with HMGET (or HGETALL) in a single round-trip for all sandbox IDs in the page, then build a map[sandboxID]endAt locally.
cb7473d to
87b0ad1
Compare
|
|
||
| now := time.Now().UnixMilli() | ||
| meta.TimeoutSeconds = timeoutSeconds | ||
| meta.CreatedAt = now |
There was a problem hiding this comment.
Semantic overload: meta.CreatedAt = now repurposes the creation timestamp as a refresh timestamp. After a set_timeout/refresh call, any code reading CreatedAt to answer "when was this sandbox created?" (billing, audit, diagnostics) gets the wrong answer.
Consider adding a dedicated LastRefreshAt field to SandboxLifecycleMeta instead, and having the sweeper's baseline prefer max(LastActiveMs, LastRefreshAt, CreatedAt).
Ref: code-quality-review, performance-review
|
|
||
| // SetTimeout implements POST /cube/sandbox/timeout. It is the master-side | ||
| // counterpart of CubeAPI's SetTimeoutRequest -> SetTimeoutResponse. | ||
| func SetTimeout(ctx context.Context, req *types.SetTimeoutRequest) (rsp *types.SetTimeoutRes) { |
There was a problem hiding this comment.
~40 lines of cloned code between SetTimeout and Refresh (compare L21-64 with L68-111). The only differences are the request type field name (Timeout vs Duration) and the log prefix. Extract a shared handler to reduce duplication and ensure future validation/logging changes apply to both uniformly.
| rsp.Ret.RetMsg = "should provide sandboxID" | ||
| return | ||
| } | ||
| if req.Timeout <= 0 { |
There was a problem hiding this comment.
No upper bound on Timeout/Duration: a caller could set timeout: 2147483647 (max int32), extending a sandbox's life ~68 years. Add a configurable maximum (e.g. 86400s = 24h) and reject values above it. Ref: security-review.
| return | ||
| } | ||
|
|
||
| func sandboxExists(ctx context.Context, sandboxID string) bool { |
There was a problem hiding this comment.
TOCTOU race: sandboxExists checks a stale local cache (GetSandboxCache is periodically wiped) and a Redis proxy-map key — neither of which are atomic with the subsequent refreshTimeoutMeta. A sandbox could be deleted between the check and the meta update.
Recommendation: let refreshTimeoutMeta / LoadMeta be the authoritative existence check and report a "not found" error when the meta is nil, eliminating this separate existence check. Ref: security-review, test-coverage-review.
| case string: | ||
| raw = []byte(x) | ||
| default: | ||
| log.G(ctx).Warnf("lifecycle: HGET %s %s unexpected type %T", MetaKey, sandboxID, v) |
There was a problem hiding this comment.
Silent nil on type mismatch: The default branch logs a warning and returns (nil, nil) for unexpected HGET return types. The caller cannot distinguish "entry not found" from "type mismatch error." If a future redis library upgrade returns int64, the entry is silently treated as missing. Return an error here instead. Ref: code-quality-review.
| } | ||
| reason := req.KillReason | ||
| if reason == "" { | ||
| reason = "request" |
There was a problem hiding this comment.
No kill_reason validation: The KillReason is accepted as an arbitrary string from the request body. An attacker who reaches the DELETE endpoint can inject "timeout" or "orphaned" or any custom value into cube.master.instance.kill_reason, corrupting audit/billing trails. Add an allowlist validation (request|timeout|orphaned). Ref: security-review.
| @@ -107,9 +109,6 @@ func (s *Sweeper) sweepOnce(ctx context.Context) { | |||
| withinWarmup := now.Sub(s.o.StartedAt) < s.o.BootstrapWarmup | |||
|
|
|||
| for _, e := range s.o.Registry.Snapshot() { | |||
There was a problem hiding this comment.
Sweeper now iterates ALL sandboxes every tick, not just auto-pause ones (the if !e.Meta.AutoPause { continue } guard was removed at the old L109). This means the sweeper does a Redis GET per non-auto-pause sandbox per sweep interval. At 10k sandboxes with a 1s interval, that's ~10k GET/s sustained. Consider whether the sweep interval should be adaptive, or whether a separate kill-sweeper could run at a lower frequency. Ref: performance-review.
| // KillReasonOrphaned is reserved for future use: a sandbox observed on a | ||
| // node but missing from the registry / Redis source-of-truth. Mirrors | ||
| // e2b's orphan reaper. | ||
| KillReasonOrphaned = "orphaned" |
There was a problem hiding this comment.
Dead code: KillReasonOrphaned is declared but never referenced. Remove it and add back when the orphan reaper is implemented.
| } | ||
| } | ||
|
|
||
| if err := s.o.Redis.SetState(ctx, sid, "killed", s.o.StateLockTTL); err != nil { |
There was a problem hiding this comment.
Terminal state TTL causes re-kill cycle: SetState for "killed" uses StateLockTTL (60s). After TTL expiry, the sweeper re-detects the sandbox as idle, finds no state, re-enters tryKill, and issues a redundant Kill RPC. The same pattern exists for "paused" in tryPause (L262). Terminal states should use a persistent key (no TTL) or a much longer TTL (hours). Transient states (pausing, killing) keep a short TTL. Ref: performance-review.
| zap.String("sandbox_id", sid), | ||
| zap.Int("ret_code", apiErr.RetCode), | ||
| zap.String("ret_msg", apiErr.RetMsg)) | ||
| case errors.As(killErr, &apiErr) && apiErr.IsAlreadyInState(): |
There was a problem hiding this comment.
Already-in-state path untested: tryKill's IsAlreadyInState() case falls through to success bookkeeping (writes "killed", deletes meta, evicts registry) but has no test. The pause equivalent IS tested (TestSweeper_AlreadyPausedReconcilesAsSuccess). Add the kill-side counterpart. Ref: test-coverage-review.
| zap.String("sandbox_id", sid), | ||
| zap.Int("ret_code", apiErr.RetCode)) | ||
| default: | ||
| _ = s.o.Redis.ClearState(ctx, sid) |
There was a problem hiding this comment.
Kill rollback resets to "running" — enables oscillation: When CubeMaster.Kill() returns a non-terminal error (neither IsNotFound nor IsAlreadyInState), ClearState + SetState("running") re-opens the sandbox to traffic. If the Kill RPC partially succeeded before erroring (timeout after submitting the destroy task), the cycle repeats every sweep. Consider a backoff mechanism instead of immediately clearing state. Ref: security-review.
|
Our semantics differ slightly from E2B. E2B timeout starts from sandbox startup, while ours starts from no request. In the automatic pause scenario, I think idle timeout is more reasonable, but from the perspective of how long an E2B sandbox can survive, should we follow our auto_puase semantics? |
I think keep idle-based semantics is better:
What I think we should adopt from e2b is something orthogonal: a |
| - 通过 HTTP 直连沙箱内的服务(例如 `getHost()` 返回的 URL)。 | ||
|
|
||
| 未配置 `auto_pause` / 不传 `lifecycle` 的沙箱完全保留旧行为:空闲超时直接销毁。 | ||
| 未配置 `auto_pause` / 不传 `lifecycle` 的沙箱默认行为是 `on_timeout="kill"`:空闲超过 `timeout` 秒后,平台(CubeProxy sidecar 的 sweeper)会主动销毁该沙箱并广播 `killing` → `killed` 状态,CubeProxy 会对后续请求立即返回 410 Gone。这与 e2b `lifecycle.on_timeout="kill"` 语义一致。如果完全不希望被自动回收,请在创建时把 `timeout` 设得足够大、或主动在客户端发心跳调用刷新 idle 计时。 |
There was a problem hiding this comment.
These user-facing documents only need to describe the behavior and do not need to expose implementation details(Cubeproxy etc).
| rsp = handleSandboxLogsAction(w, r, rt) | ||
| case r.URL.Path == actionURI(SandboxUpdateAction): | ||
| rsp = handleUpdateAction(w, r, rt) | ||
| case r.URL.Path == actionURI(SandboxTimeoutAction): |
There was a problem hiding this comment.
We should employ some web framework here. cc @fslongjin
There was a problem hiding this comment.
I'll refactor these codes later.
| zap.String("sandbox_id", e.Meta.SandboxID), | ||
| zap.Duration("idle_for", idleFor), | ||
| zap.Int("timeout_seconds", e.Meta.TimeoutSeconds)) | ||
| if e.Meta.AutoPause { |
There was a problem hiding this comment.
Should we use a switch here?
| } | ||
| }() | ||
|
|
||
| if req.SandboxID == "" { |
There was a problem hiding this comment.
I think this kind of check should reside in CubeMaster/pkg/service/httpservice/cube/sandbox_timeout.go.
There was a problem hiding this comment.
Here, I did not choose to move it to cube/sandbox_timeout.go because I observed that in sandbox/sandbox_*.go, the logic is used as a fallback, while httpservice/cube only performs basic validations at the HTTP layer, just like sandbox_update.
CubeSandbox/CubeMaster/pkg/service/sandbox/sandbox_update.go
Lines 39 to 48 in 08e5309
| return | ||
| } | ||
| if req.InstanceType == "" { | ||
| req.InstanceType = cubebox.InstanceType_cubebox.String() |
There was a problem hiding this comment.
Same here. The caller already set default value?
chenhengqi
left a comment
There was a problem hiding this comment.
I left some comments, but this PR looks good overall.
|
One more request, please add an example for this feature. Thanks. |
87b0ad1 to
014361a
Compare
done, |
…end_at
Implement the timeout-then-kill side of e2b's lifecycle.on_timeout semantics,
expose the missing set_timeout/refresh endpoints on CubeMaster, and surface
a deterministic end_at across the whole stack (CubeMaster → sidecar →
CubeProxy → CubeAPI/SDK).
CubeMaster:
- Add POST /cube/sandbox/timeout (SetTimeout) and POST /cube/sandbox/refresh
(Refresh) handlers; both rebase CreatedAt and overwrite TimeoutSeconds /
EndAt on the lifecycle meta, then publish an OpUpdate event.
- Persist EndAt = CreatedAt + TimeoutSeconds*1000 in SandboxLifecycleMeta
at create time and expose it on /sandbox/info and /sandbox/list responses
(SandboxData.EndAt / SandboxBriefData.EndAt).
- Introduce a sandbox.TimeoutProvider seam wired by lifecycle.Init so the
HTTP service can mutate Redis meta without taking a hard dependency on
pkg/lifecycle.
- Add Store.PublishUpdate / Store.LoadMeta and the OpUpdate stream op so
every sidecar replica converges on the new view.
- Propagate kill_reason (request | timeout | orphaned) through
DeleteCubeSandboxReq into the cubelet destroy annotation.
CubeProxy sidecar:
- Sweeper no longer short-circuits on !AutoPause; it now picks Pause
(auto_pause=true) or Kill (auto_pause=false, default) when the idle
threshold is exceeded, mirroring e2b's on_timeout=pause|kill.
- Add cubemasterclient.Kill (DELETE /cube/sandbox with kill_reason) and the
KillReason{Request,Timeout,Orphaned} constants.
- Extend the terminal-state fast path to cover "killing"/"killed" so we
don't spam RPCs after a kill is in flight.
- Handle the new OpUpdate event: upsert registry, reset LastActive to 0
(so the idle clock restarts), and push the refreshed meta to CubeProxy.
- Add registry.ResetLastActive helper.
CubeProxy Lua gate:
- sandbox_state.gate now returns 410 Gone for killing/killed (cube_retcode
310410) so SDK clients fail fast instead of retrying a doomed sandbox.
CubeAPI (Rust):
- Mark SandboxTimeoutRequest / SandboxRefreshRequest as implemented; decode
end_at via a new deserialize_optional_datetime that accepts null,
RFC 3339 strings, or unix-millis integers/strings.
- Surface end_at on GetSandboxResponse from the new master field.
Signed-off-by: joeyczheng <joeyczheng@tencent.com>
014361a to
2cbfac1
Compare
| | `read.py` | `sandbox.files.read()` — read a file from the sandbox filesystem | | ||
| | `pause.py` | `sandbox.pause()` / `sandbox.connect()` — snapshot and restore | | ||
| | `auto_resume.py` | `lifecycle={"on_timeout": "pause", "auto_resume": True}` — let the platform pause idle sandboxes and resume them on the next request | | ||
| | `auto_kill.py` | `lifecycle={"on_timeout": "kill"}` — let the platform tear down idle sandboxes (the default — destruction is final, no snapshot kept) | |
feat: #233 #232
implement lifecycle:kill base on #613
Summary
Bring sandbox lifecycle management to parity with e2b: on top of the existing
auto_pauseflow, add a timeout-then-kill default path, implement the previously-stubbedset_timeout/refreshAPIs, and surface a deterministicend_attimestamp across the entire stack.The change spans four layers — CubeMaster (Go), CubeProxy sidecar (Go), CubeProxy data plane (Lua), CubeAPI (Rust) — plus documentation.
Motivation
The previous implementation only handled
lifecycle.on_timeout="pause":auto_pausewere never reaped on idle timeout and held resources indefinitely.SandboxTimeoutRequest/SandboxRefreshRequestin CubeAPI were marked❌ not implemented, sosandbox.set_timeout()/sandbox.refresh()from the SDK were silent no-ops.GetSandboxResponse.end_atwas alwaysNone, forcing SDK clients to guess the real expiry.This PR closes those three gaps in one go, aligning with e2b's
on_timeout="kill"default and theset_timeout(duration)API surface.What's changed
1. New timeout → kill path (the new default)
CubeProxy/sidecar/sweeper!AutoPause; dispatch totryPauseortryKillbased onauto_pausewhen idle threshold is exceededcubemasterclient.KillDELETE /cube/sandboxcall withkill_reason=timeout, reusing the SETNX → push → RPC → finalise pattern fromPausesandbox_state.luakilling/killedstates and returns410 Gone(retcode310410) so SDK clients stop retrying immediatelyCubeMaster/sandbox_removeDestroyReqcarriesKillReason, written into thecube.master.instance.kill_reasonannotation for audit/billingkilling/killedto avoid RPC churn after a kill is in flight2. New
set_timeout/refreshendpoints (CubeMaster)POST /cube/sandbox/timeoutsandboxID,timeout(seconds)TimeoutSeconds = timeout,EndAt = now + timeout*1000POST /cube/sandbox/refreshsandboxID,duration(seconds)set_timeout(matches e2b convergence)