feat(lifecycle): kill path, set_timeout/refresh, and stable endAt by chengjoey · Pull Request #624 · TencentCloud/CubeSandbox

chengjoey · 2026-06-23T02:30:22Z

implement lifecycle:kill base on #613

Summary

Bring sandbox lifecycle management to parity with e2b: on top of the existing auto_pause flow, add a timeout-then-kill default path, implement the previously-stubbed set_timeout / refresh APIs, and surface a deterministic end_at timestamp across the entire stack.

The change spans four layers — CubeMaster (Go), CubeProxy sidecar (Go), CubeProxy data plane (Lua), CubeAPI (Rust) — plus documentation.

Motivation

The previous implementation only handled lifecycle.on_timeout="pause":

Sandboxes without auto_pause were never reaped on idle timeout and held resources indefinitely.
SandboxTimeoutRequest / SandboxRefreshRequest in CubeAPI were marked ❌ not implemented, so sandbox.set_timeout() / sandbox.refresh() from the SDK were silent no-ops.
GetSandboxResponse.end_at was always None, forcing SDK clients to guess the real expiry.

This PR closes those three gaps in one go, aligning with e2b's on_timeout="kill" default and the set_timeout(duration) API surface.

What's changed

1. New timeout → kill path (the new default)

Component	Change
`CubeProxy/sidecar/sweeper`	Stop short-circuiting on `!AutoPause`; dispatch to `tryPause` or `tryKill` based on `auto_pause` when idle threshold is exceeded
`cubemasterclient.Kill`	New `DELETE /cube/sandbox` call with `kill_reason=timeout`, reusing the SETNX → push → RPC → finalise pattern from `Pause`
`sandbox_state.lua`	Gate recognises `killing` / `killed` states and returns `410 Gone` (retcode `310410`) so SDK clients stop retrying immediately
`CubeMaster/sandbox_remove`	`DestroyReq` carries `KillReason`, written into the `cube.master.instance.kill_reason` annotation for audit/billing
Terminal-state fast path	Existing fast path extended to `killing` / `killed` to avoid RPC churn after a kill is in flight

2. New `set_timeout` / `refresh` endpoints (CubeMaster)

Route	Input	Behaviour
`POST /cube/sandbox/timeout`	`sandboxID`, `timeout` (seconds)	set `TimeoutSeconds = timeout`, `EndAt = now + timeout*1000`
`POST /cube/sandbox/refresh`	`sandboxID`, `duration` (seconds)	Equivalent to `set_timeout` (matches e2b convergence)

cubesandboxbot · 2026-06-23T02:38:35Z

 		destroyReq.Annotations = make(map[string]string)
 		destroyReq.Annotations[constants.CubeAnnotationsInsType] = req.InstanceType
 	}
+	if destroyReq.Annotations == nil {


Dead nil-check: this second destroyReq.Annotations == nil guard can never be reached.

If req.Annotations was nil, the first block (L43-46) already initializes the map. If it was non-nil, the struct literal at L40 already copied the non-nil map. Harmless but should be cleaned up.

cubesandboxbot · 2026-06-23T02:38:42Z

+			return endAt
+		}
+	}
+	return time.Now().UnixMilli() + int64(timeoutSeconds)*1000


Critical: refreshTimeoutMeta can report success without persisting state.

When getTimeoutProvider() returns nil, or RefreshTimeout fails with an error, this falls through to computing and returning a synthetic endAt (now + timeout*1000). The caller returns HTTP 200 with this plausible end_at in the response — but the lifecycle meta was not updated in Redis. The sweeper never sees the new timeout, and the sandbox is killed/paused at the wrong time.

Either propagate the error to the caller (return an error response code) or write the fallback value to Redis.

cubesandboxbot · 2026-06-23T02:38:44Z

+	}
+	endAt, err := p.LookupEndAt(ctx, sandboxID)
+	if err != nil {
+		return 0


LookupSandboxEndAt silently swallows provider errors.

When TimeoutProvider.LookupEndAt returns an error (e.g. transient Redis timeout), this returns 0 with no logging. Since this is called for every sandbox in info/list responses, a transient Redis blip silently makes all sandboxes show end_at=0 with no observable signal. Please log at warn level before returning 0.

cubesandboxbot · 2026-06-23T02:38:46Z

+		default:
+			_ = s.o.Redis.ClearState(ctx, sid)
+			_ = s.o.ProxyPush.SetState(ctx, sid, "running")
+			return errors.New("cubemaster kill: " + killErr.Error())


Rollback failure can permanently deny traffic. Both ClearState and SetState("running") errors are silently discarded with _ =. If either fails (transient Redis hiccup, proxy channel timeout), the sandbox's state remains as "killing" in the nginx dict — and the Lua gate returns 410 Gone permanently for every request until operator intervention.

Consider retry-with-backoff for rollback operations, or set a TTL on proxy shared-dict state entries so stale "killing" self-heals.

chenhengqi

Please update your commit message and PR summary. I don't have a whole picture what this PR trying to achieve.

I presume we don't have to touch CubeProxy for killing sandboxes.

chengjoey · 2026-06-23T07:38:33Z

Please update your commit message and PR summary. I don't have a whole picture what this PR trying to achieve.

I presume we don't have to touch CubeProxy for killing sandboxes.

done

cubesandboxbot · 2026-06-23T07:50:40Z

+		return 0
+	}
+	endAt, err := p.LookupEndAt(ctx, sandboxID)
+	if err != nil {


Error silently swallowed. LookupEndAt already surfaces the error but LookupSandboxEndAt discards it without logging. If Redis is down, sandbox list responses return end_at=0 with zero operator visibility. Compare with refreshTimeoutMeta (sandbox_timeout.go:126–128) which correctly logs the error.

Recommendation: Add log.G(ctx).Warnf(...) before return 0 on this line.

cubesandboxbot · 2026-06-23T07:50:41Z

+		rsp.Ret.RetMsg = "should provide sandboxID"
+		return
+	}
+	if req.Timeout <= 0 {


No upper bound on timeout accepts indefinite sandbox lifetime. An int32 can hold ~2.1 billion seconds (~68 years). A caller that sends set_timeout(2147483647) can permanently anchor a sandbox, defeating the resource-reclamation purpose of the timeout feature. The creation path defaults to 60s, but this endpoint accepts any positive value.

Recommendation: Define a configurable MaxTimeoutSeconds and validate against it:

const maxTimeoutSeconds = 86400 // 24h, or from config if req.Timeout <= 0 || req.Timeout > maxTimeoutSeconds { rsp.Ret.RetCode = int(errorcode.ErrorCode_MasterParamsError) rsp.Ret.RetMsg = fmt.Sprintf("timeout must be between 1 and %d seconds", maxTimeoutSeconds) return }

cubesandboxbot · 2026-06-23T07:50:43Z

+// doesn't carry: AutoPause / AutoResume / TemplateID / HostID / HostIP /
+// InstanceType), rewrites CreatedAt + TimeoutSeconds + EndAt, and publishes
+// an OpUpdate event so every sidecar replica converges on the new view.
+func (p *storeTimeoutProvider) RefreshTimeout(ctx context.Context, sandboxID string, timeoutSeconds int) (int64, error) {


Read-modify-write race. Two concurrent set_timeout calls for the same sandbox can interleave: both read the meta, both modify it in memory, then both HSET — the second write silently overwrites the first. The first caller receives a success response with their expected end_at, but the second caller's value is what's persisted. The reverse interleaving (shorter timeout winning) can cause premature sandbox death.

Recommendation: Use a Redis Lua script or WATCH/MULTI/EXEC to atomically read, modify, and write the meta. A Lua EVALSHA would also collapse the three round-trips (HGET → modify → HSET + XADD) into one, which would also be a nice performance win.

cubesandboxbot · 2026-06-23T07:50:45Z

+		req.InstanceType = cubebox.InstanceType_cubebox.String()
+	}
+
+	if !sandboxExists(ctx, req.SandboxID) {


TOCTOU race between existence check and meta write. sandboxExists checks local caches; then refreshTimeoutMeta calls LoadMeta on Redis. If the sandbox is concurrently deleted between these two calls: (1) sandboxExists returns true (cache not yet invalidated), (2) LoadMeta returns nil (meta was HDEL'd), (3) RefreshTimeout returns 0, nil, then (4) refreshTimeoutMeta falls through to the local fallback on line 132, producing a fake end_at. The client gets HTTP 200 with end_at that was never persisted. The caller thinks they set a timeout, but the sandbox is already gone.

Recommendation: Eliminate the separate sandboxExists check and rely on LoadMeta returning nil as the authoritative "not found" signal. Propagate that to the HTTP layer as a 404 rather than returning a fake end_at.

cubesandboxbot · 2026-06-23T07:50:46Z

 		one.TemplateID = templateID
 		one.Annotations = buildAnnotationsFromLabels(sandboxLabels)
 		one.Labels = sandboxLabels
+		one.EndAt = LookupSandboxEndAt(ctx, sandbox.GetId())


N+1 Redis round trips per sandbox list page. Each sandbox in the response issues a separate HGET + json.Unmarshal to Redis. With 50 sandboxes per page and callers like monitoring polling making repeated list calls, this adds up fast.

Recommendation: Batch with a single HMGET for all sandbox IDs in the page, then build a map[sandboxID]endAt locally in one pass.

cubesandboxbot · 2026-06-23T07:52:59Z

+		return 0
+	}
+	endAt, err := p.LookupEndAt(ctx, sandboxID)
+	if err != nil {


Error silently swallowed. LookupEndAt already surfaces the error but LookupSandboxEndAt discards it without logging. If Redis is down, sandbox list responses return end_at=0 with zero operator visibility. Compare with refreshTimeoutMeta (sandbox_timeout.go:126-128) which correctly logs the error.

Recommendation: Add log.G(ctx).Warnf(...) before return 0 on this line.

cubesandboxbot · 2026-06-23T07:53:00Z

+		rsp.Ret.RetMsg = "should provide sandboxID"
+		return
+	}
+	if req.Timeout <= 0 {


No upper bound on timeout. An int32 can hold ~2.1 billion seconds (~68 years). A caller that sends set_timeout(2147483647) can indefinitely anchor a sandbox, defeating the resource-reclamation purpose of the timeout feature. The creation path has a default (60s), but this endpoint accepts any positive value.

Recommendation: Define a configurable MaxTimeoutSeconds and validate against it, e.g.:

const maxTimeoutSeconds = 86400 // 24h if req.Timeout <= 0 || req.Timeout > maxTimeoutSeconds { // return params error }

cubesandboxbot · 2026-06-23T07:53:00Z

+// doesn't carry: AutoPause / AutoResume / TemplateID / HostID / HostIP /
+// InstanceType), rewrites CreatedAt + TimeoutSeconds + EndAt, and publishes
+// an OpUpdate event so every sidecar replica converges on the new view.
+func (p *storeTimeoutProvider) RefreshTimeout(ctx context.Context, sandboxID string, timeoutSeconds int) (int64, error) {


Classic read-modify-write race. Two concurrent set_timeout calls for the same sandbox will interleave: both read the meta, both modify it in memory, then both HSET — the second write silently overwrites the first. The first caller receives a success response with their expected end_at, but the second caller's value is what's persisted. The reverse interleaving (shorter timeout winning) can cause premature sandbox death.

Recommendation: Use a Redis Lua script or WATCH/MULTI/EXEC to atomically read, modify, and write the meta. A Lua EVALSHA would also collapse the three round-trips (HGET → modify → HSET + XADD) into one.

cubesandboxbot · 2026-06-23T07:53:01Z

+		req.InstanceType = cubebox.InstanceType_cubebox.String()
+	}
+
+	if !sandboxExists(ctx, req.SandboxID) {


TOCTOU race between existence check and meta write. sandboxExists checks local caches, then refreshTimeoutMeta calls LoadMeta on Redis. If the sandbox is concurrently deleted between these two calls: (1) sandboxExists returns true, (2) LoadMeta returns nil (meta was just HDEL'd), (3) RefreshTimeout returns 0, nil, (4) refreshTimeoutMeta falls through to the local fallback (line 132), (5) the client gets HTTP 200 with a locally-computed end_at that was never persisted. The caller believes the timeout was set, but the sandbox is already gone.

Recommendation: Eliminate the separate existence check and rely on LoadMeta returning nil as the authoritative "not found" signal. Propagate that to the HTTP layer as a proper 404 response rather than returning a fake end_at.

cubesandboxbot · 2026-06-23T07:53:02Z

 		one.TemplateID = templateID
 		one.Annotations = buildAnnotationsFromLabels(sandboxLabels)
 		one.Labels = sandboxLabels
+		one.EndAt = LookupSandboxEndAt(ctx, sandbox.GetId())


N+1 Redis round trips per sandbox list page. Each sandbox in the list response issues a separate HGET + json.Unmarshal to Redis. With 50 sandboxes per page (and paginated callers like monitoring polling), this is 50 sequential Redis round-trips per list API call.

Recommendation: Batch with HMGET (or HGETALL) in a single round-trip for all sandbox IDs in the page, then build a map[sandboxID]endAt locally.

cubesandboxbot · 2026-06-23T10:15:05Z

+
+	now := time.Now().UnixMilli()
+	meta.TimeoutSeconds = timeoutSeconds
+	meta.CreatedAt = now


Semantic overload: meta.CreatedAt = now repurposes the creation timestamp as a refresh timestamp. After a set_timeout/refresh call, any code reading CreatedAt to answer "when was this sandbox created?" (billing, audit, diagnostics) gets the wrong answer.

Consider adding a dedicated LastRefreshAt field to SandboxLifecycleMeta instead, and having the sweeper's baseline prefer max(LastActiveMs, LastRefreshAt, CreatedAt).

Ref: code-quality-review, performance-review

cubesandboxbot · 2026-06-23T10:15:07Z

+
+// SetTimeout implements POST /cube/sandbox/timeout. It is the master-side
+// counterpart of CubeAPI's SetTimeoutRequest -> SetTimeoutResponse.
+func SetTimeout(ctx context.Context, req *types.SetTimeoutRequest) (rsp *types.SetTimeoutRes) {


~40 lines of cloned code between SetTimeout and Refresh (compare L21-64 with L68-111). The only differences are the request type field name (Timeout vs Duration) and the log prefix. Extract a shared handler to reduce duplication and ensure future validation/logging changes apply to both uniformly.

cubesandboxbot · 2026-06-23T10:15:08Z

+		rsp.Ret.RetMsg = "should provide sandboxID"
+		return
+	}
+	if req.Timeout <= 0 {


No upper bound on Timeout/Duration: a caller could set timeout: 2147483647 (max int32), extending a sandbox's life ~68 years. Add a configurable maximum (e.g. 86400s = 24h) and reject values above it. Ref: security-review.

cubesandboxbot · 2026-06-23T10:15:10Z

+	return
+}
+
+func sandboxExists(ctx context.Context, sandboxID string) bool {


TOCTOU race: sandboxExists checks a stale local cache (GetSandboxCache is periodically wiped) and a Redis proxy-map key — neither of which are atomic with the subsequent refreshTimeoutMeta. A sandbox could be deleted between the check and the meta update.

Recommendation: let refreshTimeoutMeta / LoadMeta be the authoritative existence check and report a "not found" error when the meta is nil, eliminating this separate existence check. Ref: security-review, test-coverage-review.

cubesandboxbot · 2026-06-23T10:15:11Z

+	case string:
+		raw = []byte(x)
+	default:
+		log.G(ctx).Warnf("lifecycle: HGET %s %s unexpected type %T", MetaKey, sandboxID, v)


Silent nil on type mismatch: The default branch logs a warning and returns (nil, nil) for unexpected HGET return types. The caller cannot distinguish "entry not found" from "type mismatch error." If a future redis library upgrade returns int64, the entry is silently treated as missing. Return an error here instead. Ref: code-quality-review.

cubesandboxbot · 2026-06-23T10:15:13Z

+	}
+	reason := req.KillReason
+	if reason == "" {
+		reason = "request"


No kill_reason validation: The KillReason is accepted as an arbitrary string from the request body. An attacker who reaches the DELETE endpoint can inject "timeout" or "orphaned" or any custom value into cube.master.instance.kill_reason, corrupting audit/billing trails. Add an allowlist validation (request|timeout|orphaned). Ref: security-review.

cubesandboxbot · 2026-06-23T10:15:18Z

@@ -107,9 +109,6 @@ func (s *Sweeper) sweepOnce(ctx context.Context) {
 	withinWarmup := now.Sub(s.o.StartedAt) < s.o.BootstrapWarmup

 	for _, e := range s.o.Registry.Snapshot() {


Sweeper now iterates ALL sandboxes every tick, not just auto-pause ones (the if !e.Meta.AutoPause { continue } guard was removed at the old L109). This means the sweeper does a Redis GET per non-auto-pause sandbox per sweep interval. At 10k sandboxes with a 1s interval, that's ~10k GET/s sustained. Consider whether the sweep interval should be adaptive, or whether a separate kill-sweeper could run at a lower frequency. Ref: performance-review.

cubesandboxbot · 2026-06-23T10:15:20Z

+	// KillReasonOrphaned is reserved for future use: a sandbox observed on a
+	// node but missing from the registry / Redis source-of-truth. Mirrors
+	// e2b's orphan reaper.
+	KillReasonOrphaned = "orphaned"


Dead code: KillReasonOrphaned is declared but never referenced. Remove it and add back when the orphan reaper is implemented.

cubesandboxbot · 2026-06-23T10:15:56Z

+		}
+	}
+
+	if err := s.o.Redis.SetState(ctx, sid, "killed", s.o.StateLockTTL); err != nil {


Terminal state TTL causes re-kill cycle: SetState for "killed" uses StateLockTTL (60s). After TTL expiry, the sweeper re-detects the sandbox as idle, finds no state, re-enters tryKill, and issues a redundant Kill RPC. The same pattern exists for "paused" in tryPause (L262). Terminal states should use a persistent key (no TTL) or a much longer TTL (hours). Transient states (pausing, killing) keep a short TTL. Ref: performance-review.

cubesandboxbot · 2026-06-23T10:15:57Z

+				zap.String("sandbox_id", sid),
+				zap.Int("ret_code", apiErr.RetCode),
+				zap.String("ret_msg", apiErr.RetMsg))
+		case errors.As(killErr, &apiErr) && apiErr.IsAlreadyInState():


Already-in-state path untested: tryKill's IsAlreadyInState() case falls through to success bookkeeping (writes "killed", deletes meta, evicts registry) but has no test. The pause equivalent IS tested (TestSweeper_AlreadyPausedReconcilesAsSuccess). Add the kill-side counterpart. Ref: test-coverage-review.

cubesandboxbot · 2026-06-23T10:15:59Z

+				zap.String("sandbox_id", sid),
+				zap.Int("ret_code", apiErr.RetCode))
+		default:
+			_ = s.o.Redis.ClearState(ctx, sid)


Kill rollback resets to "running" — enables oscillation: When CubeMaster.Kill() returns a non-terminal error (neither IsNotFound nor IsAlreadyInState), ClearState + SetState("running") re-opens the sandbox to traffic. If the Kill RPC partially succeeded before erroring (timeout after submitting the destroy task), the cycle repeats every sweep. Consider a backoff mechanism instead of immediately clearing state. Ref: security-review.

kinwin-ustc · 2026-06-23T10:35:40Z

Our semantics differ slightly from E2B. E2B timeout starts from sandbox startup, while ours starts from no request. In the automatic pause scenario, I think idle timeout is more reasonable, but from the perspective of how long an E2B sandbox can survive, should we follow our auto_puase semantics?

chengjoey · 2026-06-23T12:01:28Z

Our semantics differ slightly from E2B. E2B timeout starts from sandbox startup, while ours starts from no request. In the automatic pause scenario, I think idle timeout is more reasonable, but from the perspective of how long an E2B sandbox can survive, should we follow our auto_puase semantics?

e2b's model: EndTime is an absolute deadline = StartTime + timeout. The evictor compares now to EndTime; activity does not reset it. The e2b SDK clients (not the server) call keep_alive on every run_code / commands.run to push EndTime forward — so "activity extends life" is implemented client-side via explicit RPCs. They also have a hard wall-clock ceiling MaxInstanceLength (24h Pro / 1h Base) on top of that.
Our model: server-side idle reaper. baseline = max(LastActiveMs, CreatedAt), idleFor = now - baseline. CubeProxy sidecar reports last_active on every HTTP request, so the server itself observes activity and rebases the idle clock without any client-side cooperation.

I think keep idle-based semantics is better:

It's the right primitive for on_timeout=pause — auto-pause means "free resources when idle"; if we switched to absolute-deadline, a sandbox actively serving requests would get paused mid-flight unless the SDK keeps pushing keep_alive. That regression isn't worth it.
We already invested in server-side activity tracking (sidecar last_active stream). E2b had to use absolute-deadline because their server doesn't see request traffic; we do, so we can give a strictly better experience.

What I think we should adopt from e2b is something orthogonal: a MaxInstanceLength hard ceiling that bounds the worst case independent of idle activity (so an always-busy sandbox doesn't live forever).

https://github.com/e2b-dev/infra/blob/ecd13b04d8aead46e7afe386538d5cb10ef22b40/packages/api/internal/sandbox/sandboxtypes/sandbox.go#L91

chenhengqi · 2026-06-24T11:03:38Z

 - 通过 HTTP 直连沙箱内的服务（例如 `getHost()` 返回的 URL）。

-未配置 `auto_pause` / 不传 `lifecycle` 的沙箱完全保留旧行为：空闲超时直接销毁。
+未配置 `auto_pause` / 不传 `lifecycle` 的沙箱默认行为是 `on_timeout="kill"`：空闲超过 `timeout` 秒后，平台（CubeProxy sidecar 的 sweeper）会主动销毁该沙箱并广播 `killing` → `killed` 状态，CubeProxy 会对后续请求立即返回 410 Gone。这与 e2b `lifecycle.on_timeout="kill"` 语义一致。如果完全不希望被自动回收，请在创建时把 `timeout` 设得足够大、或主动在客户端发心跳调用刷新 idle 计时。


These user-facing documents only need to describe the behavior and do not need to expose implementation details(Cubeproxy etc).

chenhengqi · 2026-06-24T11:13:10Z

 		rsp = handleSandboxLogsAction(w, r, rt)
 	case r.URL.Path == actionURI(SandboxUpdateAction):
 		rsp = handleUpdateAction(w, r, rt)
+	case r.URL.Path == actionURI(SandboxTimeoutAction):


We should employ some web framework here. cc @fslongjin

I'll refactor these codes later.

chenhengqi · 2026-06-24T11:31:56Z

-			zap.String("sandbox_id", e.Meta.SandboxID),
-			zap.Duration("idle_for", idleFor),
-			zap.Int("timeout_seconds", e.Meta.TimeoutSeconds))
+		if e.Meta.AutoPause {


Should we use a switch here?

chenhengqi · 2026-06-24T11:35:51Z

+		}
+	}()
+
+	if req.SandboxID == "" {


I think this kind of check should reside in CubeMaster/pkg/service/httpservice/cube/sandbox_timeout.go‎.

Here, I did not choose to move it to cube/sandbox_timeout.go‎ because I observed that in sandbox/sandbox_*.go, the logic is used as a fallback, while httpservice/cube only performs basic validations at the HTTP layer, just like sandbox_update.

CubeSandbox/CubeMaster/pkg/service/sandbox/sandbox_update.go

Lines 39 to 48 in 08e5309

if req.SandboxID == "" || req.InstanceType == "" || req.Action == "" {

rsp.Ret.RetCode = int(errorcode.ErrorCode_MasterParamsError)

rsp.Ret.RetMsg = "should provide InstanceType,SandboxID,Action"

return

}

if req.Action != "pause" && req.Action != "resume" {

rsp.Ret.RetCode = int(errorcode.ErrorCode_MasterParamsError)

rsp.Ret.RetMsg = "action should be pause or resume"

return

}

chenhengqi · 2026-06-24T11:36:39Z

+		return
+	}
+	if req.InstanceType == "" {
+		req.InstanceType = cubebox.InstanceType_cubebox.String()


Same here. The caller already set default value?

chenhengqi

I left some comments, but this PR looks good overall.

chenhengqi · 2026-06-25T01:35:10Z

One more request, please add an example for this feature. Thanks.

chengjoey · 2026-06-25T03:35:09Z

One more request, please add an example for this feature. Thanks.

done, auto-kill.py added

…end_at Implement the timeout-then-kill side of e2b's lifecycle.on_timeout semantics, expose the missing set_timeout/refresh endpoints on CubeMaster, and surface a deterministic end_at across the whole stack (CubeMaster → sidecar → CubeProxy → CubeAPI/SDK). CubeMaster: - Add POST /cube/sandbox/timeout (SetTimeout) and POST /cube/sandbox/refresh (Refresh) handlers; both rebase CreatedAt and overwrite TimeoutSeconds / EndAt on the lifecycle meta, then publish an OpUpdate event. - Persist EndAt = CreatedAt + TimeoutSeconds*1000 in SandboxLifecycleMeta at create time and expose it on /sandbox/info and /sandbox/list responses (SandboxData.EndAt / SandboxBriefData.EndAt). - Introduce a sandbox.TimeoutProvider seam wired by lifecycle.Init so the HTTP service can mutate Redis meta without taking a hard dependency on pkg/lifecycle. - Add Store.PublishUpdate / Store.LoadMeta and the OpUpdate stream op so every sidecar replica converges on the new view. - Propagate kill_reason (request | timeout | orphaned) through DeleteCubeSandboxReq into the cubelet destroy annotation. CubeProxy sidecar: - Sweeper no longer short-circuits on !AutoPause; it now picks Pause (auto_pause=true) or Kill (auto_pause=false, default) when the idle threshold is exceeded, mirroring e2b's on_timeout=pause|kill. - Add cubemasterclient.Kill (DELETE /cube/sandbox with kill_reason) and the KillReason{Request,Timeout,Orphaned} constants. - Extend the terminal-state fast path to cover "killing"/"killed" so we don't spam RPCs after a kill is in flight. - Handle the new OpUpdate event: upsert registry, reset LastActive to 0 (so the idle clock restarts), and push the refreshed meta to CubeProxy. - Add registry.ResetLastActive helper. CubeProxy Lua gate: - sandbox_state.gate now returns 410 Gone for killing/killed (cube_retcode 310410) so SDK clients fail fast instead of retrying a doomed sandbox. CubeAPI (Rust): - Mark SandboxTimeoutRequest / SandboxRefreshRequest as implemented; decode end_at via a new deserialize_optional_datetime that accepts null, RFC 3339 strings, or unix-millis integers/strings. - Surface end_at on GetSandboxResponse from the new master field. Signed-off-by: joeyczheng <joeyczheng@tencent.com>

chenhengqi · 2026-06-25T14:29:52Z

 | `read.py` | `sandbox.files.read()` — read a file from the sandbox filesystem |
 | `pause.py` | `sandbox.pause()` / `sandbox.connect()` — snapshot and restore |
 | `auto_resume.py` | `lifecycle={"on_timeout": "pause", "auto_resume": True}` — let the platform pause idle sandboxes and resume them on the next request |
+| `auto_kill.py` | `lifecycle={"on_timeout": "kill"}` — let the platform tear down idle sandboxes (the default — destruction is final, no snapshot kept) |


no snapshot kept?

chengjoey requested review from chenhengqi, fslongjin, ls-ggg, tinklone and up2wing as code owners June 23, 2026 02:30

cubesandboxbot Bot reviewed Jun 23, 2026

View reviewed changes

chenhengqi reviewed Jun 23, 2026

View reviewed changes

chenhengqi self-assigned this Jun 23, 2026

chengjoey force-pushed the feat/lifecycle-kill branch 2 times, most recently from 0970025 to cb7473d Compare June 23, 2026 07:35

cubesandboxbot Bot reviewed Jun 23, 2026

View reviewed changes

chengjoey force-pushed the feat/lifecycle-kill branch from cb7473d to 87b0ad1 Compare June 23, 2026 10:09

cubesandboxbot Bot reviewed Jun 23, 2026

View reviewed changes

chenhengqi reviewed Jun 24, 2026

View reviewed changes

chengjoey force-pushed the feat/lifecycle-kill branch from 87b0ad1 to 014361a Compare June 25, 2026 03:30

chengjoey force-pushed the feat/lifecycle-kill branch from 014361a to 2cbfac1 Compare June 25, 2026 03:37

chenhengqi reviewed Jun 25, 2026

View reviewed changes

		@@ -107,9 +109,6 @@ func (s *Sweeper) sweepOnce(ctx context.Context) {
		withinWarmup := now.Sub(s.o.StartedAt) < s.o.BootstrapWarmup

		for _, e := range s.o.Registry.Snapshot() {

	if req.SandboxID == "" \|\| req.InstanceType == "" \|\| req.Action == "" {
	rsp.Ret.RetCode = int(errorcode.ErrorCode_MasterParamsError)
	rsp.Ret.RetMsg = "should provide InstanceType,SandboxID,Action"
	return
	}
	if req.Action != "pause" && req.Action != "resume" {
	rsp.Ret.RetCode = int(errorcode.ErrorCode_MasterParamsError)
	rsp.Ret.RetMsg = "action should be pause or resume"
	return
	}

Uh oh!

Conversation

chengjoey commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

What's changed

1. New timeout → kill path (the new default)

2. New set_timeout / refresh endpoints (CubeMaster)

Uh oh!

cubesandboxbot Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

chenhengqi left a comment

Choose a reason for hiding this comment

Uh oh!

chengjoey commented Jun 23, 2026

Uh oh!

cubesandboxbot Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jun 23, 2026

Choose a reason for hiding this comment

chengjoey commented Jun 23, 2026 •

edited

Loading

2. New `set_timeout` / `refresh` endpoints (CubeMaster)