Skip to content

feat(lifecycle): kill path, set_timeout/refresh, and stable endAt#624

Open
chengjoey wants to merge 1 commit into
TencentCloud:masterfrom
chengjoey:feat/lifecycle-kill
Open

feat(lifecycle): kill path, set_timeout/refresh, and stable endAt#624
chengjoey wants to merge 1 commit into
TencentCloud:masterfrom
chengjoey:feat/lifecycle-kill

Conversation

@chengjoey

@chengjoey chengjoey commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

feat: #233 #232

implement lifecycle:kill base on #613

Summary

Bring sandbox lifecycle management to parity with e2b: on top of the existing auto_pause flow, add a timeout-then-kill default path, implement the previously-stubbed set_timeout / refresh APIs, and surface a deterministic end_at timestamp across the entire stack.

The change spans four layers — CubeMaster (Go), CubeProxy sidecar (Go), CubeProxy data plane (Lua), CubeAPI (Rust) — plus documentation.

Motivation

The previous implementation only handled lifecycle.on_timeout="pause":

  • Sandboxes without auto_pause were never reaped on idle timeout and held resources indefinitely.
  • SandboxTimeoutRequest / SandboxRefreshRequest in CubeAPI were marked ❌ not implemented, so sandbox.set_timeout() / sandbox.refresh() from the SDK were silent no-ops.
  • GetSandboxResponse.end_at was always None, forcing SDK clients to guess the real expiry.

This PR closes those three gaps in one go, aligning with e2b's on_timeout="kill" default and the set_timeout(duration) API surface.

What's changed

1. New timeout → kill path (the new default)

Component Change
CubeProxy/sidecar/sweeper Stop short-circuiting on !AutoPause; dispatch to tryPause or tryKill based on auto_pause when idle threshold is exceeded
cubemasterclient.Kill New DELETE /cube/sandbox call with kill_reason=timeout, reusing the SETNX → push → RPC → finalise pattern from Pause
sandbox_state.lua Gate recognises killing / killed states and returns 410 Gone (retcode 310410) so SDK clients stop retrying immediately
CubeMaster/sandbox_remove DestroyReq carries KillReason, written into the cube.master.instance.kill_reason annotation for audit/billing
Terminal-state fast path Existing fast path extended to killing / killed to avoid RPC churn after a kill is in flight

2. New set_timeout / refresh endpoints (CubeMaster)

Route Input Behaviour
POST /cube/sandbox/timeout sandboxID, timeout (seconds) set TimeoutSeconds = timeout, EndAt = now + timeout*1000
POST /cube/sandbox/refresh sandboxID, duration (seconds) Equivalent to set_timeout (matches e2b convergence)

destroyReq.Annotations = make(map[string]string)
destroyReq.Annotations[constants.CubeAnnotationsInsType] = req.InstanceType
}
if destroyReq.Annotations == nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead nil-check: this second destroyReq.Annotations == nil guard can never be reached.

If req.Annotations was nil, the first block (L43-46) already initializes the map. If it was non-nil, the struct literal at L40 already copied the non-nil map. Harmless but should be cleaned up.

return endAt
}
}
return time.Now().UnixMilli() + int64(timeoutSeconds)*1000

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: refreshTimeoutMeta can report success without persisting state.

When getTimeoutProvider() returns nil, or RefreshTimeout fails with an error, this falls through to computing and returning a synthetic endAt (now + timeout*1000). The caller returns HTTP 200 with this plausible end_at in the response — but the lifecycle meta was not updated in Redis. The sweeper never sees the new timeout, and the sandbox is killed/paused at the wrong time.

Either propagate the error to the caller (return an error response code) or write the fallback value to Redis.

}
endAt, err := p.LookupEndAt(ctx, sandboxID)
if err != nil {
return 0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LookupSandboxEndAt silently swallows provider errors.

When TimeoutProvider.LookupEndAt returns an error (e.g. transient Redis timeout), this returns 0 with no logging. Since this is called for every sandbox in info/list responses, a transient Redis blip silently makes all sandboxes show end_at=0 with no observable signal. Please log at warn level before returning 0.

default:
_ = s.o.Redis.ClearState(ctx, sid)
_ = s.o.ProxyPush.SetState(ctx, sid, "running")
return errors.New("cubemaster kill: " + killErr.Error())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rollback failure can permanently deny traffic. Both ClearState and SetState("running") errors are silently discarded with _ =. If either fails (transient Redis hiccup, proxy channel timeout), the sandbox's state remains as "killing" in the nginx dict — and the Lua gate returns 410 Gone permanently for every request until operator intervention.

Consider retry-with-backoff for rollback operations, or set a TTL on proxy shared-dict state entries so stale "killing" self-heals.

@chenhengqi chenhengqi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update your commit message and PR summary. I don't have a whole picture what this PR trying to achieve.

I presume we don't have to touch CubeProxy for killing sandboxes.

@chenhengqi chenhengqi self-assigned this Jun 23, 2026
@chengjoey chengjoey force-pushed the feat/lifecycle-kill branch 2 times, most recently from 0970025 to cb7473d Compare June 23, 2026 07:35
@chengjoey

Copy link
Copy Markdown
Contributor Author

Please update your commit message and PR summary. I don't have a whole picture what this PR trying to achieve.

I presume we don't have to touch CubeProxy for killing sandboxes.

done

return 0
}
endAt, err := p.LookupEndAt(ctx, sandboxID)
if err != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error silently swallowed. LookupEndAt already surfaces the error but LookupSandboxEndAt discards it without logging. If Redis is down, sandbox list responses return end_at=0 with zero operator visibility. Compare with refreshTimeoutMeta (sandbox_timeout.go:126–128) which correctly logs the error.

Recommendation: Add log.G(ctx).Warnf(...) before return 0 on this line.

rsp.Ret.RetMsg = "should provide sandboxID"
return
}
if req.Timeout <= 0 {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No upper bound on timeout accepts indefinite sandbox lifetime. An int32 can hold ~2.1 billion seconds (~68 years). A caller that sends set_timeout(2147483647) can permanently anchor a sandbox, defeating the resource-reclamation purpose of the timeout feature. The creation path defaults to 60s, but this endpoint accepts any positive value.

Recommendation: Define a configurable MaxTimeoutSeconds and validate against it:

const maxTimeoutSeconds = 86400 // 24h, or from config
if req.Timeout <= 0 || req.Timeout > maxTimeoutSeconds {
    rsp.Ret.RetCode = int(errorcode.ErrorCode_MasterParamsError)
    rsp.Ret.RetMsg = fmt.Sprintf("timeout must be between 1 and %d seconds", maxTimeoutSeconds)
    return
}

// doesn't carry: AutoPause / AutoResume / TemplateID / HostID / HostIP /
// InstanceType), rewrites CreatedAt + TimeoutSeconds + EndAt, and publishes
// an OpUpdate event so every sidecar replica converges on the new view.
func (p *storeTimeoutProvider) RefreshTimeout(ctx context.Context, sandboxID string, timeoutSeconds int) (int64, error) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read-modify-write race. Two concurrent set_timeout calls for the same sandbox can interleave: both read the meta, both modify it in memory, then both HSET — the second write silently overwrites the first. The first caller receives a success response with their expected end_at, but the second caller's value is what's persisted. The reverse interleaving (shorter timeout winning) can cause premature sandbox death.

Recommendation: Use a Redis Lua script or WATCH/MULTI/EXEC to atomically read, modify, and write the meta. A Lua EVALSHA would also collapse the three round-trips (HGET → modify → HSET + XADD) into one, which would also be a nice performance win.

req.InstanceType = cubebox.InstanceType_cubebox.String()
}

if !sandboxExists(ctx, req.SandboxID) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOCTOU race between existence check and meta write. sandboxExists checks local caches; then refreshTimeoutMeta calls LoadMeta on Redis. If the sandbox is concurrently deleted between these two calls: (1) sandboxExists returns true (cache not yet invalidated), (2) LoadMeta returns nil (meta was HDEL'd), (3) RefreshTimeout returns 0, nil, then (4) refreshTimeoutMeta falls through to the local fallback on line 132, producing a fake end_at. The client gets HTTP 200 with end_at that was never persisted. The caller thinks they set a timeout, but the sandbox is already gone.

Recommendation: Eliminate the separate sandboxExists check and rely on LoadMeta returning nil as the authoritative "not found" signal. Propagate that to the HTTP layer as a 404 rather than returning a fake end_at.

one.TemplateID = templateID
one.Annotations = buildAnnotationsFromLabels(sandboxLabels)
one.Labels = sandboxLabels
one.EndAt = LookupSandboxEndAt(ctx, sandbox.GetId())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N+1 Redis round trips per sandbox list page. Each sandbox in the response issues a separate HGET + json.Unmarshal to Redis. With 50 sandboxes per page and callers like monitoring polling making repeated list calls, this adds up fast.

Recommendation: Batch with a single HMGET for all sandbox IDs in the page, then build a map[sandboxID]endAt locally in one pass.

return 0
}
endAt, err := p.LookupEndAt(ctx, sandboxID)
if err != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error silently swallowed. LookupEndAt already surfaces the error but LookupSandboxEndAt discards it without logging. If Redis is down, sandbox list responses return end_at=0 with zero operator visibility. Compare with refreshTimeoutMeta (sandbox_timeout.go:126-128) which correctly logs the error.

Recommendation: Add log.G(ctx).Warnf(...) before return 0 on this line.

rsp.Ret.RetMsg = "should provide sandboxID"
return
}
if req.Timeout <= 0 {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No upper bound on timeout. An int32 can hold ~2.1 billion seconds (~68 years). A caller that sends set_timeout(2147483647) can indefinitely anchor a sandbox, defeating the resource-reclamation purpose of the timeout feature. The creation path has a default (60s), but this endpoint accepts any positive value.

Recommendation: Define a configurable MaxTimeoutSeconds and validate against it, e.g.:

const maxTimeoutSeconds = 86400 // 24h
if req.Timeout <= 0 || req.Timeout > maxTimeoutSeconds {
    // return params error
}

// doesn't carry: AutoPause / AutoResume / TemplateID / HostID / HostIP /
// InstanceType), rewrites CreatedAt + TimeoutSeconds + EndAt, and publishes
// an OpUpdate event so every sidecar replica converges on the new view.
func (p *storeTimeoutProvider) RefreshTimeout(ctx context.Context, sandboxID string, timeoutSeconds int) (int64, error) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Classic read-modify-write race. Two concurrent set_timeout calls for the same sandbox will interleave: both read the meta, both modify it in memory, then both HSET — the second write silently overwrites the first. The first caller receives a success response with their expected end_at, but the second caller's value is what's persisted. The reverse interleaving (shorter timeout winning) can cause premature sandbox death.

Recommendation: Use a Redis Lua script or WATCH/MULTI/EXEC to atomically read, modify, and write the meta. A Lua EVALSHA would also collapse the three round-trips (HGET → modify → HSET + XADD) into one.

req.InstanceType = cubebox.InstanceType_cubebox.String()
}

if !sandboxExists(ctx, req.SandboxID) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOCTOU race between existence check and meta write. sandboxExists checks local caches, then refreshTimeoutMeta calls LoadMeta on Redis. If the sandbox is concurrently deleted between these two calls: (1) sandboxExists returns true, (2) LoadMeta returns nil (meta was just HDEL'd), (3) RefreshTimeout returns 0, nil, (4) refreshTimeoutMeta falls through to the local fallback (line 132), (5) the client gets HTTP 200 with a locally-computed end_at that was never persisted. The caller believes the timeout was set, but the sandbox is already gone.

Recommendation: Eliminate the separate existence check and rely on LoadMeta returning nil as the authoritative "not found" signal. Propagate that to the HTTP layer as a proper 404 response rather than returning a fake end_at.

one.TemplateID = templateID
one.Annotations = buildAnnotationsFromLabels(sandboxLabels)
one.Labels = sandboxLabels
one.EndAt = LookupSandboxEndAt(ctx, sandbox.GetId())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N+1 Redis round trips per sandbox list page. Each sandbox in the list response issues a separate HGET + json.Unmarshal to Redis. With 50 sandboxes per page (and paginated callers like monitoring polling), this is 50 sequential Redis round-trips per list API call.

Recommendation: Batch with HMGET (or HGETALL) in a single round-trip for all sandbox IDs in the page, then build a map[sandboxID]endAt locally.

@chengjoey chengjoey force-pushed the feat/lifecycle-kill branch from cb7473d to 87b0ad1 Compare June 23, 2026 10:09

now := time.Now().UnixMilli()
meta.TimeoutSeconds = timeoutSeconds
meta.CreatedAt = now

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantic overload: meta.CreatedAt = now repurposes the creation timestamp as a refresh timestamp. After a set_timeout/refresh call, any code reading CreatedAt to answer "when was this sandbox created?" (billing, audit, diagnostics) gets the wrong answer.

Consider adding a dedicated LastRefreshAt field to SandboxLifecycleMeta instead, and having the sweeper's baseline prefer max(LastActiveMs, LastRefreshAt, CreatedAt).

Ref: code-quality-review, performance-review


// SetTimeout implements POST /cube/sandbox/timeout. It is the master-side
// counterpart of CubeAPI's SetTimeoutRequest -> SetTimeoutResponse.
func SetTimeout(ctx context.Context, req *types.SetTimeoutRequest) (rsp *types.SetTimeoutRes) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~40 lines of cloned code between SetTimeout and Refresh (compare L21-64 with L68-111). The only differences are the request type field name (Timeout vs Duration) and the log prefix. Extract a shared handler to reduce duplication and ensure future validation/logging changes apply to both uniformly.

rsp.Ret.RetMsg = "should provide sandboxID"
return
}
if req.Timeout <= 0 {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No upper bound on Timeout/Duration: a caller could set timeout: 2147483647 (max int32), extending a sandbox's life ~68 years. Add a configurable maximum (e.g. 86400s = 24h) and reject values above it. Ref: security-review.

return
}

func sandboxExists(ctx context.Context, sandboxID string) bool {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOCTOU race: sandboxExists checks a stale local cache (GetSandboxCache is periodically wiped) and a Redis proxy-map key — neither of which are atomic with the subsequent refreshTimeoutMeta. A sandbox could be deleted between the check and the meta update.

Recommendation: let refreshTimeoutMeta / LoadMeta be the authoritative existence check and report a "not found" error when the meta is nil, eliminating this separate existence check. Ref: security-review, test-coverage-review.

case string:
raw = []byte(x)
default:
log.G(ctx).Warnf("lifecycle: HGET %s %s unexpected type %T", MetaKey, sandboxID, v)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent nil on type mismatch: The default branch logs a warning and returns (nil, nil) for unexpected HGET return types. The caller cannot distinguish "entry not found" from "type mismatch error." If a future redis library upgrade returns int64, the entry is silently treated as missing. Return an error here instead. Ref: code-quality-review.

}
reason := req.KillReason
if reason == "" {
reason = "request"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No kill_reason validation: The KillReason is accepted as an arbitrary string from the request body. An attacker who reaches the DELETE endpoint can inject "timeout" or "orphaned" or any custom value into cube.master.instance.kill_reason, corrupting audit/billing trails. Add an allowlist validation (request|timeout|orphaned). Ref: security-review.

@@ -107,9 +109,6 @@ func (s *Sweeper) sweepOnce(ctx context.Context) {
withinWarmup := now.Sub(s.o.StartedAt) < s.o.BootstrapWarmup

for _, e := range s.o.Registry.Snapshot() {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweeper now iterates ALL sandboxes every tick, not just auto-pause ones (the if !e.Meta.AutoPause { continue } guard was removed at the old L109). This means the sweeper does a Redis GET per non-auto-pause sandbox per sweep interval. At 10k sandboxes with a 1s interval, that's ~10k GET/s sustained. Consider whether the sweep interval should be adaptive, or whether a separate kill-sweeper could run at a lower frequency. Ref: performance-review.

// KillReasonOrphaned is reserved for future use: a sandbox observed on a
// node but missing from the registry / Redis source-of-truth. Mirrors
// e2b's orphan reaper.
KillReasonOrphaned = "orphaned"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code: KillReasonOrphaned is declared but never referenced. Remove it and add back when the orphan reaper is implemented.

}
}

if err := s.o.Redis.SetState(ctx, sid, "killed", s.o.StateLockTTL); err != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Terminal state TTL causes re-kill cycle: SetState for "killed" uses StateLockTTL (60s). After TTL expiry, the sweeper re-detects the sandbox as idle, finds no state, re-enters tryKill, and issues a redundant Kill RPC. The same pattern exists for "paused" in tryPause (L262). Terminal states should use a persistent key (no TTL) or a much longer TTL (hours). Transient states (pausing, killing) keep a short TTL. Ref: performance-review.

zap.String("sandbox_id", sid),
zap.Int("ret_code", apiErr.RetCode),
zap.String("ret_msg", apiErr.RetMsg))
case errors.As(killErr, &apiErr) && apiErr.IsAlreadyInState():

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already-in-state path untested: tryKill's IsAlreadyInState() case falls through to success bookkeeping (writes "killed", deletes meta, evicts registry) but has no test. The pause equivalent IS tested (TestSweeper_AlreadyPausedReconcilesAsSuccess). Add the kill-side counterpart. Ref: test-coverage-review.

zap.String("sandbox_id", sid),
zap.Int("ret_code", apiErr.RetCode))
default:
_ = s.o.Redis.ClearState(ctx, sid)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kill rollback resets to "running" — enables oscillation: When CubeMaster.Kill() returns a non-terminal error (neither IsNotFound nor IsAlreadyInState), ClearState + SetState("running") re-opens the sandbox to traffic. If the Kill RPC partially succeeded before erroring (timeout after submitting the destroy task), the cycle repeats every sweep. Consider a backoff mechanism instead of immediately clearing state. Ref: security-review.

@kinwin-ustc

Copy link
Copy Markdown
Collaborator

Our semantics differ slightly from E2B. E2B timeout starts from sandbox startup, while ours starts from no request. In the automatic pause scenario, I think idle timeout is more reasonable, but from the perspective of how long an E2B sandbox can survive, should we follow our auto_puase semantics?

@chengjoey

Copy link
Copy Markdown
Contributor Author

Our semantics differ slightly from E2B. E2B timeout starts from sandbox startup, while ours starts from no request. In the automatic pause scenario, I think idle timeout is more reasonable, but from the perspective of how long an E2B sandbox can survive, should we follow our auto_puase semantics?

  • e2b's model: EndTime is an absolute deadline = StartTime + timeout. The evictor compares now to EndTime; activity does not reset it. The e2b SDK clients (not the server) call keep_alive on every run_code / commands.run to push EndTime forward — so "activity extends life" is implemented client-side via explicit RPCs. They also have a hard wall-clock ceiling MaxInstanceLength (24h Pro / 1h Base) on top of that.

  • Our model: server-side idle reaper. baseline = max(LastActiveMs, CreatedAt), idleFor = now - baseline. CubeProxy sidecar reports last_active on every HTTP request, so the server itself observes activity and rebases the idle clock without any client-side cooperation.

I think keep idle-based semantics is better:

  1. It's the right primitive for on_timeout=pause — auto-pause means "free resources when idle"; if we switched to absolute-deadline, a sandbox actively serving requests would get paused mid-flight unless the SDK keeps pushing keep_alive. That regression isn't worth it.

  2. We already invested in server-side activity tracking (sidecar last_active stream). E2b had to use absolute-deadline because their server doesn't see request traffic; we do, so we can give a strictly better experience.

What I think we should adopt from e2b is something orthogonal: a MaxInstanceLength hard ceiling that bounds the worst case independent of idle activity (so an always-busy sandbox doesn't live forever).

https://github.com/e2b-dev/infra/blob/ecd13b04d8aead46e7afe386538d5cb10ef22b40/packages/api/internal/sandbox/sandboxtypes/sandbox.go#L91

Comment thread docs/zh/guide/lifecycle.md Outdated
- 通过 HTTP 直连沙箱内的服务(例如 `getHost()` 返回的 URL)。

未配置 `auto_pause` / 不传 `lifecycle` 的沙箱完全保留旧行为:空闲超时直接销毁
未配置 `auto_pause` / 不传 `lifecycle` 的沙箱默认行为是 `on_timeout="kill"`:空闲超过 `timeout` 秒后,平台(CubeProxy sidecar 的 sweeper)会主动销毁该沙箱并广播 `killing` → `killed` 状态,CubeProxy 会对后续请求立即返回 410 Gone。这与 e2b `lifecycle.on_timeout="kill"` 语义一致。如果完全不希望被自动回收,请在创建时把 `timeout` 设得足够大、或主动在客户端发心跳调用刷新 idle 计时

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These user-facing documents only need to describe the behavior and do not need to expose implementation details(Cubeproxy etc).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

rsp = handleSandboxLogsAction(w, r, rt)
case r.URL.Path == actionURI(SandboxUpdateAction):
rsp = handleUpdateAction(w, r, rt)
case r.URL.Path == actionURI(SandboxTimeoutAction):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should employ some web framework here. cc @fslongjin

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll refactor these codes later.

zap.String("sandbox_id", e.Meta.SandboxID),
zap.Duration("idle_for", idleFor),
zap.Int("timeout_seconds", e.Meta.TimeoutSeconds))
if e.Meta.AutoPause {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use a switch here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
}()

if req.SandboxID == "" {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this kind of check should reside in CubeMaster/pkg/service/httpservice/cube/sandbox_timeout.go‎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I did not choose to move it to cube/sandbox_timeout.go‎ because I observed that in sandbox/sandbox_*.go, the logic is used as a fallback, while httpservice/cube only performs basic validations at the HTTP layer, just like sandbox_update.

if req.SandboxID == "" || req.InstanceType == "" || req.Action == "" {
rsp.Ret.RetCode = int(errorcode.ErrorCode_MasterParamsError)
rsp.Ret.RetMsg = "should provide InstanceType,SandboxID,Action"
return
}
if req.Action != "pause" && req.Action != "resume" {
rsp.Ret.RetCode = int(errorcode.ErrorCode_MasterParamsError)
rsp.Ret.RetMsg = "action should be pause or resume"
return
}

return
}
if req.InstanceType == "" {
req.InstanceType = cubebox.InstanceType_cubebox.String()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. The caller already set default value?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@chenhengqi chenhengqi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, but this PR looks good overall.

@chenhengqi

Copy link
Copy Markdown
Collaborator

One more request, please add an example for this feature. Thanks.

@chengjoey chengjoey force-pushed the feat/lifecycle-kill branch from 87b0ad1 to 014361a Compare June 25, 2026 03:30
@chengjoey

Copy link
Copy Markdown
Contributor Author

One more request, please add an example for this feature. Thanks.

done, auto-kill.py added

…end_at

Implement the timeout-then-kill side of e2b's lifecycle.on_timeout semantics,
expose the missing set_timeout/refresh endpoints on CubeMaster, and surface
a deterministic end_at across the whole stack (CubeMaster → sidecar →
CubeProxy → CubeAPI/SDK).

CubeMaster:
- Add POST /cube/sandbox/timeout (SetTimeout) and POST /cube/sandbox/refresh
  (Refresh) handlers; both rebase CreatedAt and overwrite TimeoutSeconds /
  EndAt on the lifecycle meta, then publish an OpUpdate event.
- Persist EndAt = CreatedAt + TimeoutSeconds*1000 in SandboxLifecycleMeta
  at create time and expose it on /sandbox/info and /sandbox/list responses
  (SandboxData.EndAt / SandboxBriefData.EndAt).
- Introduce a sandbox.TimeoutProvider seam wired by lifecycle.Init so the
  HTTP service can mutate Redis meta without taking a hard dependency on
  pkg/lifecycle.
- Add Store.PublishUpdate / Store.LoadMeta and the OpUpdate stream op so
  every sidecar replica converges on the new view.
- Propagate kill_reason (request | timeout | orphaned) through
  DeleteCubeSandboxReq into the cubelet destroy annotation.

CubeProxy sidecar:
- Sweeper no longer short-circuits on !AutoPause; it now picks Pause
  (auto_pause=true) or Kill (auto_pause=false, default) when the idle
  threshold is exceeded, mirroring e2b's on_timeout=pause|kill.
- Add cubemasterclient.Kill (DELETE /cube/sandbox with kill_reason) and the
  KillReason{Request,Timeout,Orphaned} constants.
- Extend the terminal-state fast path to cover "killing"/"killed" so we
  don't spam RPCs after a kill is in flight.
- Handle the new OpUpdate event: upsert registry, reset LastActive to 0
  (so the idle clock restarts), and push the refreshed meta to CubeProxy.
- Add registry.ResetLastActive helper.

CubeProxy Lua gate:
- sandbox_state.gate now returns 410 Gone for killing/killed (cube_retcode
  310410) so SDK clients fail fast instead of retrying a doomed sandbox.

CubeAPI (Rust):
- Mark SandboxTimeoutRequest / SandboxRefreshRequest as implemented; decode
  end_at via a new deserialize_optional_datetime that accepts null,
  RFC 3339 strings, or unix-millis integers/strings.
- Surface end_at on GetSandboxResponse from the new master field.

Signed-off-by: joeyczheng <joeyczheng@tencent.com>
@chengjoey chengjoey force-pushed the feat/lifecycle-kill branch from 014361a to 2cbfac1 Compare June 25, 2026 03:37
| `read.py` | `sandbox.files.read()` — read a file from the sandbox filesystem |
| `pause.py` | `sandbox.pause()` / `sandbox.connect()` — snapshot and restore |
| `auto_resume.py` | `lifecycle={"on_timeout": "pause", "auto_resume": True}` — let the platform pause idle sandboxes and resume them on the next request |
| `auto_kill.py` | `lifecycle={"on_timeout": "kill"}` — let the platform tear down idle sandboxes (the default — destruction is final, no snapshot kept) |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no snapshot kept?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants