Add auto-pause/auto-resume capability across the stack#613
Conversation
| end | ||
| local got = ngx.var.http_x_cube_admin_token | ||
| if got ~= expected then | ||
| reply_error(ngx.HTTP_FORBIDDEN, "admin token mismatch") |
There was a problem hiding this comment.
check_token() calls reply_error() which calls ngx.exit(), but does not return it. While ngx.exit() throws a Lua error and normally aborts execution, this is inconsistent with every other error guard in the same file (lines 79, 116-118, 121 all use return reply_error(...)). If this handler is ever wrapped in pcall(), execution would fall through to the route dispatch and potentially return a 404 instead of the intended 403.
| reply_error(ngx.HTTP_FORBIDDEN, "admin token mismatch") | |
| if got ~= expected then | |
| return reply_error(ngx.HTTP_FORBIDDEN, "admin token mismatch") | |
| end |
| -- For the dict size we set (8m → ~10w entries) the sidecar should pull | ||
| -- often enough that any single response stays bounded; if not, the | ||
| -- sidecar must use ?since= incrementally. | ||
| local keys = LAST:get_keys(0) |
There was a problem hiding this comment.
get_keys(0) returns at most 1024 keys — OpenResty's lua-resty-core caps this regardless of value. The comment acknowledges this ("we iterate until exhausted") but the code does NOT implement the iteration loop. If the dict holds more than 1024 entries (the 8m dict supports ~100k entries at ~80B each), entries beyond the first 1024 are silently dropped from the response. The sidecar's last_active poll (every 5s) will miss timestamps, potentially triggering premature auto-pause of active sandboxes. Either paginate with a while loop over incremental get_keys(1024) calls, or document the 1024-entry cap.
| zap.Bool("auto_resume", ev.Meta.AutoResume), | ||
| zap.Int("timeout_seconds", ev.Meta.TimeoutSeconds), | ||
| zap.Int("registry_size", reg.Len())) | ||
| if err := push.UpsertMeta(ctx, *ev.Meta); err != nil { |
There was a problem hiding this comment.
When push.UpsertMeta() fails (or push.DeleteMeta() at line 231), the error is logged but handleEvent returns no error to the caller. consumeStream then unconditionally ACKs the stream entry (line 196). This means a transient CubeProxy push failure causes permanent event loss — the stream entry is acknowledged and never replayed.
The HGETALL bootstrap at startup provides eventual recovery, but only on sidecar restart. During normal operation, a brief network hiccup between the sidecar and CubeProxy silently drops metadata updates, leaving the proxy unaware of new sandboxes (or still tracking deleted ones).
Consider: either (a) don't ACK until the push succeeds (with retry), or (b) push to a local retry queue.
| r.calls[sandboxID] = c | ||
| r.mu.Unlock() | ||
|
|
||
| defer func() { |
There was a problem hiding this comment.
The deferred cleanup deletes from calls before closing c.done. This creates a brief window where a concurrent caller to Resume() can find no entry in the map (just deleted), create a new call object, and start a second doResume() RPC — while existing waiters on the original c.done haven't yet been woken up.
CubeMaster is idempotent, so correctness is preserved, but this wastes one RPC per concurrent burst per sandbox. Fix: close(c.done) first, then delete from the map:
| defer func() { | |
| defer func() { | |
| close(c.done) | |
| r.mu.Lock() | |
| delete(r.calls, sandboxID) | |
| r.mu.Unlock() | |
| }() |
| // Run blocks until ctx is cancelled or ListenAndServe returns. | ||
| func (s *Server) Run(ctx context.Context) error { | ||
| mux := http.NewServeMux() | ||
| mux.HandleFunc("/internal/resume", s.handleResume) |
There was a problem hiding this comment.
The sidecar's /internal/resume, /healthz, and /readyz endpoints have zero authentication. While the sidecar defaults to listening on 127.0.0.1:8083 (loopback), the config allows operators to bind to any address via CUBE_SIDECAR_LISTEN_ADDR. If bound to 0.0.0.0 or a routable IP, any network peer can trigger resume RPCs to CubeMaster or read registry metadata.
Consider adding an optional token check (similar to admin_phase.lua's X-Cube-Admin-Token), or document this security boundary prominently in the sidecar config.
| end | ||
| local got = ngx.var.http_x_cube_admin_token | ||
| if got ~= expected then | ||
| reply_error(ngx.HTTP_FORBIDDEN, "admin token mismatch") |
There was a problem hiding this comment.
check_token() calls reply_error() which calls ngx.exit(), but does not return it. While ngx.exit() throws a Lua error and normally aborts execution, this is inconsistent with every other error guard in the same file (lines 79, 116-118, 121 all use return reply_error(...)). If this handler is ever wrapped in pcall(), execution would fall through to the route dispatch and potentially return a 404 instead of the intended 403.
| reply_error(ngx.HTTP_FORBIDDEN, "admin token mismatch") | |
| if got ~= expected then | |
| return reply_error(ngx.HTTP_FORBIDDEN, "admin token mismatch") | |
| end |
| -- For the dict size we set (8m → ~10w entries) the sidecar should pull | ||
| -- often enough that any single response stays bounded; if not, the | ||
| -- sidecar must use ?since= incrementally. | ||
| local keys = LAST:get_keys(0) |
There was a problem hiding this comment.
get_keys(0) returns at most 1024 keys — OpenResty's lua-resty-core caps this regardless of value. The comment acknowledges this ("we iterate until exhausted") but the code does NOT implement the iteration loop. If the dict holds more than 1024 entries (the 8m dict supports ~100k entries at ~80B each), entries beyond the first 1024 are silently dropped from the response. The sidecar's last_active poll (every 5s) will miss timestamps, potentially triggering premature auto-pause of active sandboxes. Either paginate with a while loop over incremental get_keys(1024) calls, or document the 1024-entry cap.
| zap.Bool("auto_resume", ev.Meta.AutoResume), | ||
| zap.Int("timeout_seconds", ev.Meta.TimeoutSeconds), | ||
| zap.Int("registry_size", reg.Len())) | ||
| if err := push.UpsertMeta(ctx, *ev.Meta); err != nil { |
There was a problem hiding this comment.
When push.UpsertMeta() fails (or push.DeleteMeta() at line 231), the error is logged but handleEvent returns no error to the caller. consumeStream (line 196-197) then unconditionally ACKs the stream entry. This means a transient CubeProxy push failure causes permanent event loss — the stream entry is acknowledged and never replayed.
The HGETALL bootstrap path at startup provides eventual recovery, but only on sidecar restart. During normal operation, a brief network hiccup between the sidecar and CubeProxy silently drops metadata updates, leaving the proxy unaware of new sandboxes (or still tracking deleted ones).
Consider: either (a) don't ACK until the push succeeds, retrying on failure, or (b) push to a local retry queue for reconciliation.
| r.calls[sandboxID] = c | ||
| r.mu.Unlock() | ||
|
|
||
| defer func() { |
There was a problem hiding this comment.
The deferred cleanup deletes from calls before closing c.done. This creates a brief window where a concurrent caller to Resume() can find no entry in the map (just deleted), create a new call object, and start a second doResume() RPC — while existing waiters on the original c.done haven't yet been woken up.
The comment in the PR body notes CubeMaster is idempotent, so correctness is preserved, but this wastes one RPC per concurrent burst per sandbox. Fix: close(c.done) first, then delete from the map:
| defer func() { | |
| defer func() { | |
| close(c.done) | |
| r.mu.Lock() | |
| delete(r.calls, sandboxID) | |
| r.mu.Unlock() | |
| }() |
| // Run blocks until ctx is cancelled or ListenAndServe returns. | ||
| func (s *Server) Run(ctx context.Context) error { | ||
| mux := http.NewServeMux() | ||
| mux.HandleFunc("/internal/resume", s.handleResume) |
There was a problem hiding this comment.
The sidecar's /internal/resume, /healthz, and /readyz endpoints have zero authentication. While the sidecar defaults to listening on 127.0.0.1:8083 (loopback), the config allows operators to bind to any address via CUBE_SIDECAR_LISTEN_ADDR. If bound to 0.0.0.0 or a routable IP, any network peer can:
- Trigger resume RPCs to CubeMaster (
/internal/resume?sandbox_id=...) - Read registry metadata (
/readyzreturnsregistry_len)
Consider adding an optional token check similar to how admin_phase.lua uses X-Cube-Admin-Token, or document this security boundary prominently.
Implement an e2b-compatible auto-pause/auto-resume feature end-to-end:
sandboxes opted in via the SDK's `lifecycle` config are paused after
their idle timeout and transparently resumed on the next request, with
no caller visible to the unwinding/winding work. The feature is
always-on at deploy time — opt-in is per-sandbox, not per-cluster.
Architecture:
CubeMaster create/destroy hooks publish a Redis HSet snapshot
(cube:sandbox:meta) plus an append-only event stream
(cube:sandbox:events) into the RedisWrite pool — the
same pool localcache already uses for the bypass-host
proxy map, so no extra Redis configuration is required.
Writes are warn-only: a Redis hiccup never fails a
sandbox lifecycle op.
CubeProxy log_phase stamps per-sandbox last-active into a worker
shared dict; rewrite_phase consults a state dict and
either passes traffic, returns 503+Retry-After, or
fires an internal sub-request to the sidecar's
/internal/resume before forwarding. A loopback admin
server (127.0.0.1:8082) lets the sidecar push state
and pull last-active on a poll. The /_sidecar_resume
location appends ?$args to proxy_pass and forces
Content-Length: 0 — both load-bearing fixes for the
sub-request hand-off.
cube-proxy- New Go component at CubeProxy/sidecar/, bundled inside
sidecar the cube-proxy container image. The binary is built
static (CGO=0, -tags 'netgo osusergo') for the musl-based Alpine
runtime. At startup it bootstraps from the meta HSet,
consumes the event stream via XREADGROUP, sweeps the
registry on a fixed interval, and drives CubeMaster's
/cube/sandbox/update for pause/resume. Concurrent
resumes are coalesced in-process; cross-replica
coordination uses Redis SETNX state locks. Failure
paths are explicit:
* not-found (ret_code 130483) → evict registry +
proxy meta, no retry.
* already-in-state (130490) → reconcile + treat as
success.
* generic RPC failure → roll back, sweep retries.
The sweeper has a bootstrap-warmup window so a fresh
sidecar doesn't pause everything before its first
last_active poll lands.
CubeAPI & SDK gains a single `lifecycle` kwarg on Sandbox.create
Python SDK matching e2b verbatim:
lifecycle={"on_timeout": "pause"|"kill",
"auto_resume": bool}
Wire shape is the camelCase nested form e2b uses
(`lifecycle.onTimeout`, `lifecycle.autoResume`) — drop-
in compatibility for existing e2b clients. CubeAPI
translates the nested object into CubeMaster's two
internal bools so the master-side protocol is
unchanged. When `lifecycle` is unset, payloads stay
byte-identical to pre-feature behaviour; regression
tests lock the absent-field shape down.
Deployment deploy/one-click renders 8 CUBE_SIDECAR_* env vars
(Redis, CubeMaster URL, listen addr, idle timeout,
admin token, …) and 2 nginx-side knobs
($cube_sidecar_addr, $cube_admin_token) into the
cube-proxy compose file and global.conf. CubeMaster
URL defaults to :8089 (the canonical http_port). No
feature-flag env var: lifecycle is always live in the
control plane; opt-in happens per-sandbox via the
SDK's `lifecycle` argument.
Documentation:
* docs/guide/lifecycle.md + zh translation under "Core Concepts" /
"核心概念" — covers the state machine, every lifecycle method,
auto-pause/auto-resume semantics, and operational notes.
* examples/code-sandbox-quickstart/auto-resume.py — end-to-end TUI
demo (paired with the existing pause.py) verifying state survives
the auto-pause / auto-resume cycle across kernel memory and the
filesystem.
Tests:
* CubeMaster: lifecycle store + multi-hook chain (Go unit, race-
clean). Pool resolution falls back through RedisWrite cleanly.
* CubeProxy: nginx -t against openresty:1.21.4.1 with the rendered
global.conf; luajit -b on every touched .lua file.
* Sidecar: 34 unit tests across cubemasterclient/registry/
sweeper/resumer/proxypush/httpapi/lifecycle. Failure-
code reconciliation, bootstrap-warmup gating,
terminal-paused ownership, in-flight de-dup, peer-
lock waits, and HTTP path semantics are all covered;
deps injected via interfaces so no live Redis is
required, -race green.
* CubeAPI: inbound lifecycle deserialization → CubeMaster bool
translation across all five meaningful shape
combinations; outbound CreateSandboxRequest wire-shape
snapshot.
* Python SDK: 6 lifecycle wire-format snapshots (default omitted,
pause-only, pause+resume, kill explicit,
auto_resume-only, invalid on_timeout raises); full
suite (172 tests) still passes.
Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
| end | ||
| local got = ngx.var.http_x_cube_admin_token | ||
| if got ~= expected then | ||
| reply_error(ngx.HTTP_FORBIDDEN, "admin token mismatch") |
There was a problem hiding this comment.
Critical: check_token may not abort when lua_safe_exit is enabled
reply_error → reply → ngx.exit() is relied upon to abort the request, but if lua_safe_exit on is set (which prevents ngx.exit() from raising an error), check_token() silently returns to dispatch() which continues processing the request as if auth succeeded.
Add return after the call for defense-in-depth:
if got ~= expected then
reply_error(ngx.HTTP_FORBIDDEN, "admin token mismatch")
return
end| -- get_keys(0) returns up to 1024 keys; we iterate until exhausted. | ||
| -- For the dict size we set (8m → ~10w entries) the sidecar should pull | ||
| -- often enough that any single response stays bounded; if not, the | ||
| -- sidecar must use ?since= incrementally. |
There was a problem hiding this comment.
Critical: get_keys(0) caps at 1024 entries — wide-scale last_active tracking silently broken
get_keys(0) returns at most 1024 keys from pseudo-random hash-table positions. With 90k entries in cube_sandbox_last_active, the sidecar's PullLastActive never learns about ~89k sandboxes. The sweeper then falls back to CreatedAt for those invisible sandboxes and incorrectly pauses actively-trafficked ones.
Call get_keys() with a count large enough for the full dict (e.g. get_keys(LAST:capacity())), or scan the dict in chunks.
| # Sizing rationale: ~80B per entry on the active dict supports >>10w live | ||
| # sandboxes per CubeProxy worker pool, with healthy headroom for keys and | ||
| # bookkeeping. Meta entries are JSON, ~512B nominal. | ||
| lua_shared_dict cube_sandbox_meta 16m; |
There was a problem hiding this comment.
Sizing comment claims >>10w but 16m supports ~28k entries at 512B/entry
cube_sandbox_meta 16m with ~512B JSON payloads + key + overhead ≈ 600B/entry → ~28k entries max, not 100k+. The sizing rationale comment is misleading. LRU evictions will kick in far sooner than expected.
Consider bumping to 64m, or at minimum correct the comment to reflect actual capacity.
| -- launch their own resume sub-requests. The sidecar will push the | ||
| -- authoritative value via /admin/state shortly after, which simply | ||
| -- overwrites this. | ||
| states:set(ins_id, "running") |
There was a problem hiding this comment.
Cross-worker optimization comment should document limitation
ngx.shared dicts are per-worker, not shared across nginx workers. This optimistic set only deduplicates requests on the same worker. A request hitting a different worker will still fire a duplicate resume sub-request. Correctness is preserved (the sidecar deduplicates via Redis lock), but the comment should clarify this is a best-effort optimization scoped to a single worker.
| // Run blocks until ctx is cancelled or ListenAndServe returns. | ||
| func (s *Server) Run(ctx context.Context) error { | ||
| mux := http.NewServeMux() | ||
| mux.HandleFunc("/internal/resume", s.handleResume) |
There was a problem hiding this comment.
High: /internal/resume endpoint has no authentication
The sidecar's HTTP endpoint accepts any POST with a valid sandbox_id. The only protection is the nginx internal directive upstream. If the listen address is ever changed to 0.0.0.0 (or if the admin port is exposed), this becomes remotely exploitable — any caller can trigger resume RPCs to CubeMaster for any sandbox.
At minimum, verify the client is talking from loopback in the handler, or add optional shared-secret auth matching the admin token pattern.
| } | ||
| return &Sweeper{o: o} | ||
| } | ||
|
|
There was a problem hiding this comment.
Both sweeper and pollLastActive default to 5s intervals without jitter
In multi-replica deployments, all sidecars start their sweeps simultaneously, creating Redis GET bursts. Consider adding startup jitter (random initial delay up to Interval) so sidecars desynchronize naturally.
| } | ||
| defer resp.Body.Close() | ||
|
|
||
| respBody, err := io.ReadAll(io.LimitReader(resp.Body, 1<<20)) |
There was a problem hiding this comment.
1MB response limit will truncate full last_active response
If PullLastActive returns all entries from cube_sandbox_last_active (e.g. ~90k), the JSON response would be ~5MB. The io.LimitReader(resp.Body, 1<<20) cap will truncate this, causing JSON parse failures.
Either increase the limit to 8MB for the PullLastActive path, or switch to streaming JSON parsing.
| } | ||
|
|
||
| // Get returns a copy of the entry for inspection. Returns nil when absent. | ||
| func (r *Registry) Get(sandboxID string) *Entry { |
There was a problem hiding this comment.
Snapshot() sort is unnecessary overhead for sweeper iteration
The sort adds O(n log n) work and holds the RWMutex read lock for its duration (~20–30ms for 100k entries). The sweeper iterates entries independently — iteration order has no functional impact.
Consider removing the sort and iterating the map directly. The copy semantics (returning value types) are preserved.
| c.RedisDB = n | ||
| } | ||
| } | ||
| if v := os.Getenv("CUBE_SIDECAR_PROXY_ADMIN_URLS"); v != "" { |
There was a problem hiding this comment.
Consider validating outbound URLs to prevent SSRF
CUBE_SIDECAR_PROXY_ADMIN_URLS and CUBE_SIDECAR_CUBEMASTER_URL are accepted without host/scheme validation. An attacker who controls the sidecar's env vars could redirect traffic to attacker-controlled servers, leaking the admin token and sandbox metadata.
Consider adding URL validation in Validate(): reject non-loopback CubeProxyAdminURLs, log a warning for non-HTTPS CubeMaster URLs pointing at non-loopback hosts.
| // caller can retry. Without this guard we | ||
| // would silently let through a request to a | ||
| // still-paused sandbox. | ||
| func (r *Resumer) waitForRunning(ctx context.Context, sandboxID string) error { |
There was a problem hiding this comment.
Untested edge cases in waitForRunning
The polling loop has four terminal outcomes, but only one (state == "running") is tested. The other three — key expired (!ok), peer left paused (state == "paused"), and context cancellation — are reachable production paths that should have test coverage.
Implement an e2b-compatible auto-pause/auto-resume feature end-to-end: sandboxes opted in via the SDK's
lifecycleconfig are paused after their idle timeout and transparently resumed on the next request, with no caller visible to the unwinding/winding work. The feature is always-on at deploy time — opt-in is per-sandbox, not per-cluster.Architecture:
CubeMaster create/destroy hooks publish a Redis HSet snapshot
(cube:sandbox:meta) plus an append-only event stream
(cube:sandbox:events) into the RedisWrite pool — the
same pool localcache already uses for the bypass-host
proxy map, so no extra Redis configuration is required.
Writes are warn-only: a Redis hiccup never fails a
sandbox lifecycle op.
CubeProxy log_phase stamps per-sandbox last-active into a worker
shared dict; rewrite_phase consults a state dict and
either passes traffic, returns 503+Retry-After, or
fires an internal sub-request to the sidecar's
/internal/resume before forwarding. A loopback admin
server (127.0.0.1:8082) lets the sidecar push state
and pull last-active on a poll. The /_sidecar_resume
location appends ?$args to proxy_pass and forces
Content-Length: 0 — both load-bearing fixes for the
sub-request hand-off.
cube-proxy- New Go component at CubeProxy/sidecar/, bundled inside
sidecar the cube-proxy container image. The binary is built
static (CGO=0, -tags 'netgo osusergo') for the musl-based Alpine
runtime. At startup it bootstraps from the meta HSet,
consumes the event stream via XREADGROUP, sweeps the
registry on a fixed interval, and drives CubeMaster's
/cube/sandbox/update for pause/resume. Concurrent
resumes are coalesced in-process; cross-replica
coordination uses Redis SETNX state locks. Failure
paths are explicit:
* not-found (ret_code 130483) → evict registry +
proxy meta, no retry.
* already-in-state (130490) → reconcile + treat as
success.
* generic RPC failure → roll back, sweep retries.
The sweeper has a bootstrap-warmup window so a fresh
sidecar doesn't pause everything before its first
last_active poll lands.
CubeAPI & SDK gains a single
lifecyclekwarg on Sandbox.createPython SDK matching e2b verbatim:
lifecycle={"on_timeout": "pause"|"kill",
"auto_resume": bool}
Wire shape is the camelCase nested form e2b uses
(
lifecycle.onTimeout,lifecycle.autoResume) — drop-in compatibility for existing e2b clients. CubeAPI
translates the nested object into CubeMaster's two
internal bools so the master-side protocol is
unchanged. When
lifecycleis unset, payloads staybyte-identical to pre-feature behaviour; regression
tests lock the absent-field shape down.
Deployment deploy/one-click renders 8 CUBE_SIDECAR_* env vars
(Redis, CubeMaster URL, listen addr, idle timeout,
admin token, …) and 2 nginx-side knobs
($cube_sidecar_addr, $cube_admin_token) into the
cube-proxy compose file and global.conf. CubeMaster
URL defaults to :8089 (the canonical http_port). No
feature-flag env var: lifecycle is always live in the
control plane; opt-in happens per-sandbox via the
SDK's
lifecycleargument.Documentation:
Tests:
sweeper/resumer/proxypush/httpapi/lifecycle. Failure-
code reconciliation, bootstrap-warmup gating,
terminal-paused ownership, in-flight de-dup, peer-
lock waits, and HTTP path semantics are all covered;
deps injected via interfaces so no live Redis is
required, -race green.
translation across all five meaningful shape
combinations; outbound CreateSandboxRequest wire-shape
snapshot.
pause-only, pause+resume, kill explicit,
auto_resume-only, invalid on_timeout raises); full
suite (172 tests) still passes.