Skip to content

fix: retry proxy requests when gateway crashes after startup#338

Closed
andreasjansson wants to merge 13 commits intomainfrom
ajansson/fix/gateway-crash-retry
Closed

fix: retry proxy requests when gateway crashes after startup#338
andreasjansson wants to merge 13 commits intomainfrom
ajansson/fix/gateway-crash-retry

Conversation

@andreasjansson
Copy link
Copy Markdown
Member

Fixes #179

Problem

When the OpenClaw gateway process starts successfully and passes the TCP port health check, but then crashes while handling the first HTTP request, subsequent requests fail with:

Error proxying request to container: The container is not listening in the TCP address 10.0.0.1:18789

This happens because:

  1. ensureMoltbotGateway() succeeds — the port is reachable
  2. containerFetch() is called — the gateway processes the request, returns 500, and crashes
  3. The next request finds the port unreachable, but there's no recovery path

The containerFetch() call had no error handling at all — any exception would propagate as an unhandled error, returning an opaque failure to the client.

Fix

1. Retry on gateway crash (HTTP + WebSocket)

Both the HTTP proxy (containerFetch) and WebSocket proxy (wsConnect) now detect "container is not listening" errors from the Sandbox SDK. When detected:

  1. Kill the dead gateway process
  2. Restart the gateway via ensureMoltbotGateway()
  3. Retry the request once

This handles the exact pattern from #179: gateway starts → first request crashes it → retry brings it back.

2. Proper error handling around containerFetch

Previously, if containerFetch threw for any reason (not just crashes), the error was completely unhandled. Now all errors return structured JSON responses with appropriate status codes:

  • 503 for gateway crash + failed recovery
  • 502 for other proxy errors

Helper functions

  • isGatewayCrashedError() — detects Sandbox SDK errors indicating the gateway process died
  • killExistingGateway() — finds and kills the dead process so ensureMoltbotGateway() starts fresh

Test plan

  • npm run build — passes
  • npm run lint — 0 warnings, 0 errors
  • npm test — 82 tests pass

@andreasjansson andreasjansson force-pushed the ajansson/fix/gateway-crash-retry branch 2 times, most recently from 7cdbcbe to c2ca3ff Compare March 27, 2026 11:26
Metamolty and others added 7 commits March 27, 2026 17:39
When the OpenClaw gateway process starts successfully and passes the port
health check, but then crashes while handling the first request, subsequent
containerFetch/wsConnect calls throw 'is not listening' errors with no
recovery path. The user sees HTTP 500s followed by connection failures.

This adds retry-on-crash logic to both HTTP and WebSocket proxy paths:
1. Detect 'is not listening' errors from the Sandbox SDK
2. Kill the dead gateway process
3. Restart the gateway via ensureMoltbotGateway()
4. Retry the request once

Also adds proper error handling around containerFetch (previously had no
try-catch at all), returning structured JSON errors instead of unhandled
exceptions.

Fixes #179
…Gateway

Complete the crash retry implementation:
- HTTP proxy: catch 'is not listening' errors from containerFetch,
  kill crashed gateway, restart, retry once
- WebSocket proxy: same for wsConnect
- Return structured errors (503 for crash+failed recovery, 502 for other)

Extract killGateway() into gateway/process.ts as a shared function
used by both the restart handler and the crash retry logic. Removes
duplicate kill code from index.ts and api.ts.

Tested on staging: kill gateway → next HTTP request returns 200 (retry worked).
@andreasjansson andreasjansson force-pushed the ajansson/fix/gateway-crash-retry branch from c2ca3ff to 8770607 Compare March 27, 2026 16:45
andreasjansson and others added 6 commits March 27, 2026 18:05
The 4 e2e variants pull the sandbox base image in parallel, frequently
hitting Docker Hub rate limits. Retry up to 5 times with 60s backoff.
…291)

Fixes #289, closes #291

1. Gateway double-spawn (#289): findExistingMoltbotProcess() missed
   processes invoked as 'bash /usr/local/bin/start-openclaw.sh' (full
   path with shell prefix), causing a second spawn that fails with
   'port already in use'. Fix: broaden command matching and add a TCP
   port pre-check before spawning as a safety net.

2. Clarify sandbox.start() is NOT needed (#291): Added a comment to the
   sandbox middleware explaining why we don't call sandbox.start(). The
   SDK's containerFetch() auto-starts the container, and the catch-all
   route uses ensureMoltbotGateway() for explicit lifecycle management.
   Three separate PRs (#292, #294, #315) proposed adding sandbox.start()
   based on a misunderstanding of the API.
When the gateway fails to start, we need to see what /api/status is
returning. Added:
- Background debug loop in _setup that polls /api/status and logs to stderr
- 15s timeout around restoreIfNeeded calls (was potentially hanging)
- Logging in /api/status handler
…quest

The catch-all route and /api/status were calling restoreIfNeeded on EVERY
request, including WebSocket reconnects from the browser. If a reconnect
happened after a sync stored a backup handle, restoreIfNeeded would mount
a FUSE overlay. The next createBackup would then reset the overlay, wiping
upper-layer files (like the marker).

Fix: check if the gateway is already running FIRST. Only call
restoreIfNeeded if the gateway needs to be started.
@github-actions
Copy link
Copy Markdown

E2E Test Recording (base)

✅ Tests passed

E2E Test Video

@github-actions
Copy link
Copy Markdown

E2E Test Recording (workers-ai)

❌ Tests failed

E2E Test Video

@github-actions
Copy link
Copy Markdown

E2E Test Recording (discord)

❌ Tests failed

E2E Test Video

@github-actions
Copy link
Copy Markdown

E2E Test Recording (telegram)

❌ Tests failed

E2E Test Video

@andreasjansson
Copy link
Copy Markdown
Member Author

Closing — the crash retry changes need to be re-implemented on top of #337 which now contains all the infrastructure fixes. The original commit was lost during rebase conflict resolution. Will re-add the crash retry logic to #337.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway returns HTTP 500 errors and crashes immediately after startup

1 participant