fix(serverless): use executionTimeoutMs as runsync client timeout#272
fix(serverless): use executionTimeoutMs as runsync client timeout#272
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes ServerlessResource.runsync() always using a hardcoded 60s client timeout by deriving the HTTP client timeout from executionTimeoutMs (when configured), preventing long-running GPU inference jobs from failing client-side.
Changes:
- Add a module-level default runsync timeout constant (60s).
- Compute the runsync HTTP client timeout from
executionTimeoutMs / 1000when non-zero; otherwise fall back to the default. - Add unit tests covering configured timeout and default fallbacks.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/runpod_flash/core/resources/serverless.py |
Derives runsync client timeout from executionTimeoutMs with a 60s fallback constant. |
tests/unit/resources/test_serverless.py |
Adds async unit tests asserting the timeout passed to rp_client.post() for set/zero/None executionTimeoutMs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
runpod-Henrik
left a comment
There was a problem hiding this comment.
1. The fix — correct
Clean implementation. The guard is not None and > 0 correctly rejects None, 0, and negative values, all falling back to the 60s default. No issues with the core logic.
2. Question: no buffer between server limit and client timeout
executionTimeoutMs=300_000 → client timeout of exactly 300.0s. The RunPod /runsync endpoint will kill the job at 300_000ms server-side, then return a response. If there's any network overhead between the server killing the job and the response arriving at the client, the client HTTP timeout fires first — same failure mode as before, just pushed out by however long the job runs.
A small buffer is conventional here:
timeout_s: float = (
self.executionTimeoutMs / 1000 + 10 # +10s for network round-trip
if self.executionTimeoutMs is not None and self.executionTimeoutMs > 0
else DEFAULT_RUNSYNC_TIMEOUT_S
)Not blocking — the fix is still a massive improvement over the hardcoded 60s — but worth a decision either way. If intentionally no buffer, a comment explaining why would close the question.
3. Test gap: negative executionTimeoutMs not covered
The guard rejects executionTimeoutMs < 0 (falls back to 60s), and commit edc425c explicitly called this out, but there's no test for it. A test would lock in that behavior:
async def test_runsync_default_timeout_when_negative(self):
resource = ServerlessResource(name="test-neg", executionTimeoutMs=-1)
resource.id = "ep-neg"
# ... assert timeout=604. Scope note: run() polling path unaffected
This fix is specific to runsync. The async run() → polling path uses a different timeout model (RunPod SDK's await job.result() with its own polling interval). Not a regression here, just worth noting if AE-2375 was originally reported against both paths.
5. Known gap (pre-existing, not introduced here)
From the known-bugs register: execution_timeout_ms is not enforced server-side — workers ignore the configured timeout. This PR fixes the client-side mismatch only. If a job runs past executionTimeoutMs, the server doesn't kill it, so the client will now correctly wait the full configured duration. The server-side enforcement gap remains open separately.
Verdict
PASS. The fix resolves the stated bug correctly. One ask: decide on the buffer question (#2) and either add +10 or add a comment explaining the no-buffer choice.
🤖 Reviewed by Henrik's AI-Powered Bug Finder
edc425c to
a3d5e7d
Compare
runpod-Henrik
left a comment
There was a problem hiding this comment.
Follow-up on prior review
The explicit is not None and > 0 guard and DEFAULT_RUNSYNC_TIMEOUT_S = 60 constant are clean. The three test cases cover the main branches correctly.
The buffer question from the prior review is still open — executionTimeoutMs / 1000 gives exactly the server-side limit with no room for network round-trip. Still non-blocking, but worth noting.
Two minor gaps the tests don't cover:
- Negative value (
executionTimeoutMs=-100) — the> 0guard correctly falls back to 60s but there's no test for it - Very small value (
executionTimeoutMs=500) →timeout=0.5s— passes the guard and would cause an immediate client timeout on any real request; no validation prevents it
Verdict: PASS WITH NITS
Core fix is correct. The gaps above are low-risk in practice.
🤖 Reviewed by Henrik's AI-Powered Bug Finder
runsync() hardcoded a 60s HTTP client timeout regardless of executionTimeoutMs, causing GPU inference jobs exceeding 1 minute to fail with a client-side timeout even when the server allowed more time. Now uses executionTimeoutMs / 1000 when set, falling back to 60s default. Closes AE-2375
- Move DEFAULT_RUNSYNC_TIMEOUT_S to module-level constant - Add test for executionTimeoutMs=None fallback
Replace truthiness check with `is not None and > 0` to reject negative values. Annotate timeout_s as float for clarity.
a3d5e7d to
6a22440
Compare
Summary
ServerlessResource.runsync()hardcoded a 60s HTTP client timeout regardless ofexecutionTimeoutMs, causing GPU inference jobs exceeding 1 minute to always fail with a client-side timeoutexecutionTimeoutMs / 1000as the client timeout when set (non-zero, positive), falling back to 60s defaultChanges
core/resources/serverless.pyself.executionTimeoutMswith explicitis not None and > 0guardtests/unit/resources/test_serverless.pyTest plan
test_runsync_uses_execution_timeout— verifiesexecutionTimeoutMs=300000results intimeout=300test_runsync_default_timeout_when_zero— verifiesexecutionTimeoutMs=0falls back totimeout=60test_runsync_default_timeout_when_none— verifiesexecutionTimeoutMs=Nonefalls back totimeout=60make quality-checkpasses (85.40% coverage)Closes AE-2375