fix(serverless): use executionTimeoutMs as runsync client timeout by deanq · Pull Request #272 · runpod/flash

deanq · 2026-03-15T04:16:30Z

Summary

ServerlessResource.runsync() hardcoded a 60s HTTP client timeout regardless of executionTimeoutMs, causing GPU inference jobs exceeding 1 minute to always fail with a client-side timeout
Now uses executionTimeoutMs / 1000 as the client timeout when set (non-zero, positive), falling back to 60s default

Changes

File	Change
`core/resources/serverless.py`	Derive client timeout from `self.executionTimeoutMs` with explicit `is not None and > 0` guard
`tests/unit/resources/test_serverless.py`	3 tests: custom timeout, zero fallback, and None fallback

Test plan

test_runsync_uses_execution_timeout — verifies executionTimeoutMs=300000 results in timeout=300
test_runsync_default_timeout_when_zero — verifies executionTimeoutMs=0 falls back to timeout=60
test_runsync_default_timeout_when_none — verifies executionTimeoutMs=None falls back to timeout=60
Full make quality-check passes (85.40% coverage)

Closes AE-2375

Copilot

Pull request overview

This PR fixes ServerlessResource.runsync() always using a hardcoded 60s client timeout by deriving the HTTP client timeout from executionTimeoutMs (when configured), preventing long-running GPU inference jobs from failing client-side.

Changes:

Add a module-level default runsync timeout constant (60s).
Compute the runsync HTTP client timeout from executionTimeoutMs / 1000 when non-zero; otherwise fall back to the default.
Add unit tests covering configured timeout and default fallbacks.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`src/runpod_flash/core/resources/serverless.py`	Derives runsync client timeout from `executionTimeoutMs` with a 60s fallback constant.
`tests/unit/resources/test_serverless.py`	Adds async unit tests asserting the timeout passed to `rp_client.post()` for set/zero/None `executionTimeoutMs`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

src/runpod_flash/core/resources/serverless.py

tests/unit/resources/test_serverless.py

runpod-Henrik

1. The fix — correct

Clean implementation. The guard is not None and > 0 correctly rejects None, 0, and negative values, all falling back to the 60s default. No issues with the core logic.

2. Question: no buffer between server limit and client timeout

executionTimeoutMs=300_000 → client timeout of exactly 300.0s. The RunPod /runsync endpoint will kill the job at 300_000ms server-side, then return a response. If there's any network overhead between the server killing the job and the response arriving at the client, the client HTTP timeout fires first — same failure mode as before, just pushed out by however long the job runs.

A small buffer is conventional here:

timeout_s: float = (
    self.executionTimeoutMs / 1000 + 10   # +10s for network round-trip
    if self.executionTimeoutMs is not None and self.executionTimeoutMs > 0
    else DEFAULT_RUNSYNC_TIMEOUT_S
)

Not blocking — the fix is still a massive improvement over the hardcoded 60s — but worth a decision either way. If intentionally no buffer, a comment explaining why would close the question.

3. Test gap: negative `executionTimeoutMs` not covered

The guard rejects executionTimeoutMs < 0 (falls back to 60s), and commit edc425c explicitly called this out, but there's no test for it. A test would lock in that behavior:

async def test_runsync_default_timeout_when_negative(self):
    resource = ServerlessResource(name="test-neg", executionTimeoutMs=-1)
    resource.id = "ep-neg"
    # ... assert timeout=60

4. Scope note: `run()` polling path unaffected

This fix is specific to runsync. The async run() → polling path uses a different timeout model (RunPod SDK's await job.result() with its own polling interval). Not a regression here, just worth noting if AE-2375 was originally reported against both paths.

5. Known gap (pre-existing, not introduced here)

From the known-bugs register: execution_timeout_ms is not enforced server-side — workers ignore the configured timeout. This PR fixes the client-side mismatch only. If a job runs past executionTimeoutMs, the server doesn't kill it, so the client will now correctly wait the full configured duration. The server-side enforcement gap remains open separately.

Verdict

PASS. The fix resolves the stated bug correctly. One ask: decide on the buffer question (#2) and either add +10 or add a comment explaining the no-buffer choice.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

runpod-Henrik

Follow-up on prior review

The explicit is not None and > 0 guard and DEFAULT_RUNSYNC_TIMEOUT_S = 60 constant are clean. The three test cases cover the main branches correctly.

The buffer question from the prior review is still open — executionTimeoutMs / 1000 gives exactly the server-side limit with no room for network round-trip. Still non-blocking, but worth noting.

Two minor gaps the tests don't cover:

Negative value (executionTimeoutMs=-100) — the > 0 guard correctly falls back to 60s but there's no test for it
Very small value (executionTimeoutMs=500) → timeout=0.5s — passes the guard and would cause an immediate client timeout on any real request; no validation prevents it

Verdict: PASS WITH NITS

Core fix is correct. The gaps above are low-risk in practice.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

runsync() hardcoded a 60s HTTP client timeout regardless of executionTimeoutMs, causing GPU inference jobs exceeding 1 minute to fail with a client-side timeout even when the server allowed more time. Now uses executionTimeoutMs / 1000 when set, falling back to 60s default. Closes AE-2375

- Move DEFAULT_RUNSYNC_TIMEOUT_S to module-level constant - Add test for executionTimeoutMs=None fallback

Replace truthiness check with `is not None and > 0` to reject negative values. Annotate timeout_s as float for clarity.

deanq requested a review from Copilot March 15, 2026 04:25

Copilot started reviewing on behalf of deanq March 15, 2026 04:25 View session

Copilot AI reviewed Mar 15, 2026

View reviewed changes

src/runpod_flash/core/resources/serverless.py Outdated Show resolved Hide resolved

tests/unit/resources/test_serverless.py Show resolved Hide resolved

deanq requested a review from runpod-Henrik March 15, 2026 04:32

runpod-Henrik reviewed Mar 16, 2026

View reviewed changes

deanq force-pushed the fix/AE-2375-runsync-client-timeout branch from edc425c to a3d5e7d Compare March 17, 2026 20:00

runpod-Henrik reviewed Mar 18, 2026

View reviewed changes

deanq added 3 commits March 19, 2026 11:02

fix(serverless): address review feedback on runsync timeout

ffe5ecd

- Move DEFAULT_RUNSYNC_TIMEOUT_S to module-level constant - Add test for executionTimeoutMs=None fallback

fix(serverless): use explicit guard for executionTimeoutMs

6a22440

Replace truthiness check with `is not None and > 0` to reject negative values. Annotate timeout_s as float for clarity.

deanq force-pushed the fix/AE-2375-runsync-client-timeout branch from a3d5e7d to 6a22440 Compare March 19, 2026 18:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(serverless): use executionTimeoutMs as runsync client timeout#272

fix(serverless): use executionTimeoutMs as runsync client timeout#272
deanq wants to merge 3 commits intomainfrom
fix/AE-2375-runsync-client-timeout

deanq commented Mar 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

runpod-Henrik left a comment

Uh oh!

runpod-Henrik left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

deanq commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

runpod-Henrik left a comment

Choose a reason for hiding this comment

1. The fix — correct

2. Question: no buffer between server limit and client timeout

3. Test gap: negative executionTimeoutMs not covered

4. Scope note: run() polling path unaffected

5. Known gap (pre-existing, not introduced here)

Verdict

Uh oh!

runpod-Henrik left a comment

Choose a reason for hiding this comment

Follow-up on prior review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

deanq commented Mar 15, 2026 •

edited

Loading

3. Test gap: negative `executionTimeoutMs` not covered

4. Scope note: `run()` polling path unaffected