Skip to content

fix(serverless): use executionTimeoutMs as runsync client timeout#272

Open
deanq wants to merge 3 commits intomainfrom
fix/AE-2375-runsync-client-timeout
Open

fix(serverless): use executionTimeoutMs as runsync client timeout#272
deanq wants to merge 3 commits intomainfrom
fix/AE-2375-runsync-client-timeout

Conversation

@deanq
Copy link
Member

@deanq deanq commented Mar 15, 2026

Summary

  • ServerlessResource.runsync() hardcoded a 60s HTTP client timeout regardless of executionTimeoutMs, causing GPU inference jobs exceeding 1 minute to always fail with a client-side timeout
  • Now uses executionTimeoutMs / 1000 as the client timeout when set (non-zero, positive), falling back to 60s default

Changes

File Change
core/resources/serverless.py Derive client timeout from self.executionTimeoutMs with explicit is not None and > 0 guard
tests/unit/resources/test_serverless.py 3 tests: custom timeout, zero fallback, and None fallback

Test plan

  • test_runsync_uses_execution_timeout — verifies executionTimeoutMs=300000 results in timeout=300
  • test_runsync_default_timeout_when_zero — verifies executionTimeoutMs=0 falls back to timeout=60
  • test_runsync_default_timeout_when_none — verifies executionTimeoutMs=None falls back to timeout=60
  • Full make quality-check passes (85.40% coverage)

Closes AE-2375

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes ServerlessResource.runsync() always using a hardcoded 60s client timeout by deriving the HTTP client timeout from executionTimeoutMs (when configured), preventing long-running GPU inference jobs from failing client-side.

Changes:

  • Add a module-level default runsync timeout constant (60s).
  • Compute the runsync HTTP client timeout from executionTimeoutMs / 1000 when non-zero; otherwise fall back to the default.
  • Add unit tests covering configured timeout and default fallbacks.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/runpod_flash/core/resources/serverless.py Derives runsync client timeout from executionTimeoutMs with a 60s fallback constant.
tests/unit/resources/test_serverless.py Adds async unit tests asserting the timeout passed to rp_client.post() for set/zero/None executionTimeoutMs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@deanq deanq requested a review from runpod-Henrik March 15, 2026 04:32
Copy link
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. The fix — correct

Clean implementation. The guard is not None and > 0 correctly rejects None, 0, and negative values, all falling back to the 60s default. No issues with the core logic.


2. Question: no buffer between server limit and client timeout

executionTimeoutMs=300_000 → client timeout of exactly 300.0s. The RunPod /runsync endpoint will kill the job at 300_000ms server-side, then return a response. If there's any network overhead between the server killing the job and the response arriving at the client, the client HTTP timeout fires first — same failure mode as before, just pushed out by however long the job runs.

A small buffer is conventional here:

timeout_s: float = (
    self.executionTimeoutMs / 1000 + 10   # +10s for network round-trip
    if self.executionTimeoutMs is not None and self.executionTimeoutMs > 0
    else DEFAULT_RUNSYNC_TIMEOUT_S
)

Not blocking — the fix is still a massive improvement over the hardcoded 60s — but worth a decision either way. If intentionally no buffer, a comment explaining why would close the question.


3. Test gap: negative executionTimeoutMs not covered

The guard rejects executionTimeoutMs < 0 (falls back to 60s), and commit edc425c explicitly called this out, but there's no test for it. A test would lock in that behavior:

async def test_runsync_default_timeout_when_negative(self):
    resource = ServerlessResource(name="test-neg", executionTimeoutMs=-1)
    resource.id = "ep-neg"
    # ... assert timeout=60

4. Scope note: run() polling path unaffected

This fix is specific to runsync. The async run() → polling path uses a different timeout model (RunPod SDK's await job.result() with its own polling interval). Not a regression here, just worth noting if AE-2375 was originally reported against both paths.


5. Known gap (pre-existing, not introduced here)

From the known-bugs register: execution_timeout_ms is not enforced server-side — workers ignore the configured timeout. This PR fixes the client-side mismatch only. If a job runs past executionTimeoutMs, the server doesn't kill it, so the client will now correctly wait the full configured duration. The server-side enforcement gap remains open separately.


Verdict

PASS. The fix resolves the stated bug correctly. One ask: decide on the buffer question (#2) and either add +10 or add a comment explaining the no-buffer choice.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

@deanq deanq force-pushed the fix/AE-2375-runsync-client-timeout branch from edc425c to a3d5e7d Compare March 17, 2026 20:00
Copy link
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up on prior review

The explicit is not None and > 0 guard and DEFAULT_RUNSYNC_TIMEOUT_S = 60 constant are clean. The three test cases cover the main branches correctly.

The buffer question from the prior review is still open — executionTimeoutMs / 1000 gives exactly the server-side limit with no room for network round-trip. Still non-blocking, but worth noting.

Two minor gaps the tests don't cover:

  • Negative value (executionTimeoutMs=-100) — the > 0 guard correctly falls back to 60s but there's no test for it
  • Very small value (executionTimeoutMs=500) → timeout=0.5s — passes the guard and would cause an immediate client timeout on any real request; no validation prevents it

Verdict: PASS WITH NITS

Core fix is correct. The gaps above are low-risk in practice.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

deanq added 3 commits March 19, 2026 11:02
runsync() hardcoded a 60s HTTP client timeout regardless of
executionTimeoutMs, causing GPU inference jobs exceeding 1 minute to
fail with a client-side timeout even when the server allowed more time.

Now uses executionTimeoutMs / 1000 when set, falling back to 60s default.

Closes AE-2375
- Move DEFAULT_RUNSYNC_TIMEOUT_S to module-level constant
- Add test for executionTimeoutMs=None fallback
Replace truthiness check with `is not None and > 0` to reject negative
values. Annotate timeout_s as float for clarity.
@deanq deanq force-pushed the fix/AE-2375-runsync-client-timeout branch from a3d5e7d to 6a22440 Compare March 19, 2026 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants