Skip to content

fix: stop uploads hanging forever when the connection drops mid-transfer#235

Open
pkeuter wants to merge 5 commits into
minio:mainfrom
pkeuter:fix/upload-hang-on-connection-drop
Open

fix: stop uploads hanging forever when the connection drops mid-transfer#235
pkeuter wants to merge 5 commits into
minio:mainfrom
pkeuter:fix/upload-hang-on-connection-drop

Conversation

@pkeuter

@pkeuter pkeuter commented Jun 23, 2026

Copy link
Copy Markdown

The libcurl multi-pump loop called select() with a nullptr timeout (block forever). A connection that drops mid-upload without a clean RST/FIN never makes the socket ready, so select() blocked indefinitely and the synchronous Execute() wedged the calling thread permanently -- no further uploads could run. Normal uploads also had no timeout configured (only the RDMA control plane set one), so libcurl had nothing to enforce even when woken.

  • Bound the select() wait to 1s so the loop keeps pumping libcurl.
  • Handle maxfd < 0 with a short poll-sleep instead of an invalid select() call (which also errors out / terminates on Windows).
  • Set a default low-speed timeout (1 byte/s for 60s) so a stalled transfer aborts with CURLE_OPERATION_TIMEDOUT. Gated on timeout_secs <= 0 to leave the RDMA control plane's explicit timeouts unchanged.

Summary by CodeRabbit

  • Bug Fixes
    • Fixed scenarios where HTTP requests could hang indefinitely during stalled network transfers. The request now applies a low-speed/no-progress timeout so stalled connections fail gracefully.
    • Updated the post-transfer socket waiting to use bounded polling rather than potentially unbounded waiting.
    • Improved failure reporting when a request completes without receiving any response data by returning a clearer diagnostic error message.

The libcurl multi-pump loop called select() with a nullptr timeout (block
forever). A connection that drops mid-upload without a clean RST/FIN never
makes the socket ready, so select() blocked indefinitely and the synchronous
Execute() wedged the calling thread permanently -- no further uploads could
run. Normal uploads also had no timeout configured (only the RDMA control
plane set one), so libcurl had nothing to enforce even when woken.

- Bound the select() wait to 1s so the loop keeps pumping libcurl.
- Handle maxfd < 0 with a short poll-sleep instead of an invalid select()
  call (which also errors out / terminates on Windows).
- Set a default low-speed timeout (1 byte/s for 60s) so a stalled transfer
  aborts with CURLE_OPERATION_TIMEDOUT. Gated on timeout_secs <= 0 to leave
  the RDMA control plane's explicit timeouts unchanged.
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@pkeuter, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 24 minutes and 51 seconds. Learn how PR review limits work.

To continue reviewing without waiting, enable usage-based billing in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 741a5297-ddac-40e5-a7dd-c043974dc23d

📥 Commits

Reviewing files that changed from the base of the PR and between c34e26c and dd58c10.

📒 Files selected for processing (1)
  • include/miniocpp/http.h
📝 Walkthrough

Walkthrough

src/http.cc adds <chrono> and <thread>, defines a 60-second stall timeout, enables libcurl low-speed aborts when no explicit timeout is set, and changes the request polling loop to use a bounded select() wait with a brief sleep when no file descriptors are available.

Changes

Stall timeout and bounded select loop

Layer / File(s) Summary
Includes and stall timeout constant
src/http.cc
Adds <chrono> and <thread> headers and defines kStallTimeoutSecs = 60 as the internal low-speed stall threshold constant.
Low-speed abort config
src/http.cc
When timeout_secs <= 0, sets CURLOPT_LOW_SPEED_LIMIT and CURLOPT_LOW_SPEED_TIME to abort stalled transfers after 60 seconds.
Bounded poll loop and empty-response error
src/http.cc
Reworks the synchronous poll loop to rebuild fd_sets each iteration, use a 1-second select() timeout, sleep 100ms if no fds are available, retry EINTR on non-Windows, and set a diagnostic error when the transfer ends without response bytes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 Hoppity-hop through the network stream,
No more forever-waiting dreams.
Sixty seconds still? Then off we go,
A bounded select keeps progress in flow.
The rabbit twitches a happy nose 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main fix: preventing uploads from hanging indefinitely after a mid-transfer connection drop.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@harshavardhana

Copy link
Copy Markdown
Member

@coderabbitai review

@harshavardhana

Copy link
Copy Markdown
Member

How is this different than? #207

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@jiuker

jiuker commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Please fix lint. Code is good

@pkeuter

pkeuter commented Jun 26, 2026

Copy link
Copy Markdown
Author

@harshavardhana #207 only stops the loop from crashing/spinning when there are no active sockets (maxfd == -1); it doesn't actually abort a transfer whose socket stays open but goes silent after the connection drops. My fix handles that case too by adding curl's LOW_SPEED_LIMIT/LOW_SPEED_TIME so the dead transfer is actively timed out, bounding select() alone (as in #207) just wakes the loop without giving it any reason to terminate the stalled request.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/http.cc (1)

542-545: 🩺 Stability & Availability | 🟠 Major

select() crash on EINTR (signal interruption) crashes the entire process

The current code terminates on any select() failure, including errno == EINTR (interrupted by signal). On POSIX systems, EINTR is a recoverable condition - the wait was simply interrupted and should be retried. Calling std::terminate() here crashes the entire library whenever a signal arrives during Execute(), causing serious stability regressions for embedded applications.

Retry the loop on EINTR and only treat genuine errors as fatal:

Proposed fix
       timeval timeout{};
       timeout.tv_sec = 1;
       timeout.tv_usec = 0;
+#include <cerrno>
       if (select(maxfd + 1, &fdread, &fdwrite, &fdexcep, &timeout) < 0) {
+#ifndef _WIN32
+        if (errno == EINTR) continue;  // signal interrupted; retry select()
+#endif
         std::cerr << "select() failed; this should not happen" << std::endl;
         std::terminate();
       }

On Windows, select() does not set errno; use WSAGetLastError() if EINTR-like handling is needed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/http.cc` around lines 542 - 545, The select() error handling in Execute()
is treating every failure as fatal, which causes the process to terminate on
signal interruption. Update the select(maxfd + 1, &fdread, &fdwrite, &fdexcep,
&timeout) failure path to detect EINTR and simply retry the loop instead of
calling std::terminate(), while keeping genuine non-recoverable errors fatal.
Use the existing select() wait loop in src/http.cc to apply the retry logic, and
account for platform-specific error retrieval if the code path is shared with
Windows.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/http.cc`:
- Around line 542-545: The select() error handling in Execute() is treating
every failure as fatal, which causes the process to terminate on signal
interruption. Update the select(maxfd + 1, &fdread, &fdwrite, &fdexcep,
&timeout) failure path to detect EINTR and simply retry the loop instead of
calling std::terminate(), while keeping genuine non-recoverable errors fatal.
Use the existing select() wait loop in src/http.cc to apply the retry logic, and
account for platform-specific error retrieval if the code path is shared with
Windows.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 0cd41932-4478-4786-838a-183b3c661821

📥 Commits

Reviewing files that changed from the base of the PR and between f07563a and 9644c68.

📒 Files selected for processing (1)
  • src/http.cc

@jiuker

jiuker commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

@pkeuter Could you fix #235 (review) as well?

       timeval timeout{};
       timeout.tv_sec = 1;
       timeout.tv_usec = 0;
+#include <cerrno>
       if (select(maxfd + 1, &fdread, &fdwrite, &fdexcep, &timeout) < 0) {
+#ifndef _WIN32
+        if (errno == EINTR) continue;  // signal interrupted; retry select()
+#endif
         std::cerr << "select() failed; this should not happen" << std::endl;
         std::terminate();
       }

@pkeuter

pkeuter commented Jun 26, 2026

Copy link
Copy Markdown
Author

Of course, done.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/http.cc (1)

518-553: 🩺 Stability & Availability | 🟠 Major

Confirm that a stalled/aborted transfer populates Response::error instead of returning an empty failure.

The bounded select() loop correctly prevents indefinite blocking, but the code exits the loop and returns response without checking per-handle result codes (e.g., CURLE_OPERATION_TIMEDOUT). status_code is set only by the header callback, and response.error is populated only by the write callback or curlpp exceptions. If the transfer times out before any headers or data arrive, both fields remain unset (status_code == 0, error == ""), resulting in a generic failure with no diagnostic.

Add curl_multi_info_read handling after the loop to drain CURLMSG_DONE messages and extract CURLcode from each handle. Map error codes like CURLE_OPERATION_TIMEDOUT to a descriptive message in response.error. This ensures callers receive a meaningful error instead of an empty Response::operator bool failure.

Diff context
  while (left) {
    // ... select() loop ...
    while (!requests.perform(&left)) {
    }
  }

  // TODO: Add curl_multi_info_read loop here to extract per-handle result codes.
  // Example:
  // int msgs_in_queue;
  // while ((msg = curl_multi_info_read(multi_handle, &msgs_in_queue))) {
  //   if (msg->msg == CURLMSG_DONE) {
  //     CURLcode result = msg->data.result;
  //     if (result != CURLE_OK) {
  //       // Set response.error with human-readable message
  //     }
  //   }
  // }

  return response;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/http.cc` around lines 518 - 553, After the main transfer loop in Response
handling, drain curl_multi_info_read for each CURLMSG_DONE and inspect the
CURLcode from the completed easy handle. If the result is not CURLE_OK, populate
Response::error with a descriptive message (for example, mapping
CURLE_OPERATION_TIMEDOUT and other failures) so a stalled or aborted transfer
doesn’t return an empty failure. Use the existing transfer flow around
requests.perform, Response::error, and the response return path to locate the
fix.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/http.cc`:
- Around line 518-553: After the main transfer loop in Response handling, drain
curl_multi_info_read for each CURLMSG_DONE and inspect the CURLcode from the
completed easy handle. If the result is not CURLE_OK, populate Response::error
with a descriptive message (for example, mapping CURLE_OPERATION_TIMEDOUT and
other failures) so a stalled or aborted transfer doesn’t return an empty
failure. Use the existing transfer flow around requests.perform,
Response::error, and the response return path to locate the fix.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 2e2a5a72-ed12-442d-a43e-2b5aa429beb3

📥 Commits

Reviewing files that changed from the base of the PR and between 9644c68 and ba09555.

📒 Files selected for processing (1)
  • src/http.cc

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/http.cc (1)

417-420: 🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

timeout_secs == 0 no longer matches the public contract.

include/miniocpp/http.h:129-136 documents timeout_secs = 0 as “use libcurl defaults”, but this branch now always installs a 60s low-speed timeout for that case. That is a behavioral contract change for every caller that leaves the field unset, so please either update the public API/docs and audit those callers in the same PR, or gate the stall timeout behind a separate knob instead of overloading the existing 0 semantics.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/http.cc` around lines 417 - 420, The timeout handling in the HTTP setup
path now treats timeout_secs == 0 as a stall timeout override, which conflicts
with the public contract documented in http.h. Update the behavior in the HTTP
configuration code (where CURLOPT_LOW_SPEED_LIMIT and CURLOPT_LOW_SPEED_TIME are
set) so that zero still means “use libcurl defaults,” or introduce a separate
setting for the stall timeout and keep timeout_secs semantics unchanged. If you
keep the new behavior, also update the public API/docs and review all callers
that rely on the default zero value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/http.cc`:
- Around line 417-420: The timeout handling in the HTTP setup path now treats
timeout_secs == 0 as a stall timeout override, which conflicts with the public
contract documented in http.h. Update the behavior in the HTTP configuration
code (where CURLOPT_LOW_SPEED_LIMIT and CURLOPT_LOW_SPEED_TIME are set) so that
zero still means “use libcurl defaults,” or introduce a separate setting for the
stall timeout and keep timeout_secs semantics unchanged. If you keep the new
behavior, also update the public API/docs and review all callers that rely on
the default zero value.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 74c433da-caf8-41d7-a925-7a3c6fe37f35

📥 Commits

Reviewing files that changed from the base of the PR and between ba09555 and c34e26c.

📒 Files selected for processing (1)
  • src/http.cc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants