fix: stop uploads hanging forever when the connection drops mid-transfer#235
fix: stop uploads hanging forever when the connection drops mid-transfer#235pkeuter wants to merge 5 commits into
Conversation
The libcurl multi-pump loop called select() with a nullptr timeout (block forever). A connection that drops mid-upload without a clean RST/FIN never makes the socket ready, so select() blocked indefinitely and the synchronous Execute() wedged the calling thread permanently -- no further uploads could run. Normal uploads also had no timeout configured (only the RDMA control plane set one), so libcurl had nothing to enforce even when woken. - Bound the select() wait to 1s so the loop keeps pumping libcurl. - Handle maxfd < 0 with a short poll-sleep instead of an invalid select() call (which also errors out / terminates on Windows). - Set a default low-speed timeout (1 byte/s for 60s) so a stalled transfer aborts with CURLE_OPERATION_TIMEDOUT. Gated on timeout_secs <= 0 to leave the RDMA control plane's explicit timeouts unchanged.
|
Warning Review limit reached
More reviews will be available in 24 minutes and 51 seconds. Learn how PR review limits work. To continue reviewing without waiting, enable usage-based billing in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthrough
ChangesStall timeout and bounded select loop
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
|
How is this different than? #207 |
✅ Action performedReview finished.
|
|
Please fix lint. Code is good |
|
@harshavardhana #207 only stops the loop from crashing/spinning when there are no active sockets (maxfd == -1); it doesn't actually abort a transfer whose socket stays open but goes silent after the connection drops. My fix handles that case too by adding curl's LOW_SPEED_LIMIT/LOW_SPEED_TIME so the dead transfer is actively timed out, bounding select() alone (as in #207) just wakes the loop without giving it any reason to terminate the stalled request. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/http.cc (1)
542-545: 🩺 Stability & Availability | 🟠 Major
select()crash onEINTR(signal interruption) crashes the entire processThe current code terminates on any
select()failure, includingerrno == EINTR(interrupted by signal). On POSIX systems,EINTRis a recoverable condition - the wait was simply interrupted and should be retried. Callingstd::terminate()here crashes the entire library whenever a signal arrives duringExecute(), causing serious stability regressions for embedded applications.Retry the loop on
EINTRand only treat genuine errors as fatal:Proposed fix
timeval timeout{}; timeout.tv_sec = 1; timeout.tv_usec = 0; +#include <cerrno> if (select(maxfd + 1, &fdread, &fdwrite, &fdexcep, &timeout) < 0) { +#ifndef _WIN32 + if (errno == EINTR) continue; // signal interrupted; retry select() +#endif std::cerr << "select() failed; this should not happen" << std::endl; std::terminate(); }On Windows,
select()does not seterrno; useWSAGetLastError()if EINTR-like handling is needed.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/http.cc` around lines 542 - 545, The select() error handling in Execute() is treating every failure as fatal, which causes the process to terminate on signal interruption. Update the select(maxfd + 1, &fdread, &fdwrite, &fdexcep, &timeout) failure path to detect EINTR and simply retry the loop instead of calling std::terminate(), while keeping genuine non-recoverable errors fatal. Use the existing select() wait loop in src/http.cc to apply the retry logic, and account for platform-specific error retrieval if the code path is shared with Windows.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/http.cc`:
- Around line 542-545: The select() error handling in Execute() is treating
every failure as fatal, which causes the process to terminate on signal
interruption. Update the select(maxfd + 1, &fdread, &fdwrite, &fdexcep,
&timeout) failure path to detect EINTR and simply retry the loop instead of
calling std::terminate(), while keeping genuine non-recoverable errors fatal.
Use the existing select() wait loop in src/http.cc to apply the retry logic, and
account for platform-specific error retrieval if the code path is shared with
Windows.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 0cd41932-4478-4786-838a-183b3c661821
📒 Files selected for processing (1)
src/http.cc
|
@pkeuter Could you fix #235 (review) as well? timeval timeout{};
timeout.tv_sec = 1;
timeout.tv_usec = 0;
+#include <cerrno>
if (select(maxfd + 1, &fdread, &fdwrite, &fdexcep, &timeout) < 0) {
+#ifndef _WIN32
+ if (errno == EINTR) continue; // signal interrupted; retry select()
+#endif
std::cerr << "select() failed; this should not happen" << std::endl;
std::terminate();
} |
|
Of course, done. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/http.cc (1)
518-553: 🩺 Stability & Availability | 🟠 MajorConfirm that a stalled/aborted transfer populates
Response::errorinstead of returning an empty failure.The bounded
select()loop correctly prevents indefinite blocking, but the code exits the loop and returnsresponsewithout checking per-handle result codes (e.g.,CURLE_OPERATION_TIMEDOUT).status_codeis set only by the header callback, andresponse.erroris populated only by the write callback orcurlppexceptions. If the transfer times out before any headers or data arrive, both fields remain unset (status_code == 0,error == ""), resulting in a generic failure with no diagnostic.Add
curl_multi_info_readhandling after the loop to drainCURLMSG_DONEmessages and extractCURLcodefrom each handle. Map error codes likeCURLE_OPERATION_TIMEDOUTto a descriptive message inresponse.error. This ensures callers receive a meaningful error instead of an emptyResponse::operator boolfailure.Diff context
while (left) { // ... select() loop ... while (!requests.perform(&left)) { } } // TODO: Add curl_multi_info_read loop here to extract per-handle result codes. // Example: // int msgs_in_queue; // while ((msg = curl_multi_info_read(multi_handle, &msgs_in_queue))) { // if (msg->msg == CURLMSG_DONE) { // CURLcode result = msg->data.result; // if (result != CURLE_OK) { // // Set response.error with human-readable message // } // } // } return response;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/http.cc` around lines 518 - 553, After the main transfer loop in Response handling, drain curl_multi_info_read for each CURLMSG_DONE and inspect the CURLcode from the completed easy handle. If the result is not CURLE_OK, populate Response::error with a descriptive message (for example, mapping CURLE_OPERATION_TIMEDOUT and other failures) so a stalled or aborted transfer doesn’t return an empty failure. Use the existing transfer flow around requests.perform, Response::error, and the response return path to locate the fix.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/http.cc`:
- Around line 518-553: After the main transfer loop in Response handling, drain
curl_multi_info_read for each CURLMSG_DONE and inspect the CURLcode from the
completed easy handle. If the result is not CURLE_OK, populate Response::error
with a descriptive message (for example, mapping CURLE_OPERATION_TIMEDOUT and
other failures) so a stalled or aborted transfer doesn’t return an empty
failure. Use the existing transfer flow around requests.perform,
Response::error, and the response return path to locate the fix.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 2e2a5a72-ed12-442d-a43e-2b5aa429beb3
📒 Files selected for processing (1)
src/http.cc
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/http.cc (1)
417-420: 🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift
timeout_secs == 0no longer matches the public contract.
include/miniocpp/http.h:129-136documentstimeout_secs = 0as “use libcurl defaults”, but this branch now always installs a 60s low-speed timeout for that case. That is a behavioral contract change for every caller that leaves the field unset, so please either update the public API/docs and audit those callers in the same PR, or gate the stall timeout behind a separate knob instead of overloading the existing0semantics.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/http.cc` around lines 417 - 420, The timeout handling in the HTTP setup path now treats timeout_secs == 0 as a stall timeout override, which conflicts with the public contract documented in http.h. Update the behavior in the HTTP configuration code (where CURLOPT_LOW_SPEED_LIMIT and CURLOPT_LOW_SPEED_TIME are set) so that zero still means “use libcurl defaults,” or introduce a separate setting for the stall timeout and keep timeout_secs semantics unchanged. If you keep the new behavior, also update the public API/docs and review all callers that rely on the default zero value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/http.cc`:
- Around line 417-420: The timeout handling in the HTTP setup path now treats
timeout_secs == 0 as a stall timeout override, which conflicts with the public
contract documented in http.h. Update the behavior in the HTTP configuration
code (where CURLOPT_LOW_SPEED_LIMIT and CURLOPT_LOW_SPEED_TIME are set) so that
zero still means “use libcurl defaults,” or introduce a separate setting for the
stall timeout and keep timeout_secs semantics unchanged. If you keep the new
behavior, also update the public API/docs and review all callers that rely on
the default zero value.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 74c433da-caf8-41d7-a925-7a3c6fe37f35
📒 Files selected for processing (1)
src/http.cc
The libcurl multi-pump loop called select() with a nullptr timeout (block forever). A connection that drops mid-upload without a clean RST/FIN never makes the socket ready, so select() blocked indefinitely and the synchronous Execute() wedged the calling thread permanently -- no further uploads could run. Normal uploads also had no timeout configured (only the RDMA control plane set one), so libcurl had nothing to enforce even when woken.
Summary by CodeRabbit