Fix NCCL Error 7 by removing eager_connect_single_device#320
Fix NCCL Error 7 by removing eager_connect_single_device#320
Conversation
The QuantizedAllReduceTest and QuantizedReduceScatterTest tests were failing with "NCCL Error 7: NCCL operation in progress" because non-blocking NCCL work.wait() returns before NCCL is ready for the next operation. Adding cuda.synchronize() between iterations ensures operations are fully complete.
In non-blocking NCCL mode (blocking=False), eager_connect_single_device starts an async communicator init that cannot be waited on through Python APIs. When the first collective is subsequently called, NCCL returns "Error 7: operation in progress" because the init hasn't completed. Remove the eager_connect call and let the communicator be lazily initialized by the first collective, which properly handles the async init inside its ncclGroupStart/ncclGroupEnd context. Also reverts the cuda.synchronize() workaround in collectives_test.py which was insufficient.
JiwaniZakir
left a comment
There was a problem hiding this comment.
The fix removes eager_connect_single_device unconditionally, but the comment specifically calls out that the problem occurs in non-blocking mode (blocking=False). If opts can also represent a blocking configuration, removing eager connect in that case is an unnecessary regression — blocking mode could safely call eager_connect_single_device and would benefit from the earlier communicator initialization and faster first-collective latency. A more precise fix would be to check opts (or whatever flag controls blocking) and conditionally skip the call only when non-blocking is set, e.g.:
if not opts.is_high_priority_stream: # or whatever the blocking flag is
backend_class.eager_connect_single_device(...)Additionally, there's no test covering the scenario described — a test that creates a ProcessGroupNCCL in non-blocking mode and runs a collective immediately after init would both prevent regression and document the expected behavior. Without it, a future refactor could silently reintroduce the bug.
Summary
eager_connect_single_devicefromProcessGroupNCCL._create_pgto fix "NCCL Error 7: operation in progress" failuresblocking=False),eager_connect_single_devicestarts an async communicator init that cannot be waited on through Python APIs. The first collective then fails because the init hasn't completed.ncclGroupStart/ncclGroupEndcontext.Test plan
QuantizedAllReduceTest) should pass, which have been failing since ~Dec 2025 due to a PyTorch nightly regression in non-blocking NCCL + eager_connect interaction