Fix NCCL Error 7 by removing eager_connect_single_device by d4l3k · Pull Request #320 · meta-pytorch/torchft

d4l3k · 2026-03-24T17:31:41Z

Summary

Remove eager_connect_single_device from ProcessGroupNCCL._create_pg to fix "NCCL Error 7: operation in progress" failures
In non-blocking NCCL mode (blocking=False), eager_connect_single_device starts an async communicator init that cannot be waited on through Python APIs. The first collective then fails because the init hasn't completed.
Without eager_connect, the communicator is lazily initialized by the first collective, which properly handles the async init inside its ncclGroupStart/ncclGroupEnd context.

Test plan

CI GPU tests (QuantizedAllReduceTest) should pass, which have been failing since ~Dec 2025 due to a PyTorch nightly regression in non-blocking NCCL + eager_connect interaction

The QuantizedAllReduceTest and QuantizedReduceScatterTest tests were failing with "NCCL Error 7: NCCL operation in progress" because non-blocking NCCL work.wait() returns before NCCL is ready for the next operation. Adding cuda.synchronize() between iterations ensures operations are fully complete.

In non-blocking NCCL mode (blocking=False), eager_connect_single_device starts an async communicator init that cannot be waited on through Python APIs. When the first collective is subsequently called, NCCL returns "Error 7: operation in progress" because the init hasn't completed. Remove the eager_connect call and let the communicator be lazily initialized by the first collective, which properly handles the async init inside its ncclGroupStart/ncclGroupEnd context. Also reverts the cuda.synchronize() workaround in collectives_test.py which was insufficient.

JiwaniZakir

The fix removes eager_connect_single_device unconditionally, but the comment specifically calls out that the problem occurs in non-blocking mode (blocking=False). If opts can also represent a blocking configuration, removing eager connect in that case is an unnecessary regression — blocking mode could safely call eager_connect_single_device and would benefit from the earlier communicator initialization and faster first-collective latency. A more precise fix would be to check opts (or whatever flag controls blocking) and conditionally skip the call only when non-blocking is set, e.g.:

if not opts.is_high_priority_stream: # or whatever the blocking flag is
backend_class.eager_connect_single_device(...)

Additionally, there's no test covering the scenario described — a test that creates a ProcessGroupNCCL in non-blocking mode and runs a collective immediately after init would both prevent regression and document the expected behavior. Without it, a future refactor could silently reintroduce the bug.

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 24, 2026

d4l3k changed the title ~~Fix NCCL non-blocking test failures with cuda.synchronize()~~ Fix NCCL Error 7 by removing eager_connect_single_device Mar 24, 2026

JiwaniZakir reviewed Apr 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix NCCL Error 7 by removing eager_connect_single_device#320

Fix NCCL Error 7 by removing eager_connect_single_device#320
d4l3k wants to merge 2 commits intomainfrom
fix-nccl-test-sync

d4l3k commented Mar 24, 2026 •

edited

Loading

Uh oh!

JiwaniZakir left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

d4l3k commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

JiwaniZakir left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

d4l3k commented Mar 24, 2026 •

edited

Loading