Skip to content

Conversation

@lidel
Copy link
Member

@lidel lidel commented Jan 7, 2026

Summary

stream.Close() can block forever when the remote peer is slow or unresponsive, causing goroutine leaks. This particularly affects WebRTC SCTP state machines and bitswap servers, eventually preventing nodes from serving blocks.

Note: this PR aims to be a surgical fix, perhaps there is a better way?

Affected Versions

  • go-libp2p: all versions using lazy multistream-select it seems?
  • Observed in: kubo v0.39.0 (boxo v0.35.2, go-libp2p v0.45.0)

Symptoms

  • Bitswap stops serving blocks despite having them pinned
  • vole bitswap check returns Responded: false
  • libp2p identify/ping work fine (connection is "alive")
  • Goroutine profile shows hundreds of goroutines stuck for days/weeks

Likely the Root Cause

When protocol is known from identify, host.NewStream() returns a lazy multistream wrapper that defers handshake completion until Close():

streamWrapper.Close()
  → lazyClientConn.Close()           // go-multistream
    → Flush()                        // sends write handshake
    → rhandshakeOnce.Do(doReadHandshake)
      → ReadNextToken(l.con)         // BLOCKS HERE - no timeout!
    → l.con.Close()

ReadNextToken() reads from the stream with no deadline. If the peer doesn't respond, the goroutine blocks forever.

Why This Happens?

  1. Bitswap server opens stream, sends blocks, calls Close()
  2. Close() must complete multistream handshake before closing
  3. Handshake read blocks waiting for peer's protocol confirmation
  4. Peer is unresponsive (network issue, overloaded, disconnected uncleanly)
  5. Read never returns → goroutine leaked

This affects all transports but WebRTC is more prone due to NAT traversal issues and complex SCTP state machines.

Evidence

Goroutine profile from production node (35 days uptime):

goroutine 1421 [chan receive, 20381 minutes]:
github.com/multiformats/go-multistream.(*once).Do(...)
    lazyClient.go:53
github.com/multiformats/go-multistream.(*lazyClientConn[...]).Close(...)
    lazyClient.go:196
github.com/ipfs/boxo/bitswap/server.(*Server).sendBlocks(...)
    server.go:335
  • 400+ goroutines stuck in sendBlocksClose() path
  • Blocked for 20,000+ minutes (14+ days)
  • All waiting on WebRTC/SCTP reads

Proposed Fix (this PR)

Set a read deadline before calling multistream Close():

// go-libp2p/p2p/host/basic/basic_host.go

func (s *streamWrapper) Close() error {
    // Prevent indefinite blocking on multistream handshake completion.
    _ = s.Stream.SetReadDeadline(time.Now().Add(DefaultNegotiationTimeout))
    return s.rw.Close()
}

This uses the existing 10-second DefaultNegotiationTimeout.

Why this location?

  • Minimal change (5 lines)
  • Uses existing timeout configuration
  • Fixes all callers automatically
  • No API changes required

Why not fix in go-multistream?

Fix Verification

  • s.Stream and lazyClientConn.con are the same object
  • SetReadDeadline affects in-progress reads (per Go spec)
  • pion/sctp correctly signals blocked readers on deadline
  • Stream is closed at end of lazyClientConn.Close() regardless
  • All existing tests pass

Related

Workaround (without this PR)

Restart affected nodes to clear stuck goroutines.

@lidel lidel force-pushed the fix/multistream-close-deadline branch from de88179 to c2adc7a Compare January 7, 2026 01:51
@lidel lidel requested review from MarcoPolo and sukunrt January 7, 2026 01:55
@lidel lidel marked this pull request as ready for review January 7, 2026 02:11
@MarcoPolo MarcoPolo force-pushed the fix/multistream-close-deadline branch from c2adc7a to aeab4ba Compare January 7, 2026 19:33
@MarcoPolo
Copy link
Collaborator

I made some small changes to move this to a simnet+synctest env so we don't wait 10s for the test.

…t blocking

streamWrapper.Close() can block indefinitely when the remote peer is slow
or unresponsive during the multistream-select handshake completion.

The lazy multistream protocol negotiation defers reading the handshake
response until Close() is called. If the remote peer doesn't respond,
the read blocks forever, causing goroutine leaks.

This is particularly problematic for bitswap servers where taskWorkers
can get stuck trying to close streams after sending blocks.

The fix sets a read deadline (using DefaultNegotiationTimeout) before
calling the multistream Close(), ensuring the operation will time out
rather than block indefinitely.

Related: multiformats/go-multistream#47
Related: multiformats/go-multistream#48
@MarcoPolo MarcoPolo force-pushed the fix/multistream-close-deadline branch from aeab4ba to bcc2bf1 Compare January 8, 2026 01:11
@gammazero
Copy link
Contributor

I made some small changes to move this to a simnet+synctest env so we don't wait 10s for the test.

Is the use of synctest going to require changing go.mod to use go 1.25?

@MarcoPolo
Copy link
Collaborator

I made some small changes to move this to a simnet+synctest env so we don't wait 10s for the test.

Is the use of synctest going to require changing go.mod to use go 1.25?

no. The test is in a new file with a build tag that only runs when testing with go 1.25.

@MarcoPolo MarcoPolo merged commit bcc2bf1 into master Jan 8, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants