fix(ssh): tear down tunnel relay on socket hangup instead of busy-spinning (#1769)#1772
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 71806a44e2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if transportState == .drainThenStop { return .transportHangup } | ||
| if localState == .drainThenStop { return .localClosed } |
There was a problem hiding this comment.
Drain all queued bytes before stopping on hangup
When a peer sends more than one relay buffer and then closes, poll can report POLLIN | POLLHUP; this loop pumps only a single 32 KiB chunk and then these lines return, so any remaining bytes already queued on the socket/channel are discarded. This can truncate large requests or result chunks that arrive with a close; keep draining until recv/channel read reaches EOF or would-block before terminating on the hangup.
Useful? React with 👍 / 👎.
Fixes #1769.
Problem
TablePro pins ~100% of a CPU core while idle when an SSH tunnel's socket drops (laptop sleep, network change, server-side idle timeout). The tunnel's relay loop keeps running at full tilt forever instead of tearing down.
Root cause
Both SSH relay loops polled their fds for
POLLINonly and never inspectedPOLLHUP/POLLERR/POLLNVAL:LibSSH2Tunnel.runRelay(client socket <-> SSH channel)LibSSH2TunnelFactory.startChannelRelay(jump-host socketpair <-> SSH channel)When the transport (or, for jump hosts, the inter-hop UNIX
socketpair) peer closes, that fd sits permanently readable-at-EOF.poll()returns immediately every iteration andlibssh2_channel_readreturnsEAGAINforever (libssh2 never surfaces a clean channel EOF when only the transport died), so none of the existing branches fire and the loop spins. For multi-hop tunnels the keepalive also can't detect the death, because it writes into a still-locally-open socketpair, so nothing ever tore the dead tunnel down.Fix
The two relay loops were near-duplicate code carrying the same bug, so they are replaced with one tested engine:
RelayPollState.swift— purerelayFDState(_:)classifier:POLLERR/POLLNVALare fatal;POLLHUPdrains buffered data then stops (POLLHUP can coexist with readable data, so the last bytes are not dropped);POLLINis readable.SSHChannelRelay.swift— one bidirectional relay engine (pure Swift/Darwin, no libssh2). It terminates on hangup/error instead of spinning, returns why it stopped, and also closes a secondary spin in the write path (a transport hangup during anEAGAINwrite-wait now bails). Channel I/O sits behind anSSHChannelIOprotocol so the whole loop is unit-testable.LibSSH2ChannelIO.swift— the libssh2-backedSSHChannelIO, preserving the existing EAGAIN/EOF semantics and serializing throughsessionQueue.Death propagation reuses the existing
markDead -> onDeath -> recoverDeadTunnelmachinery, no new callback types:runRelaycallsmarkDead()on transport hangup (gated onisRunning, so a deliberate close doesn't trigger recovery). A client-side hangup just ends that one relay.startChannelRelaynowshutdowns its socketpair end before closing it on exit, so a dead hop's EOF cascades down the chain torunRelay, which marks the tunnel dead. This also covers the multi-hop keepalive gap.Tests
RelayPollStateTests— 8 cases, including the exactPOLLIN | POLLHUP -> drainThenStopmisclassification that caused the spin, and fatal-wins cases.SSHChannelRelayTests— runs the engine over real socketpairs with a scripted channel and asserts a closed peer terminates as.transportHangup/.localClosed(the spin would fail these by timing out), plus data-forwarding both directions.Limitation
Recovery flows through
runRelay, so a transport that dies before any client has connected relies on the existing 30s health-monitor backstop to reconnect. The CPU spin itself is fixed unconditionally.Verification
swiftlint lint --strictclean on changed files. The user builds and runs the test suite in Xcode.Credit to @drekinov for the detailed
ps/sample/lsof/fs_usageinvestigation in #1769 that pinned the spinning fd and the missingPOLLHUP/POLLERRhandling.