Skip to content

fix(ssh): tear down tunnel relay on socket hangup instead of busy-spinning (#1769)#1772

Merged
datlechin merged 2 commits into
mainfrom
fix/ssh-relay-busy-spin-1769
Jun 27, 2026
Merged

fix(ssh): tear down tunnel relay on socket hangup instead of busy-spinning (#1769)#1772
datlechin merged 2 commits into
mainfrom
fix/ssh-relay-busy-spin-1769

Conversation

@datlechin

Copy link
Copy Markdown
Member

Fixes #1769.

Problem

TablePro pins ~100% of a CPU core while idle when an SSH tunnel's socket drops (laptop sleep, network change, server-side idle timeout). The tunnel's relay loop keeps running at full tilt forever instead of tearing down.

Root cause

Both SSH relay loops polled their fds for POLLIN only and never inspected POLLHUP/POLLERR/POLLNVAL:

  • LibSSH2Tunnel.runRelay (client socket <-> SSH channel)
  • LibSSH2TunnelFactory.startChannelRelay (jump-host socketpair <-> SSH channel)

When the transport (or, for jump hosts, the inter-hop UNIX socketpair) peer closes, that fd sits permanently readable-at-EOF. poll() returns immediately every iteration and libssh2_channel_read returns EAGAIN forever (libssh2 never surfaces a clean channel EOF when only the transport died), so none of the existing branches fire and the loop spins. For multi-hop tunnels the keepalive also can't detect the death, because it writes into a still-locally-open socketpair, so nothing ever tore the dead tunnel down.

Fix

The two relay loops were near-duplicate code carrying the same bug, so they are replaced with one tested engine:

  • RelayPollState.swift — pure relayFDState(_:) classifier: POLLERR/POLLNVAL are fatal; POLLHUP drains buffered data then stops (POLLHUP can coexist with readable data, so the last bytes are not dropped); POLLIN is readable.
  • SSHChannelRelay.swift — one bidirectional relay engine (pure Swift/Darwin, no libssh2). It terminates on hangup/error instead of spinning, returns why it stopped, and also closes a secondary spin in the write path (a transport hangup during an EAGAIN write-wait now bails). Channel I/O sits behind an SSHChannelIO protocol so the whole loop is unit-testable.
  • LibSSH2ChannelIO.swift — the libssh2-backed SSHChannelIO, preserving the existing EAGAIN/EOF semantics and serializing through sessionQueue.

Death propagation reuses the existing markDead -> onDeath -> recoverDeadTunnel machinery, no new callback types:

  • runRelay calls markDead() on transport hangup (gated on isRunning, so a deliberate close doesn't trigger recovery). A client-side hangup just ends that one relay.
  • startChannelRelay now shutdowns its socketpair end before closing it on exit, so a dead hop's EOF cascades down the chain to runRelay, which marks the tunnel dead. This also covers the multi-hop keepalive gap.

Tests

  • RelayPollStateTests — 8 cases, including the exact POLLIN | POLLHUP -> drainThenStop misclassification that caused the spin, and fatal-wins cases.
  • SSHChannelRelayTests — runs the engine over real socketpairs with a scripted channel and asserts a closed peer terminates as .transportHangup / .localClosed (the spin would fail these by timing out), plus data-forwarding both directions.

Limitation

Recovery flows through runRelay, so a transport that dies before any client has connected relies on the existing 30s health-monitor backstop to reconnect. The CPU spin itself is fixed unconditionally.

Verification

swiftlint lint --strict clean on changed files. The user builds and runs the test suite in Xcode.

Credit to @drekinov for the detailed ps/sample/lsof/fs_usage investigation in #1769 that pinned the spinning fd and the missing POLLHUP/POLLERR handling.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 71806a44e2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +93 to +94
if transportState == .drainThenStop { return .transportHangup }
if localState == .drainThenStop { return .localClosed }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Drain all queued bytes before stopping on hangup

When a peer sends more than one relay buffer and then closes, poll can report POLLIN | POLLHUP; this loop pumps only a single 32 KiB chunk and then these lines return, so any remaining bytes already queued on the socket/channel are discarded. This can truncate large requests or result chunks that arrive with a close; keep draining until recv/channel read reaches EOF or would-block before terminating on the hangup.

Useful? React with 👍 / 👎.

@datlechin datlechin merged commit d23e45e into main Jun 27, 2026
3 checks passed
@datlechin datlechin deleted the fix/ssh-relay-busy-spin-1769 branch June 27, 2026 04:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

High CPU (100% of a core) — SSH tunnel relay busy-spins after the tunnel socket drops (no POLLHUP/POLLERR handling)

1 participant