fix(pool): guard Pool.renew against out-of-range addrIndex after pool reset#154
Closed
SAY-5 wants to merge 2 commits intoqdm12:v2.0.0-betafrom
Closed
fix(pool): guard Pool.renew against out-of-range addrIndex after pool reset#154SAY-5 wants to merge 2 commits intoqdm12:v2.0.0-betafrom
SAY-5 wants to merge 2 commits intoqdm12:v2.0.0-betafrom
Conversation
… reset Pool.renew dialed outside the mutex and then reacquired the lock and did p.addrConns[conn.addrIndex] without checking that the pool hadn't been reset while the dial was in flight. When the upstream TLS DNS target (Cloudflare) starts resetting connections repeatedly, the pool can be emptied and recreated between when the caller grabs the addrIndex and when renew retakes the lock; the deref then panicked the DNS server goroutine with 'index out of range [0] with length 0', taking gluetun and every container sharing its network namespace down with it. Bounds-check conn.addrIndex against len(p.addrConns) while the lock is held. On failure, close the freshly-dialed netConn (to avoid leaking a live socket) and return an error so the caller falls back to its retry path instead of crashing the process. Refs qdm12/gluetun issue 3292. Signed-off-by: SAY-5 <say.apm35@gmail.com>
Owner
|
Did this still happen using the gluetun latest image post qdm12/gluetun@792a5ff ? If so can you share the panic message? |
Author
|
Re-tested against |
Author
|
Closing per my last comment — the underlying panic is no longer reproducible against gluetun ; thanks @qdm12 for the pointer. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Downstream crash reported at qdm12/gluetun#3292.
Pool.renewdials outside the mutex and then reacquires the lock and doesp.addrConns[conn.addrIndex]without checking that the pool hadn't been reset while the dial was in flight. When the upstream TLS DNS target (Cloudflare1.0.0.1:853/1.1.1.1:853) starts resetting connections repeatedly, the pool can be emptied and recreated between when the caller grabsaddrIndexand when renew retakes the lock; the deref then panics:This kills the DNS server goroutine and takes gluetun + every container sharing its network namespace down with it (crash loop).
Fix
Bounds-check
conn.addrIndexagainstlen(p.addrConns)while the lock is held. On failure, close the freshly-dialednetConn(so we don't leak a live socket) and return an error so the caller falls back to its retry path instead of crashing the process.Test
go test ./internal/pool/...behaves identically to main on this machine — a pre-existing darwin-only "can't assign requested address" vs "connection refused" message mismatch fails without my change too; every other subtest continues to pass.