Skip to content

fix: Make Vault sleeps cancelable to prevent leaked goroutines#2148

Open
skoppe wants to merge 1 commit intohashicorp:mainfrom
skoppe:vault_goroutine_leak_fix
Open

fix: Make Vault sleeps cancelable to prevent leaked goroutines#2148
skoppe wants to merge 1 commit intohashicorp:mainfrom
skoppe:vault_goroutine_leak_fix

Conversation

@skoppe
Copy link
Copy Markdown

@skoppe skoppe commented Apr 15, 2026

We had our Vault Agents DDoS our Vault cluster due to accumulated leaked goroutines that all woke up at the same time (thousands of them).

The agents in question were missing access to a particular secret they were requesting. This meant they restarted the consul template runner after 12 tries (approximately every 5 min).

Over the period of 27 days we saw a slow buildup of goroutines which then all suddenly woke up. Initially we suspected authentication renewal (our agent token is 30 days), but that was ruled out.

Eventually we found out the vault_pki file has a goroutine which sleeps until 90% of a PKI's cert expiry - which in our case is 30 days - and thus slept until 27 days after cert issuance. Critically, on an consul template runner restart these goroutines don't respond to the stop channel, but instead continue sleeping until 90% of the cert is expired.

This PR makes the sleep paths in Vault dependencies cancelable so goroutines exit promptly when a runner is stopped.

We had our Vault Agents DDoS our Vault cluster due to accumulated leaked
goroutines that all waked up at the same time (thousands of them).

The agents in question where missing access to a particular secret they were
requesting. This meant they restarted the consul template runner after 12
tries (approximately every 5 min).

Over the period of 27 days we saw a slow buildup of goroutines which then
_all_ suddenly woke up. Initially we suspected authentication renewal (our
agent token is 30 days), but that was ruled out.

Eventually we found out the `vault_pki` file has a goroutine which sleeps
for 90% of a PKI's cert expiry - which is our case was 30 days - and thus
slept for 27 days. Critically, on an consul template runner restart these
goroutines don't respond to the stop channel, but instead continue sleeping
until 90% of the cert is expired.

This PR makes the sleep paths in Vault dependencies cancelable so goroutines
exit promptly when a runner is stopped.
@skoppe skoppe requested review from a team as code owners April 15, 2026 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant