fix(topology): Fix for issue causing stalling on shutdown for sinks configured w/ disk buffers#24949
Conversation
|
@codex review |
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eb5630a0fd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
eb5630a to
28a64b0
Compare
lib/vector-buffers/src/variants/disk_v2/tests/acknowledgements.rs
Outdated
Show resolved
Hide resolved
de6b4ec to
a62a56a
Compare
|
@codex review this PR |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a62a56a08b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if self.ledger.is_writer_done() { | ||
| return Ok(None); | ||
| } |
There was a problem hiding this comment.
Do not return EOF before draining late acknowledgements
This early writer_done return can terminate the reader while there are still in-flight records awaiting ack. Those acks are only applied in handle_pending_acknowledgements() at the start of a later next() call, but most callers stop polling after the first None, so the pending acks never get drained. In that case the ledger/file deletion state is left stale and already-delivered records can be replayed on the next startup/reload. Before this change, this path waited and could be woken by the ack notification.
Useful? React with 👍 / 👎.
|
@graphcareful any movement in this fix? |
a62a56a to
522821e
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 522821e093
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
The disk buffer reader stalls when editing a buffering option of a sinks config while running. It was tracked down to a deadlock on calling self.ledger.wait_for_writer during a read. The solution is to not Pause a sink configured with disk buffers but rather Remove it. This is safe because during graceful shutdown the sink will process all acks and whatever data is unread remains on disk to be later processed.
522821e to
9b96386
Compare
|
Latest round of changes send the cancel signal to the sinks to force them to gracefully shutdown instead of pause then reconnect buffers. A bit of extra code was added to be able to wait for a sink to fully shut down, since it wasn't really possible to understand when the retain count of a buffer would go to 0 since fanout.rs performed delayed shutdown of channels in the background -- with no way to wait on that works completion. I had to include that after observing that during manual testing things were fine but my new unit tests failed, so something must have been racing to make things flaky. |
tobz
left a comment
There was a problem hiding this comment.
Seems straightforward enough to me if it fixes the glitch. 👍🏻
| @@ -0,0 +1,4 @@ | |||
| Fixed issue during shutdown or restart of a sink with disk buffer configured where | |||
There was a problem hiding this comment.
Update this, fix no longer involves shutdown path
|
@codex review
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9f6ea320f0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| self.detach_triggers | ||
| .remove(key) | ||
| .unwrap() | ||
| .into_inner() | ||
| .cancel(); |
There was a problem hiding this comment.
Keep graceful drain for non-reused sink reloads
This unconditional cancel() now runs for every changed sink, not just buffer-reuse cases. Because sink input is wrapped with take_until_if(tripwire) in builder.rs, canceling here terminates the stream immediately and bypasses natural draining of queued events. For changed sinks whose buffers are not reused (especially in-memory buffers), this introduces event loss during reloads that previously allowed graceful drain.
Useful? React with 👍 / 👎.
| if let Some(rxs) = buffer_release_barriers.remove(key) { | ||
| for rx in rxs { | ||
| let _ = rx.await; | ||
| } |
There was a problem hiding this comment.
Prevent deadlock when awaiting disk writer release barrier
Waiting on the barrier here can hang reload indefinitely when a changed sink is paused (not removed) and upstream stops emitting. remove_inputs may send Pause, and fanout processes control messages during send activity; if no more sends occur, the paused sender clone holding the disk writer is never dropped, so this rx.await never resolves. That turns certain quiet pipelines into permanent reload stalls.
Useful? React with 👍 / 👎.
Summary
This PR includes two fixes when sinks configured with buffers are shutdown. First issue is observed when reloading a config that contains a sink with a disk buffer. Vector will wait until
batch.timeout_secshas completed which causes buffers to flush and unblocks the reload process. The fix for this is to send thecancel()signal to the sink, so it doesn't block on its buffer not being flushed downstream.The second fix is for an issue with the same root cause. I noticed the same hang on issue of control-c with the
aws_s3sink. I employed the same solution there in thestop()method - to call thecancel()signal early in thestop()method.Vector configuration
How did you test this PR?
By using the config above and a simple HTTP traffic generator. Modifying the sink buffer and re-saving the file, looking for errors from the source.
Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References
DatadogandPrometheusand S3 sink hangs forever #17666Notes
@vectordotdev/vectorto reach out to us regarding this PR.pre-pushhook, please see this template.make fmtmake check-clippy(if there are failures it's possible some of them can be fixed withmake clippy-fix)make testgit merge origin masterandgit push.Cargo.lock), pleaserun
make build-licensesto regenerate the license inventory and commit the changes (if any). More details here.