Skip to content

fix(topology): Fix for issue causing stalling on shutdown for sinks configured w/ disk buffers#24949

Open
graphcareful wants to merge 6 commits intovectordotdev:masterfrom
graphcareful:rob/fix-disk-buffer-reload-bug
Open

fix(topology): Fix for issue causing stalling on shutdown for sinks configured w/ disk buffers#24949
graphcareful wants to merge 6 commits intovectordotdev:masterfrom
graphcareful:rob/fix-disk-buffer-reload-bug

Conversation

@graphcareful
Copy link
Copy Markdown
Contributor

@graphcareful graphcareful commented Mar 17, 2026

Summary

This PR includes two fixes when sinks configured with buffers are shutdown. First issue is observed when reloading a config that contains a sink with a disk buffer. Vector will wait until batch.timeout_secs has completed which causes buffers to flush and unblocks the reload process. The fix for this is to send the cancel() signal to the sink, so it doesn't block on its buffer not being flushed downstream.

The second fix is for an issue with the same root cause. I noticed the same hang on issue of control-c with the aws_s3 sink. I employed the same solution there in the stop() method - to call the cancel() signal early in the stop() method.

Vector configuration

sources:
  http-server-source:
    type: "http_server"
    address: "127.0.0.1:8001"

sinks:
  amazon-s3-sink:
    inputs:
      - "http-server-source"
    type: "aws_s3"
    bucket: e2e-tests
    region: us-east-1
    encoding:
      codec: json
    buffer:
      type: disk
      max_size: 30000000000
      when_full: block
    batch:
      timeout_secs: 500
      max_bytes: 100000000

How did you test this PR?

By using the config above and a simple HTTP traffic generator. Modifying the sink buffer and re-saving the file, looking for errors from the source.

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • make fmt
      • make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
      • make test
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run make build-licenses to regenerate the license inventory and commit the changes (if any). More details here.

@graphcareful graphcareful requested a review from a team as a code owner March 17, 2026 19:05
@graphcareful graphcareful added domain: topology Anything related to Vector's topology code domain: config Anything related to configuring Vector domain: buffers Anything related to Vector's memory/disk buffers labels Mar 17, 2026
@pront
Copy link
Copy Markdown
Member

pront commented Mar 17, 2026

@codex review

@bruceg bruceg added the type: bug A code related bug. label Mar 17, 2026
@pront

This comment was marked as outdated.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eb5630a0fd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@graphcareful graphcareful force-pushed the rob/fix-disk-buffer-reload-bug branch from eb5630a to 28a64b0 Compare March 23, 2026 17:26
@github-actions github-actions bot removed the domain: topology Anything related to Vector's topology code label Mar 23, 2026
@graphcareful graphcareful requested review from bruceg and pront March 23, 2026 21:06
@graphcareful graphcareful force-pushed the rob/fix-disk-buffer-reload-bug branch from de6b4ec to a62a56a Compare March 24, 2026 15:57
@pront
Copy link
Copy Markdown
Member

pront commented Mar 24, 2026

@codex review this PR

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a62a56a08b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1083 to +1085
if self.ledger.is_writer_done() {
return Ok(None);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not return EOF before draining late acknowledgements

This early writer_done return can terminate the reader while there are still in-flight records awaiting ack. Those acks are only applied in handle_pending_acknowledgements() at the start of a later next() call, but most callers stop polling after the first None, so the pending acks never get drained. In that case the ledger/file deletion state is left stale and already-delivered records can be replayed on the next startup/reload. Before this change, this path waited and could be woken by the ack notification.

Useful? React with 👍 / 👎.

@garryliu-stripe
Copy link
Copy Markdown

@graphcareful any movement in this fix?

@graphcareful graphcareful force-pushed the rob/fix-disk-buffer-reload-bug branch from a62a56a to 522821e Compare April 2, 2026 19:35
@github-actions github-actions bot added the domain: topology Anything related to Vector's topology code label Apr 2, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 522821e093

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

The disk buffer reader stalls when editing a buffering option of a sinks config
while running. It was tracked down to a deadlock on calling self.ledger.wait_for_writer
during a read.

The solution is to not Pause a sink configured with disk buffers but rather Remove
it. This is safe because during graceful shutdown the sink will process all acks and
whatever data is unread remains on disk to be later processed.
@graphcareful graphcareful force-pushed the rob/fix-disk-buffer-reload-bug branch from 522821e to 9b96386 Compare April 2, 2026 20:56
@graphcareful
Copy link
Copy Markdown
Contributor Author

Latest round of changes send the cancel signal to the sinks to force them to gracefully shutdown instead of pause then reconnect buffers. A bit of extra code was added to be able to wait for a sink to fully shut down, since it wasn't really possible to understand when the retain count of a buffer would go to 0 since fanout.rs performed delayed shutdown of channels in the background -- with no way to wait on that works completion. I had to include that after observing that during manual testing things were fine but my new unit tests failed, so something must have been racing to make things flaky.

Copy link
Copy Markdown
Contributor

@tobz tobz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems straightforward enough to me if it fixes the glitch. 👍🏻

@@ -0,0 +1,4 @@
Fixed issue during shutdown or restart of a sink with disk buffer configured where
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this, fix no longer involves shutdown path

@pront
Copy link
Copy Markdown
Member

pront commented Apr 3, 2026

@codex review

  • focus on unconditional cancel() and graceful drain
  • RunningTopology::stop() and cancels sink detach triggers

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9f6ea320f0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +654 to +658
self.detach_triggers
.remove(key)
.unwrap()
.into_inner()
.cancel();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep graceful drain for non-reused sink reloads

This unconditional cancel() now runs for every changed sink, not just buffer-reuse cases. Because sink input is wrapped with take_until_if(tripwire) in builder.rs, canceling here terminates the stream immediately and bypasses natural draining of queued events. For changed sinks whose buffers are not reused (especially in-memory buffers), this introduces event loss during reloads that previously allowed graceful drain.

Useful? React with 👍 / 👎.

Comment on lines +730 to +733
if let Some(rxs) = buffer_release_barriers.remove(key) {
for rx in rxs {
let _ = rx.await;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent deadlock when awaiting disk writer release barrier

Waiting on the barrier here can hang reload indefinitely when a changed sink is paused (not removed) and upstream stops emitting. remove_inputs may send Pause, and fanout processes control messages during send activity; if no more sends occur, the paused sender clone holding the disk writer is never dropped, so this rx.await never resolves. That turns certain quiet pipelines into permanent reload stalls.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: buffers Anything related to Vector's memory/disk buffers domain: config Anything related to configuring Vector domain: topology Anything related to Vector's topology code type: bug A code related bug.

Projects

None yet

5 participants