fix(topology): Fix for issue causing stalling on shutdown for sinks configured w/ disk buffers by graphcareful · Pull Request #24949 · vectordotdev/vector

graphcareful · 2026-03-17T19:05:53Z

Summary

This PR includes two fixes when sinks configured with buffers are shutdown. First issue is observed when reloading a config that contains a sink with a disk buffer. Vector will wait until batch.timeout_secs has completed which causes buffers to flush and unblocks the reload process. The fix for this is to send the cancel() signal to the sink, so it doesn't block on its buffer not being flushed downstream.

The second fix is for an issue with the same root cause. I noticed the same hang on issue of control-c with the aws_s3 sink. I employed the same solution there in the stop() method - to call the cancel() signal early in the stop() method.

Vector configuration

sources:
  http-server-source:
    type: "http_server"
    address: "127.0.0.1:8001"

sinks:
  amazon-s3-sink:
    inputs:
      - "http-server-source"
    type: "aws_s3"
    bucket: e2e-tests
    region: us-east-1
    encoding:
      codec: json
    buffer:
      type: disk
      max_size: 30000000000
      when_full: block
    batch:
      timeout_secs: 500
      max_bytes: 100000000

How did you test this PR?

By using the config above and a simple HTTP traffic generator. Modifying the sink buffer and re-saving the file, looking for errors from the source.

Change Type

Is this a breaking change?

Yes
No

Does this PR include user facing changes?

Yes. Please add a changelog fragment based on our guidelines.
No. A maintainer will apply the no-changelog label to this PR.

References

Notes

Please read our Vector contributor resources.
Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
Some CI checks run only after we manually approve them.
- We recommend adding a pre-push hook, please see this template.
- Alternatively, we recommend running the following locally before pushing to the remote branch:
  - make fmt
  - make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
  - make test
After a review is requested, please avoid force pushes to help us review incrementally.
- Feel free to push as many commits as you want. They will be squashed into one before merging.
- For example, you can run git merge origin master and git push.
If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
run make build-licenses to regenerate the license inventory and commit the changes (if any). More details here.

pront · 2026-03-17T19:07:32Z

@codex review

src/topology/running.rs

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eb5630a0fd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/topology/running.rs

lib/vector-buffers/src/variants/disk_v2/tests/acknowledgements.rs

lib/vector-buffers/src/variants/disk_v2/ledger.rs

pront · 2026-03-24T17:00:14Z

@codex review this PR

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a62a56a08b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-24T17:05:46Z

lib/vector-buffers/src/variants/disk_v2/reader.rs

+                if self.ledger.is_writer_done() {
+                    return Ok(None);
+                }


Do not return EOF before draining late acknowledgements

This early writer_done return can terminate the reader while there are still in-flight records awaiting ack. Those acks are only applied in handle_pending_acknowledgements() at the start of a later next() call, but most callers stop polling after the first None, so the pending acks never get drained. In that case the ledger/file deletion state is left stale and already-delivered records can be replayed on the next startup/reload. Before this change, this path waited and could be woken by the ack notification.

Useful? React with 👍 / 👎.

garryliu-stripe · 2026-03-31T17:27:11Z

@graphcareful any movement in this fix?

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 522821e093

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/topology/running.rs

The disk buffer reader stalls when editing a buffering option of a sinks config while running. It was tracked down to a deadlock on calling self.ledger.wait_for_writer during a read. The solution is to not Pause a sink configured with disk buffers but rather Remove it. This is safe because during graceful shutdown the sink will process all acks and whatever data is unread remains on disk to be later processed.

graphcareful · 2026-04-02T21:02:24Z

Latest round of changes send the cancel signal to the sinks to force them to gracefully shutdown instead of pause then reconnect buffers. A bit of extra code was added to be able to wait for a sink to fully shut down, since it wasn't really possible to understand when the retain count of a buffer would go to 0 since fanout.rs performed delayed shutdown of channels in the background -- with no way to wait on that works completion. I had to include that after observing that during manual testing things were fine but my new unit tests failed, so something must have been racing to make things flaky.

tobz

Seems straightforward enough to me if it fixes the glitch. 👍🏻

graphcareful · 2026-04-03T14:30:10Z

changelog.d/24929_fix_stall_on_disk_shutdown.fix.md

@@ -0,0 +1,4 @@
+Fixed issue during shutdown or restart of a sink with disk buffer configured where


Update this, fix no longer involves shutdown path

pront · 2026-04-03T15:04:31Z

@codex review

focus on unconditional cancel() and graceful drain
RunningTopology::stop() and cancels sink detach triggers

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9f6ea320f0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-03T15:09:26Z

src/topology/running.rs

+            self.detach_triggers
+                .remove(key)
+                .unwrap()
+                .into_inner()
+                .cancel();


Keep graceful drain for non-reused sink reloads

This unconditional cancel() now runs for every changed sink, not just buffer-reuse cases. Because sink input is wrapped with take_until_if(tripwire) in builder.rs, canceling here terminates the stream immediately and bypasses natural draining of queued events. For changed sinks whose buffers are not reused (especially in-memory buffers), this introduces event loss during reloads that previously allowed graceful drain.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-03T15:09:26Z

src/topology/running.rs

+                if let Some(rxs) = buffer_release_barriers.remove(key) {
+                    for rx in rxs {
+                        let _ = rx.await;
+                    }


Prevent deadlock when awaiting disk writer release barrier

Waiting on the barrier here can hang reload indefinitely when a changed sink is paused (not removed) and upstream stops emitting. remove_inputs may send Pause, and fanout processes control messages during send activity; if no more sends occur, the paused sender clone holding the disk writer is never dropped, so this rx.await never resolves. That turns certain quiet pipelines into permanent reload stalls.

Useful? React with 👍 / 👎.

graphcareful requested a review from a team as a code owner March 17, 2026 19:05

graphcareful added domain: topology Anything related to Vector's topology code domain: config Anything related to configuring Vector domain: buffers Anything related to Vector's memory/disk buffers labels Mar 17, 2026

bruceg added the type: bug A code related bug. label Mar 17, 2026

bruceg reviewed Mar 17, 2026

View reviewed changes

src/topology/running.rs Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

src/topology/running.rs Outdated Show resolved Hide resolved

src/topology/running.rs Outdated Show resolved Hide resolved

graphcareful force-pushed the rob/fix-disk-buffer-reload-bug branch from eb5630a to 28a64b0 Compare March 23, 2026 17:26

github-actions bot removed the domain: topology Anything related to Vector's topology code label Mar 23, 2026

bruceg reviewed Mar 23, 2026

View reviewed changes

lib/vector-buffers/src/variants/disk_v2/tests/acknowledgements.rs Outdated Show resolved Hide resolved

bruceg reviewed Mar 23, 2026

View reviewed changes

lib/vector-buffers/src/variants/disk_v2/ledger.rs Outdated Show resolved Hide resolved

graphcareful requested review from bruceg and pront March 23, 2026 21:06

graphcareful force-pushed the rob/fix-disk-buffer-reload-bug branch from de6b4ec to a62a56a Compare March 24, 2026 15:57

chatgpt-codex-connector bot reviewed Mar 24, 2026

View reviewed changes

graphcareful force-pushed the rob/fix-disk-buffer-reload-bug branch from a62a56a to 522821e Compare April 2, 2026 19:35

github-actions bot added the domain: topology Anything related to Vector's topology code label Apr 2, 2026

chatgpt-codex-connector bot reviewed Apr 2, 2026

View reviewed changes

src/topology/running.rs Show resolved Hide resolved

graphcareful added 5 commits April 2, 2026 15:40

Unit tests for the stall on shutdown during edit of buffer config bug

b059d9c

Add changelog fragment

8eb4778

Fine grained Mutex to not hold lock across multiple actions

d52b8ac

Address clippy comments

9b96386

graphcareful force-pushed the rob/fix-disk-buffer-reload-bug branch from 522821e to 9b96386 Compare April 2, 2026 20:56

run cargo fmt

9f6ea32

tobz approved these changes Apr 3, 2026

View reviewed changes

graphcareful commented Apr 3, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Apr 3, 2026

View reviewed changes

		@@ -0,0 +1,4 @@
		Fixed issue during shutdown or restart of a sink with disk buffer configured where

Conversation

graphcareful commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Vector configuration

How did you test this PR?

Change Type

Is this a breaking change?

Does this PR include user facing changes?

References

Notes

Uh oh!

pront commented Mar 17, 2026

Uh oh!

Uh oh!

This comment was marked as outdated.

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pront commented Mar 24, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

garryliu-stripe commented Mar 31, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

graphcareful commented Apr 2, 2026

Uh oh!

tobz left a comment

Choose a reason for hiding this comment

Uh oh!

graphcareful Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

pront commented Apr 3, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

graphcareful commented Mar 17, 2026 •

edited

Loading