Skip to content

fix: prevent infinite drain loop in forked BGSave child process#894

Open
eliastam wants to merge 2 commits intovalkey-io:mainfrom
eliastam:fix-drain-mutation-bug
Open

fix: prevent infinite drain loop in forked BGSave child process#894
eliastam wants to merge 2 commits intovalkey-io:mainfrom
eliastam:fix-drain-mutation-bug

Conversation

@eliastam
Copy link
Collaborator

DrainMutationQueue spins forever in the BGSave forked child process because the mutation queue is a frozen snapshot that will never be drained by writer threads. Check VALKEYMODULE_CTX_FLAGS_IS_CHILD and return an error instead of attempting to drain.

Also adds a configurable timeout (drain-mutation-queue-timeout-ms, default 5s) to DrainMutationQueue as a safety net, and an integration test covering the BGSAVE with non-empty mutation queue scenario.

@eliastam eliastam force-pushed the fix-drain-mutation-bug branch from bd3d3b6 to 8bce6fe Compare March 11, 2026 23:16
DrainMutationQueue spins forever in the BGSave forked child process
because the mutation queue is a frozen snapshot that will never be
drained by writer threads. Check VALKEYMODULE_CTX_FLAGS_IS_CHILD and
return an error instead of attempting to drain.

Also adds a configurable timeout (drain-mutation-queue-timeout-ms,
default 5s) to DrainMutationQueue as a safety net, and an integration
test covering the BGSAVE with non-empty mutation queue scenario.

Signed-off-by: Elias Tamraz <tamrazelias@gmail.com>
@eliastam eliastam force-pushed the fix-drain-mutation-bug branch from 8bce6fe to 4eb7371 Compare March 11, 2026 23:22
static const auto max_sleep = std::chrono::milliseconds(100);
auto sleep_duration = std::chrono::milliseconds(1);
auto timeout = std::chrono::milliseconds(
options::GetDrainMutationQueueTimeoutMs().GetValue());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetDrainMutationQueueOnSaveTimeoutMs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if (options::GetDrainMutationQueueOnSave().GetValue()) {
int flags = ValkeyModule_GetContextFlags(nullptr);
bool is_bgsave = (flags & VALKEYMODULE_CTX_FLAGS_IS_CHILD) != 0;
if (is_bgsave) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should still continue if the queue is empty at this point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Comment on lines +1690 to +1695
auto status = DrainMutationQueue(ctx);
if (!status.ok()) {
VMSDK_LOG(WARNING, ctx)
<< "Failed to drain mutation queue on load for index "
<< vmsdk::config::RedactIfNeeded(name_) << ": " << status;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout should only apply for save. We should keep the load behaviour unchanged.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted this change

Copy link
Collaborator

@otherscase otherscase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments

@eliastam eliastam force-pushed the fix-drain-mutation-bug branch 5 times, most recently from cf417ac to 51f5cbc Compare March 13, 2026 19:59
- Rename config to drain-mutation-queue-on-save-timeout-ms
- Only error in forked child if mutation queue is non-empty
- Keep load behavior unchanged, timeout only applies to save

Signed-off-by: Elias Tamraz <tamrazelias@gmail.com>
@eliastam eliastam force-pushed the fix-drain-mutation-bug branch from 51f5cbc to 1d1acc6 Compare March 13, 2026 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants