Skip to content

Fix assertion crash in processIOThreadsReadDone when DONT_PARSE client has stale argc#3245

Open
aradz44 wants to merge 4 commits intovalkey-io:unstablefrom
aradz44:dont-parse-bug
Open

Fix assertion crash in processIOThreadsReadDone when DONT_PARSE client has stale argc#3245
aradz44 wants to merge 4 commits intovalkey-io:unstablefrom
aradz44:dont-parse-bug

Conversation

@aradz44
Copy link
Contributor

@aradz44 aradz44 commented Feb 22, 2026

Found while reading the code — I have not yet been able to confirm whether this path can materialize in existing real-world scenarios.

Bug

When an IO thread reads from a client with READ_FLAGS_DONT_PARSE set (because canParseCommand returns 0), processIOThreadsReadDone skips handleParseResults but still checks c->argc > 0 and sets pending_command = 1. However, argc may be stale from a previous incomplete pipelined command parse — it was set by consumeCommandQueue when handleParseResults returned PARSE_NEEDMORE, and was never cleared.

This leads to processCommand being called with parsed_cmd == NULL and without READ_FLAGS_COMMAND_NOT_FOUND, triggering the assertion:

==> server.c:4177 'c->read_flags & READ_FLAGS_COMMAND_NOT_FOUND' is not true

Crash sequence

  1. IO thread parses pipelined commands; the last command is incomplete → cmd_queue entry has argc > 0, parsed_cmd = NULL
  2. Main thread consumeCommandQueue pops the incomplete entry → c->argc > 0, c->parsed_cmd = NULL
  3. handleParseResults returns PARSE_NEEDMORE → loop breaks, stale argc remains on the client
  4. An external event makes canParseCommand return 0 for this client (e.g. close_after_reply, close_asap, or application-specific blocking flags)
  5. New data arrives → trySendReadToIOThreads sets READ_FLAGS_DONT_PARSE
  6. IO thread reads data, skips parsing
  7. processIOThreadsReadDone: stale argc > 0 → pending_command = 1
  8. processCommand: parsed_cmd == NULL, COMMAND_NOT_FOUND not set → assertion crash

Fix

In processIOThreadsReadDone, when DONT_PARSE is set and argc > 0, skip the client. The argc is stale from a previous incomplete parse — the IO thread didn't parse anything new, so there is no valid command to execute.

Test

Added DEBUG SET-CLOSE-AFTER-REPLY subcommand and a test that reproduces the crash deterministically:

  1. Client sends a complete PING + incomplete SET (truncated value) in one pipeline
  2. After PING reply confirms the incomplete parse was consumed (stale argc), DEBUG SET-CLOSE-AFTER-REPLY is called from another connection to simulate the external event
  3. Client sends remaining data → triggers the DONT_PARSE path → crash on unfixed server

Signed-off-by: Arad Zilberstein <aradz@amazon.com>
@aradz44 aradz44 changed the title Fix *theoretical* assertion crash in processIOThreadsReadDone when DONT_PARSE client has stale argc Fix assertion crash in processIOThreadsReadDone when DONT_PARSE client has stale argc Feb 22, 2026
} else if (!strcasecmp(objectGetVal(c->argv[1]), "client-enforce-reply-list") && c->argc == 3) {
server.debug_client_enforce_reply_list = atoi(objectGetVal(c->argv[2]));
addReply(c, shared.ok);
} else if (!strcasecmp(objectGetVal(c->argv[1]), "set-close-after-reply") && c->argc == 3) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we wanted is a debug extended CLIENT KILL abilities right?
maybe it would be helpful to have a debug command: DEBUG CLOSE-ASAP (close async) and DEBUG CLOSE-AFTER-REPLY ?

src/debug.c Outdated
Comment on lines +1068 to +1069
if (!target) { addReplyError(c, "No such client"); return; }
target->flag.close_after_reply = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably this would be reproducible after doing ACL LOAD with the current user getting invalidated. In that case we call close_after_reply. That seems like a real potential crash? If we can trigger this with a real known workload, I would feel a bit better about it than something synthetic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. better to fix and test REAL issue rather than a synthetic one. Good point about the ACL LOAD!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

close_after_reply is only set by a client on itself by what I saw , but a client can't have both stale argc and close_after_reply because
setting close_after_reply requires executing a command, which calls resetClient and zeros argc, and once it's set. For external paths like ACL LOAD, the code uses freeClientOrCloseLater which sets close_asap (not close_after_reply),
and close_asap is already guarded by trySendReadToIOThreads.
If you think of another way to trigger it I can check it.

aradz44 added 2 commits March 3, 2026 10:41
Signed-off-by: Arad Zilberstein <aradz@amazon.com>
Signed-off-by: Arad Zilberstein <aradz@amazon.com>
Signed-off-by: Arad Zilberstein <aradz@amazon.com>
@aradz44
Copy link
Contributor Author

aradz44 commented Mar 3, 2026

Sorry for the push -f I forgot to add -s to the commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants