Adding Empty shard causing mass timeouts when under heavy load #3331

bseto · 2026-01-07T02:51:26Z

bseto
Jan 7, 2026

Hi, I'm wondering if anyone else is also having this problem, and if there's anything I can do to help find a resolution to this.

Problem

Note: This only happens to our production cluster with heavy load

When adding an empty shard (so no migrations are involved) using the kvrocks-controller create shard ..., we observe that our readers and writers will start mass disconnecting, and reconnecting and continue to do so until we move reads off of the cluster.

This also happens when we migrate slots, but I think I've narrowed it down to it happening on SetClusterNodes and maybe something to do with the kCmdExclusive taking too long and commands timeout?

We have 4 regions we deploy to, and during November and December testing, the 2nd and 3rd largest clusters would display the above behaviour when I added an empty shard, or did any migrations. As a hail-mary, I had Claude Opus read the kvrocks codebase and see if it could find anything, and it suggested i set persist-cluster-nodes-enabled no since that could add to how long the exclusive lock needed to be held.

It's January now, and the load is similar (but slightly lower than it was in November/December), but when I tried migrations on the 2nd and 3rd largest clusters after setting persist-cluster-nodes-enabled no, it seems those clusters no longer have issues.

However when I try it on the largest cluster we have, I'm still seeing the issue.

Testing

Attempt at Reproducing not in production

I spent a week in November trying to reproduce this issue on a non-production cluster and I was unable to do so.
I had 8 r6idn.large nodes in the cluster, with 16 c6gn.2xlarge nodes to load test (a mix of readers and writers) and was unable to get this issue happening.

In Production

I'll just give the details of our largest cluster:

kvrocks version: 2.12.1
Instance Type: r6idn.8xlarge
Number of nodes: 44
Operations: 3.4M ops/s (3M read, 130k hsetexpire, 130k hmget)
Each Node averages 80k op/s at peak times.

We were originally using ruedis client to connect to the cluster, and we tried to change the timeout settings and pipelines. Originally I had thought it might have been the way the client was handling MOVED errors, but getting the timeouts even when we just add an empty shard ruled that out (I had also modified the client to track if we were getting MOVED errors and we weren't).

We then moved to go-redis and tried to play around with the timeouts there, and ignoring context.Timeout.

Followed this document about Go Context Timeouts being potentially harmful and basically set go-redis to have a 10second timeout on the client, and we'd have our own wrapping function that'd timeout the original request, while letting the original redis connection live and not need to quit. The idea here was that future requests to the go-redis client would try to grab from the pool and get errors from go-redis since the pool of connections would be exhausted instead of spamming the kvrocks cluster. However this still didn't work. Somehow it'd still timeout.

Currently we're using go-redis and using the Limiter interface to create a circuit breaker. So whenever we get a large amount of errors, we just open the circuit breaker for a bit before connecting again - but this still takes 5-10mins, and isn't ideal since it'd still mean we'd have 40+ migrations to do, with each migration ending with a disconnect/reconnect storm.

Note: I'm also noticing it's not every node. It's usually maybe 10% of the nodes in the cluster that get hammered hard and have clients continuously connect/disconnect.

Thoughts

Is there a way I can confirm if SetClusterNodes is indeed taking too long and the culprit?
Or is there any ideas on what settings we can change?

git-hulk · 2026-01-07T05:50:46Z

git-hulk
Jan 7, 2026
Collaborator

Hi @bseto

It's January now, and the load is similar (but slightly lower than it was in November/December), but when I tried migrations on the 2nd and 3rd largest clusters after setting persist-cluster-nodes-enabled no, it seems those clusters no longer have issues.

This shouldn't be an issue because it only writes the cluster nodes information into the local file nodes.conf.

I'm also noticing it's not every node. It's usually maybe 10% of the nodes in the cluster that get hammered hard and have clients continuously connect/disconnect.

Could you please provide the logs of the cluster node? It might have the potential clues about what's going on at that time. For example, whether this node role was changed or not?

Is there a way I can confirm if SetClusterNodes is indeed taking too long and the culprit?

You could see the command_stats via the info command.

3 replies

bseto Jan 7, 2026
Author

This shouldn't be an issue because it only writes the cluster nodes information into the local file nodes.conf.

Would it be an issue if my disk was very busy?

Could you please provide the logs of the cluster node? It might have the potential clues about what's going on at that time. For example, whether this node role was changed or not?
Here's a zipped log.
kvrocks-node-33-1h.logs.zip

I included a few minutes before the incident, but here's some useful information:

My grafana's 14:30 timestamp correlates with the logs 21:30 timestamp.
Migration finishes at the ~14:29min mark and we can start to see that the op/s starts going crazy.

the kvrocks-controller starts reporting errors right after the migration is finished. I don't have the logs from when it began, but the errors are basically the same, but with more nodes.

{"level":"error","timestamp":"2026-01-06T22:29:43.082Z","caller":"controller/cluster.go:313","msg":"Failed to sync the clusterName info","id":"JPurQbrFcDUaULqDQsvExCI0gEB9WKipZ6gYl5Jf","is_master":true,"addr":"172.30.95.9:6379","error":"read tcp 172.30.86.1:58634->172.30.95.9:6379: i/o timeout"...
{"level":"error","timestamp":"2026-01-06T22:29:43.082Z","caller":"controller/cluster.go:313","msg":"Failed to sync the clusterName info","id":"YDAcU3srM7d6iMyOX44rIuKp59fUEM9IoA0t49gv","is_master":true,"addr":"172.30.95.247:6379","error":"read tcp 172.30.86.1:33576->172.30.95.247:6379: i/o timeout"...
{"level":"warn","timestamp":"2026-01-06T22:29:49.089Z","caller":"controller/cluster.go:304","msg":"Failed to probe the node","id":"YDAcU3srM7d6iMyOX44rIuKp59fUEM9IoA0t49gv","is_master":true,"addr":"172.30.95.247:6379","error":"read tcp 172.30.86.1:48212->172.30.95.247:6379: i/o timeout","failure_count":1}

We start to see MASTER MODE enabled by cluster topology setting at 21:29:30 (log timestamp)
at 21:37:30 we start to see more of the Going to remove the client errors. Which is around when the op/s stops reporting

Something I notice when on a machine that starts removing clients is that it's btop (basically top), will report that it's still uploading/downloading a lot of data through the network. It's cpu usage is still quite high too so it's still operating. 6. I have some circuit breaking happening but I think for this cluster it didn't work very well. I had to manually restart the node before things returned to normal

You could see the command_stats via the info command.

# CommandStats
cmdstat_client:calls=3204,usec=455,usec_per_call=0.1420099875156055
cmdstat_cluster:calls=13731,usec=835218,usec_per_call=60.82717937513655
cmdstat_clusterx:calls=2,usec=2693,usec_per_call=1346.5
cmdstat_command:calls=1,usec=1792,usec_per_call=1792
cmdstat_config:calls=3204,usec=242110,usec_per_call=75.5649188514357
cmdstat_hello:calls=102010,usec=380233,usec_per_call=3.7274090775414175
cmdstat_hgetall:calls=1996260836,usec=284523117119,usec_per_call=142.52802639213826
cmdstat_hmget:calls=90774738,usec=6918249866,usec_per_call=76.21338291276588
cmdstat_hsetexpire:calls=92825387,usec=32159375464,usec_per_call=346.4502169433455
cmdstat_info:calls=3205,usec=856819644,usec_per_call=267338.4224648986
cmdstat_ping:calls=131,usec=30,usec_per_call=0.22900763358778625
cmdstat_slowlog:calls=6408,usec=12585,usec_per_call=1.9639513108614233

hmm, the clusterx seems like it's not taking that long?

git-hulk Jan 7, 2026
Collaborator

@bseto Thanks for your information. Want to know if the log file you provided belongs to the node kvrocks-profile-lite-33? I see the IP address in the log file is 172.30.69.200.

From the current log file, I didn't see any clues about why the traffic was sharply dropped.

Would it be an issue if my disk was very busy?

It would take too long, even if the disk were very busy, because it only wrote a few hundred bytes.

hmm, the clusterx seems like it's not taking that long?

The command_stats would be reset if the server was restarted. But I feel it's not related to the CLUSTERX SET command.

bseto Jan 9, 2026
Author

Hi sorry for the late reply! Yes, that log is from that node 33.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Empty shard causing mass timeouts when under heavy load #3331

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Adding Empty shard causing mass timeouts when under heavy load #3331

Uh oh!

bseto Jan 7, 2026

Problem

Testing

Attempt at Reproducing not in production

In Production

Thoughts

Replies: 1 comment · 3 replies

Uh oh!

git-hulk Jan 7, 2026 Collaborator

Uh oh!

bseto Jan 7, 2026 Author

Uh oh!

git-hulk Jan 7, 2026 Collaborator

Uh oh!

bseto Jan 9, 2026 Author

bseto
Jan 7, 2026

Replies: 1 comment 3 replies

git-hulk
Jan 7, 2026
Collaborator

bseto Jan 7, 2026
Author

git-hulk Jan 7, 2026
Collaborator

bseto Jan 9, 2026
Author