Replies: 1 comment 3 replies
-
|
Hi @bseto
This shouldn't be an issue because it only writes the cluster nodes information into the local file nodes.conf.
Could you please provide the logs of the cluster node? It might have the potential clues about what's going on at that time. For example, whether this node role was changed or not?
You could see the command_stats via the |
Beta Was this translation helpful? Give feedback.



Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I'm wondering if anyone else is also having this problem, and if there's anything I can do to help find a resolution to this.
Problem
Note: This only happens to our production cluster with heavy load
When adding an empty shard (so no migrations are involved) using the kvrocks-controller
create shard ..., we observe that our readers and writers will start mass disconnecting, and reconnecting and continue to do so until we move reads off of the cluster.This also happens when we migrate slots, but I think I've narrowed it down to it happening on
SetClusterNodesand maybe something to do with thekCmdExclusivetaking too long and commands timeout?We have 4 regions we deploy to, and during November and December testing, the 2nd and 3rd largest clusters would display the above behaviour when I added an empty shard, or did any migrations. As a hail-mary, I had Claude Opus read the kvrocks codebase and see if it could find anything, and it suggested i set
persist-cluster-nodes-enabled nosince that could add to how long the exclusive lock needed to be held.It's January now, and the load is similar (but slightly lower than it was in November/December), but when I tried migrations on the 2nd and 3rd largest clusters after setting
persist-cluster-nodes-enabled no, it seems those clusters no longer have issues.However when I try it on the largest cluster we have, I'm still seeing the issue.
Testing
Attempt at Reproducing not in production
I spent a week in November trying to reproduce this issue on a non-production cluster and I was unable to do so.
I had 8 r6idn.large nodes in the cluster, with 16 c6gn.2xlarge nodes to load test (a mix of readers and writers) and was unable to get this issue happening.
In Production
I'll just give the details of our largest cluster:
kvrocks version: 2.12.1
Instance Type: r6idn.8xlarge
Number of nodes: 44
Operations: 3.4M ops/s (3M read, 130k hsetexpire, 130k hmget)
Each Node averages 80k op/s at peak times.
We were originally using ruedis client to connect to the cluster, and we tried to change the timeout settings and pipelines. Originally I had thought it might have been the way the client was handling
MOVEDerrors, but getting the timeouts even when we just add an empty shard ruled that out (I had also modified the client to track if we were getting MOVED errors and we weren't).We then moved to go-redis and tried to play around with the timeouts there, and ignoring context.Timeout.
Followed this document about Go Context Timeouts being potentially harmful and basically set go-redis to have a 10second timeout on the client, and we'd have our own wrapping function that'd timeout the original request, while letting the original redis connection live and not need to quit. The idea here was that future requests to the go-redis client would try to grab from the pool and get errors from go-redis since the pool of connections would be exhausted instead of spamming the kvrocks cluster. However this still didn't work. Somehow it'd still timeout.
Currently we're using go-redis and using the
Limiterinterface to create a circuit breaker. So whenever we get a large amount of errors, we just open the circuit breaker for a bit before connecting again - but this still takes 5-10mins, and isn't ideal since it'd still mean we'd have 40+ migrations to do, with each migration ending with a disconnect/reconnect storm.Note: I'm also noticing it's not every node. It's usually maybe 10% of the nodes in the cluster that get hammered hard and have clients continuously connect/disconnect.
Thoughts
Is there a way I can confirm if
SetClusterNodesis indeed taking too long and the culprit?Or is there any ideas on what settings we can change?
Beta Was this translation helpful? Give feedback.
All reactions