Modular Clustering by murphyjacob4 · Pull Request #35 · valkey-io/valkey-rfc

murphyjacob4 · 2026-02-11T01:49:09Z

This RFC is part 1 of 2 for "Cluster V2" valkey-io/valkey#384

[This Proposal] Introduce a modular Core <-> Clustering interface
Introduce a first-party cluster implementation following from 1

Signed-off-by: Jacob Murphy <jkmurphy@google.com>

allenss-amazon · 2026-02-11T04:24:30Z

ModularClustering.md

+      /* The expected current revision. If not the actual latest, the Module
+       * must fail the operation. If the key does not currently exist, use
+       * NULL. This provides compare-and-swap (CAS) semantics. */
+      void *expected_revision;


Any reason we can't be more specific about the data type of "version"? If it's intended to be an opaque type (defined by the cluster provider), then it should have it's own type, with a vtable, not void.

Yeah it is arbitrary, e.g. in raft I believe you need to specify term and index. I'm also fine defining it as a type. I figured when it comes to implementation some of the API might change a little (based on what is most ergonomic). Hard to anticipate now

allenss-amazon · 2026-02-11T04:39:53Z

ModularClustering.md

+`void*` pointers are used to allow flexibility in how revisions are defined. The
+details of this will be left to the Module (e.g. in Raft, it may include both
+term and index).


Nit: rather than void, it should be an opaque type that has a vtable exported from the module.

allenss-amazon · 2026-02-11T04:43:15Z

ModularClustering.md

+Note: propagation is expressed as minimum guarantees. Modules are free to
+propagate beyond the specified scope (e.g. propagate SHARD to all nodes). For
+that reason, the Core should ensure shard-gossipped metadata is globally unique
+(e.g. by prefixing with node ID or shard ID).


This feels like you're really defining a third type of data, Why not simply define a shard-local eventually consistent namespace?

I'm open to it. I iterated a lot on this. E.g. another idea was one Metadata channel with configurable write/propagation concern. E.g. UpdateMetadata(propagation=SHARD/GLOBAL, write_concern=LOCAL/SHARD_QUORUM, ...) but you end up with some nonsense options like prop=GLOBAL, write_concern=SHARD_QUORUM which I didn't like

So I guess we could have something like three channels:

Gossip-only, shard-local

Gossip-only, global

Quorum-commit, shard-local

allenss-amazon · 2026-02-11T04:46:04Z

ModularClustering.md

+
+### 4.2 Remote Failure
+
+When the Module detects a remote failure, the Module is expected to callback via


What is the responsibility within the module for detecting remote failures? Is this time-bounded? Is the intra-shard timeframe the same as the cross-shard timeframe?

I didn't define this since it will depend on how the clustering implementation aims to guarantee correctness. Modules interested in correctness/write consistency will need to implement the primary health via leasing/heartbeats. So like for Raft, primary needs to append an entry every X seconds, and when replicas don't see it for X*2 seconds, they need to mark it down. In theory a module could also have a less-correct implementation (a la cluster v1). If we go with this design and we implement our own module, it will come with a separate design that hashes these details out.

allenss-amazon · 2026-02-11T04:55:43Z

ModularClustering.md

+choice is left up to the Module and is not strictly required, but may make
+implementation easier.
+
+## 5 Cluster Formation and Operation Details


It's not entirely clear how much of the code in this section is in the module and how much is built on top of the module's API. I think the intention is mostly the later -- right?

Yeah I tried to split it like so:

Membership: code would mostly be in the Module. The Core just tells it what modifications to make and it makes those changes opaque to the core (e.g. following from Raft membership changes spec)

Metadata: code would be mostly in the Core. The Core will tell the module to change metadata in some way, but the module does not know why. The Module is just responsible for the consistency/durability/propagation of that opaque metadata

I can split the examples out to better show this

madolson

So, the main comment is just that I don't know if this is really better than the direction of the previous cluster V2. It was intended to require more implementation, but also meant that the backend could do literally whatever it wanted to do to materialize the clustering state. This new implementation seems to force a specific architecture on the backends, which seems less flexible overall to me.

madolson · 2026-02-16T21:28:04Z

ModularClustering.md

+- Abstracted Membership: A push-based model where the Module informs the Core of
+  topology changes, enabling diverse backends (e.g., Raft, Multi-Raft, or
+  external stores like etcd).


The current legacy split was to have the module maintain it's own copy of the topology, that the engine could fetch values from. The same is roughly true of the next two points, the engine is just getting the data that it wants. It requires a lot of boiler plate code in the theoretical future module, but is not dictating the implementation. This new proposal is also tight coupling with a specific backend.

The current legacy split was to have the module maintain it's own copy of the topology, that the engine could fetch values from

Yeah I don't like how we refer to "topology" since it is really a mix of membership (who is in the cluster, which shard are they in) and metadata (which shards own what slots, which node leads a shard). To me that feels like a leaky abstraction.

The goal here was to really tease out those details so we don't end up with a vague concept of "topology". Now you have membership (static, transparent to module) and metadata (dynamic, opaque to the module). If, for a dumb/crazy example, tomorrow we wanted to change the number of slots to 32k, the clustering layer does not need to be aware of that. I think this is the right abstraction to make, but it breaks from cluster v1's "slots are transparent" and that's why I don't think we should bother with coming up with one uber-design that encompasses both.

The above goes for primaryship as well, although this one has more concrete examples. We will need to support more complex concepts there, e.g. active-active, synchronous replication, and there is no reason to force the module to handle those cases IMO.

All that said, we could say slots are a firm contract and make it visible to the clustering layer. Same for shard ownership/primaryship. There isn't necessarily anything really wrong with that, but it is just a more strict contract and I don't really know why we would need that if we can make the abstraction higher level.

This new proposal is also tight coupling with a specific backend.

You are going to need to explain more, is it just because we split Gossip and Shard Log? There are other ways we can model that (e.g. propagation concern [shard/cluster/etc] and write concern [shard quorum/not]) but I didn't want to explode the interface into a bunch of permutations

madolson · 2026-02-16T21:45:19Z

ModularClustering.md

+- `PFAIL` - node is not yet canonically unhealthy (by the Module’s definition)
+  but is locally unhealthy. It is not strictly needed, but will be client
+  visible as PFAIL flag in the cluster topology


We should kill off pfail. It's not practically useful to end clients.

The only place I feel like I remember using PFAIL was in coordinated searches in ValkeySearch, we try to avoid sending to nodes in the shard that are in PFAIL. For that reason, I felt like it could be useful

madolson · 2026-02-16T21:47:22Z

ModularClustering.md

+Many modules will use the heartbeat to implement a leasing system in order to
+detect network partitions. But it also serves to inform the Core that the Module
+is still alive.


It seems odd the module is down? What exactly does the module being alive mean? The core can be down because of long running, but we should able to better fence of the module.

If the module is on the main thread, then yeah this doesn't really make sense. But I figured it may make sense to do consensus work on a background thread. In which case you can be exposed to things like deadlock or busy loops which prevent the heartbeat from arriving. From the core's point of view, it wouldn't be any different from a partition, and it shouldn't violate the correctness

murphyjacob4 · 2026-02-16T22:25:06Z

So, the main comment is just that I don't know if this is really better than the direction of the previous cluster V2.

Is there somewhere the proposal for previous cluster v2 is documented/proposed? It's hard to discuss the merits of it since I don't have a very concrete idea of what all it proposed. It iterated a lot so I am also not sure which version you are referring to

That said - from what I can tell - no iteration of cluster V2 made an attempt to define a clear core <-> clustering contract. With the proposed split in this RFC, I think it would be really easy to do things like: Active-Active (store in shard-log metadata), Synchronous Replication (store in shard-log metadata), Cluster-level configs (store in gossip), Cross-Cluster replication (basically just a cluster-level config on the target cluster, then each shard follows the source shard). The primitives feel strong enough that the core would be less burdened with reasoning about the distributed state.

This new implementation seems to force a specific architecture on the backends, which seems less flexible overall to me.

So my whole intention was to decouple from architecture specifics. Maybe you can provide an example of an implementation that would be hard to implement on the new model? I provided examples throughout of a few different backends and how they would implement this but maybe there is some other way that you are thinking?

Modular Clustering RFC

5b33e28

Signed-off-by: Jacob Murphy <jkmurphy@google.com>

allenss-amazon reviewed Feb 11, 2026

View reviewed changes

madolson reviewed Feb 16, 2026

View reviewed changes


		### 4.2 Remote Failure

		When the Module detects a remote failure, the Module is expected to callback via

Conversation

murphyjacob4 commented Feb 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

murphyjacob4 commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants