Skip to content

Modular Clustering#35

Open
murphyjacob4 wants to merge 1 commit intovalkey-io:mainfrom
murphyjacob4:main
Open

Modular Clustering#35
murphyjacob4 wants to merge 1 commit intovalkey-io:mainfrom
murphyjacob4:main

Conversation

@murphyjacob4
Copy link

This RFC is part 1 of 2 for "Cluster V2" valkey-io/valkey#384

  1. [This Proposal] Introduce a modular Core <-> Clustering interface
  2. Introduce a first-party cluster implementation following from 1

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
/* The expected current revision. If not the actual latest, the Module
* must fail the operation. If the key does not currently exist, use
* NULL. This provides compare-and-swap (CAS) semantics. */
void *expected_revision;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we can't be more specific about the data type of "version"? If it's intended to be an opaque type (defined by the cluster provider), then it should have it's own type, with a vtable, not void.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it is arbitrary, e.g. in raft I believe you need to specify term and index. I'm also fine defining it as a type. I figured when it comes to implementation some of the API might change a little (based on what is most ergonomic). Hard to anticipate now

Comment on lines +323 to +325
`void*` pointers are used to allow flexibility in how revisions are defined. The
details of this will be left to the Module (e.g. in Raft, it may include both
term and index).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: rather than void, it should be an opaque type that has a vtable exported from the module.

Comment on lines +374 to +377
Note: propagation is expressed as minimum guarantees. Modules are free to
propagate beyond the specified scope (e.g. propagate SHARD to all nodes). For
that reason, the Core should ensure shard-gossipped metadata is globally unique
(e.g. by prefixing with node ID or shard ID).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like you're really defining a third type of data, Why not simply define a shard-local eventually consistent namespace?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to it. I iterated a lot on this. E.g. another idea was one Metadata channel with configurable write/propagation concern. E.g. UpdateMetadata(propagation=SHARD/GLOBAL, write_concern=LOCAL/SHARD_QUORUM, ...) but you end up with some nonsense options like prop=GLOBAL, write_concern=SHARD_QUORUM which I didn't like

So I guess we could have something like three channels:

  1. Gossip-only, shard-local
  2. Gossip-only, global
  3. Quorum-commit, shard-local


### 4.2 Remote Failure

When the Module detects a remote failure, the Module is expected to callback via
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the responsibility within the module for detecting remote failures? Is this time-bounded? Is the intra-shard timeframe the same as the cross-shard timeframe?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't define this since it will depend on how the clustering implementation aims to guarantee correctness. Modules interested in correctness/write consistency will need to implement the primary health via leasing/heartbeats. So like for Raft, primary needs to append an entry every X seconds, and when replicas don't see it for X*2 seconds, they need to mark it down. In theory a module could also have a less-correct implementation (a la cluster v1). If we go with this design and we implement our own module, it will come with a separate design that hashes these details out.

choice is left up to the Module and is not strictly required, but may make
implementation easier.

## 5 Cluster Formation and Operation Details
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not entirely clear how much of the code in this section is in the module and how much is built on top of the module's API. I think the intention is mostly the later -- right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I tried to split it like so:

  • Membership: code would mostly be in the Module. The Core just tells it what modifications to make and it makes those changes opaque to the core (e.g. following from Raft membership changes spec)
  • Metadata: code would be mostly in the Core. The Core will tell the module to change metadata in some way, but the module does not know why. The Module is just responsible for the consistency/durability/propagation of that opaque metadata

I can split the examples out to better show this

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the main comment is just that I don't know if this is really better than the direction of the previous cluster V2. It was intended to require more implementation, but also meant that the backend could do literally whatever it wanted to do to materialize the clustering state. This new implementation seems to force a specific architecture on the backends, which seems less flexible overall to me.

Comment on lines +20 to +22
- Abstracted Membership: A push-based model where the Module informs the Core of
topology changes, enabling diverse backends (e.g., Raft, Multi-Raft, or
external stores like etcd).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current legacy split was to have the module maintain it's own copy of the topology, that the engine could fetch values from. The same is roughly true of the next two points, the engine is just getting the data that it wants. It requires a lot of boiler plate code in the theoretical future module, but is not dictating the implementation. This new proposal is also tight coupling with a specific backend.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current legacy split was to have the module maintain it's own copy of the topology, that the engine could fetch values from

Yeah I don't like how we refer to "topology" since it is really a mix of membership (who is in the cluster, which shard are they in) and metadata (which shards own what slots, which node leads a shard). To me that feels like a leaky abstraction.

The goal here was to really tease out those details so we don't end up with a vague concept of "topology". Now you have membership (static, transparent to module) and metadata (dynamic, opaque to the module). If, for a dumb/crazy example, tomorrow we wanted to change the number of slots to 32k, the clustering layer does not need to be aware of that. I think this is the right abstraction to make, but it breaks from cluster v1's "slots are transparent" and that's why I don't think we should bother with coming up with one uber-design that encompasses both.

The above goes for primaryship as well, although this one has more concrete examples. We will need to support more complex concepts there, e.g. active-active, synchronous replication, and there is no reason to force the module to handle those cases IMO.

All that said, we could say slots are a firm contract and make it visible to the clustering layer. Same for shard ownership/primaryship. There isn't necessarily anything really wrong with that, but it is just a more strict contract and I don't really know why we would need that if we can make the abstraction higher level.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new proposal is also tight coupling with a specific backend.

You are going to need to explain more, is it just because we split Gossip and Shard Log? There are other ways we can model that (e.g. propagation concern [shard/cluster/etc] and write concern [shard quorum/not]) but I didn't want to explode the interface into a bunch of permutations

Comment on lines +450 to +452
- `PFAIL` - node is not yet canonically unhealthy (by the Module’s definition)
but is locally unhealthy. It is not strictly needed, but will be client
visible as PFAIL flag in the cluster topology
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should kill off pfail. It's not practically useful to end clients.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only place I feel like I remember using PFAIL was in coordinated searches in ValkeySearch, we try to avoid sending to nodes in the shard that are in PFAIL. For that reason, I felt like it could be useful

Comment on lines +433 to +435
Many modules will use the heartbeat to implement a leasing system in order to
detect network partitions. But it also serves to inform the Core that the Module
is still alive.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems odd the module is down? What exactly does the module being alive mean? The core can be down because of long running, but we should able to better fence of the module.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the module is on the main thread, then yeah this doesn't really make sense. But I figured it may make sense to do consensus work on a background thread. In which case you can be exposed to things like deadlock or busy loops which prevent the heartbeat from arriving. From the core's point of view, it wouldn't be any different from a partition, and it shouldn't violate the correctness

@murphyjacob4
Copy link
Author

So, the main comment is just that I don't know if this is really better than the direction of the previous cluster V2.

Is there somewhere the proposal for previous cluster v2 is documented/proposed? It's hard to discuss the merits of it since I don't have a very concrete idea of what all it proposed. It iterated a lot so I am also not sure which version you are referring to

That said - from what I can tell - no iteration of cluster V2 made an attempt to define a clear core <-> clustering contract. With the proposed split in this RFC, I think it would be really easy to do things like: Active-Active (store in shard-log metadata), Synchronous Replication (store in shard-log metadata), Cluster-level configs (store in gossip), Cross-Cluster replication (basically just a cluster-level config on the target cluster, then each shard follows the source shard). The primitives feel strong enough that the core would be less burdened with reasoning about the distributed state.

This new implementation seems to force a specific architecture on the backends, which seems less flexible overall to me.

So my whole intention was to decouple from architecture specifics. Maybe you can provide an example of an implementation that would be hard to implement on the new model? I provided examples throughout of a few different backends and how they would implement this but maybe there is some other way that you are thinking?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants