Conversation
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
| /* The expected current revision. If not the actual latest, the Module | ||
| * must fail the operation. If the key does not currently exist, use | ||
| * NULL. This provides compare-and-swap (CAS) semantics. */ | ||
| void *expected_revision; |
There was a problem hiding this comment.
Any reason we can't be more specific about the data type of "version"? If it's intended to be an opaque type (defined by the cluster provider), then it should have it's own type, with a vtable, not void.
There was a problem hiding this comment.
Yeah it is arbitrary, e.g. in raft I believe you need to specify term and index. I'm also fine defining it as a type. I figured when it comes to implementation some of the API might change a little (based on what is most ergonomic). Hard to anticipate now
| `void*` pointers are used to allow flexibility in how revisions are defined. The | ||
| details of this will be left to the Module (e.g. in Raft, it may include both | ||
| term and index). |
There was a problem hiding this comment.
Nit: rather than void, it should be an opaque type that has a vtable exported from the module.
| Note: propagation is expressed as minimum guarantees. Modules are free to | ||
| propagate beyond the specified scope (e.g. propagate SHARD to all nodes). For | ||
| that reason, the Core should ensure shard-gossipped metadata is globally unique | ||
| (e.g. by prefixing with node ID or shard ID). |
There was a problem hiding this comment.
This feels like you're really defining a third type of data, Why not simply define a shard-local eventually consistent namespace?
There was a problem hiding this comment.
I'm open to it. I iterated a lot on this. E.g. another idea was one Metadata channel with configurable write/propagation concern. E.g. UpdateMetadata(propagation=SHARD/GLOBAL, write_concern=LOCAL/SHARD_QUORUM, ...) but you end up with some nonsense options like prop=GLOBAL, write_concern=SHARD_QUORUM which I didn't like
So I guess we could have something like three channels:
- Gossip-only, shard-local
- Gossip-only, global
- Quorum-commit, shard-local
|
|
||
| ### 4.2 Remote Failure | ||
|
|
||
| When the Module detects a remote failure, the Module is expected to callback via |
There was a problem hiding this comment.
What is the responsibility within the module for detecting remote failures? Is this time-bounded? Is the intra-shard timeframe the same as the cross-shard timeframe?
There was a problem hiding this comment.
I didn't define this since it will depend on how the clustering implementation aims to guarantee correctness. Modules interested in correctness/write consistency will need to implement the primary health via leasing/heartbeats. So like for Raft, primary needs to append an entry every X seconds, and when replicas don't see it for X*2 seconds, they need to mark it down. In theory a module could also have a less-correct implementation (a la cluster v1). If we go with this design and we implement our own module, it will come with a separate design that hashes these details out.
| choice is left up to the Module and is not strictly required, but may make | ||
| implementation easier. | ||
|
|
||
| ## 5 Cluster Formation and Operation Details |
There was a problem hiding this comment.
It's not entirely clear how much of the code in this section is in the module and how much is built on top of the module's API. I think the intention is mostly the later -- right?
There was a problem hiding this comment.
Yeah I tried to split it like so:
- Membership: code would mostly be in the Module. The Core just tells it what modifications to make and it makes those changes opaque to the core (e.g. following from Raft membership changes spec)
- Metadata: code would be mostly in the Core. The Core will tell the module to change metadata in some way, but the module does not know why. The Module is just responsible for the consistency/durability/propagation of that opaque metadata
I can split the examples out to better show this
madolson
left a comment
There was a problem hiding this comment.
So, the main comment is just that I don't know if this is really better than the direction of the previous cluster V2. It was intended to require more implementation, but also meant that the backend could do literally whatever it wanted to do to materialize the clustering state. This new implementation seems to force a specific architecture on the backends, which seems less flexible overall to me.
| - Abstracted Membership: A push-based model where the Module informs the Core of | ||
| topology changes, enabling diverse backends (e.g., Raft, Multi-Raft, or | ||
| external stores like etcd). |
There was a problem hiding this comment.
The current legacy split was to have the module maintain it's own copy of the topology, that the engine could fetch values from. The same is roughly true of the next two points, the engine is just getting the data that it wants. It requires a lot of boiler plate code in the theoretical future module, but is not dictating the implementation. This new proposal is also tight coupling with a specific backend.
There was a problem hiding this comment.
The current legacy split was to have the module maintain it's own copy of the topology, that the engine could fetch values from
Yeah I don't like how we refer to "topology" since it is really a mix of membership (who is in the cluster, which shard are they in) and metadata (which shards own what slots, which node leads a shard). To me that feels like a leaky abstraction.
The goal here was to really tease out those details so we don't end up with a vague concept of "topology". Now you have membership (static, transparent to module) and metadata (dynamic, opaque to the module). If, for a dumb/crazy example, tomorrow we wanted to change the number of slots to 32k, the clustering layer does not need to be aware of that. I think this is the right abstraction to make, but it breaks from cluster v1's "slots are transparent" and that's why I don't think we should bother with coming up with one uber-design that encompasses both.
The above goes for primaryship as well, although this one has more concrete examples. We will need to support more complex concepts there, e.g. active-active, synchronous replication, and there is no reason to force the module to handle those cases IMO.
All that said, we could say slots are a firm contract and make it visible to the clustering layer. Same for shard ownership/primaryship. There isn't necessarily anything really wrong with that, but it is just a more strict contract and I don't really know why we would need that if we can make the abstraction higher level.
There was a problem hiding this comment.
This new proposal is also tight coupling with a specific backend.
You are going to need to explain more, is it just because we split Gossip and Shard Log? There are other ways we can model that (e.g. propagation concern [shard/cluster/etc] and write concern [shard quorum/not]) but I didn't want to explode the interface into a bunch of permutations
| - `PFAIL` - node is not yet canonically unhealthy (by the Module’s definition) | ||
| but is locally unhealthy. It is not strictly needed, but will be client | ||
| visible as PFAIL flag in the cluster topology |
There was a problem hiding this comment.
We should kill off pfail. It's not practically useful to end clients.
There was a problem hiding this comment.
The only place I feel like I remember using PFAIL was in coordinated searches in ValkeySearch, we try to avoid sending to nodes in the shard that are in PFAIL. For that reason, I felt like it could be useful
| Many modules will use the heartbeat to implement a leasing system in order to | ||
| detect network partitions. But it also serves to inform the Core that the Module | ||
| is still alive. |
There was a problem hiding this comment.
It seems odd the module is down? What exactly does the module being alive mean? The core can be down because of long running, but we should able to better fence of the module.
There was a problem hiding this comment.
If the module is on the main thread, then yeah this doesn't really make sense. But I figured it may make sense to do consensus work on a background thread. In which case you can be exposed to things like deadlock or busy loops which prevent the heartbeat from arriving. From the core's point of view, it wouldn't be any different from a partition, and it shouldn't violate the correctness
Is there somewhere the proposal for previous cluster v2 is documented/proposed? It's hard to discuss the merits of it since I don't have a very concrete idea of what all it proposed. It iterated a lot so I am also not sure which version you are referring to That said - from what I can tell - no iteration of cluster V2 made an attempt to define a clear core <-> clustering contract. With the proposed split in this RFC, I think it would be really easy to do things like: Active-Active (store in shard-log metadata), Synchronous Replication (store in shard-log metadata), Cluster-level configs (store in gossip), Cross-Cluster replication (basically just a cluster-level config on the target cluster, then each shard follows the source shard). The primitives feel strong enough that the core would be less burdened with reasoning about the distributed state.
So my whole intention was to decouple from architecture specifics. Maybe you can provide an example of an implementation that would be hard to implement on the new model? I provided examples throughout of a few different backends and how they would implement this but maybe there is some other way that you are thinking? |
This RFC is part 1 of 2 for "Cluster V2" valkey-io/valkey#384