Conversation
This document defines the problem of insufficient write durability in Valkey Cluster and outlines the core requirements for a solution. Signed-off-by: Ping Xie <pingxie@google.com>
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a Product Requirements Document (PRD) that defines the problem of insufficient write durability in Valkey Cluster and establishes requirements for a quorum-based durability solution. The document addresses the gap between write acknowledgment and actual data persistence across topology changes.
- Identifies durability challenges in current asynchronous replication model
- Defines requirements for explicit per-command and connection-level durability controls
- Establishes architectural constraints to preserve performance for non-durable workloads
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
This document defines the problem of insufficient write durability in Valkey Cluster and outlines the core requirements for a solution. Signed-off-by: Ping Xie <pingxie@google.com>
Signed-off-by: Ping Xie <pingxie@google.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Ping Xie <pingxie@outlook.com>
Signed-off-by: Ping Xie <pingxie@google.com>
ranshid
left a comment
There was a problem hiding this comment.
I like this doc as it mainly focus on the problem and less on the technical solution.
| * **Leader Election**: A write may be acknowledged by the primary and propagated to a subset of | ||
| replicas, but a different replica—one that never received the write—can be legitimately promoted to | ||
| primary. After promotion, the shard's authoritative history will exclude that write. |
There was a problem hiding this comment.
This assumes that we had more than a single replica and we only waited for a single approval right? Maybe we should emphasis this?
There was a problem hiding this comment.
When you say "we waited" are you referring to the use of the WAIT command?
There was a problem hiding this comment.
This assumes that we had more than a single replica and we only waited for a single approval right? Maybe we should emphasis this?
Yes. I am talking about a generic case of N replicas. One replica setup doesn't have an election concern.
| * **Slot Migration Rollback**: A data loss scenario occurs when the source shard broadcasts a | ||
| higher config epoch after the migration has completed, forcing the target shard to roll back the | ||
| ownership transfer and discard any writes it accepted in the interim. This can be triggered by a | ||
| failover in the source shard or by a concurrent migration conflict where the source shard bumps its | ||
| own epoch. |
There was a problem hiding this comment.
You refer to the legacy SM IIUC. I would state why using the Atomic Slot migration would not help in this case?
There was a problem hiding this comment.
ASM has the same durability gap today but that is not the main point. I am using "Slot Migration" in a generic sense and as the context for the later "Solution Requirements" section ("Membership-awareness")
There was a problem hiding this comment.
Yeah, we need a two-phase commit for the slot migration?
There was a problem hiding this comment.
Yes. Slot ownership needs to be durably committed along with its data. So, we will need to 2PC the data between shard with asm. That will also promote slot migration to a real 2PC as well.
There was a problem hiding this comment.
just to clairfy - is this example below the scenario being called out?.
- imagine slot 5 is migrated from Shard A to Shard B. A client writes "SET k v" to Shard B after the migration. Later, Shard A bumps its config epoch (say due to failover), and the cluster decides Shard A still owns slot 5. Shard B must roll back and discard "SET k v," so the client’s acknowledged write is lost.
| * **Two-Node Shard Tradeoffs**: In a 1-primary/1-replica topology, any durability guarantee | ||
| inherently trades availability for durability. Under a single-node failure, requests expecting a | ||
| durable acknowledgement must fail clearly and predictably. |
There was a problem hiding this comment.
I am not sure I understand this paragraph. Is the intention to say that in order to gain full durability we have to use at least 1 replica?
There was a problem hiding this comment.
It is the other way around. the idea is to explicitly call out the need to support strong durability for a very common HA setup of 1p1r. The non-HA 1p0r setup is a special use case imo, which might or might not require a different solution. I can add a clause to spec out the requirement/experience for 1p0r, without, again, hinting any solutions.
There was a problem hiding this comment.
With only two nodes (1p1r) it's not possible to tell the difference between primary failure and netsplit. The replica can never get a majority.
If we move the quorum to the shard level, I'd say we need at least two replicas, but I think it's a totally acceptable limitation. Now we can instead allow single-shard and 2-shard clusters.
There was a problem hiding this comment.
I would also prefer 1p2r per shard to be the minimum for strong durability. We can add support for observer nodes later that process the stream and can handle the network split case for multiple shards.
There was a problem hiding this comment.
I would also prefer 1p2r per shard to be the minimum for strong durability.
+1 but I think we can leave this decision to the execution. btw, the framework that I am going with is
- consensus on requirements, which is this RFC
- an HLD that doesn't paint us to an architectural corner for all the things we would like to achieve down the road, including data tiering, 1p1r, etc
- an execution plan for 9.1 and 10.0, maybe with 9.1 laying down the foundation, and 10.0 with the MVP (1p2r for instance)
There was a problem hiding this comment.
I think this is a good high level plan.
The HLD is hard to know long term though. 1p1r is obviously always possible without availability, but we should probably decide on the availability story.
There was a problem hiding this comment.
Regarding 1p1r or 1p2r in a shard debate, if the durable system is completely decoupled from the Valkey engine. Wouldn't we be able to support 1p1r? I think the design would dictate based on this key requirement.
There was a problem hiding this comment.
@hpatro - if the durable system is completely decoupled from the Valkey engine - are you thinking about durability and consistency system with raft/paxos through a module?.
| simpler code and stronger durability guarantees. | ||
|
|
||
|
|
||
| ## **Key Durability Challenges** |
There was a problem hiding this comment.
I think we also need to drop some words about AOF durability guarantees on the replica side. like, in case the AOF is always fsync on the replica it might help prevent any type of data loss when combined with full replicas write acknowledgments right?
There was a problem hiding this comment.
Should we leave this to the HLD, as one of the potential solutions? As you have rightly noticed, my plan is to spec out just the requirements/experience in this PRD doc.
There was a problem hiding this comment.
Durability comes from latin "durus" = hard, as in harddisk.
But i think it's a misconceptipn. The point of durability is that the data survives various levels of failures, such as RAM power outage or software crash.
In the times before distributed databases, this meant fsync to disk. Raid adds disk redundancy. Even weekly backups to a disk that someone beings home, for the event that the server room explodes...
But with distributed databases, we get the redundancy by keeping multiple replicas in different availability zones. I'd argue that a disk isn't necessary.
The appendfsync always is orthogonal to this discussion, the way I see it.
There was a problem hiding this comment.
Yeah, fwiw I always say we're adding durability and consistency. Valkey written to mutltiple disks with aof may be durable, but it has long failure times. If you failover to a replica, you lose consistency.
There was a problem hiding this comment.
I agree with everything. But in order to fully pass the message of what we would like to solve we need to make sure we emphesis the problems with the alternatives
There was a problem hiding this comment.
yes. durability + consistency* + high-availability is all in my goal.
there is durability with an 1p0r aof-always-fsync setup and I am sure it works for some users but the lack of high-availability is a big gap imo.
| * **Clear Failure Modes**: If a client requests a durable write and the server cannot confirm it | ||
| within the timeout, the server must return an explicit error, not an ambiguous success. This | ||
| prevents a "silent downgrade" in durability where a client might incorrectly assume a write was | ||
| successful. This clear error signal informs the client that the write's state is unknown, enabling | ||
| the application to safely retry the operation, which should be designed to be idempotent. |
There was a problem hiding this comment.
This will not serve as an indication if the data exists or not right? Like in shared lock algorithms it will not help understand if to assume if the lock was indeed taken or not correct?
There was a problem hiding this comment.
That is the whole point of this paragraph, i.e., the durable signal needs to be clear and durable. In the key challenges section, I gave the example of today's slot migration losing a WAITed write in some corner but non-zero-probability cases. The ask here is that when the system can't ensure data durability, it needs to clearly say so. Don't accept the request but silently drop the guarantee.
There was a problem hiding this comment.
So you basically call-out that a durable solution should provide a way to be sure if the data exists or not? What about network disconnects? Do we require a durable solution to ensure the data was not written down (like a 3 way handshake for writes)?
There was a problem hiding this comment.
What about network disconnects?
are you talking about the networking issue between the client or server? or network partitioning among the nodes? regardless, I think this is a good reminder that I need to define the fault domain in this RFC as well. we are apparently not signing up for the day when the Earth disappears ...
There was a problem hiding this comment.
I think we should focus on not returning OK to writes that are not yet made durable.
the server must return an explicit error, not an ambiguous success
Whether there is an explicit error reply or no reply at all (as in disconnects) is not important IMHO. Let's reword the part about "explicit error"?
There was a problem hiding this comment.
Right, we should make clear what are the durability semantics of write operations. We should state something like:
- If a client submits a write operation and receives a success response from the server, the data is guaranteed to be durable.
- If a client submits a write operation and receives a failure response from the server, no data has been written.
- If a client submits a write operation but does not receive any response from the server (due to a timeout, connection failure, etc.), there are no guarantees regarding whether the data was written or not.
There was a problem hiding this comment.
agreed. will incorporate @rjd15372's proposal
There was a problem hiding this comment.
Thanks for getting this started @PingXie. I've a slightly different take on durability.
Sharing the high level thought here:
-
I would like to maintain the same semantics of Valkey and avoid introducing any client behavior change.
Client wouldn't receive a success/failure back until the quorum hasn't committed the operation. During that period the client would be in blocked state on the primary. -
I would like to introduce durability configuration at server level and keep it as immutable.
By that I intend to make the user opt in to the performance degradation of the system with better durability guarantee and not see that as a surprise. Introducing the concept at connection level would make the debug-ability of the system very hard.
| * **Authoritative Acknowledgement**: A durable write's acknowledgement must be tied to a single, | ||
| authoritative decision at the shard level, ensuring it remains valid through any topology change | ||
| like a leader election or slot migration. |
There was a problem hiding this comment.
Why does it have to be authoritative?
There was a problem hiding this comment.
This sounds like slot level consistency.
There was a problem hiding this comment.
I use authoritative to qualify the replication history. there needs to be one deterministic replication history with a given shart that is consistent with the client expectation. we don't have a slot-level replication history today but if/when we get there, this would apply too.
There was a problem hiding this comment.
To be pedantic, i think that is the same as saying we have sequential consistency for each slot. There is some total ordering of requests against a slot, and that is preserved (even though nodes may be at different parts along the ordering).
I think the requirement is sensible, though, just suggesting a more formal way of saying it.
There was a problem hiding this comment.
I agree with Madelyn, looks like what we want is "sequential consistency for each slot", and that should be stated clearly as a requirement.
There was a problem hiding this comment.
not sure I understand the definition of 'sequential consistency for each slot". the linearizability requirement is at the shard level. there shoud be a total order for all writes for a given shard and this total order is what I called "authoratative" but I will use "total order" going forward, which seems like the more well-established term.
| * **Clear Failure Modes**: If a client requests a durable write and the server cannot confirm it | ||
| within the timeout, the server must return an explicit error, not an ambiguous success. This | ||
| prevents a "silent downgrade" in durability where a client might incorrectly assume a write was | ||
| successful. This clear error signal informs the client that the write's state is unknown, enabling | ||
| the application to safely retry the operation, which should be designed to be idempotent. |
There was a problem hiding this comment.
Why do we want to introduce a new mechanism ? No response would imply the request hasn't succeeded yet.
There was a problem hiding this comment.
The point here is that we never return success (+OK) before we know for sure a write is durably committed in the cluster.
If not, if we return an error reply or no reply at all shouldn't be that important, but there will always be some cases like if the connection is lost when the client is waiting for a write to be committed. Then the client can't know if it was committed or not in this case.
I think we should drop the formulation "must return an explicit error" and instead focus on not returning success unless we know for sure.
There was a problem hiding this comment.
The point here is that we never return success (+OK) before we know for sure a write is durably committed in the cluster.
that is exactly what I don't quite like the way how WAIT is implemented. however, I also didn't want to completely throw away WAIT and introduce read/write concerns, which I consider over-engineered and I can expand on if there is interest. today, WAIT returns the number of the replicas that have applied the changes. it requires the application developer to understand the quorum, which imo is a cognitive overload. I would like to return an explicit message on whether or not the client write was committed in quorum. but this alone breaks the compat that I was hoping for in the requirement section - maybe we do need a new WAIT variant, WAIT_QUORUM?
There was a problem hiding this comment.
IMO the WAIT command should not be part of the durability feature. The WAIT only makes sense when using asynchronous replication.
If we enable durability of write commands, then when the server returns the response to a client write command, it must be guaranteed that the data is persisted and durable.
There was a problem hiding this comment.
I am fine with dropping WAIT for now for simplification but as I mentioned in the other comment, WAIT has its value and I think can be incoporated into raft (by waiting for a quorum ACK for a given offset only)
| * **Preserve <code>WAIT</code> Command Interface**: The syntax (`WAIT <replicas> <timeout>`) and | ||
| integer return value of the existing command must not be changed. |
There was a problem hiding this comment.
I don't see any benefit of coupling this with the durability system we can build independently.
There was a problem hiding this comment.
Agree. It's related as a background information though, when explaining why WAIT is not enough in the current design.
There was a problem hiding this comment.
not sure if I understand the "coupling" and the "background information" callouts?
this clause is about the interface compatibility. my take is that the WAIT command was introduced to solve the exact same core problem. its failure is not with the interface but the design/implementation. I would like to see if we can retain the syntax while fixing things under the hood. this helps with the learning curve on the developer side.
There was a problem hiding this comment.
Ok... With background information i meant earlier attempts. I guess I find it hard to think about the requirements without seeing a solution. Let's keep this then.
There was a problem hiding this comment.
I would drop this requirement. IMO the WAIT command only makes sense when a system uses asynchronous replication.
If a client wants to enable the durability feature, it should not require to use WAIT, and should rewrite it's existing application code accordingly.
There was a problem hiding this comment.
WAIT is an optimization that allows the users to balance durability and performance explicitly. I am fine with every write being "durable" in V1 but I do think there is a value down the road.
| * **Flexible Durability Controls**: The solution must also provide a connection-level opt-in for | ||
| durable-by-default behavior. |
There was a problem hiding this comment.
Shouldn't we be doing this at server level for ease of understanding by the developers ?
All write operation would be durable.
There was a problem hiding this comment.
I think that is a veyr reasonable first version. I do want to provide some configuration agility in the future but I agree that it shuldn't be a blocker. I will remove this
zuiderkwast
left a comment
There was a problem hiding this comment.
Thank you for this work!
|
|
||
| Production users increasingly run workloads where **loss of an acknowledged write is | ||
| unacceptable**, not just in steady state but also during failover and cluster changes. In | ||
| practice, users have two distinct needs for durability controls: |
There was a problem hiding this comment.
We could add a 3rd option: a global config.
It's hard to ensure these quorum-based writes without moving leader election to the shard, making the majority of replicas in the shard the quorum. Then we're talking about a new consensus model.
If we go for something like Raft-based replication, can we still allow both asynchronous and synchronous (ack-by-majority) replication? Maybe we can...
I do like the per-client wait behavior, but I'm OK with all writes being durable.
| * **Two-Node Shard Tradeoffs**: In a 1-primary/1-replica topology, any durability guarantee | ||
| inherently trades availability for durability. Under a single-node failure, requests expecting a | ||
| durable acknowledgement must fail clearly and predictably. |
There was a problem hiding this comment.
With only two nodes (1p1r) it's not possible to tell the difference between primary failure and netsplit. The replica can never get a majority.
If we move the quorum to the shard level, I'd say we need at least two replicas, but I think it's a totally acceptable limitation. Now we can instead allow single-shard and 2-shard clusters.
| * **Slot Migration Rollback**: A data loss scenario occurs when the source shard broadcasts a | ||
| higher config epoch after the migration has completed, forcing the target shard to roll back the | ||
| ownership transfer and discard any writes it accepted in the interim. This can be triggered by a | ||
| failover in the source shard or by a concurrent migration conflict where the source shard bumps its | ||
| own epoch. |
There was a problem hiding this comment.
Yeah, we need a two-phase commit for the slot migration?
| simpler code and stronger durability guarantees. | ||
|
|
||
|
|
||
| ## **Key Durability Challenges** |
There was a problem hiding this comment.
Durability comes from latin "durus" = hard, as in harddisk.
But i think it's a misconceptipn. The point of durability is that the data survives various levels of failures, such as RAM power outage or software crash.
In the times before distributed databases, this meant fsync to disk. Raid adds disk redundancy. Even weekly backups to a disk that someone beings home, for the event that the server room explodes...
But with distributed databases, we get the redundancy by keeping multiple replicas in different availability zones. I'd argue that a disk isn't necessary.
The appendfsync always is orthogonal to this discussion, the way I see it.
madolson
left a comment
There was a problem hiding this comment.
I would also like one more requirement, which is we shouldn't invent our own consistency model or algorithm. When data integrity is on the line, we don't want to accidentally lose data because we tried to tune it better.
These set of requirements more or less define a slot level durability and consistency guarantee. That can either be done through synchronous replication, if a replica is down writes fail, or eventual consistency, we have consistency gaps between replicas. Both seem doable.
| latency for many workloads, but it also creates a window where an **acknowledged** write can be | ||
| lost if the primary fails before replicas make the write durable. The existing `WAIT <replicas> | ||
| <timeout>` command provides a client-side synchronization mechanism for clients to block until a | ||
| specified number of replicas have recorded the write. However, this propagation-based |
There was a problem hiding this comment.
WAIT also exposes the dirty data to other clients. Who might then take action on it, but have it later get reverted. This is why just replicated to majority is not sufficient.
There was a problem hiding this comment.
Yeah, this document should mention that we can't allow other clients to read uncommitted writes. That would be inconsistent. (This is why we also need a WAL, but let's not mention WAL in this document.)
There was a problem hiding this comment.
agreed. "read committed writes only on primary"is a better way to phrase it. I will add it to the requirements section.
There was a problem hiding this comment.
Ideally we should fulfil this extra requirement, but I believe it adds a lot of pain as in slowness and complexity. It may require MVCC or something equivalently complex.
I believe it's possible to provide durability guarantees for write commands and/or WAIT, while still allowing dirty writes (writes that are not yet made durable). It's the read-uncommitted isolation level, which is not perfect, but I think it can be fast, not very complex and can be done without locks and mvcc.
Let's keep this requirement but give it a number, so we can refer to the requirements and implement incrementally?
There was a problem hiding this comment.
Just to clarify Why not make both read isolation configurable and enforce 1p2r as the default when DURABILITY=QUORUM is declared?
– Configurable isolation:
• Cluster default = READ_COMMITTED + DURABILITY=QUORUM.
• Connections may explicitly opt into READ_UNCOMMITTED for latency, but not stronger guarantees than the cluster supports.
• On failover/migration: READ_COMMITTED readers stall or error until durable state is restored; READ_UNCOMMITTED readers must re-negotiate.
– Topology enforcement:
• Quorum durability requires ≥3 voters, so 1p2r should be the safe default.
• If users want 1p1r, they must explicitly choose a weaker mode (e.g., DURABILITY=BEST_EFFORT).
• No silent downgrades — the durability/isolation contract should be fixed at cluster creation.
This gives safe defaults (1p2r + READ_COMMITTED), flexibility for latency-sensitive use cases, and avoids the pitfalls of hidden behavior changes.
There was a problem hiding this comment.
the focus of this rfc is on durability and linearizaability. read-uncommited or dirty read is conceptually a property of isolation, which is very important but I would suggest keeping out of this rfc to keep the scope contained.
for the 1p1r setup to reach consensus at the shard level, we would need a witness node, which adds complexity. this is why I propose supporting (quorum) durability for 1p2+r only but leaving all other setups (including 1p2+r) to the existing cluster architecture, for now.
| * **Slot Migration Rollback**: A data loss scenario occurs when the source shard broadcasts a | ||
| higher config epoch after the migration has completed, forcing the target shard to roll back the | ||
| ownership transfer and discard any writes it accepted in the interim. This can be triggered by a | ||
| failover in the source shard or by a concurrent migration conflict where the source shard bumps its | ||
| own epoch. |
There was a problem hiding this comment.
Yes. Slot ownership needs to be durably committed along with its data. So, we will need to 2PC the data between shard with asm. That will also promote slot migration to a real 2PC as well.
| * **Two-Node Shard Tradeoffs**: In a 1-primary/1-replica topology, any durability guarantee | ||
| inherently trades availability for durability. Under a single-node failure, requests expecting a | ||
| durable acknowledgement must fail clearly and predictably. |
There was a problem hiding this comment.
I would also prefer 1p2r per shard to be the minimum for strong durability. We can add support for observer nodes later that process the stream and can handle the network split case for multiple shards.
| * **Authoritative Acknowledgement**: A durable write's acknowledgement must be tied to a single, | ||
| authoritative decision at the shard level, ensuring it remains valid through any topology change | ||
| like a leader election or slot migration. |
There was a problem hiding this comment.
This sounds like slot level consistency.
| simpler code and stronger durability guarantees. | ||
|
|
||
|
|
||
| ## **Key Durability Challenges** |
There was a problem hiding this comment.
Yeah, fwiw I always say we're adding durability and consistency. Valkey written to mutltiple disks with aof may be durable, but it has long failure times. If you failover to a replica, you lose consistency.
I am aligned in principle but I feel that we might have to introduce a new
I would make sure the requirement RFC leaves enough room for this option. that being said, I also feel that this is very likely the way to go, at least for our very first strong durability release. |
YES. fully agreed. Will add it to the requirements |
|
Thanks for all the feedback and discussions. I will update the doc next and also collate open questions. reusing a comment I left earlier.
|
|
Thanks for the draft! A few additional areas to consider in the PRD: |
| * **Preserve <code>WAIT</code> Command Interface**: The syntax (`WAIT <replicas> <timeout>`) and | ||
| integer return value of the existing command must not be changed. | ||
|
|
||
| * **Flexible Durability Controls**: The solution must also provide a connection-level opt-in for |
There was a problem hiding this comment.
Why is connection level important?. Why shard or a cluster level is not sufficient?. Are there specific usecases where customers would benefit from this?. Hoping not to overengineer if there i no real need.
There was a problem hiding this comment.
I think there is but I do agree that this is going into the design area which was not my intent for this doc. I have rephrase the configurability to something more vague and will remove this per-command/per-connection and WAIT callouts from this RFC. we can continue this discussion in the HLD.
| that avoids the I/O and network latency costs of a durable, quorum-based commit. | ||
|
|
||
| * **Foundation for Future Features**: The chosen durability mechanism should provide a clear and | ||
| extensible foundation for future high-value features, most notably **data tiering**. This would |
There was a problem hiding this comment.
Why is durablity and data tiering interconnected. Are you suggesting we cannot support data tiering for non durable workloads that use asynchronous replication?.
Does the same argument apply for Active-Active replication as well?. Do we need durability as a foundation to support Active-Active replication?.
There was a problem hiding this comment.
Are you suggesting we cannot support data tiering for non durable workloads that use asynchronous replication?.
it is actually the opposite; we need to make sure whatever design we pick at the end for durability shouldn't paint us to an architecural corner. data tiering and active-active are both big bets for valkey too. this callout should be no brainer imo but I consider this too important point to miss.
This document defines the problem of insufficient write durability in Valkey Cluster and outlines the core requirements for a solution. It details the failure modes of the current asynchronous replication model and establishes the principles for a new, quorum-based durability mechanism that will provide strong durability guarantees without regressing on performance for non-durable workloads.
@valkey-io/core-team FYI