-
Notifications
You must be signed in to change notification settings - Fork 26
Add draft PRD for Valkey Durability #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
9479c09
ffa491c
45244e4
ea9f475
b84f8b5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,115 @@ | ||
| --- | ||
| RFC: | ||
| Status: Draft | ||
| --- | ||
|
|
||
| ## **Abstract** | ||
|
|
||
| Valkey clusters optimize for low latency: a primary executes a write and acknowledges immediately; | ||
| replicas receive the update asynchronously. This approach delivers excellent throughput and tail | ||
| latency for many workloads, but it also creates a window where an **acknowledged** write can be | ||
| lost if the primary fails before replicas make the write durable. The existing `WAIT <replicas> | ||
| <timeout>` command provides a client-side synchronization mechanism for clients to block until a | ||
| specified number of replicas have recorded the write. However, this propagation-based | ||
| acknowledgement does not provide a design-level guarantee of durability across all topology | ||
| changes, including failover, replica additions/removals, and horizontal scaling via slot | ||
| migration. | ||
|
|
||
|
|
||
| ## **Motivation** | ||
|
|
||
| Production users increasingly run workloads where **loss of an acknowledged write is | ||
| unacceptable**, not just in steady state but also during failover and cluster changes. In | ||
| practice, users have two distinct needs for durability controls: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could add a 3rd option: a global config. It's hard to ensure these quorum-based writes without moving leader election to the shard, making the majority of replicas in the shard the quorum. Then we're talking about a new consensus model. If we go for something like Raft-based replication, can we still allow both asynchronous and synchronous (ack-by-majority) replication? Maybe we can... I do like the per-client wait behavior, but I'm OK with all writes being durable. |
||
|
|
||
|
|
||
|
|
||
| 1. **Explicit, Per-Command Control**: Users require that the existing `WAIT` command provides a | ||
| meaningful durability guarantee. When a client explicitly invokes `WAIT` after a write **with a | ||
| replica count sufficient to form a quorum**, its successful return must signify that the write | ||
| will be preserved across topology changes. This allows applications to make deliberate, fine- | ||
| grained tradeoffs between latency and durability, | ||
| 2. **Connection-Level Opt-In**: For ease of use and safety, users need a way to configure a | ||
| connection to be durable by default. This allows an application to establish its durability | ||
| requirement once, ensuring all subsequent writes on that connection are treated as durable without | ||
| needing to add `WAIT` after every command. This approach reduces application complexity by removing | ||
| the burden on developers to correctly identify every critical synchronization point in their logic. | ||
| It allows them to make a deliberate tradeoff, accepting higher write latency in exchange for | ||
| simpler code and stronger durability guarantees. | ||
|
|
||
|
|
||
| ## **Key Durability Challenges** | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we also need to drop some words about AOF durability guarantees on the replica side. like, in case the AOF is always fsync on the replica it might help prevent any type of data loss when combined with full replicas write acknowledgments right?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we leave this to the HLD, as one of the potential solutions? As you have rightly noticed, my plan is to spec out just the requirements/experience in this PRD doc.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Durability comes from latin "durus" = hard, as in harddisk. But i think it's a misconceptipn. The point of durability is that the data survives various levels of failures, such as RAM power outage or software crash. In the times before distributed databases, this meant fsync to disk. Raid adds disk redundancy. Even weekly backups to a disk that someone beings home, for the event that the server room explodes... But with distributed databases, we get the redundancy by keeping multiple replicas in different availability zones. I'd argue that a disk isn't necessary. The appendfsync always is orthogonal to this discussion, the way I see it.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, fwiw I always say we're adding durability and consistency. Valkey written to mutltiple disks with aof may be durable, but it has long failure times. If you failover to a replica, you lose consistency.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with everything. But in order to fully pass the message of what we would like to solve we need to make sure we emphesis the problems with the alternatives
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes. durability + consistency* + high-availability is all in my goal. there is durability with an 1p0r aof-always-fsync setup and I am sure it works for some users but the lack of high-availability is a big gap imo. |
||
|
|
||
| While Valkey allows a client to receive an acknowledgement that a write has been accepted by the | ||
| primary, this acknowledgement is not bound to the future authoritative history of the shard. | ||
| Several legitimate cluster behaviors can cause an acknowledged write to be lost, even without bugs | ||
| or operator errors: | ||
|
|
||
|
|
||
|
|
||
| * **Leader Election**: A write may be acknowledged by the primary and propagated to a subset of | ||
| replicas, but a different replica—one that never received the write—can be legitimately promoted to | ||
| primary. After promotion, the shard's authoritative history will exclude that write. | ||
|
Comment on lines
+50
to
+52
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This assumes that we had more than a single replica and we only waited for a single approval right? Maybe we should emphasis this?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When you say "we waited" are you referring to the use of the
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes. I am talking about a generic case of N replicas. One replica setup doesn't have an election concern. |
||
|
|
||
| * **Slot Migration Rollback**: A data loss scenario occurs when the source shard broadcasts a | ||
| higher config epoch after the migration has completed, forcing the target shard to roll back the | ||
| ownership transfer and discard any writes it accepted in the interim. This can be triggered by a | ||
| failover in the source shard or by a concurrent migration conflict where the source shard bumps its | ||
| own epoch. | ||
|
Comment on lines
+54
to
+58
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You refer to the legacy SM IIUC. I would state why using the Atomic Slot migration would not help in this case?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ASM has the same durability gap today but that is not the main point. I am using "Slot Migration" in a generic sense and as the context for the later "Solution Requirements" section ("Membership-awareness")
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, we need a two-phase commit for the slot migration?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. Slot ownership needs to be durably committed along with its data. So, we will need to 2PC the data between shard with asm. That will also promote slot migration to a real 2PC as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just to clairfy - is this example below the scenario being called out?.
|
||
|
|
||
| * **Two-Node Shard Tradeoffs**: In a 1-primary/1-replica topology, any durability guarantee | ||
| inherently trades availability for durability. Under a single-node failure, requests expecting a | ||
| durable acknowledgement must fail clearly and predictably. | ||
|
Comment on lines
+60
to
+62
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure I understand this paragraph. Is the intention to say that in order to gain full durability we have to use at least 1 replica?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is the other way around. the idea is to explicitly call out the need to support strong durability for a very common HA setup of 1p1r. The non-HA 1p0r setup is a special use case imo, which might or might not require a different solution. I can add a clause to spec out the requirement/experience for 1p0r, without, again, hinting any solutions.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With only two nodes (1p1r) it's not possible to tell the difference between primary failure and netsplit. The replica can never get a majority. If we move the quorum to the shard level, I'd say we need at least two replicas, but I think it's a totally acceptable limitation. Now we can instead allow single-shard and 2-shard clusters.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would also prefer 1p2r per shard to be the minimum for strong durability. We can add support for observer nodes later that process the stream and can handle the network split case for multiple shards.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
+1 but I think we can leave this decision to the execution. btw, the framework that I am going with is
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is a good high level plan. The HLD is hard to know long term though. 1p1r is obviously always possible without availability, but we should probably decide on the availability story.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Regarding 1p1r or 1p2r in a shard debate, if the durable system is completely decoupled from the Valkey engine. Wouldn't we be able to support 1p1r? I think the design would dictate based on this key requirement. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @hpatro - if the durable system is completely decoupled from the Valkey engine - are you thinking about durability and consistency system with raft/paxos through a module?. |
||
|
|
||
| The fundamental problem is that a success from the primary today confirms **acceptance**, not a | ||
| durable **commitment** that will survive a topology change. Solving this requires a mechanism for | ||
| **quorum-based writes**, where an acknowledgement is only sent after a majority of nodes have | ||
| durably recorded the operation. This concept of a quorum-based write is embodied by both the `WAIT` | ||
| command (when used with a replica count sufficient to form a quorum) and the connection-level | ||
| durability opt-in. | ||
|
|
||
|
|
||
| ## **Solution Requirements** | ||
|
|
||
| Based on the durability gaps and user needs identified, any proposed solution must adhere to the | ||
| following principles and constraints: | ||
|
|
||
| ### **Core Durability Principles** | ||
|
|
||
| * **Authoritative Acknowledgement**: A durable write's acknowledgement must be tied to a single, | ||
| authoritative decision at the shard level, ensuring it remains valid through any topology change | ||
| like a leader election or slot migration. | ||
|
Comment on lines
+79
to
+81
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why does it have to be authoritative?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This sounds like slot level consistency.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I use
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be pedantic, i think that is the same as saying we have sequential consistency for each slot. There is some total ordering of requests against a slot, and that is preserved (even though nodes may be at different parts along the ordering). I think the requirement is sensible, though, just suggesting a more formal way of saying it.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with Madelyn, looks like what we want is "sequential consistency for each slot", and that should be stated clearly as a requirement.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not sure I understand the definition of 'sequential consistency for each slot". the linearizability requirement is at the shard level. there shoud be a total order for all writes for a given shard and this total order is what I called "authoratative" but I will use "total order" going forward, which seems like the more well-established term. |
||
| * **Membership-Awareness**: During a cluster change like a slot migration, a durability | ||
| confirmation must come from the nodes that will be the **final, authoritative owners** of the | ||
| data. This prevents a "lame duck" quorum, a group of nodes that is about to lose authority, from | ||
| confirming a write that the new owner has never seen, which would otherwise lead to data loss. | ||
| * **Clear Failure Modes**: If a client requests a durable write and the server cannot confirm it | ||
| within the timeout, the server must return an explicit error, not an ambiguous success. This | ||
| prevents a "silent downgrade" in durability where a client might incorrectly assume a write was | ||
| successful. This clear error signal informs the client that the write's state is unknown, enabling | ||
| the application to safely retry the operation, which should be designed to be idempotent. | ||
|
Comment on lines
+86
to
+90
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will not serve as an indication if the data exists or not right? Like in shared lock algorithms it will not help understand if to assume if the lock was indeed taken or not correct?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is the whole point of this paragraph, i.e., the durable signal needs to be clear and durable. In the key challenges section, I gave the example of today's slot migration losing a
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So you basically call-out that a durable solution should provide a way to be sure if the data exists or not? What about network disconnects? Do we require a durable solution to ensure the data was not written down (like a 3 way handshake for writes)?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
are you talking about the networking issue between the client or server? or network partitioning among the nodes? regardless, I think this is a good reminder that I need to define the fault domain in this RFC as well. we are apparently not signing up for the day when the Earth disappears ...
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should focus on not returning OK to writes that are not yet made durable.
Whether there is an explicit error reply or no reply at all (as in disconnects) is not important IMHO. Let's reword the part about "explicit error"?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, we should make clear what are the durability semantics of write operations. We should state something like:
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. agreed. will incorporate @rjd15372's proposal
Comment on lines
+86
to
+90
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we want to introduce a new mechanism ? No response would imply the request hasn't succeeded yet.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The point here is that we never return success (+OK) before we know for sure a write is durably committed in the cluster. If not, if we return an error reply or no reply at all shouldn't be that important, but there will always be some cases like if the connection is lost when the client is waiting for a write to be committed. Then the client can't know if it was committed or not in this case. I think we should drop the formulation "must return an explicit error" and instead focus on not returning success unless we know for sure.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
that is exactly what I don't quite like the way how
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO the
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am fine with dropping WAIT for now for simplification but as I mentioned in the other comment, WAIT has its value and I think can be incoporated into raft (by waiting for a quorum ACK for a given offset only) |
||
|
|
||
| ### **Interface and Usability Constraints** | ||
|
|
||
| * **Preserve <code>WAIT</code> Command Interface**: The syntax (`WAIT <replicas> <timeout>`) and | ||
| integer return value of the existing command must not be changed. | ||
|
Comment on lines
+94
to
+95
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see any benefit of coupling this with the durability system we can build independently.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree. It's related as a background information though, when explaining why WAIT is not enough in the current design.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not sure if I understand the "coupling" and the "background information" callouts? this clause is about the interface compatibility. my take is that the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok... With background information i meant earlier attempts. I guess I find it hard to think about the requirements without seeing a solution. Let's keep this then.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would drop this requirement. IMO the If a client wants to enable the durability feature, it should not require to use
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. WAIT is an optimization that allows the users to balance durability and performance explicitly. I am fine with every write being "durable" in V1 but I do think there is a value down the road. |
||
|
|
||
| * **Flexible Durability Controls**: The solution must also provide a connection-level opt-in for | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is connection level important?. Why shard or a cluster level is not sufficient?. Are there specific usecases where customers would benefit from this?. Hoping not to overengineer if there i no real need.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think there is but I do agree that this is going into the design area which was not my intent for this doc. I have rephrase the configurability to something more vague and will remove this per-command/per-connection and WAIT callouts from this RFC. we can continue this discussion in the HLD. |
||
| durable-by-default behavior. | ||
|
Comment on lines
+97
to
+98
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't we be doing this at server level for ease of understanding by the developers ? All write operation would be durable.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that is a veyr reasonable first version. I do want to provide some configuration agility in the future but I agree that it shuldn't be a blocker. I will remove this |
||
|
|
||
| ### **Architectural Constraints** | ||
|
|
||
| * **No Performance Regression for Non-Durable Workloads**: The chosen durability mechanism must be | ||
| implemented in a way that does not prevent workloads that do not require strong durability from | ||
| retaining their existing high-performance characteristics. The system must continue to offer a path | ||
| that avoids the I/O and network latency costs of a durable, quorum-based commit. | ||
|
|
||
| * **Foundation for Future Features**: The chosen durability mechanism should provide a clear and | ||
| extensible foundation for future high-value features, most notably **data tiering**. This would | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is durablity and data tiering interconnected. Are you suggesting we cannot support data tiering for non durable workloads that use asynchronous replication?. Does the same argument apply for Active-Active replication as well?. Do we need durability as a foundation to support Active-Active replication?.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
it is actually the opposite; we need to make sure whatever design we pick at the end for durability shouldn't paint us to an architecural corner. data tiering and active-active are both big bets for valkey too. this callout should be no brainer imo but I consider this too important point to miss. |
||
| enable Valkey to cost-effectively manage datasets that are significantly larger than available | ||
| memory. | ||
|
|
||
| * **Self-Sufficient High Availability**: The solution must provide robust, application-level high | ||
| availability. The system must be self-sufficient and manage its own state, consensus, and replication | ||
| without reliance on specialized, external, or infrastructure-level mechanisms for durability or | ||
| availability. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WAIT also exposes the dirty data to other clients. Who might then take action on it, but have it later get reverted. This is why just replicated to majority is not sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this document should mention that we can't allow other clients to read uncommitted writes. That would be inconsistent. (This is why we also need a WAL, but let's not mention WAL in this document.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. "read committed writes only on primary"is a better way to phrase it. I will add it to the requirements section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should fulfil this extra requirement, but I believe it adds a lot of pain as in slowness and complexity. It may require MVCC or something equivalently complex.
I believe it's possible to provide durability guarantees for write commands and/or WAIT, while still allowing dirty writes (writes that are not yet made durable). It's the read-uncommitted isolation level, which is not perfect, but I think it can be fast, not very complex and can be done without locks and mvcc.
Let's keep this requirement but give it a number, so we can refer to the requirements and implement incrementally?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify Why not make both read isolation configurable and enforce 1p2r as the default when DURABILITY=QUORUM is declared?
– Configurable isolation:
• Cluster default = READ_COMMITTED + DURABILITY=QUORUM.
• Connections may explicitly opt into READ_UNCOMMITTED for latency, but not stronger guarantees than the cluster supports.
• On failover/migration: READ_COMMITTED readers stall or error until durable state is restored; READ_UNCOMMITTED readers must re-negotiate.
– Topology enforcement:
• Quorum durability requires ≥3 voters, so 1p2r should be the safe default.
• If users want 1p1r, they must explicitly choose a weaker mode (e.g., DURABILITY=BEST_EFFORT).
• No silent downgrades — the durability/isolation contract should be fixed at cluster creation.
This gives safe defaults (1p2r + READ_COMMITTED), flexibility for latency-sensitive use cases, and avoids the pitfalls of hidden behavior changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the focus of this rfc is on durability and linearizaability. read-uncommited or dirty read is conceptually a property of isolation, which is very important but I would suggest keeping out of this rfc to keep the scope contained.
for the 1p1r setup to reach consensus at the shard level, we would need a witness node, which adds complexity. this is why I propose supporting (quorum) durability for 1p2+r only but leaving all other setups (including 1p2+r) to the existing cluster architecture, for now.