Skip to content

Add draft PRD for Valkey Durability#29

Open
PingXie wants to merge 5 commits intovalkey-io:mainfrom
PingXie:main
Open

Add draft PRD for Valkey Durability#29
PingXie wants to merge 5 commits intovalkey-io:mainfrom
PingXie:main

Conversation

@PingXie
Copy link
Member

@PingXie PingXie commented Sep 12, 2025

This document defines the problem of insufficient write durability in Valkey Cluster and outlines the core requirements for a solution. It details the failure modes of the current asynchronous replication model and establishes the principles for a new, quorum-based durability mechanism that will provide strong durability guarantees without regressing on performance for non-durable workloads.

@valkey-io/core-team FYI

This document defines the problem of insufficient write durability in Valkey Cluster and outlines the core requirements for a solution.

Signed-off-by: Ping Xie <pingxie@google.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a Product Requirements Document (PRD) that defines the problem of insufficient write durability in Valkey Cluster and establishes requirements for a quorum-based durability solution. The document addresses the gap between write acknowledgment and actual data persistence across topology changes.

  • Identifies durability challenges in current asynchronous replication model
  • Defines requirements for explicit per-command and connection-level durability controls
  • Establishes architectural constraints to preserve performance for non-durable workloads

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Ping Xie and others added 4 commits September 11, 2025 21:12
This document defines the problem of insufficient write durability in Valkey Cluster and outlines the core requirements for a solution.

Signed-off-by: Ping Xie <pingxie@google.com>
Signed-off-by: Ping Xie <pingxie@google.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Ping Xie <pingxie@outlook.com>
Signed-off-by: Ping Xie <pingxie@google.com>
Copy link
Member

@ranshid ranshid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this doc as it mainly focus on the problem and less on the technical solution.

Comment on lines +50 to +52
* **Leader Election**: A write may be acknowledged by the primary and propagated to a subset of
replicas, but a different replica—one that never received the write—can be legitimately promoted to
primary. After promotion, the shard's authoritative history will exclude that write.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that we had more than a single replica and we only waited for a single approval right? Maybe we should emphasis this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say "we waited" are you referring to the use of the WAIT command?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that we had more than a single replica and we only waited for a single approval right? Maybe we should emphasis this?

Yes. I am talking about a generic case of N replicas. One replica setup doesn't have an election concern.

Comment on lines +54 to +58
* **Slot Migration Rollback**: A data loss scenario occurs when the source shard broadcasts a
higher config epoch after the migration has completed, forcing the target shard to roll back the
ownership transfer and discard any writes it accepted in the interim. This can be triggered by a
failover in the source shard or by a concurrent migration conflict where the source shard bumps its
own epoch.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You refer to the legacy SM IIUC. I would state why using the Atomic Slot migration would not help in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASM has the same durability gap today but that is not the main point. I am using "Slot Migration" in a generic sense and as the context for the later "Solution Requirements" section ("Membership-awareness")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we need a two-phase commit for the slot migration?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Slot ownership needs to be durably committed along with its data. So, we will need to 2PC the data between shard with asm. That will also promote slot migration to a real 2PC as well.

Copy link

@cherukum-Amazon cherukum-Amazon Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to clairfy - is this example below the scenario being called out?.

  • imagine slot 5 is migrated from Shard A to Shard B. A client writes "SET k v" to Shard B after the migration. Later, Shard A bumps its config epoch (say due to failover), and the cluster decides Shard A still owns slot 5. Shard B must roll back and discard "SET k v," so the client’s acknowledged write is lost.

Comment on lines +60 to +62
* **Two-Node Shard Tradeoffs**: In a 1-primary/1-replica topology, any durability guarantee
inherently trades availability for durability. Under a single-node failure, requests expecting a
durable acknowledgement must fail clearly and predictably.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand this paragraph. Is the intention to say that in order to gain full durability we have to use at least 1 replica?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the other way around. the idea is to explicitly call out the need to support strong durability for a very common HA setup of 1p1r. The non-HA 1p0r setup is a special use case imo, which might or might not require a different solution. I can add a clause to spec out the requirement/experience for 1p0r, without, again, hinting any solutions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With only two nodes (1p1r) it's not possible to tell the difference between primary failure and netsplit. The replica can never get a majority.

If we move the quorum to the shard level, I'd say we need at least two replicas, but I think it's a totally acceptable limitation. Now we can instead allow single-shard and 2-shard clusters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also prefer 1p2r per shard to be the minimum for strong durability. We can add support for observer nodes later that process the stream and can handle the network split case for multiple shards.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also prefer 1p2r per shard to be the minimum for strong durability.

+1 but I think we can leave this decision to the execution. btw, the framework that I am going with is

  1. consensus on requirements, which is this RFC
  2. an HLD that doesn't paint us to an architectural corner for all the things we would like to achieve down the road, including data tiering, 1p1r, etc
  3. an execution plan for 9.1 and 10.0, maybe with 9.1 laying down the foundation, and 10.0 with the MVP (1p2r for instance)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good high level plan.

The HLD is hard to know long term though. 1p1r is obviously always possible without availability, but we should probably decide on the availability story.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding 1p1r or 1p2r in a shard debate, if the durable system is completely decoupled from the Valkey engine. Wouldn't we be able to support 1p1r? I think the design would dictate based on this key requirement.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hpatro - if the durable system is completely decoupled from the Valkey engine - are you thinking about durability and consistency system with raft/paxos through a module?.

simpler code and stronger durability guarantees.


## **Key Durability Challenges**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to drop some words about AOF durability guarantees on the replica side. like, in case the AOF is always fsync on the replica it might help prevent any type of data loss when combined with full replicas write acknowledgments right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we leave this to the HLD, as one of the potential solutions? As you have rightly noticed, my plan is to spec out just the requirements/experience in this PRD doc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Durability comes from latin "durus" = hard, as in harddisk.

But i think it's a misconceptipn. The point of durability is that the data survives various levels of failures, such as RAM power outage or software crash.

In the times before distributed databases, this meant fsync to disk. Raid adds disk redundancy. Even weekly backups to a disk that someone beings home, for the event that the server room explodes...

But with distributed databases, we get the redundancy by keeping multiple replicas in different availability zones. I'd argue that a disk isn't necessary.

The appendfsync always is orthogonal to this discussion, the way I see it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fwiw I always say we're adding durability and consistency. Valkey written to mutltiple disks with aof may be durable, but it has long failure times. If you failover to a replica, you lose consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with everything. But in order to fully pass the message of what we would like to solve we need to make sure we emphesis the problems with the alternatives

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. durability + consistency* + high-availability is all in my goal.

there is durability with an 1p0r aof-always-fsync setup and I am sure it works for some users but the lack of high-availability is a big gap imo.

Comment on lines +86 to +90
* **Clear Failure Modes**: If a client requests a durable write and the server cannot confirm it
within the timeout, the server must return an explicit error, not an ambiguous success. This
prevents a "silent downgrade" in durability where a client might incorrectly assume a write was
successful. This clear error signal informs the client that the write's state is unknown, enabling
the application to safely retry the operation, which should be designed to be idempotent.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not serve as an indication if the data exists or not right? Like in shared lock algorithms it will not help understand if to assume if the lock was indeed taken or not correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the whole point of this paragraph, i.e., the durable signal needs to be clear and durable. In the key challenges section, I gave the example of today's slot migration losing a WAITed write in some corner but non-zero-probability cases. The ask here is that when the system can't ensure data durability, it needs to clearly say so. Don't accept the request but silently drop the guarantee.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you basically call-out that a durable solution should provide a way to be sure if the data exists or not? What about network disconnects? Do we require a durable solution to ensure the data was not written down (like a 3 way handshake for writes)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about network disconnects?

are you talking about the networking issue between the client or server? or network partitioning among the nodes? regardless, I think this is a good reminder that I need to define the fault domain in this RFC as well. we are apparently not signing up for the day when the Earth disappears ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should focus on not returning OK to writes that are not yet made durable.

the server must return an explicit error, not an ambiguous success

Whether there is an explicit error reply or no reply at all (as in disconnects) is not important IMHO. Let's reword the part about "explicit error"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we should make clear what are the durability semantics of write operations. We should state something like:

  • If a client submits a write operation and receives a success response from the server, the data is guaranteed to be durable.
  • If a client submits a write operation and receives a failure response from the server, no data has been written.
  • If a client submits a write operation but does not receive any response from the server (due to a timeout, connection failure, etc.), there are no guarantees regarding whether the data was written or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. will incorporate @rjd15372's proposal

Copy link
Contributor

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this started @PingXie. I've a slightly different take on durability.

Sharing the high level thought here:

  • I would like to maintain the same semantics of Valkey and avoid introducing any client behavior change.
    Client wouldn't receive a success/failure back until the quorum hasn't committed the operation. During that period the client would be in blocked state on the primary.

  • I would like to introduce durability configuration at server level and keep it as immutable.
    By that I intend to make the user opt in to the performance degradation of the system with better durability guarantee and not see that as a surprise. Introducing the concept at connection level would make the debug-ability of the system very hard.

Comment on lines +79 to +81
* **Authoritative Acknowledgement**: A durable write's acknowledgement must be tied to a single,
authoritative decision at the shard level, ensuring it remains valid through any topology change
like a leader election or slot migration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it have to be authoritative?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like slot level consistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use authoritative to qualify the replication history. there needs to be one deterministic replication history with a given shart that is consistent with the client expectation. we don't have a slot-level replication history today but if/when we get there, this would apply too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be pedantic, i think that is the same as saying we have sequential consistency for each slot. There is some total ordering of requests against a slot, and that is preserved (even though nodes may be at different parts along the ordering).

I think the requirement is sensible, though, just suggesting a more formal way of saying it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Madelyn, looks like what we want is "sequential consistency for each slot", and that should be stated clearly as a requirement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand the definition of 'sequential consistency for each slot". the linearizability requirement is at the shard level. there shoud be a total order for all writes for a given shard and this total order is what I called "authoratative" but I will use "total order" going forward, which seems like the more well-established term.

Comment on lines +86 to +90
* **Clear Failure Modes**: If a client requests a durable write and the server cannot confirm it
within the timeout, the server must return an explicit error, not an ambiguous success. This
prevents a "silent downgrade" in durability where a client might incorrectly assume a write was
successful. This clear error signal informs the client that the write's state is unknown, enabling
the application to safely retry the operation, which should be designed to be idempotent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to introduce a new mechanism ? No response would imply the request hasn't succeeded yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point here is that we never return success (+OK) before we know for sure a write is durably committed in the cluster.

If not, if we return an error reply or no reply at all shouldn't be that important, but there will always be some cases like if the connection is lost when the client is waiting for a write to be committed. Then the client can't know if it was committed or not in this case.

I think we should drop the formulation "must return an explicit error" and instead focus on not returning success unless we know for sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point here is that we never return success (+OK) before we know for sure a write is durably committed in the cluster.

that is exactly what I don't quite like the way how WAIT is implemented. however, I also didn't want to completely throw away WAIT and introduce read/write concerns, which I consider over-engineered and I can expand on if there is interest. today, WAIT returns the number of the replicas that have applied the changes. it requires the application developer to understand the quorum, which imo is a cognitive overload. I would like to return an explicit message on whether or not the client write was committed in quorum. but this alone breaks the compat that I was hoping for in the requirement section - maybe we do need a new WAIT variant, WAIT_QUORUM?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the WAIT command should not be part of the durability feature. The WAIT only makes sense when using asynchronous replication.
If we enable durability of write commands, then when the server returns the response to a client write command, it must be guaranteed that the data is persisted and durable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with dropping WAIT for now for simplification but as I mentioned in the other comment, WAIT has its value and I think can be incoporated into raft (by waiting for a quorum ACK for a given offset only)

Comment on lines +94 to +95
* **Preserve <code>WAIT</code> Command Interface**: The syntax (`WAIT <replicas> <timeout>`) and
integer return value of the existing command must not be changed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any benefit of coupling this with the durability system we can build independently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. It's related as a background information though, when explaining why WAIT is not enough in the current design.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if I understand the "coupling" and the "background information" callouts?

this clause is about the interface compatibility. my take is that the WAIT command was introduced to solve the exact same core problem. its failure is not with the interface but the design/implementation. I would like to see if we can retain the syntax while fixing things under the hood. this helps with the learning curve on the developer side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok... With background information i meant earlier attempts. I guess I find it hard to think about the requirements without seeing a solution. Let's keep this then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would drop this requirement. IMO the WAIT command only makes sense when a system uses asynchronous replication.

If a client wants to enable the durability feature, it should not require to use WAIT, and should rewrite it's existing application code accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAIT is an optimization that allows the users to balance durability and performance explicitly. I am fine with every write being "durable" in V1 but I do think there is a value down the road.

Comment on lines +97 to +98
* **Flexible Durability Controls**: The solution must also provide a connection-level opt-in for
durable-by-default behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be doing this at server level for ease of understanding by the developers ?

All write operation would be durable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is a veyr reasonable first version. I do want to provide some configuration agility in the future but I agree that it shuldn't be a blocker. I will remove this

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this work!


Production users increasingly run workloads where **loss of an acknowledged write is
unacceptable**, not just in steady state but also during failover and cluster changes. In
practice, users have two distinct needs for durability controls:
Copy link
Contributor

@zuiderkwast zuiderkwast Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a 3rd option: a global config.

It's hard to ensure these quorum-based writes without moving leader election to the shard, making the majority of replicas in the shard the quorum. Then we're talking about a new consensus model.

If we go for something like Raft-based replication, can we still allow both asynchronous and synchronous (ack-by-majority) replication? Maybe we can...

I do like the per-client wait behavior, but I'm OK with all writes being durable.

Comment on lines +60 to +62
* **Two-Node Shard Tradeoffs**: In a 1-primary/1-replica topology, any durability guarantee
inherently trades availability for durability. Under a single-node failure, requests expecting a
durable acknowledgement must fail clearly and predictably.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With only two nodes (1p1r) it's not possible to tell the difference between primary failure and netsplit. The replica can never get a majority.

If we move the quorum to the shard level, I'd say we need at least two replicas, but I think it's a totally acceptable limitation. Now we can instead allow single-shard and 2-shard clusters.

Comment on lines +54 to +58
* **Slot Migration Rollback**: A data loss scenario occurs when the source shard broadcasts a
higher config epoch after the migration has completed, forcing the target shard to roll back the
ownership transfer and discard any writes it accepted in the interim. This can be triggered by a
failover in the source shard or by a concurrent migration conflict where the source shard bumps its
own epoch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we need a two-phase commit for the slot migration?

simpler code and stronger durability guarantees.


## **Key Durability Challenges**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Durability comes from latin "durus" = hard, as in harddisk.

But i think it's a misconceptipn. The point of durability is that the data survives various levels of failures, such as RAM power outage or software crash.

In the times before distributed databases, this meant fsync to disk. Raid adds disk redundancy. Even weekly backups to a disk that someone beings home, for the event that the server room explodes...

But with distributed databases, we get the redundancy by keeping multiple replicas in different availability zones. I'd argue that a disk isn't necessary.

The appendfsync always is orthogonal to this discussion, the way I see it.

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also like one more requirement, which is we shouldn't invent our own consistency model or algorithm. When data integrity is on the line, we don't want to accidentally lose data because we tried to tune it better.

These set of requirements more or less define a slot level durability and consistency guarantee. That can either be done through synchronous replication, if a replica is down writes fail, or eventual consistency, we have consistency gaps between replicas. Both seem doable.

latency for many workloads, but it also creates a window where an **acknowledged** write can be
lost if the primary fails before replicas make the write durable. The existing `WAIT <replicas>
<timeout>` command provides a client-side synchronization mechanism for clients to block until a
specified number of replicas have recorded the write. However, this propagation-based
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAIT also exposes the dirty data to other clients. Who might then take action on it, but have it later get reverted. This is why just replicated to majority is not sufficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this document should mention that we can't allow other clients to read uncommitted writes. That would be inconsistent. (This is why we also need a WAL, but let's not mention WAL in this document.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. "read committed writes only on primary"is a better way to phrase it. I will add it to the requirements section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should fulfil this extra requirement, but I believe it adds a lot of pain as in slowness and complexity. It may require MVCC or something equivalently complex.

I believe it's possible to provide durability guarantees for write commands and/or WAIT, while still allowing dirty writes (writes that are not yet made durable). It's the read-uncommitted isolation level, which is not perfect, but I think it can be fast, not very complex and can be done without locks and mvcc.

Let's keep this requirement but give it a number, so we can refer to the requirements and implement incrementally?

Copy link

@cherukum-Amazon cherukum-Amazon Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify Why not make both read isolation configurable and enforce 1p2r as the default when DURABILITY=QUORUM is declared?
– Configurable isolation:
• Cluster default = READ_COMMITTED + DURABILITY=QUORUM.
• Connections may explicitly opt into READ_UNCOMMITTED for latency, but not stronger guarantees than the cluster supports.
• On failover/migration: READ_COMMITTED readers stall or error until durable state is restored; READ_UNCOMMITTED readers must re-negotiate.
– Topology enforcement:
• Quorum durability requires ≥3 voters, so 1p2r should be the safe default.
• If users want 1p1r, they must explicitly choose a weaker mode (e.g., DURABILITY=BEST_EFFORT).
• No silent downgrades — the durability/isolation contract should be fixed at cluster creation.

This gives safe defaults (1p2r + READ_COMMITTED), flexibility for latency-sensitive use cases, and avoids the pitfalls of hidden behavior changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the focus of this rfc is on durability and linearizaability. read-uncommited or dirty read is conceptually a property of isolation, which is very important but I would suggest keeping out of this rfc to keep the scope contained.

for the 1p1r setup to reach consensus at the shard level, we would need a witness node, which adds complexity. this is why I propose supporting (quorum) durability for 1p2+r only but leaving all other setups (including 1p2+r) to the existing cluster architecture, for now.

Comment on lines +54 to +58
* **Slot Migration Rollback**: A data loss scenario occurs when the source shard broadcasts a
higher config epoch after the migration has completed, forcing the target shard to roll back the
ownership transfer and discard any writes it accepted in the interim. This can be triggered by a
failover in the source shard or by a concurrent migration conflict where the source shard bumps its
own epoch.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Slot ownership needs to be durably committed along with its data. So, we will need to 2PC the data between shard with asm. That will also promote slot migration to a real 2PC as well.

Comment on lines +60 to +62
* **Two-Node Shard Tradeoffs**: In a 1-primary/1-replica topology, any durability guarantee
inherently trades availability for durability. Under a single-node failure, requests expecting a
durable acknowledgement must fail clearly and predictably.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also prefer 1p2r per shard to be the minimum for strong durability. We can add support for observer nodes later that process the stream and can handle the network split case for multiple shards.

Comment on lines +79 to +81
* **Authoritative Acknowledgement**: A durable write's acknowledgement must be tied to a single,
authoritative decision at the shard level, ensuring it remains valid through any topology change
like a leader election or slot migration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like slot level consistency.

simpler code and stronger durability guarantees.


## **Key Durability Challenges**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fwiw I always say we're adding durability and consistency. Valkey written to mutltiple disks with aof may be durable, but it has long failure times. If you failover to a replica, you lose consistency.

@PingXie
Copy link
Member Author

PingXie commented Sep 13, 2025

@hpatro

I would like to maintain the same semantics of Valkey and avoid introducing any client behavior change.
Client wouldn't receive a success/failure back until the quorum hasn't committed the operation. During that period the client would be in blocked state on the primary.

I am aligned in principle but I feel that we might have to introduce a new WAIT variant, which shouldn't diverge from the existing WAIT's mental model too much.

I would like to introduce durability configuration at server level and keep it as immutable.
By that I intend to make the user opt in to the performance degradation of the system with better durability guarantee and not see that as a surprise. Introducing the concept at connection level would make the debug-ability of the system very hard.

I would make sure the requirement RFC leaves enough room for this option. that being said, I also feel that this is very likely the way to go, at least for our very first strong durability release.

@PingXie
Copy link
Member Author

PingXie commented Sep 13, 2025

I would also like one more requirement, which is we shouldn't invent our own consistency model or algorithm. When data integrity is on the line, we don't want to accidentally lose data because we tried to tune it better.

YES. fully agreed. Will add it to the requirements

@PingXie
Copy link
Member Author

PingXie commented Sep 13, 2025

Thanks for all the feedback and discussions. I will update the doc next and also collate open questions.

reusing a comment I left earlier.

the framework that I am going with is

  1. consensus on requirements, which is this RFC
  2. an HLD that doesn't paint us to an architectural corner for all the things we would like to achieve down the road, including data tiering, 1p1r, etc
  3. an execution plan for 9.1 and 10.0, maybe with 9.1 laying down the foundation, and 10.0 with the MVP (1p2r for instance)

@cherukum-Amazon
Copy link

Thanks for the draft! A few additional areas to consider in the PRD:
– Clarify client retry semantics when a durable write times out, to avoid duplicate/ambiguous commits.(Idempotency requirement)
– Define how durability applies to MULTI/EXEC and Lua scripts (per command vs atomic batch).
– Clarify survivability after crash/restart (e.g., durability should imply persistence on disk, not just in memory).
– Add observability requirements (metrics for durable vs non-durable acks, quorum failures, replica lag).
– Provide baseline performance targets/expectations so trade-offs are clear.
– Add testing/failure injection requirements to validate correctness under partitions, crashes, migrations.

* **Preserve <code>WAIT</code> Command Interface**: The syntax (`WAIT <replicas> <timeout>`) and
integer return value of the existing command must not be changed.

* **Flexible Durability Controls**: The solution must also provide a connection-level opt-in for

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is connection level important?. Why shard or a cluster level is not sufficient?. Are there specific usecases where customers would benefit from this?. Hoping not to overengineer if there i no real need.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is but I do agree that this is going into the design area which was not my intent for this doc. I have rephrase the configurability to something more vague and will remove this per-command/per-connection and WAIT callouts from this RFC. we can continue this discussion in the HLD.

that avoids the I/O and network latency costs of a durable, quorum-based commit.

* **Foundation for Future Features**: The chosen durability mechanism should provide a clear and
extensible foundation for future high-value features, most notably **data tiering**. This would

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is durablity and data tiering interconnected. Are you suggesting we cannot support data tiering for non durable workloads that use asynchronous replication?.

Does the same argument apply for Active-Active replication as well?. Do we need durability as a foundation to support Active-Active replication?.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting we cannot support data tiering for non durable workloads that use asynchronous replication?.

it is actually the opposite; we need to make sure whatever design we pick at the end for durability shouldn't paint us to an architecural corner. data tiering and active-active are both big bets for valkey too. this callout should be no brainer imo but I consider this too important point to miss.

@hpatro hpatro mentioned this pull request Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants