Add draft PRD for Valkey Durability by PingXie · Pull Request #29 · valkey-io/valkey-rfc

PingXie · 2025-09-12T04:01:57Z

This document defines the problem of insufficient write durability in Valkey Cluster and outlines the core requirements for a solution. It details the failure modes of the current asynchronous replication model and establishes the principles for a new, quorum-based durability mechanism that will provide strong durability guarantees without regressing on performance for non-durable workloads.

@valkey-io/core-team FYI

This document defines the problem of insufficient write durability in Valkey Cluster and outlines the core requirements for a solution. Signed-off-by: Ping Xie <pingxie@google.com>

Copilot

Pull Request Overview

This PR introduces a Product Requirements Document (PRD) that defines the problem of insufficient write durability in Valkey Cluster and establishes requirements for a quorum-based durability solution. The document addresses the gap between write acknowledgment and actual data persistence across topology changes.

Identifies durability challenges in current asynchronous replication model
Defines requirements for explicit per-command and connection-level durability controls
Establishes architectural constraints to preserve performance for non-durable workloads

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

ValkeyDurability-PRD.md

This document defines the problem of insufficient write durability in Valkey Cluster and outlines the core requirements for a solution. Signed-off-by: Ping Xie <pingxie@google.com>

Signed-off-by: Ping Xie <pingxie@google.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Ping Xie <pingxie@outlook.com>

Signed-off-by: Ping Xie <pingxie@google.com>

ranshid

I like this doc as it mainly focus on the problem and less on the technical solution.

ranshid · 2025-09-12T10:50:24Z

ValkeyDurability-PRD.md

+* **Leader Election**: A write may be acknowledged by the primary and propagated to a subset of
+replicas, but a different replica—one that never received the write—can be legitimately promoted to
+primary. After promotion, the shard's authoritative history will exclude that write.


This assumes that we had more than a single replica and we only waited for a single approval right? Maybe we should emphasis this?

When you say "we waited" are you referring to the use of the WAIT command?

This assumes that we had more than a single replica and we only waited for a single approval right? Maybe we should emphasis this?

Yes. I am talking about a generic case of N replicas. One replica setup doesn't have an election concern.

ranshid · 2025-09-12T10:52:29Z

ValkeyDurability-PRD.md

+* **Slot Migration Rollback**: A data loss scenario occurs when the source shard broadcasts a
+higher config epoch after the migration has completed, forcing the target shard to roll back the
+ownership transfer and discard any writes it accepted in the interim. This can be triggered by a
+failover in the source shard or by a concurrent migration conflict where the source shard bumps its
+own epoch.


You refer to the legacy SM IIUC. I would state why using the Atomic Slot migration would not help in this case?

ASM has the same durability gap today but that is not the main point. I am using "Slot Migration" in a generic sense and as the context for the later "Solution Requirements" section ("Membership-awareness")

Yeah, we need a two-phase commit for the slot migration?

Yes. Slot ownership needs to be durably committed along with its data. So, we will need to 2PC the data between shard with asm. That will also promote slot migration to a real 2PC as well.

just to clairfy - is this example below the scenario being called out?.

imagine slot 5 is migrated from Shard A to Shard B. A client writes "SET k v" to Shard B after the migration. Later, Shard A bumps its config epoch (say due to failover), and the cluster decides Shard A still owns slot 5. Shard B must roll back and discard "SET k v," so the client’s acknowledged write is lost.

ranshid · 2025-09-12T10:54:39Z

ValkeyDurability-PRD.md

+* **Two-Node Shard Tradeoffs**: In a 1-primary/1-replica topology, any durability guarantee
+inherently trades availability for durability. Under a single-node failure, requests expecting a
+durable acknowledgement must fail clearly and predictably.


I am not sure I understand this paragraph. Is the intention to say that in order to gain full durability we have to use at least 1 replica?

It is the other way around. the idea is to explicitly call out the need to support strong durability for a very common HA setup of 1p1r. The non-HA 1p0r setup is a special use case imo, which might or might not require a different solution. I can add a clause to spec out the requirement/experience for 1p0r, without, again, hinting any solutions.

With only two nodes (1p1r) it's not possible to tell the difference between primary failure and netsplit. The replica can never get a majority.

If we move the quorum to the shard level, I'd say we need at least two replicas, but I think it's a totally acceptable limitation. Now we can instead allow single-shard and 2-shard clusters.

I would also prefer 1p2r per shard to be the minimum for strong durability. We can add support for observer nodes later that process the stream and can handle the network split case for multiple shards.

I would also prefer 1p2r per shard to be the minimum for strong durability.

+1 but I think we can leave this decision to the execution. btw, the framework that I am going with is

consensus on requirements, which is this RFC

an HLD that doesn't paint us to an architectural corner for all the things we would like to achieve down the road, including data tiering, 1p1r, etc

an execution plan for 9.1 and 10.0, maybe with 9.1 laying down the foundation, and 10.0 with the MVP (1p2r for instance)

I think this is a good high level plan.

The HLD is hard to know long term though. 1p1r is obviously always possible without availability, but we should probably decide on the availability story.

Regarding 1p1r or 1p2r in a shard debate, if the durable system is completely decoupled from the Valkey engine. Wouldn't we be able to support 1p1r? I think the design would dictate based on this key requirement.

@hpatro - if the durable system is completely decoupled from the Valkey engine - are you thinking about durability and consistency system with raft/paxos through a module?.

ranshid · 2025-09-12T11:14:47Z

ValkeyDurability-PRD.md

+simpler code and stronger durability guarantees.
+
+
+## **Key Durability Challenges**


I think we also need to drop some words about AOF durability guarantees on the replica side. like, in case the AOF is always fsync on the replica it might help prevent any type of data loss when combined with full replicas write acknowledgments right?

Should we leave this to the HLD, as one of the potential solutions? As you have rightly noticed, my plan is to spec out just the requirements/experience in this PRD doc.

Durability comes from latin "durus" = hard, as in harddisk.

But i think it's a misconceptipn. The point of durability is that the data survives various levels of failures, such as RAM power outage or software crash.

In the times before distributed databases, this meant fsync to disk. Raid adds disk redundancy. Even weekly backups to a disk that someone beings home, for the event that the server room explodes...

But with distributed databases, we get the redundancy by keeping multiple replicas in different availability zones. I'd argue that a disk isn't necessary.

The appendfsync always is orthogonal to this discussion, the way I see it.

Yeah, fwiw I always say we're adding durability and consistency. Valkey written to mutltiple disks with aof may be durable, but it has long failure times. If you failover to a replica, you lose consistency.

I agree with everything. But in order to fully pass the message of what we would like to solve we need to make sure we emphesis the problems with the alternatives

yes. durability + consistency* + high-availability is all in my goal.

there is durability with an 1p0r aof-always-fsync setup and I am sure it works for some users but the lack of high-availability is a big gap imo.

ranshid · 2025-09-12T11:17:55Z

ValkeyDurability-PRD.md

+* **Clear Failure Modes**: If a client requests a durable write and the server cannot confirm it
+within the timeout, the server must return an explicit error, not an ambiguous success. This
+prevents a "silent downgrade" in durability where a client might incorrectly assume a write was
+successful. This clear error signal informs the client that the write's state is unknown, enabling
+the application to safely retry the operation, which should be designed to be idempotent.


This will not serve as an indication if the data exists or not right? Like in shared lock algorithms it will not help understand if to assume if the lock was indeed taken or not correct?

That is the whole point of this paragraph, i.e., the durable signal needs to be clear and durable. In the key challenges section, I gave the example of today's slot migration losing a WAITed write in some corner but non-zero-probability cases. The ask here is that when the system can't ensure data durability, it needs to clearly say so. Don't accept the request but silently drop the guarantee.

So you basically call-out that a durable solution should provide a way to be sure if the data exists or not? What about network disconnects? Do we require a durable solution to ensure the data was not written down (like a 3 way handshake for writes)?

What about network disconnects?

are you talking about the networking issue between the client or server? or network partitioning among the nodes? regardless, I think this is a good reminder that I need to define the fault domain in this RFC as well. we are apparently not signing up for the day when the Earth disappears ...

I think we should focus on not returning OK to writes that are not yet made durable.

the server must return an explicit error, not an ambiguous success

Whether there is an explicit error reply or no reply at all (as in disconnects) is not important IMHO. Let's reword the part about "explicit error"?

Right, we should make clear what are the durability semantics of write operations. We should state something like:

If a client submits a write operation and receives a success response from the server, the data is guaranteed to be durable.

If a client submits a write operation and receives a failure response from the server, no data has been written.

If a client submits a write operation but does not receive any response from the server (due to a timeout, connection failure, etc.), there are no guarantees regarding whether the data was written or not.

agreed. will incorporate @rjd15372's proposal

hpatro

Thanks for getting this started @PingXie. I've a slightly different take on durability.

Sharing the high level thought here:

I would like to maintain the same semantics of Valkey and avoid introducing any client behavior change.
Client wouldn't receive a success/failure back until the quorum hasn't committed the operation. During that period the client would be in blocked state on the primary.
I would like to introduce durability configuration at server level and keep it as immutable.
By that I intend to make the user opt in to the performance degradation of the system with better durability guarantee and not see that as a surprise. Introducing the concept at connection level would make the debug-ability of the system very hard.

hpatro · 2025-09-12T18:05:24Z

ValkeyDurability-PRD.md

+* **Authoritative Acknowledgement**: A durable write's acknowledgement must be tied to a single,
+authoritative decision at the shard level, ensuring it remains valid through any topology change
+like a leader election or slot migration.


Why does it have to be authoritative?

This sounds like slot level consistency.

I use authoritative to qualify the replication history. there needs to be one deterministic replication history with a given shart that is consistent with the client expectation. we don't have a slot-level replication history today but if/when we get there, this would apply too.

To be pedantic, i think that is the same as saying we have sequential consistency for each slot. There is some total ordering of requests against a slot, and that is preserved (even though nodes may be at different parts along the ordering).

I think the requirement is sensible, though, just suggesting a more formal way of saying it.

I agree with Madelyn, looks like what we want is "sequential consistency for each slot", and that should be stated clearly as a requirement.

not sure I understand the definition of 'sequential consistency for each slot". the linearizability requirement is at the shard level. there shoud be a total order for all writes for a given shard and this total order is what I called "authoratative" but I will use "total order" going forward, which seems like the more well-established term.

hpatro · 2025-09-12T18:06:31Z

ValkeyDurability-PRD.md

+* **Clear Failure Modes**: If a client requests a durable write and the server cannot confirm it
+within the timeout, the server must return an explicit error, not an ambiguous success. This
+prevents a "silent downgrade" in durability where a client might incorrectly assume a write was
+successful. This clear error signal informs the client that the write's state is unknown, enabling
+the application to safely retry the operation, which should be designed to be idempotent.


Why do we want to introduce a new mechanism ? No response would imply the request hasn't succeeded yet.

The point here is that we never return success (+OK) before we know for sure a write is durably committed in the cluster.

If not, if we return an error reply or no reply at all shouldn't be that important, but there will always be some cases like if the connection is lost when the client is waiting for a write to be committed. Then the client can't know if it was committed or not in this case.

I think we should drop the formulation "must return an explicit error" and instead focus on not returning success unless we know for sure.

The point here is that we never return success (+OK) before we know for sure a write is durably committed in the cluster.

that is exactly what I don't quite like the way how WAIT is implemented. however, I also didn't want to completely throw away WAIT and introduce read/write concerns, which I consider over-engineered and I can expand on if there is interest. today, WAIT returns the number of the replicas that have applied the changes. it requires the application developer to understand the quorum, which imo is a cognitive overload. I would like to return an explicit message on whether or not the client write was committed in quorum. but this alone breaks the compat that I was hoping for in the requirement section - maybe we do need a new WAIT variant, WAIT_QUORUM?

IMO the WAIT command should not be part of the durability feature. The WAIT only makes sense when using asynchronous replication.
If we enable durability of write commands, then when the server returns the response to a client write command, it must be guaranteed that the data is persisted and durable.

I am fine with dropping WAIT for now for simplification but as I mentioned in the other comment, WAIT has its value and I think can be incoporated into raft (by waiting for a quorum ACK for a given offset only)

hpatro · 2025-09-12T18:07:18Z

ValkeyDurability-PRD.md

+* **Preserve <code>WAIT</code> Command Interface**: The syntax (`WAIT <replicas> <timeout>`) and
+integer return value of the existing command must not be changed.


I don't see any benefit of coupling this with the durability system we can build independently.

Agree. It's related as a background information though, when explaining why WAIT is not enough in the current design.

not sure if I understand the "coupling" and the "background information" callouts?

this clause is about the interface compatibility. my take is that the WAIT command was introduced to solve the exact same core problem. its failure is not with the interface but the design/implementation. I would like to see if we can retain the syntax while fixing things under the hood. this helps with the learning curve on the developer side.

Ok... With background information i meant earlier attempts. I guess I find it hard to think about the requirements without seeing a solution. Let's keep this then.

I would drop this requirement. IMO the WAIT command only makes sense when a system uses asynchronous replication.

If a client wants to enable the durability feature, it should not require to use WAIT, and should rewrite it's existing application code accordingly.

WAIT is an optimization that allows the users to balance durability and performance explicitly. I am fine with every write being "durable" in V1 but I do think there is a value down the road.

hpatro · 2025-09-12T18:08:21Z

ValkeyDurability-PRD.md

+* **Flexible Durability Controls**: The solution must also provide a connection-level opt-in for
+durable-by-default behavior.


Shouldn't we be doing this at server level for ease of understanding by the developers ?

All write operation would be durable.

I think that is a veyr reasonable first version. I do want to provide some configuration agility in the future but I agree that it shuldn't be a blocker. I will remove this

zuiderkwast

Thank you for this work!

zuiderkwast · 2025-09-12T10:37:56Z

ValkeyDurability-PRD.md

+
+Production users increasingly run workloads where **loss of an acknowledged write is
+unacceptable**, not just in steady state but also during failover and cluster changes. In
+practice, users have two distinct needs for durability controls:


We could add a 3rd option: a global config.

It's hard to ensure these quorum-based writes without moving leader election to the shard, making the majority of replicas in the shard the quorum. Then we're talking about a new consensus model.

If we go for something like Raft-based replication, can we still allow both asynchronous and synchronous (ack-by-majority) replication? Maybe we can...

I do like the per-client wait behavior, but I'm OK with all writes being durable.

zuiderkwast · 2025-09-12T18:48:10Z

ValkeyDurability-PRD.md

+* **Two-Node Shard Tradeoffs**: In a 1-primary/1-replica topology, any durability guarantee
+inherently trades availability for durability. Under a single-node failure, requests expecting a
+durable acknowledgement must fail clearly and predictably.


With only two nodes (1p1r) it's not possible to tell the difference between primary failure and netsplit. The replica can never get a majority.

If we move the quorum to the shard level, I'd say we need at least two replicas, but I think it's a totally acceptable limitation. Now we can instead allow single-shard and 2-shard clusters.

zuiderkwast · 2025-09-12T18:54:24Z

ValkeyDurability-PRD.md

+* **Slot Migration Rollback**: A data loss scenario occurs when the source shard broadcasts a
+higher config epoch after the migration has completed, forcing the target shard to roll back the
+ownership transfer and discard any writes it accepted in the interim. This can be triggered by a
+failover in the source shard or by a concurrent migration conflict where the source shard bumps its
+own epoch.


Yeah, we need a two-phase commit for the slot migration?

zuiderkwast · 2025-09-12T19:07:31Z

ValkeyDurability-PRD.md

+simpler code and stronger durability guarantees.
+
+
+## **Key Durability Challenges**


Durability comes from latin "durus" = hard, as in harddisk.

But i think it's a misconceptipn. The point of durability is that the data survives various levels of failures, such as RAM power outage or software crash.

In the times before distributed databases, this meant fsync to disk. Raid adds disk redundancy. Even weekly backups to a disk that someone beings home, for the event that the server room explodes...

But with distributed databases, we get the redundancy by keeping multiple replicas in different availability zones. I'd argue that a disk isn't necessary.

The appendfsync always is orthogonal to this discussion, the way I see it.

madolson

I would also like one more requirement, which is we shouldn't invent our own consistency model or algorithm. When data integrity is on the line, we don't want to accidentally lose data because we tried to tune it better.

These set of requirements more or less define a slot level durability and consistency guarantee. That can either be done through synchronous replication, if a replica is down writes fail, or eventual consistency, we have consistency gaps between replicas. Both seem doable.

madolson · 2025-09-12T22:51:10Z

ValkeyDurability-PRD.md

+latency for many workloads, but it also creates a window where an **acknowledged** write can be
+lost if the primary fails before replicas make the write durable. The existing `WAIT <replicas>
+<timeout>` command provides a client-side synchronization mechanism for clients to block until a
+specified number of replicas have recorded the write. However, this propagation-based


WAIT also exposes the dirty data to other clients. Who might then take action on it, but have it later get reverted. This is why just replicated to majority is not sufficient.

Yeah, this document should mention that we can't allow other clients to read uncommitted writes. That would be inconsistent. (This is why we also need a WAL, but let's not mention WAL in this document.)

agreed. "read committed writes only on primary"is a better way to phrase it. I will add it to the requirements section.

Ideally we should fulfil this extra requirement, but I believe it adds a lot of pain as in slowness and complexity. It may require MVCC or something equivalently complex.

I believe it's possible to provide durability guarantees for write commands and/or WAIT, while still allowing dirty writes (writes that are not yet made durable). It's the read-uncommitted isolation level, which is not perfect, but I think it can be fast, not very complex and can be done without locks and mvcc.

Let's keep this requirement but give it a number, so we can refer to the requirements and implement incrementally?

Just to clarify Why not make both read isolation configurable and enforce 1p2r as the default when DURABILITY=QUORUM is declared?
– Configurable isolation:
• Cluster default = READ_COMMITTED + DURABILITY=QUORUM.
• Connections may explicitly opt into READ_UNCOMMITTED for latency, but not stronger guarantees than the cluster supports.
• On failover/migration: READ_COMMITTED readers stall or error until durable state is restored; READ_UNCOMMITTED readers must re-negotiate.
– Topology enforcement:
• Quorum durability requires ≥3 voters, so 1p2r should be the safe default.
• If users want 1p1r, they must explicitly choose a weaker mode (e.g., DURABILITY=BEST_EFFORT).
• No silent downgrades — the durability/isolation contract should be fixed at cluster creation.

This gives safe defaults (1p2r + READ_COMMITTED), flexibility for latency-sensitive use cases, and avoids the pitfalls of hidden behavior changes.

the focus of this rfc is on durability and linearizaability. read-uncommited or dirty read is conceptually a property of isolation, which is very important but I would suggest keeping out of this rfc to keep the scope contained.

for the 1p1r setup to reach consensus at the shard level, we would need a witness node, which adds complexity. this is why I propose supporting (quorum) durability for 1p2+r only but leaving all other setups (including 1p2+r) to the existing cluster architecture, for now.

madolson · 2025-09-12T22:56:14Z

ValkeyDurability-PRD.md

+* **Slot Migration Rollback**: A data loss scenario occurs when the source shard broadcasts a
+higher config epoch after the migration has completed, forcing the target shard to roll back the
+ownership transfer and discard any writes it accepted in the interim. This can be triggered by a
+failover in the source shard or by a concurrent migration conflict where the source shard bumps its
+own epoch.


Yes. Slot ownership needs to be durably committed along with its data. So, we will need to 2PC the data between shard with asm. That will also promote slot migration to a real 2PC as well.

madolson · 2025-09-12T23:02:11Z

ValkeyDurability-PRD.md

+* **Two-Node Shard Tradeoffs**: In a 1-primary/1-replica topology, any durability guarantee
+inherently trades availability for durability. Under a single-node failure, requests expecting a
+durable acknowledgement must fail clearly and predictably.


I would also prefer 1p2r per shard to be the minimum for strong durability. We can add support for observer nodes later that process the stream and can handle the network split case for multiple shards.

madolson · 2025-09-12T23:03:37Z

ValkeyDurability-PRD.md

+* **Authoritative Acknowledgement**: A durable write's acknowledgement must be tied to a single,
+authoritative decision at the shard level, ensuring it remains valid through any topology change
+like a leader election or slot migration.


This sounds like slot level consistency.

madolson · 2025-09-12T23:15:58Z

ValkeyDurability-PRD.md

+simpler code and stronger durability guarantees.
+
+
+## **Key Durability Challenges**


Yeah, fwiw I always say we're adding durability and consistency. Valkey written to mutltiple disks with aof may be durable, but it has long failure times. If you failover to a replica, you lose consistency.

PingXie · 2025-09-13T21:09:45Z

@hpatro

I would like to maintain the same semantics of Valkey and avoid introducing any client behavior change.
Client wouldn't receive a success/failure back until the quorum hasn't committed the operation. During that period the client would be in blocked state on the primary.

I am aligned in principle but I feel that we might have to introduce a new WAIT variant, which shouldn't diverge from the existing WAIT's mental model too much.

I would like to introduce durability configuration at server level and keep it as immutable.
By that I intend to make the user opt in to the performance degradation of the system with better durability guarantee and not see that as a surprise. Introducing the concept at connection level would make the debug-ability of the system very hard.

I would make sure the requirement RFC leaves enough room for this option. that being said, I also feel that this is very likely the way to go, at least for our very first strong durability release.

PingXie · 2025-09-13T21:10:56Z

I would also like one more requirement, which is we shouldn't invent our own consistency model or algorithm. When data integrity is on the line, we don't want to accidentally lose data because we tried to tune it better.

YES. fully agreed. Will add it to the requirements

PingXie · 2025-09-13T21:14:17Z

Thanks for all the feedback and discussions. I will update the doc next and also collate open questions.

reusing a comment I left earlier.

the framework that I am going with is

consensus on requirements, which is this RFC

an HLD that doesn't paint us to an architectural corner for all the things we would like to achieve down the road, including data tiering, 1p1r, etc

an execution plan for 9.1 and 10.0, maybe with 9.1 laying down the foundation, and 10.0 with the MVP (1p2r for instance)

cherukum-Amazon · 2025-09-18T05:12:13Z

Thanks for the draft! A few additional areas to consider in the PRD:
– Clarify client retry semantics when a durable write times out, to avoid duplicate/ambiguous commits.(Idempotency requirement)
– Define how durability applies to MULTI/EXEC and Lua scripts (per command vs atomic batch).
– Clarify survivability after crash/restart (e.g., durability should imply persistence on disk, not just in memory).
– Add observability requirements (metrics for durable vs non-durable acks, quorum failures, replica lag).
– Provide baseline performance targets/expectations so trade-offs are clear.
– Add testing/failure injection requirements to validate correctness under partitions, crashes, migrations.

cherukum-Amazon · 2025-09-23T23:09:56Z

ValkeyDurability-PRD.md

+* **Preserve <code>WAIT</code> Command Interface**: The syntax (`WAIT <replicas> <timeout>`) and
+integer return value of the existing command must not be changed.
+
+* **Flexible Durability Controls**: The solution must also provide a connection-level opt-in for


Why is connection level important?. Why shard or a cluster level is not sufficient?. Are there specific usecases where customers would benefit from this?. Hoping not to overengineer if there i no real need.

I think there is but I do agree that this is going into the design area which was not my intent for this doc. I have rephrase the configurability to something more vague and will remove this per-command/per-connection and WAIT callouts from this RFC. we can continue this discussion in the HLD.

cherukum-Amazon · 2025-09-23T23:20:01Z

ValkeyDurability-PRD.md

+that avoids the I/O and network latency costs of a durable, quorum-based commit.
+
+* **Foundation for Future Features**: The chosen durability mechanism should provide a clear and
+extensible foundation for future high-value features, most notably **data tiering**. This would


Why is durablity and data tiering interconnected. Are you suggesting we cannot support data tiering for non durable workloads that use asynchronous replication?.

Does the same argument apply for Active-Active replication as well?. Do we need durability as a foundation to support Active-Active replication?.

Are you suggesting we cannot support data tiering for non durable workloads that use asynchronous replication?.

it is actually the opposite; we need to make sure whatever design we pick at the end for durability shouldn't paint us to an architecural corner. data tiering and active-active are both big bets for valkey too. this callout should be no brainer imo but I consider this too important point to miss.

Add draft PRD for Valkey Durability

9479c09

This document defines the problem of insufficient write durability in Valkey Cluster and outlines the core requirements for a solution. Signed-off-by: Ping Xie <pingxie@google.com>

PingXie force-pushed the main branch from e17749e to 9479c09 Compare September 12, 2025 04:02

PingXie mentioned this pull request Sep 12, 2025

[NEW] Introduce First-Class Durability Support in Valkey valkey-io/valkey#1355

Open

PingXie requested a review from Copilot September 12, 2025 04:11

Copilot AI reviewed Sep 12, 2025

View reviewed changes

ValkeyDurability-PRD.md Outdated Show resolved Hide resolved

ValkeyDurability-PRD.md Outdated Show resolved Hide resolved

ValkeyDurability-PRD.md Outdated Show resolved Hide resolved

Ping Xie and others added 4 commits September 11, 2025 21:12

Add draft PRD for Valkey Durability

ffa491c

This document defines the problem of insufficient write durability in Valkey Cluster and outlines the core requirements for a solution. Signed-off-by: Ping Xie <pingxie@google.com>

Merge branch 'main' of https://github.com/PingXie/valkey-rfc

45244e4

Signed-off-by: Ping Xie <pingxie@google.com>

Update ValkeyDurability-PRD.md

ea9f475

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Ping Xie <pingxie@outlook.com>

Add self-sufficiency architectural constraint

b84f8b5

Signed-off-by: Ping Xie <pingxie@google.com>

ranshid reviewed Sep 12, 2025

View reviewed changes

hpatro reviewed Sep 12, 2025

View reviewed changes

zuiderkwast reviewed Sep 12, 2025

View reviewed changes

madolson reviewed Sep 12, 2025

View reviewed changes

cherukum-Amazon reviewed Sep 23, 2025

View reviewed changes

hpatro mentioned this pull request Oct 27, 2025

[Draft] Durability RFC #33

Open

		simpler code and stronger durability guarantees.


		## Key Durability Challenges

		* Preserve <code>WAIT</code> Command Interface: The syntax (`WAIT <replicas> <timeout>`) and
		integer return value of the existing command must not be changed.

		* Flexible Durability Controls: The solution must also provide a connection-level opt-in for
		durable-by-default behavior.

Conversation

PingXie commented Sep 12, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ranshid left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cherukum-Amazon Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hpatro left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

cherukum-Amazon Sep 23, 2025 •

edited

Loading

hpatro left a comment •

edited

Loading