-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Problem Statement
Valkey offers several options for users seeking to ensure data durability, but each has notable limitations.
A commonly used approach is the WAIT command, which ensures that changes are replicated to a specified number of replicas before confirming success to the client. However, this places the burden of managing durability on the application, introducing complexity and increasing the risk of errors.
Another option is to use Append-Only File (AOF) with ALWAYS fsync, while avoiding replicas altogether. This ensures that writes are persisted to disk before being acknowledged to the client. However, AOF is not a true write-ahead log (WAL). For instance, in cases where writes or fsync operations fail, the changes might still be appended to the AOF file even though the client was informed of a failure. This creates a mismatch between the client’s perception and the actual system state, leading to what can be described as durability inconsistency. Moreover, this approach provides durability guarantees only on a single node.
Similarly, the WAITAOF command, which ensures data is written to the AOF file before acknowledgment, shares the same challenges of client-side complexity and durability inconsistency. While it provides durability under specific conditions, it does not guarantee correctness in all failure scenarios.
As a result of these limitations, Valkey currently lacks a mechanism to provide a robust durability contract. Specifically, there is no reliable way to guarantee that at least N copies of data are consistently maintained—where N is the user-configured replication factor—in the face of single-point failures such as a node outage or an availability zone (AZ) failure.
Proposal
This proposal introduces first-class support for synchronous replication and deterministic elections in Valkey, enabling users to maintain at least N consistent copies of their data even in the event of single-point failures. Here, N refers to the user-configured replication factor. Additionally, this proposal ensures that only replicas fully synchronized with the primary are eligible to contend for primaryship, guaranteeing predictable and reliable behavior across all failure scenarios.
Synchronous Replication
When operating in "durability" mode—a new server configuration—the primary node acknowledges a write command only after successfully replicating the change to the expected number of replicas and ensuring that those changes are committed.
To prevent unbounded delays, the primary enforces a configurable timeout for replica acknowledgments. If the number of responding replicas falls below the durability threshold after the timeout expires, the write will fail with a "not enough replicas" error. This ensures a balance between durability guarantees and write latency.
Deterministic Elections
A replica’s eligibility for election is explicitly controlled by the primary. All replicas start in an ineligible state, which prevents them from contending for the primary role.
The full synchronization process with the primary is the first step for a replica to catch up. Once this is complete, the replica begins receiving incremental updates accumulated during the sync. When the replica’s replication offset matches the primary’s replication offset, the primary explicitly grants it election eligibility by sending a specific command. Until this command is issued, the replica remains ineligible to contend for primaryship.
To ensure deterministic behavior, the primary temporarily pauses writers during this process—similar to the manual failover.
If a replica disconnects from the primary, it immediately loses its eligibility. Upon reconnection, the replica reverts to an ineligible state and must complete the same synchronization process before regaining eligibility. This ensures that only fully synchronized replicas can participate in elections, eliminating the risk of promoting an inconsistent primary.
Faster Full Synchronization
In the event of a primary failure, one of its eligible replicas will assume the primary role. When the original primary restarts, it will rejoin as a replica and must quickly synchronize with the new primary before writes can resume. Minimizing downtime during this process is critical to maintaining availability.
Currently, when persistence is enabled, the restarting node reads from disk to reconstruct its data before synchronizing with the primary. This approach is slow and can introduce significant delays to write operations. With Valkey 8.0's support for dual-channel replication, it is now possible to always perform full synchronization directly from the new primary's memory. This significantly accelerates data transfer without adding too much load on the primary node.
Further performance improvements can be achieved by leveraging additional CPU resources during the full sync process. By parallelizing data transfer and processing, the system can dramatically reduce the time required for a restarting node to catch up with the new primary, thereby shortening the unavailability window.
Analysis
Steady State
In the steady state, where all replicas are healthy and eligible, a Valkey server running in durability mode ensures that every committed change is replicated to the user-configured number of replicas, in addition to the primary. This guarantees that the durability configuration set by the user is fully honored under normal operating conditions, assuming no more than single-point failures.
When a new replica is added, it does not participate in the write quorum until it has fully synchronized with the primary. During this synchronization period, the replica remains ineligible for elections. Once its replication offset matches the primary’s replication offset, the primary explicitly grants it eligibility and updates the expected number of replicas for the write quorum.
If a replica is removed from the cluster, it must first go through the "cluster forget" process. This ensures that the primary stops considering the replica as part of the quorum, preventing unnecessary write timeouts caused by waiting for acknowledgments from a node that is no longer present.
Failure Scenarios
An eligible replica loses connection to the primary.
A write operation racing against this event may experience delays or timeouts, which is an intentional trade-off to ensure durability. The primary closes the client representing the disconnected replica, which then loses its eligibility. Upon reconnection, the replica must catch up with the primary before it can regain eligibility.
An eligible replica requires full synchronization.
This scenario is treated the same as a disconnection. The replica remains ineligible during the synchronization process and only regains eligibility after completing the synchronization and being explicitly granted eligibility.
Multiple eligible replicas require resynchronization.
If more than one eligible replica becomes disconnected or requires resynchronization, the behavior remains identical with the previous cases. Writes will only proceed if the required number of eligible replicas are available. Otherwise, writes are delayed or rejected until the system can restore the desired durability guarantees.
The primary fails.
In the event of a primary failure, only replicas explicitly granted eligibility are allowed to participate in the election process. This ensures that no stale replica becomes the new primary. If some eligible replicas are also unavailable during this event, the election will proceed as long as at least one eligible replica remains. However, if the remaining number of eligible replicas is insufficient to meet the user-defined durability threshold, future writes will be rejected until the durability guarantee is restored.
Availability Trade-offs
This proposal prioritizes durability over availability by requiring all eligible replicas to acknowledge writes in durability mode. Once a replica is deemed eligible, it is expected to remain in that state and participate in all future replication synchronously. This strict acknowledgment process ensures that all replicas participating in elections are fully in sync, eliminating the risk of promoting an inconsistent/stale primary.
Allowing configurable durability thresholds, where only a subset of eligible replicas need to acknowledge writes, might appear to offer greater flexibility. However, this introduces significant challenges. Going with this option, if a replica fails to acknowledge a write, the system would need to dynamically revoke its eligibility. This revocation process complicates the protocol in several ways:
-
Performance Overhead: Revoking eligibility requires explicit and synchronous communication to replicas. This adds latency and increases the volume of internal messages during write operations.
-
Error-Handling Complexity: The revocation process introduces new failure scenarios. If the message revoking a replica's eligibility is lost or delayed, replicas may hold inconsistent views of eligibility status, leading to election ambiguity or delays.
By requiring strict acknowledgment from all eligible replicas, the proposed protocol avoids these challenges/complexity. The system guarantees durability and consistency at the cost of temporary unavailability during failure scenarios. This trade-off simplifies the protocol, reduces error-handling complexity, and ensures predictable behavior under both normal and failure conditions.
While this strict approach may lead to write rejections if the number of available eligible replicas drops below the user-configured replication factor, it is a necessary tradeoff that prioritizes data durability over availability.
Incremental Perfection
While the described in-memory solution guarantees durability in the face of single-point failures, it cannot prevent data loss if all nodes in a shard experience simultaneous downtime. In such extreme cases, a proper write-ahead log (WAL) implementation would be necessary to ensure data durability. The WAL acts as an extra layer of defense, complementing synchronous replication and deterministic elections.
A critical aspect of this proposal is the requirement that replicas must be explicitly granted permission by the primary to participate in elections. This design ensures that if all nodes in a shard go down and then recover, no replica can independently initiate an election unless it has previously received explicit eligibility from the old primary. Consequently, in the absence of the old primary granting voting permissions, no election can take place, and the old primary is guaranteed to resume its role once it comes back online.
When all nodes reload their WAL after a full shard downtime event:
-
The old primary’s WAL will contain the most complete and up-to-date state, making it the natural candidate to resume the primaryship.
-
Replicas remain ineligible to vote or become primary until explicitly granted eligibility by the old primary during normal operation.
This mechanism eliminates the risk of replicas contending for primaryship prematurely, ensuring data consistency.
This deterministic behavior avoids ambiguity and split-brain scenarios during recovery. The trade-off inherent in this approach is reduced availability after a full shard down event. The shard remains unavailable until the old primary is operational, as no other node is authorized to assume the primary role. However, this unavailability is intentional and aligns with the proposal’s core goal of prioritizing durability and consistency over availability in failure scenarios.
Decentralized vs. Centralized WAL
Introducing a WAL to enhance resilience in Valkey presents two potential architectural approaches, each with its own trade-offs:
Decentralized WAL
In this approach, each node maintains its own independent WAL. This design aligns closely with Valkey’s self-sufficient and distributed nature, allowing nodes to manage durability independently. However, as described above, a decentralized WAL achieves durability at the potentially higher cost of availability. For instance, after a full shard down event, only the old primary can resume its primary role. This ensures data consistency but delays recovery until the old primary becomes operational.
Distributed Centralized WAL
A centralized WAL provides a unified log that all nodes in a shard share, but it is implemented in a distributed and highly available manner. This design simplifies recovery by offering a single source of truth, making it easier to reconcile state across nodes. Essentially, centralized WALs delegate deterministic election and synchronous replication to the log service, reducing the complexity required on the engine side. However, relying on a specialized log service introduces an external dependency, which makes Valkey less self-sufficient and potentially more complex to deploy and manage.
Why AOF Is Not a True WAL
AOF is often used to persist data in Valkey, but it is not a true WAL. A true WAL guarantees that every change is recorded before the corresponding data operation is performed, ensuring an exact and replayable sequence of events for recovery. AOF, however, has several limitations that make it unsuitable as a WAL:
-
Client-Perceived Inconsistency: AOF may persist a write operation to disk even if the client perceives the operation as failed. For example, consider a SET x y command. If the primary writes the command to the AOF file but crashes before acknowledging the client, the client assumes the operation failed. Upon recovery, the primary replays the AOF file, resulting in x being updated to y—a state inconsistent with the client’s expectation. This creates a durability inconsistency because the client’s view of the operation's success does not match the system’s actual state.
-
Lack of Separation Between Commit and Application: In a proper WAL, there is a clear distinction between logging a change (commit record) and applying the change to the in-memory data structure. AOF lacks this separation, meaning that partial updates can lead to ambiguous recovery behavior, especially in failure scenarios.
-
Error Handling Limitations: AOF lacks mechanisms to address certain edge cases, such as partially written commands during disk failures. While checksum validation can mitigate some issues, it does not fully replicate the robust recovery guarantees provided by a true WAL.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status