HDDS-15066. Read-Write Lock race leave stale references to container creating orphan replicas#10109
HDDS-15066. Read-Write Lock race leave stale references to container creating orphan replicas#10109Gargi-jais11 wants to merge 9 commits intoapache:masterfrom
Conversation
|
@ChenSammi @sadanand48 Please review the patch as per your convinence. |
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @Gargi-jais11 for the patch.
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @Gargi-jais11 for updating the patch.
|
@sodonnel Please review this PR whenever you are free. |
|
This change looks good and I think it will solve the problem. However I think it reveals a problem with the code structure within the datanode. These services should not have to perform such complex locking to get a consistent view of a container and it would be very easy in the future for some other service to come along and not do things correctly, or indeed some other existing part of the DN code may also be doing things incorrectly already. I think that all the container "logic" should be hidden behind an interface and all the services like disk balancer, or block deletion should call it, eg: Then all the locking etc happens behind the scenes in the same place where it can be controlled more closely. Fixing this is a large exercise and not something we would want to take on in this PR. There are other places in KeyValueHandler that take the container lock - are we sure they are OK or is any change needed in them too? |
I agree with u @sodonnel that all locking mechanism should be hidden behind an interface that it happens at the same place. Net result: EC reconstruction fails just because the shard at DN1 has been marked as DELETED while it correctly pulled index from other DNs. Container C still has shard idx=2 missing. RM re-queues it next monitor cycle. Here we have two options:
Option 2 — Re-fetch container from ContainerSet after getting the lock like others Please let me know which option do @sodonnel @ChenSammi you both prefer ? RM retry is simpler and already works — it just costs one extra cycle and wasted network I/O for every time the race window is hit. I just wanted to bring this in your knowledge and discuss. With Ratis UnderReplication and diskbalancer working on same container there is no issue as it will replicate only from any one of the container replica so even if the contaienr state is DELETED it will be fail or if before changing to delete UnderReplication acquires lock then it will be successful. But for EC just because of one shard entire operation needs to be repeated again as shared above. |
|
Second Finding: For QUASI_CLOSED container if SCM sends force Close command and DiskBalancer is working on same container Here is the exact timeline: after T8
This is a kind of regression: Re-send a I think here as well we need to re-fetch the container . |
|
for markContainerForUnhealthy with DiskBalancer parallely working . I believe this also needs to refetch the container after writeLock. What SCM/RM does in response : Outcome: A healthy replica on Disk2 gets deleted. Container C becomes genuinely under-replicated. |
|
@sodonnel @ChenSammi above are more new areas which need same fix. Rest all do not have any issue like this. I have done thorough code checking. |
For EC Reconstruction I don't think it exports the entire container like this. More it reads the blocks out of the container block by block using the normal read path through the datanode. So the question is - can the normal read path be impacted by the balancer moving a container from disk 1 to disk 2? Ideally, the DN should be safe to do this using its locking as it is not ideal that reads would fail randomly. If its Ratis, it can just try another replica with s small delay. With EC it would fall back to reconstruction reads, which are slower and use more resources. |
|
This problem is kind of getting difficult to see if we have solved all the cases or not. The root cause is poor interface design, and further inadequate locking. Fixing this is going to be difficult. The locking is only inadequate because the lock is inside the container object and the container object can be replaced by a new object against the same ID. What are all the scenarios that replace that container object? What is we instead updated the existing object in place under a lock? Would that solve all these edge cases more simple? It sounds like these problems all come from the disk balancer moving and replacing the container object, so if we fix just that, it may be a cleaner solution to this whole problem? |
|
Looking at the containerSet class, the |
What changes were proposed in this pull request?
On a datanode, some work runs under a container read lock (or otherwise changes which replica directory / in-memory Container is authoritative) while other threads look up a container once and later take a write lock. If the
in-memorycontainer mapping or on-disk location changes in between, the second thread can still use a stale Container orKeyValueContainerDatareference. That is a classic TOCTOU problem: the map and the caller disagree about where the replica lives.Worst cases include wrong ContainerSet updates, deleting or updating the wrong paths, ghost / orphan replica data on disk that SCM no longer tracks, and block deletion targeting the wrong RocksDB/chunks tree so pending deletes on the live replica are never applied (space not reclaimed).
Applies to CLOSED and QUASI_CLOSED (and any path where balancing/replication overlaps lifecycle commands), not a single state.
This can be easily explained with help of DiskBalancer as an example, although it applies to other read write races as well:
While DiskBalancer moves a container between volumes on the same datanode, it holds the container read lock, copies data, then calls
ContainerSet.updateContainer(...)to point the in-memory map at a new Container instance (new volume / paths).Whereas it is seen that other threads often look up the container once, then block on writeLock() until the move finishes. After they unblock, they still hold a reference to the old container (source volume). They then run logic and/or removeContainer(containerId) using stale paths and stale object identity.
Impacts:
Suggested Fix:
For RM side: After acquiring writeLock(), re-fetch from the map and compare by identity. If they differ, the container was moved — abort and let the caller retry.
For BlockDeletingService:
For DiskBalancer:
Move the container state check after acquiring readLock() to prevent stale references about container.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-15066
How was this patch tested?
Added unit tests on diskBalancer side as an example for the race condition which can happen with any services read/write Lock race.
Test file:
TestDiskBalancerWithConcurrentBackgroundTasks