HDDS-15066. Read-Write Lock race leave stale references to container creating orphan replicas by Gargi-jais11 · Pull Request #10109 · apache/ozone

Gargi-jais11 · 2026-04-23T05:34:15Z

What changes were proposed in this pull request?

On a datanode, some work runs under a container read lock (or otherwise changes which replica directory / in-memory Container is authoritative) while other threads look up a container once and later take a write lock. If the in-memory container mapping or on-disk location changes in between, the second thread can still use a stale Container or KeyValueContainerData reference. That is a classic TOCTOU problem: the map and the caller disagree about where the replica lives.

Worst cases include wrong ContainerSet updates, deleting or updating the wrong paths, ghost / orphan replica data on disk that SCM no longer tracks, and block deletion targeting the wrong RocksDB/chunks tree so pending deletes on the live replica are never applied (space not reclaimed).

Applies to CLOSED and QUASI_CLOSED (and any path where balancing/replication overlaps lifecycle commands), not a single state.

This can be easily explained with help of DiskBalancer as an example, although it applies to other read write races as well:
While DiskBalancer moves a container between volumes on the same datanode, it holds the container read lock, copies data, then calls ContainerSet.updateContainer(...) to point the in-memory map at a new Container instance (new volume / paths).
Whereas it is seen that other threads often look up the container once, then block on writeLock() until the move finishes. After they unblock, they still hold a reference to the old container (source volume). They then run logic and/or removeContainer(containerId) using stale paths and stale object identity.

Impacts:

Replication Manager DeleteContainerCommand: can remove the wrong map entry (live replica on the destination volume), delete/move source files, and leave an orphan replica on disk that SCM no longer sees — plus wrong volume accounting and ICRs keyed off the old replica.
BlockDeletingTask: uses a task-scoped KeyValueContainerData snapshot (source paths) while getContainer() may return the new replica; block cleanup can target the wrong chunks path and never clear pending-deletion state on the live replica → space not reclaimed.

Suggested Fix:

For RM side: After acquiring writeLock(), re-fetch from the map and compare by identity. If they differ, the container was moved — abort and let the caller retry.

container.writeLock();
try {    
Container<?> current containerSet.getContainer( container.getContainerData().getContainerID());
if (current != container) { 
// Container was relocated by DiskBalancer while we waited for the lock.Our reference is stale; the operation will be retried on the next cycle. 
 return;    
 } 
 // ... proceed with the actual operation
}

For BlockDeletingService:

container.writeLock();

// Re-fetch AFTER lock: map already has disk-B in scenarios.
// current.getContainerData() is disk-B's ContainerData ≠ this.containerData 
(disk-A snapshot).

Container<?> current = containerSet.getContainer(containerData
.getContainerID());
if (current == null || current.getContainerData() != containerData) {   
// containerData is stale — DiskBalancer relocated this container. BlockDeletingService will reschedule with disk-B's fresh containerData.

return;
}

For DiskBalancer:
Move the container state check after acquiring readLock() to prevent stale references about container.

container.readLock();
try {
// Double check container state before acquiring lock to start move process.
// Container state may have changed after selection.
State containerState = container.getContainerData().getState();
if (!movableContainerStates.contains(containerState)) {
  LOG.warn("Container {} is in {} state, skipping move process.", containerId, containerState);
  postCall(false, startTime);
  return BackgroundTaskResult.EmptyTaskResult.newResult();
}   // Step 1: Copy container to new Volume's tmp Dir
  diskBalancerTmpDir = getDiskBalancerTmpDir(destVolume)
      .resolve(String.valueOf(containerId));
  ozoneContainer.getController().copyContainer(containerData, diskBalancerTmpDir); 
....

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15066

How was this patch tested?

Added unit tests on diskBalancer side as an example for the race condition which can happen with any services read/write Lock race.
Test file: TestDiskBalancerWithConcurrentBackgroundTasks

Gargi-jais11 · 2026-04-23T06:46:22Z

@ChenSammi @sadanand48 Please review the patch as per your convinence.

adoroszlai

Thanks @Gargi-jais11 for the patch.

adoroszlai

Thanks @Gargi-jais11 for updating the patch.

Gargi-jais11 · 2026-04-27T08:55:55Z

@sodonnel Please review this PR whenever you are free.

sodonnel · 2026-04-28T15:33:36Z

This change looks good and I think it will solve the problem. However I think it reveals a problem with the code structure within the datanode. These services should not have to perform such complex locking to get a consistent view of a container and it would be very easy in the future for some other service to come along and not do things correctly, or indeed some other existing part of the DN code may also be doing things incorrectly already.

I think that all the container "logic" should be hidden behind an interface and all the services like disk balancer, or block deletion should call it, eg:

containerManager.moveContainer(container_id, destination)
containerManager.deleteBlockList(container_id, blocksToDelete)
...
etc

Then all the locking etc happens behind the scenes in the same place where it can be controlled more closely.

Fixing this is a large exercise and not something we would want to take on in this PR.

There are other places in KeyValueHandler that take the container lock - are we sure they are OK or is any change needed in them too?

Gargi-jais11 · 2026-04-29T04:24:57Z

This change looks good and I think it will solve the problem. However I think it reveals a problem with the code structure within the datanode. These services should not have to perform such complex locking to get a consistent view of a container and it would be very easy in the future for some other service to come along and not do things correctly, or indeed some other existing part of the DN code may also be doing things incorrectly already.

I think that all the container "logic" should be hidden behind an interface and all the services like disk balancer, or block deletion should call it, eg:
containerManager.moveContainer(container_id, destination)
containerManager.deleteBlockList(container_id, blocksToDelete)
...
etc
Then all the locking etc happens behind the scenes in the same place where it can be controlled more closely.

Fixing this is a large exercise and not something we would want to take on in this PR.

There are other places in KeyValueHandler that take the container lock - are we sure they are OK or is any change needed in them too?

I agree with u @sodonnel that all locking mechanism should be hidden behind an interface that it happens at the same place.
I was going through other write locks in the code base. I see that when EC Under-replication + DiskBalancer there can be failure however it cannot be the problem but we can minimise the effort here.
Setup: Container C (RS-6-3), shard index=2 exists only on DN1 (under-replicated). DiskBalancer on DN1 simultaneously moves shard idx=2 from Disk1 → Disk2.

The race:

DiskBalancer (DN1)                    EC Reconstruction (DN5 coordinator)
─────────────────────────────────     ────────────────────────────────────────
container.readLock()                  
  copy Disk1 → tmp/                                              pulls idx=0 from DN3 ✓
  atomic move tmp → Disk2                                 pulls idx=1 from DN4 ✓
  containerSet.updateContainer()                   containerSet.getContainer(C) → OLD obj (Disk1)
  readUnlock()                                                   tries writeLock → was blocked, now unblocked
  markContainerForDelete(OLD)                        OLD.state = DELETED
                                                                            exportContainerData checks state → DELETED
                                                                           throws IllegalStateException → FAILS
                                                                         ← entire ReconstructECContainersCommand fails
                                                                           bandwidth for idx=0, idx=1 already wasted

Net result: EC reconstruction fails just because the shard at DN1 has been marked as DELETED while it correctly pulled index from other DNs. Container C still has shard idx=2 missing. RM re-queues it next monitor cycle.

Here we have two options:
Option 1 — Rely on RM to re-send (current behavior)

RM re-queues the container after command failure/timeout
Next cycle: DN1's heartbeat has already reported the new path for idx=2 on Disk2
RM sends a fresh ReconstructECContainersCommand — this time export succeeds on the NEW container
Cost: 1 extra monitor cycle delay + bandwidth for idx=0, idx=1 already wasted in the failed attempt
Safe? Yes — no data loss, just delayed fix. But every failed attempt wastes bandwidth for all N-1 other shards too.

Option 2 — Re-fetch container from ContainerSet after getting the lock like others
In exportContainerData, after acquiring the lock, re-fetch the container from ContainerSet by ID, so even if the old container is marked DELETED the new container moved by diskbalancer will be fetched and will not fail ECReconstruction.

Please let me know which option do @sodonnel @ChenSammi you both prefer ? RM retry is simpler and already works — it just costs one extra cycle and wasted network I/O for every time the race window is hit. I just wanted to bring this in your knowledge and discuss.

With Ratis UnderReplication and diskbalancer working on same container there is no issue as it will replicate only from any one of the container replica so even if the contaienr state is DELETED it will be fail or if before changing to delete UnderReplication acquires lock then it will be successful. But for EC just because of one shard entire operation needs to be repeated again as shared above.

Gargi-jais11 · 2026-04-29T04:59:48Z

Second Finding: For QUASI_CLOSED container if SCM sends force Close command and DiskBalancer is working on same container
QUASI_CLOSED containers can be force-closed by SCM (CloseContainerCommand with force=true). That goes through controller.closeContainer().

Here is the exact timeline:

DiskBalancer (DN1)                                                              CloseContainerCommandHandler
─────────────────────────────────────────        ─────────────────────────────────────────
T1: container = containerSet.getContainer(C)
    → OLD container (QUASI_CLOSED, Disk1)

T2: container.readLock() on OLD

T3: copy Disk1 → Disk2 ...                                       T3a: container = containerSet.getContainer(C)
                                                                                   → OLD container (before updateContainer)
                                                                                   T3b: switch(container.getContainerState())
                                                                                   → QUASI_CLOSED + force=true
                                                                                    → controller.closeContainer(id)
                                                                                  → containerSet.getContainer(id)
                                                                                  → OLD container (still, before T5)
                                                                                  → container.close() 
                                                                                  → writeLock() → BLOCKED     ←-------   readLock held

T4: copy done, atomic move to Disk2
T5: importContainer → 
newContainer (QUASI_CLOSED, Disk2)
T6: containerSet.updateContainer(newContainer)
    ← ContainerSet now maps C → newContainer
T7: container.readUnlock()  ← releases
 OLD readLock
                                                                                      T7a: writeLock ACQUIRED on OLD container
                                                                                       → OLD: QUASI_CLOSED → CLOSED
                                                                                       → sendICR(OLD=CLOSED) → SCM told C is CLOSED

T8: container.markContainerForDelete(OLD)
    → writeLock → OLD: CLOSED → DELETED

after T8

	                                       State	                                                In ContainerSet?
---------------------------------------------------------------------------------------
OLD container (Disk1)         DELETED                                             No (updateContainer removed it)

NEW container (Disk2)       QUASI_CLOSED                                  Yes — this is the live replica

SCM's view: Container C on DN1 = CLOSED (from ICR sent at T7a), Reality: Container C on DN1 = QUASI_CLOSED (newContainer).

This is a kind of regression:
SCM thinks it's CLOSED. But DN1's next container report says QUASI_CLOSED. SCM sees a state "regression" (CLOSED → QUASI_CLOSED). Depending on the FCR sent to SCM, it may:

Re-send a force close command → controller.closeContainer(id) now re-fetches from ContainerSet → gets NEW container → closes it correctly → CLOSED. Eventually converges.
Or treat it as an unhealthy/inconsistent replica.
No data loss — the data is intact on Disk2. But there is a state inconsistency window where SCM's cached state (CLOSED) differs from reality (QUASI_CLOSED on the new disk).

I think here as well we need to re-fetch the container .

Gargi-jais11 · 2026-04-29T06:59:29Z

for markContainerForUnhealthy with DiskBalancer parallely working . I believe this also needs to refetch the container after writeLock.
Here it works for any container state to mark all — CLOSED, QUASI_CLOSED, OPEN, RECOVERING — all can become UNHEALTHY.

Case 1: Container CLOSED/QUASI_CLOSED + DiskBalancer + Scanner in parallel
DiskBalancer (DN1)                                                                Container Scanner (DN1)
──────────────────────────────                                                    ──────────────────────────────────────────
T1: selects C (CLOSED, Disk1)
    added to inProgressContainers

T2: container.readLock() on OLD

T3: copy Disk1 → Disk2                                                            scanner reads OLD files on Disk1
    (I/O in progress)                                                             finds checksum failure
                                                                                  controller.markContainerUnhealthy(id, reason)
                                                                                    containerSet.getContainer(id)
                                                                                    → OLD container (Disk1)  ← stale ref
                                                                                    handler.markContainerUnhealthy(OLD, reason)
                                                                                    → writeLock() → BLOCKED (readLock held)

T4: copy done, checksum verified
T5: importContainer → NEW (CLOSED, Disk2)
T6: containerSet.updateContainer(NEW)
    ← ContainerSet maps C → NEW (Disk2)

T7: readUnlock() on OLD

                                                                                    writeLock ACQUIRED on OLD
                                                                                    state = CLOSED, not UNHEALTHY, volume not failed
                                                                                    OLD: CLOSED → UNHEALTHY   ← wrong container
                                                                                    writeUnlock
                                                                                    sendICR: C on DN1 = UNHEALTHY  ← stale, wrong

T8: markContainerForDelete(OLD) → DELETED

Final state:
  OLD (Disk1): DELETED
  NEW (Disk2): CLOSED  ← healthy, valid
  SCM view of container C:   C on DN1 = UNHEALTHY  ← wrong

What SCM/RM does in response :

T9:  RM next cycle — ECUnderReplicationHandler.checkAndRemoveUnhealthyReplica()
       SCM replica record: DN1 has UNHEALTHY replica of C
       checks: is there a CLOSED replica for same index on another DN? 
         → if YES: "prefer deleting the UNHEALTHY over CLOSED" → sendThrottledDeleteCommand(DN1)
         → if NO CLOSED elsewhere: "delete any UNHEALTHY" → sendThrottledDeleteCommand(DN1)
T10: DN1 receives DeleteContainerCommand for C
       containerSet.getContainer(id) → NEW container (Disk2, CLOSED)
       NEW container DELETED  ← healthy valid replica gone
       
     Now container C is genuinely under-replicated.
     RM tries to fix it by replicating — but the replica it just deleted was the source.

Outcome: A healthy replica on Disk2 gets deleted. Container C becomes genuinely under-replicated.
The window between readUnlock (T7) and markContainerForDelete (T8) is the critical period. If the scanner's sendICR reaches SCM and RM processes it before FCR corrects the state, the delete command lands on DN1 and hits the healthy NEW container. This is why markContainerUnhealthy needs the re-fetch — the consequences of operating on the wrong container are irreversible.

Gargi-jais11 · 2026-04-29T07:00:26Z

@sodonnel @ChenSammi above are more new areas which need same fix. Rest all do not have any issue like this. I have done thorough code checking.

sodonnel · 2026-04-29T15:10:37Z

In exportContainerData, after acquiring the lock, re-fetch the container from ContainerSet by ID, so even if the old container is marked DELETED the new container moved by diskbalancer will be fetched and will not fail ECReconstruction.

For EC Reconstruction I don't think it exports the entire container like this. More it reads the blocks out of the container block by block using the normal read path through the datanode. So the question is - can the normal read path be impacted by the balancer moving a container from disk 1 to disk 2? Ideally, the DN should be safe to do this using its locking as it is not ideal that reads would fail randomly. If its Ratis, it can just try another replica with s small delay. With EC it would fall back to reconstruction reads, which are slower and use more resources.

sodonnel · 2026-04-29T15:14:27Z

This problem is kind of getting difficult to see if we have solved all the cases or not.

The root cause is poor interface design, and further inadequate locking. Fixing this is going to be difficult.

The locking is only inadequate because the lock is inside the container object and the container object can be replaced by a new object against the same ID.

What are all the scenarios that replace that container object? What is we instead updated the existing object in place under a lock? Would that solve all these edge cases more simple? It sounds like these problems all come from the disk balancer moving and replacing the container object, so if we fix just that, it may be a cleaner solution to this whole problem?

sodonnel · 2026-04-29T15:24:45Z

Looking at the containerSet class, the updateContainer() method was added only for Diskbalancer and is only called by disk balancer. The only other place items are added to the containerMap is in the addContainer() method, so it seems like this could all be solved by updating the Container object fields rather than creating a new instance and replacing it. It doesn't appear to be an immutable object so this would be a much smaller fix and I think will solve all the issues?

Gargi-jais11 added 3 commits April 22, 2026 11:41

refetch containerData after acquiring writeLock

83ccfa2

added unit test for race condition

bb7e165

remove postcall from container state change move skip

2095f4f

Gargi-jais11 marked this pull request as ready for review April 23, 2026 05:54

adoroszlai reviewed Apr 23, 2026

View reviewed changes

Gargi-jais11 added 3 commits April 23, 2026 14:36

use asserThat and remove supressWarning

fe91aac

use assertThat at rest of the places as well

77bf985

fix checkstyle issue

05f6978

adoroszlai reviewed Apr 23, 2026

View reviewed changes

Comment thread ...ervice/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java Outdated

remove annotation

d40da3a

adoroszlai requested review from ChenSammi and sadanand48 April 24, 2026 12:47

ChenSammi reviewed Apr 27, 2026

View reviewed changes

Comment thread ...va/org/apache/hadoop/ozone/container/keyvalue/statemachine/background/BlockDeletingTask.java Outdated

ChenSammi reviewed Apr 27, 2026

View reviewed changes

Comment thread ...ervice/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java Outdated

sodonnel reviewed Apr 27, 2026

View reviewed changes

Comment thread ...va/org/apache/hadoop/ozone/container/keyvalue/statemachine/background/BlockDeletingTask.java Outdated

sodonnel reviewed Apr 27, 2026

View reviewed changes

Comment thread ...tainer-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java Outdated

Gargi-jais11 added 2 commits April 27, 2026 16:40

Merge branch 'master' of github.com:apache/ozone into HDDS-15066

50f28de

retry for deletion instead of abrupt failure

3e4322d

Gargi-jais11 requested review from ChenSammi and sodonnel April 28, 2026 09:25

Conversation

Gargi-jais11 commented Apr 23, 2026

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Gargi-jais11 commented Apr 23, 2026

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Gargi-jais11 commented Apr 27, 2026

Uh oh!

Uh oh!

Uh oh!

sodonnel commented Apr 28, 2026

Uh oh!

Gargi-jais11 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gargi-jais11 commented Apr 29, 2026

Uh oh!

Gargi-jais11 commented Apr 29, 2026

Uh oh!

Gargi-jais11 commented Apr 29, 2026

Uh oh!

sodonnel commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sodonnel commented Apr 29, 2026

Uh oh!

sodonnel commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Gargi-jais11 commented Apr 29, 2026 •

edited

Loading

sodonnel commented Apr 29, 2026 •

edited

Loading