Clear Stale Persistent Tasks in Stop/Pause API #1629

mohit10011999 · 2026-01-25T16:57:24Z

Description

When CCR is stopped or paused, all the index and shard replication tasks should be stopped. But if the stop/ pause is not completely successful, some of the replication tasks might stay running. This can cause conflict when we restart/resume the replication.

We have taken below actions to rectify this:-

Stop/Pause API should be idempotent and should try to remove all retention leases and remove stale persistent task if present rather than bailing out if the replication is in stop/pause state.
When CCR is stopped/paused, the state should be updated only after all the tasks have been stopped.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

ankitkala

Two major feedback:

The PR has lot of additional code changes which doesn't seems to be related to the actual change. Can you remove all the unnecessary changes so its easier to review
Stale replication tasks are problem when you're trying to create the task again(start or resume). I think we should be able to simplify by just handling this during task creation here (need to verify though, with any stacktrace from last few occurence of the issue)

src/main/kotlin/org/opensearch/replication/action/index/TransportReplicateIndexAction.kt

ankitkala · 2026-01-27T03:55:38Z

src/main/kotlin/org/opensearch/replication/action/stop/TransportStopIndexReplicationAction.kt

+    /**
+     * Handles case where no replication state exists but stale artifacts might remain.
+     */
+    private suspend fun handleMissingReplicationState(indexName: String) {


Why do you want to handle the stale tasks during stop flow? shouldn't we only do this during start and resume?

The aim is to prevent the stale tasks to occur on a best effort basis, so that STOP/PAUSE api can be called multiple times, making it idempotent and we execute the full workflow to cleanup:-

Goal: Remove/suspend replication artifacts
Stale artifacts: Leftovers from previous incomplete operations
User intent: "Make sure replication is stopped/paused"
Handling stale artifacts helps achieve the goal

src/main/kotlin/org/opensearch/replication/action/pause/TransportPauseIndexReplicationAction.kt

Signed-off-by: Mohit Kumar <[email protected]>

…cel" This reverts commit b85dc35. Signed-off-by: Mohit Kumar <[email protected]>

Signed-off-by: Mohit Kumar <[email protected]>

This reverts commit 2351fa1. Signed-off-by: Mohit Kumar <[email protected]>

Signed-off-by: Mohit Kumar <[email protected]>

…narios Signed-off-by: Mohit Kumar <[email protected]>

mohit10011999 requested review from ankitkala, gbbafna, krishna-ggk, monusingh-1, saikaranam-amazon and soosinha as code owners January 25, 2026 16:57

ankitkala reviewed Jan 27, 2026

View reviewed changes

mohit10011999 force-pushed the stalePersistentTasks branch 17 times, most recently from 96341a5 to 2351fa1 Compare January 28, 2026 16:43

mohitamg added 6 commits January 28, 2026 23:00

Clear Stale Persistent Tasks in Stop/Pause API

ea1d6a0

Signed-off-by: Mohit Kumar <[email protected]>

Clear Stale Persistent Tasks in Stop/Pause API

9b6d0ae

Signed-off-by: Mohit Kumar <[email protected]>

Get metadata before cleanup so retention lease removal

88a6911

Signed-off-by: Mohit Kumar <[email protected]>

When metadata is null, the retention lease removal is simply skipped

57ed386

Signed-off-by: Mohit Kumar <[email protected]>

Wrapped deleteIndexReplicationMetadata in try catch block

d5b2364

Signed-off-by: Mohit Kumar <[email protected]>

Changed error message to match the integ test

8280875

Signed-off-by: Mohit Kumar <[email protected]>

mohitamg added 6 commits January 28, 2026 23:00

Fixed exceptions for integ test

8cb071d

Signed-off-by: Mohit Kumar <[email protected]>

Fixed exceptions for integ test1

e07603c

Signed-off-by: Mohit Kumar <[email protected]>

Fixing integ test for follower stats to wait for tasks to cancel

6c42d77

Signed-off-by: Mohit Kumar <[email protected]>

Revert "Fixing integ test for follower stats to wait for tasks to can…

08bcb6a

…cel" This reverts commit b85dc35. Signed-off-by: Mohit Kumar <[email protected]>

Fixing integ test for follower stats response

07806b3

Signed-off-by: Mohit Kumar <[email protected]>

Fixing sequence of operations

4557a81

Signed-off-by: Mohit Kumar <[email protected]>

mohit10011999 force-pushed the stalePersistentTasks branch 7 times, most recently from 43629d5 to e904ee1 Compare January 29, 2026 04:44

Revert "Fixing integ test for follower stats response"

0c23246

This reverts commit 2351fa1. Signed-off-by: Mohit Kumar <[email protected]>

mohit10011999 force-pushed the stalePersistentTasks branch 8 times, most recently from f407086 to 759e8dc Compare January 29, 2026 09:47

Fixing IT which are not compatible with idempotency

d6ba04f

Signed-off-by: Mohit Kumar <[email protected]>

mohit10011999 force-pushed the stalePersistentTasks branch 5 times, most recently from 9a48b36 to 1cd2550 Compare January 29, 2026 16:18

Fixing retention lease cleanup retry, stale tasks and auto-delete sce…

2b60eb0

…narios Signed-off-by: Mohit Kumar <[email protected]>

mohit10011999 force-pushed the stalePersistentTasks branch from 1cd2550 to 2b60eb0 Compare January 30, 2026 05:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clear Stale Persistent Tasks in Stop/Pause API #1629

Clear Stale Persistent Tasks in Stop/Pause API #1629

mohit10011999 commented Jan 25, 2026

Uh oh!

ankitkala left a comment

Uh oh!

Uh oh!

Uh oh!

ankitkala Jan 27, 2026

Uh oh!

mohit10011999 Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Clear Stale Persistent Tasks in Stop/Pause API #1629

Are you sure you want to change the base?

Clear Stale Persistent Tasks in Stop/Pause API #1629

Conversation

mohit10011999 commented Jan 25, 2026

Description

Related Issues

Check List

Uh oh!

ankitkala left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ankitkala Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

mohit10011999 Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants