Skip to content

Conversation

@mohit10011999
Copy link
Contributor

Description

When CCR is stopped or paused, all the index and shard replication tasks should be stopped. But if the stop/ pause is not completely successful, some of the replication tasks might stay running. This can cause conflict when we restart/resume the replication.

We have taken below actions to rectify this:-

  1. Stop/Pause API should be idempotent and should try to remove all retention leases and remove stale persistent task if present rather than bailing out if the replication is in stop/pause state.
  2. When CCR is stopped/paused, the state should be updated only after all the tasks have been stopped.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Member

@ankitkala ankitkala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two major feedback:

  • The PR has lot of additional code changes which doesn't seems to be related to the actual change. Can you remove all the unnecessary changes so its easier to review
  • Stale replication tasks are problem when you're trying to create the task again(start or resume). I think we should be able to simplify by just handling this during task creation here (need to verify though, with any stacktrace from last few occurence of the issue)

/**
* Handles case where no replication state exists but stale artifacts might remain.
*/
private suspend fun handleMissingReplicationState(indexName: String) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to handle the stale tasks during stop flow? shouldn't we only do this during start and resume?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aim is to prevent the stale tasks to occur on a best effort basis, so that STOP/PAUSE api can be called multiple times, making it idempotent and we execute the full workflow to cleanup:-

Goal: Remove/suspend replication artifacts
Stale artifacts: Leftovers from previous incomplete operations
User intent: "Make sure replication is stopped/paused"
Handling stale artifacts helps achieve the goal

@mohit10011999 mohit10011999 force-pushed the stalePersistentTasks branch 17 times, most recently from 96341a5 to 2351fa1 Compare January 28, 2026 16:43
@mohit10011999 mohit10011999 force-pushed the stalePersistentTasks branch 7 times, most recently from 43629d5 to e904ee1 Compare January 29, 2026 04:44
This reverts commit 2351fa1.

Signed-off-by: Mohit Kumar <[email protected]>
@mohit10011999 mohit10011999 force-pushed the stalePersistentTasks branch 8 times, most recently from f407086 to 759e8dc Compare January 29, 2026 09:47
@mohit10011999 mohit10011999 force-pushed the stalePersistentTasks branch 5 times, most recently from 9a48b36 to 1cd2550 Compare January 29, 2026 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants