Repair schedules not updated when cluster topology changes (scale-out/scale-in)


Description:

1. The problem persists until the ecChronos pod is restarted.
2. Steps to reproduce:
   - Configure ecChronos agent as datacenterAware with 3 Cassandra replicas
   - Configure schedule.yml with an incremental repair schedule for a keyspace/table
   - Create the keyspace and table in Cassandra via CQL
   - Verify that ecctool schedules shows schedules for all 3 nodes
   - **Scale-in** to 2 replicas — observe that ecctool schedules still shows 3 entries, including a stale entry for the decommissioned node showing ON_TIME 0.00%
   - **Scale-out** to 4 replicas — observe that ecctool schedules still shows the same 3 entries (1 stale + 2 original), no entries for the 2 new nodes
   - Restart the ecChronos pod — after restart, schedules correctly show 4 entries (the decommissioned node is gone, the 2 new nodes are present)

Detailed description:

What is happening?

ecChronos does not update repair schedules when the cluster topology changes at runtime. This manifests in two ways:

Scale-out (new node joins): ecChronos detects the new node and creates a JobRunTask, but never creates a ScheduledRepairJob:

[s0-admin-0] [DefaultRepairConfigurationProvider] Node added ec92a2b1-38c2-40f0-ae07-4ea1496b9a55
[s0-admin-0] [NodeWorkerManager] Node ec92a2b1-38c2-40f0-ae07-4ea1496b9a55 being added to the threadpool
[s0-admin-0] [NodeWorkerManager] New worker created for Node ec92a2b1-38c2-40f0-ae07-4ea1496b9a55
[s0-admin-0] [ScheduleManagerImpl] JobRunTask created for new node ec92a2b1-38c2-40f0-ae07-4ea1496b9a55


Then repeatedly:

[TaskExecutor-3] [ScheduleManagerImpl] There is no ScheduledJob for this node ec92a2b1-38c2-40f0-ae07-4ea1496b9a55 to run


This was confirmed with two consecutive scale-outs (3→2→3→4). Neither new node ever got a schedule.

Scale-in (node decommissioned): ecChronos detects the node removal but the schedule entry persists:

[s0-admin-0] [DefaultRepairConfigurationProvider] Node removed 9d583870-8228-4ee5-b71e-9f8a0835076a
[pool-3-thread-1] [NodeRemovedAction] Node Removed 9d583870-8228-4ee5-b71e-9f8a0835076a


The ecctool schedules output continues to show the decommissioned node with ON_TIME 0.00% indefinitely.

After restarting the ecChronos pod, schedules are correct: the decommissioned node is gone and all active nodes have schedules.

What did you expect to happen?

When a node joins the cluster, ecChronos should create repair schedules for it for all existing configured tables. When a node is decommissioned, ecChronos should remove its schedule entries.

What have you tried?

- Waited over 20 minutes after scale-out — no schedule created for new nodes
- Waited over 20 minutes after scale-in — stale schedule not removed
- Restarted the ecChronos pod — schedules immediately correct (4 entries for 4 active nodes, decommissioned node gone)

What version of ecChronos are you using?

latest agent/master branch.

Was the problem detected during an upgrade or downgrade procedure?

No.

What do I think is the issue?

Looking at the code, I think the issue might be related to how new nodes are handled in NodeWorkerManager.addNewNodeToThreadPool(). It creates a new NodeWorker with an empty event queue. The NodeWorker only creates schedules when it receives KeyspaceCreatedEvent or TableCreatedEvent, but since the tables already exist, no events are sent to the new worker.

During startup, DefaultRepairConfigurationProvider.setupConfiguration() iterates over all existing keyspaces/tables and calls putConfigurations for each node. But when a new node is added at runtime via onAdd(), it seems like this initial configuration step might not be triggered for
the new node. I could be wrong about the exact cause, but the behavior is consistent with the new node's NodeWorker never receiving events for existing tables.

For scale-in, it seems like removeNode() in NodeWorkerManager stops the worker thread but the schedule entries in RepairSchedulerImpl might not be cleaned up.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repair schedules not updated when cluster topology changes (scale-out/scale-in) #1460

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Repair schedules not updated when cluster topology changes (scale-out/scale-in) #1460

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions