ecChronos Does Not Recover When All Cassandra Nodes in a DC Restart Simultaneously


Logs shared with Victor

## Description

1. Does the problem persist? Yes. ecChronos never recovers without a manual pod restart.

2. How to reproduce:
   - Deploy ecChronos with 1 DC, 3 replicas, TLS enabled
   - Verify ecChronos is working (ecctool repairs returns results)
   - Delete all 3 Cassandra pods simultaneously (restart Cassandra instances)
   - Wait for all Cassandra pods to come back and be UN
   - ecChronos never recovers — ecctool repairs returns HTTP 500, ecctool run-repair gets stuck

  Note: deleting pods one at a time or 2 at a time works fine — ecChronos recovers. The issue only occurs when all nodes in the DC go down simultaneously.

## Detailed description

What is happening?
When all Cassandra pods restart simultaneously, the CQL driver loses all connections and enters a NoNodeAvailableException state. The pods come back with new IPs, but the driver keeps trying the old cached IPs. Since there's no working connection, the driver can
't receive topology change events to learn the new IPs. It never recovers.

Logs show:
NoNodeAvailableException: No node was available to execute the query
Connection refused: 192-168-8-10.wcdbcd.epsdaua.svc.cluster.local/192.168.8.10:9042
Node switched state to DOWN

What did you expect to happen?
ecChronos should recover automatically after all Cassandra nodes come back.

What have you tried?
- Verified all Cassandra nodes are UN and healthy
- Verified ecctool status returns "ecChronos is running" (the process is alive, just the CQL session is broken)
- Only a manual ecChronos pod restart resolves the issue
- Deleting 1 or 2 pods at a time works fine — ecChronos recovers because at least one connection stays alive to receive topology events

What version of ecChronos are you using?
1.0.0-beta3

Was the problem detected during an upgrade or downgrade procedure?
No.

Have you checked the ecChronos documents?
Yes.

What do YOU think is the issue?
The Cassandra Java driver caches resolved IPs for the contact points. When all nodes go down and come back with new IPs, the driver has no working connection to receive gossip/topology events, and it doesn't re-resolve DNS for the contact points. This creates a 
deadlock: it needs a connection to learn new IPs, but can't connect because it only knows old IPs.

A possible fix would be to implement a reconnection strategy when this happens.

Anything else worth noting?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ecChronos Does Not Recover When All Cassandra Nodes in a DC Restart Simultaneously #1443

Description

Detailed description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ecChronos Does Not Recover When All Cassandra Nodes in a DC Restart Simultaneously #1443

Description

Description

Detailed description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions