Skip to content

ecChronos Does Not Recover When All Cassandra Nodes in a DC Restart Simultaneously #1443

@cezarpaulo16

Description

@cezarpaulo16

Logs shared with Victor

Description

  1. Does the problem persist? Yes. ecChronos never recovers without a manual pod restart.

  2. How to reproduce:

    • Deploy ecChronos with 1 DC, 3 replicas, TLS enabled
    • Verify ecChronos is working (ecctool repairs returns results)
    • Delete all 3 Cassandra pods simultaneously (restart Cassandra instances)
    • Wait for all Cassandra pods to come back and be UN
    • ecChronos never recovers — ecctool repairs returns HTTP 500, ecctool run-repair gets stuck

Note: deleting pods one at a time or 2 at a time works fine — ecChronos recovers. The issue only occurs when all nodes in the DC go down simultaneously.

Detailed description

What is happening?
When all Cassandra pods restart simultaneously, the CQL driver loses all connections and enters a NoNodeAvailableException state. The pods come back with new IPs, but the driver keeps trying the old cached IPs. Since there's no working connection, the driver can
't receive topology change events to learn the new IPs. It never recovers.

Logs show:
NoNodeAvailableException: No node was available to execute the query
Connection refused: 192-168-8-10.wcdbcd.epsdaua.svc.cluster.local/192.168.8.10:9042
Node switched state to DOWN

What did you expect to happen?
ecChronos should recover automatically after all Cassandra nodes come back.

What have you tried?

  • Verified all Cassandra nodes are UN and healthy
  • Verified ecctool status returns "ecChronos is running" (the process is alive, just the CQL session is broken)
  • Only a manual ecChronos pod restart resolves the issue
  • Deleting 1 or 2 pods at a time works fine — ecChronos recovers because at least one connection stays alive to receive topology events

What version of ecChronos are you using?
1.0.0-beta3

Was the problem detected during an upgrade or downgrade procedure?
No.

Have you checked the ecChronos documents?
Yes.

What do YOU think is the issue?
The Cassandra Java driver caches resolved IPs for the contact points. When all nodes go down and come back with new IPs, the driver has no working connection to receive gossip/topology events, and it doesn't re-resolve DNS for the contact points. This creates a
deadlock: it needs a connection to learn new IPs, but can't connect because it only knows old IPs.

A possible fix would be to implement a reconnection strategy when this happens.

Anything else worth noting?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions