-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Overview of the Issue
Problem
Related to #18528, this issue covers what happens after a successful EmergencyReparentShard when the old primary's vttablet is still running but its local mysqld is down (or slow to start).
When a PRIMARY fails, VTOrc triggers an ERS and a new primary is elected. The old primary's vttablet — either still running or restarted automatically (e.g. Kubernetes pod restart) — receives a SetReplicationSource RPC to self-demote. The RPC fails immediately at convertBoolToSemiSyncAction because it calls SemiSyncExtensionLoaded which queries MySQL — and MySQL is down (errno 2002).
Because the error occurs before setReplicationSourceLocked is called, ChangeTabletType is never invoked and the topo is never updated. The tablet remains PRIMARY in topo despite another tablet being promoted.
Reproduced locally by:
- Starting a cluster via
examples/local - Stopping VTOrc
- Killing
mysqld_safe+mysqldon the primary (keepingvttabletrunning) - Calling
SetReplicationSourceviagrpcurlon the old primary - Observing the tablet stays
PRIMARYin topo
Root Cause
SetReplicationSource calls convertBoolToSemiSyncAction before setReplicationSourceLocked. convertBoolToSemiSyncAction calls SemiSyncExtensionLoaded which queries MySQL. When MySQL is down, this fails with errno 2002 and the error is returned immediately — ChangeTabletType and the topo update are never reached.
Even if we got past that, ChangeTabletType → updateTypeAndPublish → updateLocked calls SetServingType which tries to connect to MySQL via serveNonPrimary → connect() → se.EnsureConnectionAndDB(). This could drain the context timeout, leaving no time for publishStateLocked to update topo.
Solution
-
Add
IsMySQLLocal()andIsLocalMySQLDown()to theMysqlDaemoninterfaceIsMySQLLocal()returnstruewhen the DBA connection uses a unix socketIsLocalMySQLDown()probes MySQL via a DBA connection and uses heuristics to determine if MySQL is actually down vs. other transient errors:CRConnectionError(errno 2002, unix socket) is the signal- "Too many connections" proves MySQL is alive
- File-descriptor exhaustion is detected and excluded (client-side problem, not MySQL)
- Socket file existence is validated
-
Skip
convertBoolToSemiSyncActioninSetReplicationSourcewhen MySQL is down- Use
SemiSyncActionNoneinstead of querying MySQL for semi-sync extension status
- Use
-
Skip
updateLockedinupdateTypeAndPublishwhen MySQL is downupdateLockedtransitions serving state (query service, VREngine, etc.) which all require MySQLpublishStateLockedstill runs to update toporetryTransitionhandles reconnecting when MySQL comes back
-
Skip replication configuration in
setReplicationSourceLockedwhen MySQL is down- Everything after
ChangeTabletTyperequires MySQL — return nil and let VTOrc or VTTablet restart repair replication later
- Everything after
-
Best-effort
STOP REPLICAbeforeMysqld.Shutdown()- Addresses a separate race in MySQL's
close_connections()whereclose_listener()removes the unix socket beforeend_slave()stops replication threads - See Bug Report:
Mysqld.Shutdown()can return while replication threads are still running #19625
- Addresses a separate race in MySQL's
Reproduction Steps
- Start a cluster via
examples/local(or similar) - Stop VTOrc
- Kill
mysqld_safeandmysqldon the primary tablet, keepingvttabletrunning - Call
SetReplicationSourceon the old primary:grpcurl -plaintext -d '{"parent": {"cell": "zone1", "uid": 100}, "semiSync": true}' \ localhost:16101 tabletmanagerservice.TabletManager/SetReplicationSource - Observe the tablet remains
PRIMARYin topo:vtctldclient GetTablets
Binary Version
v19+Operating System and Environment details
Linux, Kubernetes, macOS (reproduced locally)