Skip to content

Bug Report: vttablet PRIMARY self-demotion fails to update topo when local mysqld is down #19623

@timvaillancourt

Description

@timvaillancourt

Overview of the Issue

Problem

Related to #18528, this issue covers what happens after a successful EmergencyReparentShard when the old primary's vttablet is still running but its local mysqld is down (or slow to start).

When a PRIMARY fails, VTOrc triggers an ERS and a new primary is elected. The old primary's vttablet — either still running or restarted automatically (e.g. Kubernetes pod restart) — receives a SetReplicationSource RPC to self-demote. The RPC fails immediately at convertBoolToSemiSyncAction because it calls SemiSyncExtensionLoaded which queries MySQL — and MySQL is down (errno 2002).

Because the error occurs before setReplicationSourceLocked is called, ChangeTabletType is never invoked and the topo is never updated. The tablet remains PRIMARY in topo despite another tablet being promoted.

Reproduced locally by:

  1. Starting a cluster via examples/local
  2. Stopping VTOrc
  3. Killing mysqld_safe + mysqld on the primary (keeping vttablet running)
  4. Calling SetReplicationSource via grpcurl on the old primary
  5. Observing the tablet stays PRIMARY in topo

Root Cause

SetReplicationSource calls convertBoolToSemiSyncAction before setReplicationSourceLocked. convertBoolToSemiSyncAction calls SemiSyncExtensionLoaded which queries MySQL. When MySQL is down, this fails with errno 2002 and the error is returned immediately — ChangeTabletType and the topo update are never reached.

Even if we got past that, ChangeTabletTypeupdateTypeAndPublishupdateLocked calls SetServingType which tries to connect to MySQL via serveNonPrimaryconnect()se.EnsureConnectionAndDB(). This could drain the context timeout, leaving no time for publishStateLocked to update topo.

Solution

  1. Add IsMySQLLocal() and IsLocalMySQLDown() to the MysqlDaemon interface

    • IsMySQLLocal() returns true when the DBA connection uses a unix socket
    • IsLocalMySQLDown() probes MySQL via a DBA connection and uses heuristics to determine if MySQL is actually down vs. other transient errors:
      • CRConnectionError (errno 2002, unix socket) is the signal
      • "Too many connections" proves MySQL is alive
      • File-descriptor exhaustion is detected and excluded (client-side problem, not MySQL)
      • Socket file existence is validated
  2. Skip convertBoolToSemiSyncAction in SetReplicationSource when MySQL is down

    • Use SemiSyncActionNone instead of querying MySQL for semi-sync extension status
  3. Skip updateLocked in updateTypeAndPublish when MySQL is down

    • updateLocked transitions serving state (query service, VREngine, etc.) which all require MySQL
    • publishStateLocked still runs to update topo
    • retryTransition handles reconnecting when MySQL comes back
  4. Skip replication configuration in setReplicationSourceLocked when MySQL is down

    • Everything after ChangeTabletType requires MySQL — return nil and let VTOrc or VTTablet restart repair replication later
  5. Best-effort STOP REPLICA before Mysqld.Shutdown()

Reproduction Steps

  1. Start a cluster via examples/local (or similar)
  2. Stop VTOrc
  3. Kill mysqld_safe and mysqld on the primary tablet, keeping vttablet running
  4. Call SetReplicationSource on the old primary:
    grpcurl -plaintext -d '{"parent": {"cell": "zone1", "uid": 100}, "semiSync": true}' \
      localhost:16101 tabletmanagerservice.TabletManager/SetReplicationSource
  5. Observe the tablet remains PRIMARY in topo:
    vtctldclient GetTablets

Binary Version

v19+

Operating System and Environment details

Linux, Kubernetes, macOS (reproduced locally)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions