Secondary Controlplanes etcd service take hours/days until it becomes running #12899

What-is-water93 · 2026-03-06T12:59:43Z

What-is-water93
Mar 6, 2026

Bug Report

Description

When bootstrapping a cluster with 3 Control Plane nodes one of or both of the secondary control plane nodes etcd services stays stuck in a loop of preparing (for 30min), then enters the failed state (25min) with the message

24m31s ago    Failed to run pre stage: failed to build initial etcd cluster: failed to build cluster arguments: 1 error(s) occurred:
                timeout

then finally reboots, and then the loop starts anew (Preparing → Failed → Reboot).
After hours or days the nodes etcd somehow joins after a reboot.
The nodes are deployed in a vmware environment, on the same physical host in the same subnet. The kubernetes cluster comes up, via the VIP endpoint on the primary CP where etcd is running.
Network connectivity between the nodes was checked, it is possible to reach both etcd ports on the primary CP node from e.g. a pod running on the impaired secondary CP nodes.
The discovery service is used, and using get affiliates on all of the CP nodes shows all other cluster members and their IPs, and all the nodes are shown when looking up the cluster id in the discovery service web ui.
There is no etcd log on the impaired CP nodes, with the service being stuck in preparing/failed, and etcd get members returns just the primary CP against the primary, and times out as expected against the secondary CPs where etcd is not running.
The etcd image was also pulled/shows up when using talosctl -n $NODE_IP images list --namespace system.
Custom registries are used, also proxy settings, which include the nodes subnet under no_proxy (we verified connections between the CP nodes via debug pods, and also deployed goldpinger to see if there were any timeouts or long requests, but all was green).

network config

cluster

cluster:
  controlPlane:
      endpoint: https://vip_ip:6443
  network:
    cni:
      name: none # cilium is installed later
    podSubnets:
      - some_subnet/16
    serviceSubnets:
      - some_other_subnet/16
    proxy:
      disabled: true
    etcd:
      advertisedSubnets:
        - machine_subnet/24

machine

  machine:
    kubelet:
      nodeIP:
        validSubnets:
          - machine_subnet/24
  network:
    hostname: some_hostname
    interfaces:
        - interface: eth0
          addresses:
            - some_ip_from_machine_subnet/24 # assigned via kernel arg, individually to each machine
          routes:
            - network: 0.0.0.0/0
              gateway: gateway_ip
          mtu: 1500
          dhcp: false
          vip:
            ip: vip_ip # one static IP from the machine_subnet range
        nameservers:
            - some_nameserver

Logs

dmesg

On the impacted CP it only prints

user: warning: [2026-03-03T11:10:48.118328812Z]: [talos] task startAllServices (1/1): service "etcd" to be "up"
user: warning: [2026-03-03T11:10:50.960493812Z]: [talos] etcd is waiting to join the cluster, if this node is the first node in the cluster, please run `talosctl bootstrap` against one of the following IPs:
user: warning: [2026-03-03T11:10:50.960537812Z]: [talos] [$node_ip$]

etcd log

doesnt exist on impacted CPs, and does not contain cluster join related activity/failed attempts to join on the primary CP with working etcd.

Environment

Talos version:
- Client:
  Tag: v1.12.4
  SHA: fc8e600
  Built:
  Go version: go1.25.7
  OS/Arch: darwin/arm64
  Server:
  NODE: node_ip
  Tag: v1.11.6
  SHA: 6dd1430
  Built:
  Go version: go1.24.11
  OS/Arch: linux/amd64
  Enabled: RBAC
Kubernetes version: v1.33.7
Platform: metal (an iso image was used)

smira · 2026-03-06T14:41:39Z

smira
Mar 6, 2026
Maintainer

There is no communication across CP nodes over etcd API, so the joining CP can't add itself.

This is not a bug, so I'm going to convert it to a discussion.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Secondary Controlplanes etcd service take hours/days until it becomes running #12899

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Secondary Controlplanes etcd service take hours/days until it becomes running #12899

Uh oh!

Uh oh!

What-is-water93 Mar 6, 2026

Bug Report

Description

network config

cluster

machine

Logs

dmesg

etcd log

Environment

Replies: 1 comment

Uh oh!

smira Mar 6, 2026 Maintainer

What-is-water93
Mar 6, 2026

smira
Mar 6, 2026
Maintainer