Secondary Controlplanes etcd service take hours/days until it becomes running #12899
Unanswered
What-is-water93
asked this question in
Q&A
Replies: 1 comment
-
|
There is no communication across CP nodes over etcd API, so the joining CP can't add itself. This is not a bug, so I'm going to convert it to a discussion. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Bug Report
Description
When bootstrapping a cluster with 3 Control Plane nodes one of or both of the secondary control plane nodes etcd services stays stuck in a loop of preparing (for 30min), then enters the
failedstate (25min) with the messagethen finally reboots, and then the loop starts anew (Preparing → Failed → Reboot).
After hours or days the nodes etcd somehow joins after a reboot.
The nodes are deployed in a vmware environment, on the same physical host in the same subnet. The kubernetes cluster comes up, via the VIP endpoint on the primary CP where etcd is running.
Network connectivity between the nodes was checked, it is possible to reach both etcd ports on the primary CP node from e.g. a pod running on the impaired secondary CP nodes.
The discovery service is used, and using
get affiliateson all of the CP nodes shows all other cluster members and their IPs, and all the nodes are shown when looking up the cluster id in the discovery service web ui.There is no etcd log on the impaired CP nodes, with the service being stuck in preparing/failed, and
etcd get membersreturns just the primary CP against the primary, and times out as expected against the secondary CPs where etcd is not running.The etcd image was also pulled/shows up when using talosctl -n $NODE_IP images list --namespace system.
Custom registries are used, also proxy settings, which include the nodes subnet under no_proxy (we verified connections between the CP nodes via debug pods, and also deployed goldpinger to see if there were any timeouts or long requests, but all was green).
network config
cluster
machine
Logs
dmesg
On the impacted CP it only prints
etcd log
doesnt exist on impacted CPs, and does not contain cluster join related activity/failed attempts to join on the primary CP with working etcd.
Environment
Tag: v1.12.4
SHA: fc8e600
Built:
Go version: go1.25.7
OS/Arch: darwin/arm64
Server:
NODE: node_ip
Tag: v1.11.6
SHA: 6dd1430
Built:
Go version: go1.24.11
OS/Arch: linux/amd64
Enabled: RBAC
Beta Was this translation helpful? Give feedback.
All reactions