-
Notifications
You must be signed in to change notification settings - Fork 358
Open
Labels
Description
Environment:
- AKS version: [1.34.2]
- Node pool OS: Ubuntu
- Workload: Strimzi Kafka with strimzi-drain-cleaner ValidatingWebhookConfiguration
(failurePolicy: Ignore, timeoutSeconds: 5)
Expected behavior:
Per Kubernetes Eviction API spec, a webhook denial (or 429 TooManyRequests) is a temporary
"not now" response. The AKS node upgrader should backoff and retry, giving the Strimzi Cluster
Operator time to complete pod migration to the surge node before the next eviction attempt.
Actual behavior:
- AKS node upgrader attempts to evict Kafka pod during node drain
- strimzi-drain-cleaner ValidatingWebhook returns [DENY/403/422]
- AKS upgrader removes the surge node and fails within ~ seconds
- No retry observed; Strimzi operator is not given time to move the pod
Steps to reproduce:
- Deploy Strimzi Kafka with strimzi-drain-cleaner (failurePolicy: Ignore)
- Trigger AKS node pool upgrade (patch or minor version)
- Node drain begins, eviction denied by webhook
- Upgrade fails promptly without noticeable retries
Question: Is it AKS upgrader policy to treat a ValidatingWebhook denial as a non-retryable
permanent failure? This does not align with Kubernetes Eviction API semantics.
Potentially Related: #4720
Reactions are currently unavailable