Skip to content

Node upgrade promptly aborts when ValidatingWebhook denies pod eviction #5637

@mcphersonwhite-axon

Description

@mcphersonwhite-axon

Environment:

  • AKS version: [1.34.2]
  • Node pool OS: Ubuntu
  • Workload: Strimzi Kafka with strimzi-drain-cleaner ValidatingWebhookConfiguration
    (failurePolicy: Ignore, timeoutSeconds: 5)

Expected behavior:
Per Kubernetes Eviction API spec, a webhook denial (or 429 TooManyRequests) is a temporary
"not now" response. The AKS node upgrader should backoff and retry, giving the Strimzi Cluster
Operator time to complete pod migration to the surge node before the next eviction attempt.

Actual behavior:

  • AKS node upgrader attempts to evict Kafka pod during node drain
  • strimzi-drain-cleaner ValidatingWebhook returns [DENY/403/422]
  • AKS upgrader removes the surge node and fails within ~ seconds
  • No retry observed; Strimzi operator is not given time to move the pod

Steps to reproduce:

  1. Deploy Strimzi Kafka with strimzi-drain-cleaner (failurePolicy: Ignore)
  2. Trigger AKS node pool upgrade (patch or minor version)
  3. Node drain begins, eviction denied by webhook
  4. Upgrade fails promptly without noticeable retries

Question: Is it AKS upgrader policy to treat a ValidatingWebhook denial as a non-retryable
permanent failure? This does not align with Kubernetes Eviction API semantics.

Potentially Related: #4720

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions