Skip to content

[BAC-492] Implement HPA transition step#68

Open
juanariastn wants to merge 13 commits intomasterfrom
bac-492-hpa-transition
Open

[BAC-492] Implement HPA transition step#68
juanariastn wants to merge 13 commits intomasterfrom
bac-492-hpa-transition

Conversation

@juanariastn
Copy link
Contributor

@juanariastn juanariastn commented Feb 12, 2026

Why? 🤔

When a service transitions from fixed spec.replicas to HPA-managed autoscaling, Kubernetes defaults the replica count to 1, causing partial downtime for services that cannot handle their workload with a single pod.
This happens because of how kubectl apply performs a three-way merge: it compares the previous manifest (stored in the last-applied-configuration annotation), the new desired manifest, and the live state. When spec.replicas was previously declared and is now omitted (because HPA should manage it), the three-way merge interprets this as an intentional deletion and removes the field — causing Kubernetes to default to 1 replica instead of preserving the current count.
Previous attempts to fix this — setting spec.replicas in the Helm template even when HPA is enabled, combined with ArgoCD ignoreDifferences — introduced a secondary issue: RespectIgnoreDifferences strips the field from the applied manifest during sync, which triggers the same three-way merge deletion, bringing replicas back to 1.

What? 📃

Helm chart microservices-v8uses the standard Helm pattern — spec.replicas is not set when autoscaling.enabled=true.
This PR adds a new pre-deploy step (argo_hpa_transition.sh) that runs before upserting the ArgoCD Application. When autoscaling.enabled: true is detected in the user's values file, the script:

  1. Checks if the Rollout exists in the cluster
  2. Reads the last-applied-configuration annotation from the Rollout
  3. If spec.replicas is present, removes it using jq and patches the annotation back
    This ensures the three-way merge sees "field was absent, still absent" and preserves the live replica count. The operation is idempotent — if the Rollout doesn't exist or spec.replicas is already absent, the step is a no-op.

This solution works analogously to https://github.com/max0ne/helm-omit-replicas-for-hpa

Additional Links 🔗

BAC-492

Summary by CodeRabbit

  • New Features

    • Staging rollouts now enable autoscaling (min 2 / max 4, target CPU 80%) for more flexible capacity.
  • Chores

    • Deployment pipeline adds a prepare-step to coordinate autoscaler handoffs during rollouts.
    • Added tooling to clean deployment metadata before enabling autoscaling, preventing replica conflicts and ensuring smooth HPA transition.
    • CI: minor environment variable value updated for staging deploy tests.

churma and others added 9 commits January 7, 2026 18:49
Implement HPA transition preparation script and update ArgoCD deployment steps to clean spec.replicas from last-applied-configuration. This change prevents conflicts during HPA transitions by ensuring the correct replica count is maintained.
Refactor the HPA transition script to replace the ARGO_PARAMETERS_FILE variable with VALUES_FILE, aligning with the Helm values file usage. Update the ArgoCD deployment steps accordingly to ensure consistency in environment variable naming and improve clarity in the script's usage instructions.
@juanariastn juanariastn self-assigned this Feb 12, 2026
@juanariastn juanariastn added enhancement New feature or request canary labels Feb 12, 2026
@coderabbitai
Copy link

coderabbitai bot commented Feb 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds an autoscaling block to the staging values file, inserts a pipeline step "Prepare HPA transition" to run a new script, and adds a script that removes spec.replicas from a Rollout’s kubectl.kubernetes.io/last-applied-configuration annotation to prepare for HPA handoff.

Changes

Cohort / File(s) Summary
CI values
.circleci/resources/values-test-rollout-staging.yml
Adds autoscaling block: enabled: true, minReplicas: 2, maxReplicas: 4, targetCPUUtilizationPercentage: 80.
Pipeline
src/commands/argo-deploy-application.yml
Inserts new step "Prepare HPA transition" (sets RELEASE_NAME, NAMESPACE) that runs scripts/argo_hpa_transition.sh before "Upsert Argo Application".
HPA transition script (new)
src/scripts/argo_hpa_transition.sh
Adds script that validates VALUES_FILE, RELEASE_NAME, NAMESPACE; checks autoscaling.enabled via yq and skips if not enabled; verifies Rollout exists; removes spec.replicas from the kubectl.kubernetes.io/last-applied-configuration annotation using jq and updates the Rollout annotation; includes usage, colored output, and error handling.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (7 files):

⚔️ .circleci/resources/values-test-rollout-staging.yml (content)
⚔️ .circleci/test-deploy.yml (content)
⚔️ src/commands/argo-deploy-application.yml (content)
⚔️ src/jobs/argo-deploy.yml (content)
⚔️ src/scripts/argo_app_previous_status.sh (content)
⚔️ src/scripts/argo_cli_common.sh (content)
⚔️ src/scripts/argo_rollout_status_common.sh (content)

These conflicts must be resolved before merging into master.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main change: implementing an HPA transition step to fix the replica count issue during autoscaling migration.
Description check ✅ Passed The description follows the template structure with well-filled Why and What sections, providing comprehensive context about the problem, solution, and implementation details.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bac-492-hpa-transition
⚔️ Resolve merge conflicts (beta)
  • Auto-commit resolved conflicts to branch bac-492-hpa-transition
  • Create stacked PR with resolved conflicts
  • Post resolved changes as copyable diffs in a comment

No actionable comments were generated in the recent review. 🎉


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/scripts/argo_hpa_transition.sh`:
- Around line 48-55: Before invoking yq, explicitly verify the values file
exists and is readable (check values_file) and if not, print an error and exit
non‑zero; after running yq to set hpa_enabled, detect yq failure (non‑zero exit
status) instead of redirecting stderr to /dev/null and if yq fails print the yq
error and exit non‑zero so we don’t silently skip HPA handling; retain the
existing conditional that skips when hpa_enabled != "true", but ensure failures
are treated as errors (use exit 1) rather than returning success.

Comment on lines +48 to +55
# Check if autoscaling is being enabled in this deploy
local hpa_enabled
hpa_enabled=$(yq e '.autoscaling.enabled' "$values_file" 2>/dev/null)

if [[ "$hpa_enabled" != "true" ]]; then
echo -e "${GREEN}HPA not enabled in this deploy, skipping.${NC}"
return 0
fi
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, let's see the file structure and how VALUES_FILE is initialized
head -60 src/scripts/argo_hpa_transition.sh

Repository: TiendaNube/eks-orb

Length of output: 2361


🏁 Script executed:

#!/bin/bash
# Continue reading the script to see what happens after the autoscaling check
tail -n +56 src/scripts/argo_hpa_transition.sh | head -40

Repository: TiendaNube/eks-orb

Length of output: 1765


🏁 Script executed:

#!/bin/bash
# Search for how this script is called or referenced in the repo
rg "argo_hpa_transition" --type-list
find . -type f -name "*.sh" -o -name "*.yml" -o -name "*.yaml" | head -20
rg "argo_hpa_transition|VALUES_FILE" -l

Repository: TiendaNube/eks-orb

Length of output: 1259


🏁 Script executed:

#!/bin/bash
# Search for how argo_hpa_transition.sh is invoked
rg "argo_hpa_transition" -B 3 -A 3

Repository: TiendaNube/eks-orb

Length of output: 603


🏁 Script executed:

#!/bin/bash
# Get the full argo-deploy-application.yml to understand the context
cat -n src/commands/argo-deploy-application.yml | head -100

Repository: TiendaNube/eks-orb

Length of output: 4332


🏁 Script executed:

#!/bin/bash
# Continue reading to see how values files are handled
cat -n src/commands/argo-deploy-application.yml | sed -n '100,200p'

Repository: TiendaNube/eks-orb

Length of output: 5003


🏁 Script executed:

#!/bin/bash
# Check the test values files to see how autoscaling is configured
echo "=== Test values files ===" 
cat -n .circleci/resources/values-test-rollout-*.yml

Repository: TiendaNube/eks-orb

Length of output: 1990


🏁 Script executed:

#!/bin/bash
# Check the test-deploy.yml to see how the script is invoked with values files
cat -n .circleci/test-deploy.yml

Repository: TiendaNube/eks-orb

Length of output: 15461


🏁 Script executed:

#!/bin/bash
# Check if there are other error-handling patterns in the script we should note
cat -n src/scripts/argo_hpa_transition.sh

Repository: TiendaNube/eks-orb

Length of output: 4581


Add file existence check and explicit error handling for yq failures.

If VALUES_FILE doesn't exist or is unreadable, the script silently skips HPA handling due to yq errors being redirected to /dev/null. This masks a real problem: if autoscaling is actually enabled, the replica cleanup (which prevents Kubernetes from defaulting to 1 replica during HPA transition) will be skipped, causing a replica reset. The script should validate the file exists before reading and exit with an error if yq fails instead of silently skipping.

Suggested hardening for file existence + yq failures
   # Check if autoscaling is being enabled in this deploy
   local hpa_enabled
-  hpa_enabled=$(yq e '.autoscaling.enabled' "$values_file" 2>/dev/null)
+  if [[ ! -f "$values_file" ]]; then
+    echo -e "${RED}Error: values file '$values_file' not found.${NC}"
+    exit 2
+  fi
+  if ! hpa_enabled=$(yq e '.autoscaling.enabled // "false"' "$values_file"); then
+    echo -e "${RED}Error: failed to read autoscaling.enabled from '$values_file'.${NC}"
+    exit 2
+  fi
🤖 Prompt for AI Agents
In `@src/scripts/argo_hpa_transition.sh` around lines 48 - 55, Before invoking yq,
explicitly verify the values file exists and is readable (check values_file) and
if not, print an error and exit non‑zero; after running yq to set hpa_enabled,
detect yq failure (non‑zero exit status) instead of redirecting stderr to
/dev/null and if yq fails print the yq error and exit non‑zero so we don’t
silently skip HPA handling; retain the existing conditional that skips when
hpa_enabled != "true", but ensure failures are treated as errors (use exit 1)
rather than returning success.

Remove VALUES_FILE from HPA transition script and update usage instructions. This change simplifies the script by eliminating the need for the values file variable, ensuring clarity in the required environment variables for deployment.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/scripts/argo_hpa_transition.sh`:
- Around line 49-63: The script currently swallows all kubectl errors; update
the rollout existence check and the last-applied fetch to use --ignore-not-found
and stop masking failures: call kubectl get rollout "$release_name" -n
"$namespace" --ignore-not-found and check its exit code (treat non-zero as a
real error and fail fast, only treat zero+empty output as "not found"), and for
last_applied run kubectl get rollout "$release_name" -n "$namespace" -o
jsonpath='{.metadata.annotations.kubectl\.kubernetes\.io/last-applied-configuration}'
--ignore-not-found without the "|| echo ''" fallback so kubectl errors surface;
keep the existing empty-string check on last_applied to skip when legitimately
absent.

Comment on lines +49 to +63
# Check if the Rollout exists (might not on first deploy)
if ! kubectl get rollout "$release_name" -n "$namespace" > /dev/null 2>&1; then
echo -e "${GREEN}Rollout '$release_name' not found (likely first deploy), skipping.${NC}"
return 0
fi

# Get the last-applied-configuration annotation
local last_applied
last_applied=$(kubectl get rollout "$release_name" -n "$namespace" \
-o jsonpath='{.metadata.annotations.kubectl\.kubernetes\.io/last-applied-configuration}' 2>/dev/null || echo "")

if [[ -z "$last_applied" ]]; then
echo -e "${GREEN}No last-applied-configuration annotation found, skipping.${NC}"
return 0
fi
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check if the file exists and read the relevant lines
fd "argo_hpa_transition.sh" -x wc -l {}

Repository: TiendaNube/eks-orb

Length of output: 101


🏁 Script executed:

# Read the file to see the actual code context
cat -n src/scripts/argo_hpa_transition.sh | sed -n '40,75p'

Repository: TiendaNube/eks-orb

Length of output: 1749


🏁 Script executed:

# Check for kubectl usage patterns in the script to understand context
rg "kubectl" src/scripts/argo_hpa_transition.sh -A 2 -B 2

Repository: TiendaNube/eks-orb

Length of output: 1315


🏁 Script executed:

# Verify kubectl --ignore-not-found flag is standard
kubectl get --help 2>/dev/null | grep -A 2 "ignore-not-found" || echo "Checking kubectl documentation"

Repository: TiendaNube/eks-orb

Length of output: 92


🌐 Web query:

kubectl --ignore-not-found flag usage bash error handling

💡 Result:

--ignore-not-found is meant to make “object doesn’t exist” not be a fatal error (useful with set -e / CI scripts), while still failing on real errors.

What it does (exit code behavior)

  • kubectl get ... --ignore-not-found: if the object isn’t found, it prints nothing and returns exit code 0. (kubernetes.io)
  • kubectl delete ... --ignore-not-found: if the object isn’t found, it treats that as a successful delete (and returns exit code 0). Other errors still fail. (kubernetes.io)

Bash patterns

Idempotent delete (safe with set -e):

set -euo pipefail
kubectl delete configmap my-cm -n my-ns --ignore-not-found

Idempotent get (don’t fail if absent):

set -euo pipefail
kubectl get deploy my-deploy -n my-ns --ignore-not-found >/dev/null

If you need to branch on existence (without --ignore-not-found):

if kubectl get deploy my-deploy -n my-ns >/dev/null 2>&1; then
  echo "exists"
else
  echo "missing (or another error)"
fi

If you tell me which command you’re using (get, delete, wait, etc.) and whether set -e is on, I can suggest the most robust pattern for that exact case.

Citations:


Distinguish kubectl "not found" from real errors to prevent silent failures.

Lines 50 and 57–58 suppress all kubectl failures (RBAC, network, auth, parsing errors) and treat them as "not found," causing the script to silently skip cleanup. This bypasses the protection mechanism and reintroduces the three-way merge conflict risk if kubectl is misconfigured or inaccessible.

Use --ignore-not-found to distinguish true "not found" (exit 0) from real errors, and fail fast on actual kubectl failures:

Suggested fix
-  if ! kubectl get rollout "$release_name" -n "$namespace" > /dev/null 2>&1; then
-    echo -e "${GREEN}Rollout '$release_name' not found (likely first deploy), skipping.${NC}"
-    return 0
-  fi
+  local rollout_name
+  if ! rollout_name=$(kubectl get rollout "$release_name" -n "$namespace" --ignore-not-found -o name); then
+    echo -e "${RED}Failed to query rollout '$release_name' in namespace '$namespace'.${NC}"
+    return 1
+  fi
+  if [[ -z "$rollout_name" ]]; then
+    echo -e "${GREEN}Rollout '$release_name' not found (likely first deploy), skipping.${NC}"
+    return 0
+  fi
@@
-  last_applied=$(kubectl get rollout "$release_name" -n "$namespace" \
-    -o jsonpath='{.metadata.annotations.kubectl\.kubernetes\.io/last-applied-configuration}' 2>/dev/null || echo "")
+  if ! last_applied=$(kubectl get rollout "$release_name" -n "$namespace" \
+    -o jsonpath='{.metadata.annotations.kubectl\.kubernetes\.io/last-applied-configuration}'); then
+    echo -e "${RED}Failed to read last-applied-configuration annotation.${NC}"
+    return 1
+  fi
🤖 Prompt for AI Agents
In `@src/scripts/argo_hpa_transition.sh` around lines 49 - 63, The script
currently swallows all kubectl errors; update the rollout existence check and
the last-applied fetch to use --ignore-not-found and stop masking failures: call
kubectl get rollout "$release_name" -n "$namespace" --ignore-not-found and check
its exit code (treat non-zero as a real error and fail fast, only treat
zero+empty output as "not found"), and for last_applied run kubectl get rollout
"$release_name" -n "$namespace" -o
jsonpath='{.metadata.annotations.kubectl\.kubernetes\.io/last-applied-configuration}'
--ignore-not-found without the "|| echo ''" fallback so kubectl errors surface;
keep the existing empty-string check on last_applied to skip when legitimately
absent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

canary enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants