Conversation
…c When Re-enabling HPA
Implement HPA transition preparation script and update ArgoCD deployment steps to clean spec.replicas from last-applied-configuration. This change prevents conflicts during HPA transitions by ensuring the correct replica count is maintained.
Refactor the HPA transition script to replace the ARGO_PARAMETERS_FILE variable with VALUES_FILE, aligning with the Helm values file usage. Update the ArgoCD deployment steps accordingly to ensure consistency in environment variable naming and improve clarity in the script's usage instructions.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds an autoscaling block to the staging values file, inserts a pipeline step "Prepare HPA transition" to run a new script, and adds a script that removes Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts (beta)
No actionable comments were generated in the recent review. 🎉 Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/scripts/argo_hpa_transition.sh`:
- Around line 48-55: Before invoking yq, explicitly verify the values file
exists and is readable (check values_file) and if not, print an error and exit
non‑zero; after running yq to set hpa_enabled, detect yq failure (non‑zero exit
status) instead of redirecting stderr to /dev/null and if yq fails print the yq
error and exit non‑zero so we don’t silently skip HPA handling; retain the
existing conditional that skips when hpa_enabled != "true", but ensure failures
are treated as errors (use exit 1) rather than returning success.
src/scripts/argo_hpa_transition.sh
Outdated
| # Check if autoscaling is being enabled in this deploy | ||
| local hpa_enabled | ||
| hpa_enabled=$(yq e '.autoscaling.enabled' "$values_file" 2>/dev/null) | ||
|
|
||
| if [[ "$hpa_enabled" != "true" ]]; then | ||
| echo -e "${GREEN}HPA not enabled in this deploy, skipping.${NC}" | ||
| return 0 | ||
| fi |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's see the file structure and how VALUES_FILE is initialized
head -60 src/scripts/argo_hpa_transition.shRepository: TiendaNube/eks-orb
Length of output: 2361
🏁 Script executed:
#!/bin/bash
# Continue reading the script to see what happens after the autoscaling check
tail -n +56 src/scripts/argo_hpa_transition.sh | head -40Repository: TiendaNube/eks-orb
Length of output: 1765
🏁 Script executed:
#!/bin/bash
# Search for how this script is called or referenced in the repo
rg "argo_hpa_transition" --type-list
find . -type f -name "*.sh" -o -name "*.yml" -o -name "*.yaml" | head -20
rg "argo_hpa_transition|VALUES_FILE" -lRepository: TiendaNube/eks-orb
Length of output: 1259
🏁 Script executed:
#!/bin/bash
# Search for how argo_hpa_transition.sh is invoked
rg "argo_hpa_transition" -B 3 -A 3Repository: TiendaNube/eks-orb
Length of output: 603
🏁 Script executed:
#!/bin/bash
# Get the full argo-deploy-application.yml to understand the context
cat -n src/commands/argo-deploy-application.yml | head -100Repository: TiendaNube/eks-orb
Length of output: 4332
🏁 Script executed:
#!/bin/bash
# Continue reading to see how values files are handled
cat -n src/commands/argo-deploy-application.yml | sed -n '100,200p'Repository: TiendaNube/eks-orb
Length of output: 5003
🏁 Script executed:
#!/bin/bash
# Check the test values files to see how autoscaling is configured
echo "=== Test values files ==="
cat -n .circleci/resources/values-test-rollout-*.ymlRepository: TiendaNube/eks-orb
Length of output: 1990
🏁 Script executed:
#!/bin/bash
# Check the test-deploy.yml to see how the script is invoked with values files
cat -n .circleci/test-deploy.ymlRepository: TiendaNube/eks-orb
Length of output: 15461
🏁 Script executed:
#!/bin/bash
# Check if there are other error-handling patterns in the script we should note
cat -n src/scripts/argo_hpa_transition.shRepository: TiendaNube/eks-orb
Length of output: 4581
Add file existence check and explicit error handling for yq failures.
If VALUES_FILE doesn't exist or is unreadable, the script silently skips HPA handling due to yq errors being redirected to /dev/null. This masks a real problem: if autoscaling is actually enabled, the replica cleanup (which prevents Kubernetes from defaulting to 1 replica during HPA transition) will be skipped, causing a replica reset. The script should validate the file exists before reading and exit with an error if yq fails instead of silently skipping.
Suggested hardening for file existence + yq failures
# Check if autoscaling is being enabled in this deploy
local hpa_enabled
- hpa_enabled=$(yq e '.autoscaling.enabled' "$values_file" 2>/dev/null)
+ if [[ ! -f "$values_file" ]]; then
+ echo -e "${RED}Error: values file '$values_file' not found.${NC}"
+ exit 2
+ fi
+ if ! hpa_enabled=$(yq e '.autoscaling.enabled // "false"' "$values_file"); then
+ echo -e "${RED}Error: failed to read autoscaling.enabled from '$values_file'.${NC}"
+ exit 2
+ fi🤖 Prompt for AI Agents
In `@src/scripts/argo_hpa_transition.sh` around lines 48 - 55, Before invoking yq,
explicitly verify the values file exists and is readable (check values_file) and
if not, print an error and exit non‑zero; after running yq to set hpa_enabled,
detect yq failure (non‑zero exit status) instead of redirecting stderr to
/dev/null and if yq fails print the yq error and exit non‑zero so we don’t
silently skip HPA handling; retain the existing conditional that skips when
hpa_enabled != "true", but ensure failures are treated as errors (use exit 1)
rather than returning success.
Remove VALUES_FILE from HPA transition script and update usage instructions. This change simplifies the script by eliminating the need for the values file variable, ensuring clarity in the required environment variables for deployment.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/scripts/argo_hpa_transition.sh`:
- Around line 49-63: The script currently swallows all kubectl errors; update
the rollout existence check and the last-applied fetch to use --ignore-not-found
and stop masking failures: call kubectl get rollout "$release_name" -n
"$namespace" --ignore-not-found and check its exit code (treat non-zero as a
real error and fail fast, only treat zero+empty output as "not found"), and for
last_applied run kubectl get rollout "$release_name" -n "$namespace" -o
jsonpath='{.metadata.annotations.kubectl\.kubernetes\.io/last-applied-configuration}'
--ignore-not-found without the "|| echo ''" fallback so kubectl errors surface;
keep the existing empty-string check on last_applied to skip when legitimately
absent.
| # Check if the Rollout exists (might not on first deploy) | ||
| if ! kubectl get rollout "$release_name" -n "$namespace" > /dev/null 2>&1; then | ||
| echo -e "${GREEN}Rollout '$release_name' not found (likely first deploy), skipping.${NC}" | ||
| return 0 | ||
| fi | ||
|
|
||
| # Get the last-applied-configuration annotation | ||
| local last_applied | ||
| last_applied=$(kubectl get rollout "$release_name" -n "$namespace" \ | ||
| -o jsonpath='{.metadata.annotations.kubectl\.kubernetes\.io/last-applied-configuration}' 2>/dev/null || echo "") | ||
|
|
||
| if [[ -z "$last_applied" ]]; then | ||
| echo -e "${GREEN}No last-applied-configuration annotation found, skipping.${NC}" | ||
| return 0 | ||
| fi |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check if the file exists and read the relevant lines
fd "argo_hpa_transition.sh" -x wc -l {}Repository: TiendaNube/eks-orb
Length of output: 101
🏁 Script executed:
# Read the file to see the actual code context
cat -n src/scripts/argo_hpa_transition.sh | sed -n '40,75p'Repository: TiendaNube/eks-orb
Length of output: 1749
🏁 Script executed:
# Check for kubectl usage patterns in the script to understand context
rg "kubectl" src/scripts/argo_hpa_transition.sh -A 2 -B 2Repository: TiendaNube/eks-orb
Length of output: 1315
🏁 Script executed:
# Verify kubectl --ignore-not-found flag is standard
kubectl get --help 2>/dev/null | grep -A 2 "ignore-not-found" || echo "Checking kubectl documentation"Repository: TiendaNube/eks-orb
Length of output: 92
🌐 Web query:
kubectl --ignore-not-found flag usage bash error handling
💡 Result:
--ignore-not-found is meant to make “object doesn’t exist” not be a fatal error (useful with set -e / CI scripts), while still failing on real errors.
What it does (exit code behavior)
kubectl get ... --ignore-not-found: if the object isn’t found, it prints nothing and returns exit code 0. (kubernetes.io)kubectl delete ... --ignore-not-found: if the object isn’t found, it treats that as a successful delete (and returns exit code 0). Other errors still fail. (kubernetes.io)
Bash patterns
Idempotent delete (safe with set -e):
set -euo pipefail
kubectl delete configmap my-cm -n my-ns --ignore-not-foundIdempotent get (don’t fail if absent):
set -euo pipefail
kubectl get deploy my-deploy -n my-ns --ignore-not-found >/dev/nullIf you need to branch on existence (without --ignore-not-found):
if kubectl get deploy my-deploy -n my-ns >/dev/null 2>&1; then
echo "exists"
else
echo "missing (or another error)"
fiIf you tell me which command you’re using (get, delete, wait, etc.) and whether set -e is on, I can suggest the most robust pattern for that exact case.
Citations:
- 1: https://kubernetes.io/docs/reference/kubectl/kubectl-cmds/?utm_source=openai
- 2: https://kubernetes.io/docs/reference/kubectl/kubectl-cmds/?utm_source=openai
Distinguish kubectl "not found" from real errors to prevent silent failures.
Lines 50 and 57–58 suppress all kubectl failures (RBAC, network, auth, parsing errors) and treat them as "not found," causing the script to silently skip cleanup. This bypasses the protection mechanism and reintroduces the three-way merge conflict risk if kubectl is misconfigured or inaccessible.
Use --ignore-not-found to distinguish true "not found" (exit 0) from real errors, and fail fast on actual kubectl failures:
Suggested fix
- if ! kubectl get rollout "$release_name" -n "$namespace" > /dev/null 2>&1; then
- echo -e "${GREEN}Rollout '$release_name' not found (likely first deploy), skipping.${NC}"
- return 0
- fi
+ local rollout_name
+ if ! rollout_name=$(kubectl get rollout "$release_name" -n "$namespace" --ignore-not-found -o name); then
+ echo -e "${RED}Failed to query rollout '$release_name' in namespace '$namespace'.${NC}"
+ return 1
+ fi
+ if [[ -z "$rollout_name" ]]; then
+ echo -e "${GREEN}Rollout '$release_name' not found (likely first deploy), skipping.${NC}"
+ return 0
+ fi
@@
- last_applied=$(kubectl get rollout "$release_name" -n "$namespace" \
- -o jsonpath='{.metadata.annotations.kubectl\.kubernetes\.io/last-applied-configuration}' 2>/dev/null || echo "")
+ if ! last_applied=$(kubectl get rollout "$release_name" -n "$namespace" \
+ -o jsonpath='{.metadata.annotations.kubectl\.kubernetes\.io/last-applied-configuration}'); then
+ echo -e "${RED}Failed to read last-applied-configuration annotation.${NC}"
+ return 1
+ fi🤖 Prompt for AI Agents
In `@src/scripts/argo_hpa_transition.sh` around lines 49 - 63, The script
currently swallows all kubectl errors; update the rollout existence check and
the last-applied fetch to use --ignore-not-found and stop masking failures: call
kubectl get rollout "$release_name" -n "$namespace" --ignore-not-found and check
its exit code (treat non-zero as a real error and fail fast, only treat
zero+empty output as "not found"), and for last_applied run kubectl get rollout
"$release_name" -n "$namespace" -o
jsonpath='{.metadata.annotations.kubectl\.kubernetes\.io/last-applied-configuration}'
--ignore-not-found without the "|| echo ''" fallback so kubectl errors surface;
keep the existing empty-string check on last_applied to skip when legitimately
absent.
Why? 🤔
When a service transitions from fixed
spec.replicasto HPA-managedautoscaling, Kubernetes defaults the replica count to 1, causing partial downtime for services that cannot handle their workload with a single pod.This happens because of how
kubectl applyperforms a three-way merge: it compares the previous manifest (stored in thelast-applied-configurationannotation), the new desired manifest, and the live state. Whenspec.replicaswas previously declared and is now omitted (because HPA should manage it), the three-way merge interprets this as an intentional deletion and removes the field — causing Kubernetes to default to 1 replica instead of preserving the current count.Previous attempts to fix this — setting
spec.replicasin the Helm template even when HPA is enabled, combined with ArgoCDignoreDifferences— introduced a secondary issue:RespectIgnoreDifferencesstrips the field from the applied manifest during sync, which triggers the same three-way merge deletion, bringing replicas back to 1.What? 📃
Helm chart
microservices-v8uses the standard Helm pattern —spec.replicasis not set whenautoscaling.enabled=true.This PR adds a new pre-deploy step (
argo_hpa_transition.sh) that runs before upserting the ArgoCD Application. Whenautoscaling.enabled: trueis detected in the user's values file, the script:last-applied-configurationannotation from the Rolloutspec.replicasis present, removes it usingjqand patches the annotation backThis ensures the three-way merge sees "field was absent, still absent" and preserves the live replica count. The operation is idempotent — if the Rollout doesn't exist or
spec.replicasis already absent, the step is a no-op.This solution works analogously to https://github.com/max0ne/helm-omit-replicas-for-hpa
Additional Links 🔗
BAC-492
Summary by CodeRabbit
New Features
Chores