Skip to content

Add resourceRetryStrategy for quota-aware retries in Sensor triggers#12

Merged
abdulazillow merged 3 commits intofeature/zg-1.8from
abdula/AIP-9672-resource-retry-strategy-1.8
Dec 3, 2025
Merged

Add resourceRetryStrategy for quota-aware retries in Sensor triggers#12
abdulazillow merged 3 commits intofeature/zg-1.8from
abdula/AIP-9672-resource-retry-strategy-1.8

Conversation

@abdulazillow
Copy link
Collaborator

Description

Implements resource-aware retry strategy for Sensor triggers to handle resource quota errors differently from other errors. When quota limits are exceeded, triggers can use a separate retry strategy with longer intervals, allowing time for resources to become available.

Added optional resourceRetryStrategy configuration that applies when resource constraint errors (quota exceeded, limits) are detected:

  • Automatically detects resource quota errors from Kubernetes API responses
  • Uses resourceRetryStrategy for quota errors, retryStrategy for other errors
  • Falls back to retryStrategy if resourceRetryStrategy is not configured
  • Integrates with existing DLQ (Dead Letter Queue) handling

Configuration

apiVersion: argoproj.io/v1alpha1
kind: Sensor
metadata:
  name: example-sensor
spec:
  triggers:
    - template:
        name: workflow-trigger
        argoWorkflow:
          generateName: my-workflow-
          operation: submit
      retryStrategy:
        steps: 2
        duration: 5s
      # Optional: separate retry strategy for resource quota errors
      resourceRetryStrategy:
        steps: 6
        duration: 60s
      dlqTrigger:
        template:
          name: dlq-log-trigger
          log: {}

Backward Compatibility

✅ Fully backward compatible

  • The resourceRetryStrategy field is optional
  • Existing Sensor configurations continue to work without any changes
  • When not configured, all errors use the standard retryStrategy
  • No breaking changes to the API or existing behavior
  • No migration required for existing deployments

Test Results

  1. Resource Quota Error Handling

    • Setup: Trigger workflow submission when namespace quota is exceeded (5/5 workflows running)
    • Expected: Initial fast retries, then switches to 60s interval retries using resourceRetryStrategy
    • Verified: Retries every ~60 seconds until quota slot becomes available, then successfully dispatches
  2. Non-Quota Error Handling

    • Setup: Trigger with invalid workflow name (validation error)
    • Expected: Uses standard retryStrategy with fast retries
    • Verified: Two fast retries (~5s apart), then DLQ trigger fires
  3. DLQ Integration

    • Setup: Configure DLQ trigger, exhaust all retries
    • Expected: DLQ trigger fires after retries exhausted
    • Verified: DLQ trigger successfully logs payload with CloudEvent ID for audit
  4. Backward Compatibility

    • Setup: Existing Sensor without resourceRetryStrategy field
    • Expected: All errors use retryStrategy as before
    • Verified: Default behavior unchanged, no migration needed

- Move resource-aware retry logic from actionFunc to triggerWithRateLimit
- Simplify actionFunc to just call triggerActions
- Add DLQ handling in triggerWithRateLimit after retries exhausted
@abdulazillow abdulazillow changed the title Abdula/aip 9672 resource retry strategy 1.8 Add resourceRetryStrategy for quota-aware retries in Sensor triggers Nov 18, 2025
@abdulazillow abdulazillow merged commit f734daa into feature/zg-1.8 Dec 3, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants