Skip to content

fix: return all SQS messages as batch failures on non-ScaleError exceptions; enable throttle retries#5107

Draft
Copilot wants to merge 4 commits intomainfrom
copilot/fix-sqs-batch-deletion-issue
Draft

fix: return all SQS messages as batch failures on non-ScaleError exceptions; enable throttle retries#5107
Copilot wants to merge 4 commits intomainfrom
copilot/fix-sqs-batch-deletion-issue

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 14, 2026

Description

When scaleUp() throws any non-ScaleError exception (SSM ThrottlingException, EC2 Unavailable, GitHub API errors), scaleUpHandler returns an empty batchItemFailures array. SQS interprets this as "all messages processed successfully" and permanently deletes them. Jobs queue indefinitely with no runner assigned and no DLQ routing.

Root cause in lambda.ts: the else branch in the catch block never populates batchItemFailures:

// BEFORE — returns [] → SQS deletes all messages permanently
} else {
  logger.error(`...ignoring batch`, { error: e });
}

// AFTER — all messages returned as failures → SQS retries
} else {
  logger.error(`...batch will be retried via SQS.`, { error: e });
  batchItemFailures.push(...sqsMessages.map(({ messageId }) => ({ itemIdentifier: messageId })));
}

Contributing bug in auth.ts: onRateLimit and onSecondaryRateLimit handlers don't return true, so @octokit/plugin-throttling aborts immediately instead of retrying — the resulting exception then hits the empty batchItemFailures path above.

Changes:

  • lambda.ts: Populate batchItemFailures with all message IDs in the non-ScaleError catch path
  • auth.ts: Add return true to both throttle handlers to enable retry
  • lambda.test.ts: Update tests to assert correct retry behavior; add empty-batch edge case

Test Plan

  • Updated existing tests that were asserting the broken behavior (batchItemFailures: [] for non-ScaleError)
  • Added edge case test for empty batch with non-ScaleError
  • Full control-plane test suite passes (354 tests, 13 files)

Related Issues

…enable throttle retries

- lambda.ts: populate batchItemFailures for non-ScaleError exceptions so SQS
  retries messages instead of permanently deleting them
- auth.ts: return true from onRateLimit and onSecondaryRateLimit handlers so
  @octokit/plugin-throttling retries rate-limited requests
- lambda.test.ts: update tests to expect correct retry behavior and add empty
  batch edge case test
- auth.test.ts: add test for throttle handler configuration

Agent-Logs-Url: https://github.com/github-aws-runners/terraform-aws-github-runner/sessions/1d1a46e4-807c-4c6e-864d-50a9c09dd84e

Co-authored-by: Brend-Smits <15904543+Brend-Smits@users.noreply.github.com>
Comment thread lambdas/functions/control-plane/src/github/auth.test.ts Fixed
Copilot AI and others added 2 commits April 14, 2026 08:34
Co-authored-by: Brend-Smits <15904543+Brend-Smits@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix scaleUpHandler to prevent silent deletion of SQS batch fix: return all SQS messages as batch failures on non-ScaleError exceptions; enable throttle retries Apr 14, 2026
Copilot AI requested a review from Brend-Smits April 14, 2026 08:37
@github-actions
Copy link
Copy Markdown
Contributor

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Snapshot Warnings

⚠️: No snapshots were found for the head SHA 2bb21df.
Ensure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice.

Scanned Files

None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

scaleUpHandler silently deletes entire SQS batch on non-ScaleError exceptions, causing permanent job loss

2 participants