Summary
When scaleUp() throws any exception that is not a ScaleError (e.g. ThrottlingException from SSM, Unavailable from EC2, GitHub API errors), the scaleUpHandler in lambda.ts returns an empty batchItemFailures array. SQS interprets an empty array as "all messages processed successfully" and permanently deletes them. Affected GitHub Actions jobs remain queued indefinitely and eventually time out after 24 hours with no runner ever being assigned.
This issue was originally described in #5024. PR #5029 was closed without merge. PR #4990 (v7.4.1) added @octokit/plugin-retry and fixed JIT config per-instance error handling but did not fix this root cause in lambda.ts.
Root Cause
lambda.ts lines 53–64:
} catch (e) {
if (e instanceof ScaleError) {
batchItemFailures.push(...e.toBatchItemFailures(sqsMessages));
logger.warn(`${e.detailedMessage} A retry will be attempted via SQS.`, { error: e });
} else {
logger.error(`Error processing batch (size: ${sqsMessages.length}): ${(e as Error).message}, ignoring batch`, {
error: e,
});
// ← batchItemFailures is never populated here
// ← returns [] → SQS deletes ALL messages permanently
}
return { batchItemFailures };
}
The batchItemFailures array is initialized as empty on line 41 and only populated for ScaleError exceptions. For every other exception type, it remains empty and SQS permanently removes all messages.
Contributing Bug: Rate Limit Handlers in auth.ts Never Retry
onRateLimit and onSecondaryRateLimit log the rate limit event but do not return true, which is required by @octokit/plugin-throttling to trigger a retry. The handlers return undefined implicitly, causing the throttling plugin to abort immediately and throw — which then hits the lambda.ts else branch above, deleting the entire batch.
throttle: {
onRateLimit: (retryAfter: number, options: Required<EndpointDefaults>) => {
logger.warn(`GitHub rate limit: ...`);
// missing: return true
},
onSecondaryRateLimit: (retryAfter: number, options: Required<EndpointDefaults>) => {
logger.warn(`GitHub rate limit: SecondaryRateLimit ...`);
// missing: return true
},
},
Note: @octokit/plugin-retry (added in v7.4.1) handles 5xx retries from GitHub API but does not handle throttling — that is exclusively @octokit/plugin-throttling's responsibility. These two plugins serve different purposes and the throttling bug is independent.
Observed Errors (all cause message deletion)
ThrottlingException: Rate exceeded ← SSM rate limit
at putParameter
attempts: 3, totalRetryDelay: 1198ms
Unavailable: The service is unavailable. Please try again shortly. ← EC2 transient 503
at AwsEc2QueryProtocol
attempts: 3, totalRetryDelay: 191ms
POST /orgs/{org}/actions/runners/registration-token - 503 ← GitHub secondary rate limit in 860ms (not retried due to auth.ts bug)
All logged as: "Error processing batch (size: N): ..., ignoring batch" — the tell-tale sign of this bug.
Impact
- GitHub Actions jobs stuck in queue for up to 24 hours then timed out
- No DLQ routing (messages never enter SQS retry/redrive flow)
- No metrics or alarms triggered (messages appear "successfully processed")
- EC2 instances may have been partially created before the exception — wasted cost with no corresponding runner
- Silent: operators have no indication messages were lost until users report stuck jobs
Corner Cases Considered
- Empty batch: sqsMessages can be empty if all records were non-SQS event sources. [].map(...) returns [] — no-op, safe.
- Partial processing before exception: If scaleUp() processes some runner groups successfully before throwing on another group, returning all messages as failures causes already-processed messages to be re-queued. For ephemeral runners, the isJobQueued check on retry prevents duplicate EC2 launches for already-running jobs. For non-ephemeral runners, duplicate runners may be launched but the scale-down Lambda handles cleanup. Correct trade-off: a duplicate runner is recoverable; a permanently lost SQS message is not.
- ScaleError partial retry semantics preserved: ScaleError.toBatchItemFailures() only retries failedInstanceCount messages, not all. This intentional partial-retry behaviour is unchanged — the fix only affects the non-ScaleError path.
- Existing test documents the bug: lambda.test.ts line 218 explicitly asserts batchItemFailures: [] for a non-ScaleError exception — this test must be updated.
Proposed Fix
- lambdas/functions/control-plane/src/lambda.ts
} catch (e) {
if (e instanceof ScaleError) {
batchItemFailures.push(...e.toBatchItemFailures(sqsMessages));
logger.warn(`${e.detailedMessage} A retry will be attempted via SQS.`, { error: e });
} else {
logger.error(
`Error processing batch (size: ${sqsMessages.length}): ${(e as Error).message}, batch will be retried via SQS.`,
{ error: e },
);
batchItemFailures.push(...sqsMessages.map(({ messageId }) => ({ itemIdentifier: messageId })));
}
return { batchItemFailures };
}
- lambdas/functions/control-plane/src/github/auth.ts
throttle: {
onRateLimit: (retryAfter: number, options: Required<EndpointDefaults>) => {
logger.warn(
`GitHub rate limit: Request quota exhausted for request ${options.method} ${options.url}. Retrying after ${retryAfter} seconds.`,
);
return true;
},
onSecondaryRateLimit: (retryAfter: number, options: Required<EndpointDefaults>) => {
logger.warn(
`GitHub rate limit: SecondaryRateLimit detected for request ${options.method} ${options.url}. Retrying after ${retryAfter} seconds.`,
);
return true;
},
},
- lambdas/functions/control-plane/src/lambda.test.ts
// UPDATE — was asserting broken behavior:
it('Non scale error should return message as batch failure for retry.', async () => {
const error = new Error('Non scale should resolve.');
vi.mocked(scaleUp).mockRejectedValue(error);
const result = await scaleUpHandler(sqsEvent, context);
expect(result).toEqual({ batchItemFailures: [{ itemIdentifier: sqsRecord.messageId }] });
});
// UPDATE — was asserting batchItemFailures: []
it('Should return all messages as failures when scaleUp throws non-ScaleError', async () => {
const records = createMultipleRecords(2);
const multiRecordEvent: SQSEvent = { Records: records };
vi.mocked(scaleUp).mockRejectedValue(new Error('Generic error'));
const result = await scaleUpHandler(multiRecordEvent, context);
expect(result).toEqual({
batchItemFailures: [{ itemIdentifier: 'message-0' }, { itemIdentifier: 'message-1' }],
});
});
// ADD — empty batch edge case:
it('Should return empty failures when batch is empty and scaleUp throws non-ScaleError', async () => {
const emptyEvent: SQSEvent = { Records: [] };
vi.mocked(scaleUp).mockRejectedValue(new Error('Generic error'));
const result = await scaleUpHandler(emptyEvent, context);
expect(result).toEqual({ batchItemFailures: [] });
});
Affected Versions
Confirmed in v7.3.x and v7.4.x. PR #4990 (v7.4.1) does not fix this issue.
Environment
GitHub Enterprise Cloud with Data Residency, ephemeral runners, JIT config disabled, org-level runners.
Summary
When scaleUp() throws any exception that is not a ScaleError (e.g. ThrottlingException from SSM, Unavailable from EC2, GitHub API errors), the scaleUpHandler in lambda.ts returns an empty batchItemFailures array. SQS interprets an empty array as "all messages processed successfully" and permanently deletes them. Affected GitHub Actions jobs remain queued indefinitely and eventually time out after 24 hours with no runner ever being assigned.
This issue was originally described in #5024. PR #5029 was closed without merge. PR #4990 (v7.4.1) added @octokit/plugin-retry and fixed JIT config per-instance error handling but did not fix this root cause in lambda.ts.
Root Cause
lambda.ts lines 53–64:
The batchItemFailures array is initialized as empty on line 41 and only populated for ScaleError exceptions. For every other exception type, it remains empty and SQS permanently removes all messages.
Contributing Bug: Rate Limit Handlers in auth.ts Never Retry
onRateLimit and onSecondaryRateLimit log the rate limit event but do not return true, which is required by @octokit/plugin-throttling to trigger a retry. The handlers return undefined implicitly, causing the throttling plugin to abort immediately and throw — which then hits the lambda.ts else branch above, deleting the entire batch.
Note: @octokit/plugin-retry (added in v7.4.1) handles 5xx retries from GitHub API but does not handle throttling — that is exclusively @octokit/plugin-throttling's responsibility. These two plugins serve different purposes and the throttling bug is independent.
Observed Errors (all cause message deletion)
All logged as: "Error processing batch (size: N): ..., ignoring batch" — the tell-tale sign of this bug.
Impact
Corner Cases Considered
Proposed Fix
Affected Versions
Confirmed in v7.3.x and v7.4.x. PR #4990 (v7.4.1) does not fix this issue.
Environment
GitHub Enterprise Cloud with Data Residency, ephemeral runners, JIT config disabled, org-level runners.