scaleUpHandler silently deletes entire SQS batch on non-ScaleError exceptions, causing permanent job loss


### Summary

When scaleUp() throws any exception that is not a ScaleError (e.g. ThrottlingException from SSM, Unavailable from EC2, GitHub API errors), the scaleUpHandler in lambda.ts returns an empty batchItemFailures array. SQS interprets an empty array as "all messages processed successfully" and permanently deletes them. Affected GitHub Actions jobs remain queued indefinitely and eventually time out after 24 hours with no runner ever being assigned.

This issue was originally described in #5024. PR #5029 was closed without merge. PR #4990 (v7.4.1) added @octokit/plugin-retry and fixed JIT config per-instance error handling but did not fix this root cause in lambda.ts.

### Root Cause

lambda.ts lines 53–64:
```
  } catch (e) {
    if (e instanceof ScaleError) {
      batchItemFailures.push(...e.toBatchItemFailures(sqsMessages));
      logger.warn(`${e.detailedMessage} A retry will be attempted via SQS.`, { error: e });
    } else {
      logger.error(`Error processing batch (size: ${sqsMessages.length}): ${(e as Error).message}, ignoring batch`, {
        error: e,
      });
      // ← batchItemFailures is never populated here
      // ← returns [] → SQS deletes ALL messages permanently
    }
    return { batchItemFailures };
  }
```
The batchItemFailures array is initialized as empty on line 41 and only populated for ScaleError exceptions. For every other exception type, it remains empty and SQS permanently removes all messages.

### Contributing Bug: Rate Limit Handlers in auth.ts Never Retry

onRateLimit and onSecondaryRateLimit log the rate limit event but do not return true, which is required by @octokit/plugin-throttling to trigger a retry. The handlers return undefined implicitly, causing the throttling plugin to abort immediately and throw — which then hits the lambda.ts else branch above, deleting the entire batch.
```
  throttle: {
    onRateLimit: (retryAfter: number, options: Required<EndpointDefaults>) => {
      logger.warn(`GitHub rate limit: ...`);
      // missing: return true
    },
    onSecondaryRateLimit: (retryAfter: number, options: Required<EndpointDefaults>) => {
      logger.warn(`GitHub rate limit: SecondaryRateLimit ...`);
      // missing: return true
    },
  },
```
Note: @octokit/plugin-retry (added in v7.4.1) handles 5xx retries from GitHub API but does not handle throttling — that is exclusively @octokit/plugin-throttling's responsibility. These two plugins serve different purposes and the throttling bug is independent.

### Observed Errors (all cause message deletion)
```
  ThrottlingException: Rate exceeded                                    ← SSM rate limit
    at putParameter
    attempts: 3, totalRetryDelay: 1198ms
```
```
Unavailable: The service is unavailable. Please try again shortly.   ← EC2 transient 503
    at AwsEc2QueryProtocol
    attempts: 3, totalRetryDelay: 191ms
```
```
POST /orgs/{org}/actions/runners/registration-token - 503            ← GitHub secondary rate limit in 860ms                                                            (not retried due to auth.ts bug)
```
All logged as: "Error processing batch (size: N): ..., ignoring batch" — the tell-tale sign of this bug.

### Impact

  - GitHub Actions jobs stuck in queue for up to 24 hours then timed out
  - No DLQ routing (messages never enter SQS retry/redrive flow)
  - No metrics or alarms triggered (messages appear "successfully processed")
  - EC2 instances may have been partially created before the exception — wasted cost with no corresponding runner
  - Silent: operators have no indication messages were lost until users report stuck jobs

  ## Corner Cases Considered
1. Empty batch: sqsMessages can be empty if all records were non-SQS event sources. [].map(...) returns [] — no-op, safe.
2. Partial processing before exception: If scaleUp() processes some runner groups successfully before throwing on another group, returning all messages as failures causes already-processed messages to be re-queued. For ephemeral runners, the isJobQueued check on retry prevents duplicate EC2 launches for already-running jobs. For non-ephemeral runners, duplicate runners may be launched but the scale-down Lambda handles cleanup. Correct trade-off: a duplicate runner is recoverable; a permanently lost SQS message is not.
3. ScaleError partial retry semantics preserved: ScaleError.toBatchItemFailures() only retries failedInstanceCount messages, not all. This intentional partial-retry behaviour is unchanged — the fix only affects the non-ScaleError path.
4. Existing test documents the bug: lambda.test.ts line 218 explicitly asserts batchItemFailures: [] for a non-ScaleError exception — this test must be updated.

### Proposed Fix

  1. lambdas/functions/control-plane/src/lambda.ts
```
  } catch (e) {
    if (e instanceof ScaleError) {
      batchItemFailures.push(...e.toBatchItemFailures(sqsMessages));
      logger.warn(`${e.detailedMessage} A retry will be attempted via SQS.`, { error: e });
    } else {
      logger.error(
        `Error processing batch (size: ${sqsMessages.length}): ${(e as Error).message}, batch will be retried via SQS.`,
        { error: e },
      );
      batchItemFailures.push(...sqsMessages.map(({ messageId }) => ({ itemIdentifier: messageId })));
    }
    return { batchItemFailures };
  }
```
  2. lambdas/functions/control-plane/src/github/auth.ts
```
  throttle: {
    onRateLimit: (retryAfter: number, options: Required<EndpointDefaults>) => {
      logger.warn(
        `GitHub rate limit: Request quota exhausted for request ${options.method} ${options.url}. Retrying after ${retryAfter} seconds.`,
      );
      return true;
    },
    onSecondaryRateLimit: (retryAfter: number, options: Required<EndpointDefaults>) => {
      logger.warn(
        `GitHub rate limit: SecondaryRateLimit detected for request ${options.method} ${options.url}. Retrying after ${retryAfter} seconds.`,
      );
      return true;
    },
  },
```
  3. lambdas/functions/control-plane/src/lambda.test.ts
```
  // UPDATE — was asserting broken behavior:
  it('Non scale error should return message as batch failure for retry.', async () => {
    const error = new Error('Non scale should resolve.');
    vi.mocked(scaleUp).mockRejectedValue(error);
    const result = await scaleUpHandler(sqsEvent, context);
    expect(result).toEqual({ batchItemFailures: [{ itemIdentifier: sqsRecord.messageId }] });
  });

  // UPDATE — was asserting batchItemFailures: []
  it('Should return all messages as failures when scaleUp throws non-ScaleError', async () => {
    const records = createMultipleRecords(2);
    const multiRecordEvent: SQSEvent = { Records: records };
    vi.mocked(scaleUp).mockRejectedValue(new Error('Generic error'));
    const result = await scaleUpHandler(multiRecordEvent, context);
    expect(result).toEqual({
      batchItemFailures: [{ itemIdentifier: 'message-0' }, { itemIdentifier: 'message-1' }],
    });
  });

  // ADD — empty batch edge case:
  it('Should return empty failures when batch is empty and scaleUp throws non-ScaleError', async () => {
    const emptyEvent: SQSEvent = { Records: [] };
    vi.mocked(scaleUp).mockRejectedValue(new Error('Generic error'));
    const result = await scaleUpHandler(emptyEvent, context);
    expect(result).toEqual({ batchItemFailures: [] });
  });
```
### Affected Versions

  Confirmed in v7.3.x and v7.4.x. PR #4990 (v7.4.1) does not fix this issue.

### Environment

  GitHub Enterprise Cloud with Data Residency, ephemeral runners, JIT config disabled, org-level runners.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scaleUpHandler silently deletes entire SQS batch on non-ScaleError exceptions, causing permanent job loss #5105

Summary

Root Cause

Contributing Bug: Rate Limit Handlers in auth.ts Never Retry

Observed Errors (all cause message deletion)

Impact

Corner Cases Considered

Proposed Fix

Affected Versions

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

scaleUpHandler silently deletes entire SQS batch on non-ScaleError exceptions, causing permanent job loss #5105

Description

Summary

Root Cause

Contributing Bug: Rate Limit Handlers in auth.ts Never Retry

Observed Errors (all cause message deletion)

Impact

Corner Cases Considered

Proposed Fix

Affected Versions

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions