Scale-down can terminate a runner that picks up a job between busy check and termination

We hit an issue where the scale-down lambda terminated an EC2 instance while it was actively running a GitHub Actions job (a Helm deploy, in our case). The job failed with:

```
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
```

We traced it back to a race condition in `removeRunner` in `scale-down.ts`. The current flow is:

1. Check if the runner is busy via the GitHub API → returns `false`
2. Deregister the runner from GitHub (delete)
3. Terminate the EC2 instance

The problem is that a job can be assigned to the runner between step 1 and step 2. By the time the instance is terminated, it's already running a job. The runner gets killed mid-execution.

We confirmed this by matching the CloudTrail `TerminateInstances` event to the exact runner that was running the failed job. The scale-down lambda terminated the instance 13 seconds after the Helm deploy started.

### Suggested fix

Swap the order: deregister first, then re-check busy state.

1. Check busy (fast-path to skip obviously busy runners)
2. Deregister from GitHub (prevents new job assignment server-side)
3. Re-check busy state — this check is now stable since no new jobs can be assigned after deregistration

If the runner became busy between step 1 and step 2, the in-flight job still completes because the runner worker uses [job-scoped OAuth credentials](https://github.com/actions/runner/blob/main/src/Runner.Worker/JobRunner.cs#L80-L95), not the runner registration. The worker creates its own `VssConnection` from the job message and never checks registration status during execution. Deregistration only affects the listener (no new job pickup), not the worker (current job).

If the re-check finds the runner busy, we skip termination and let the instance be cleaned up as an orphan once the job finishes.

### Environment

- Module version: 7.3.0 (also verified the same code exists in 7.5.0)
- Runner type: on-demand, non-ephemeral, repo-level
- Runner config: `aws-4-ubuntu` (c7a.xlarge)

Happy to put up a PR if this approach sounds reasonable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale-down can terminate a runner that picks up a job between busy check and termination #5085

Suggested fix

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scale-down can terminate a runner that picks up a job between busy check and termination #5085

Description

Suggested fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions