We hit an issue where the scale-down lambda terminated an EC2 instance while it was actively running a GitHub Actions job (a Helm deploy, in our case). The job failed with:
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
We traced it back to a race condition in removeRunner in scale-down.ts. The current flow is:
- Check if the runner is busy via the GitHub API → returns
false
- Deregister the runner from GitHub (delete)
- Terminate the EC2 instance
The problem is that a job can be assigned to the runner between step 1 and step 2. By the time the instance is terminated, it's already running a job. The runner gets killed mid-execution.
We confirmed this by matching the CloudTrail TerminateInstances event to the exact runner that was running the failed job. The scale-down lambda terminated the instance 13 seconds after the Helm deploy started.
Suggested fix
Swap the order: deregister first, then re-check busy state.
- Check busy (fast-path to skip obviously busy runners)
- Deregister from GitHub (prevents new job assignment server-side)
- Re-check busy state — this check is now stable since no new jobs can be assigned after deregistration
If the runner became busy between step 1 and step 2, the in-flight job still completes because the runner worker uses job-scoped OAuth credentials, not the runner registration. The worker creates its own VssConnection from the job message and never checks registration status during execution. Deregistration only affects the listener (no new job pickup), not the worker (current job).
If the re-check finds the runner busy, we skip termination and let the instance be cleaned up as an orphan once the job finishes.
Environment
- Module version: 7.3.0 (also verified the same code exists in 7.5.0)
- Runner type: on-demand, non-ephemeral, repo-level
- Runner config:
aws-4-ubuntu (c7a.xlarge)
Happy to put up a PR if this approach sounds reasonable.
We hit an issue where the scale-down lambda terminated an EC2 instance while it was actively running a GitHub Actions job (a Helm deploy, in our case). The job failed with:
We traced it back to a race condition in
removeRunnerinscale-down.ts. The current flow is:falseThe problem is that a job can be assigned to the runner between step 1 and step 2. By the time the instance is terminated, it's already running a job. The runner gets killed mid-execution.
We confirmed this by matching the CloudTrail
TerminateInstancesevent to the exact runner that was running the failed job. The scale-down lambda terminated the instance 13 seconds after the Helm deploy started.Suggested fix
Swap the order: deregister first, then re-check busy state.
If the runner became busy between step 1 and step 2, the in-flight job still completes because the runner worker uses job-scoped OAuth credentials, not the runner registration. The worker creates its own
VssConnectionfrom the job message and never checks registration status during execution. Deregistration only affects the listener (no new job pickup), not the worker (current job).If the re-check finds the runner busy, we skip termination and let the instance be cleaned up as an orphan once the job finishes.
Environment
aws-4-ubuntu(c7a.xlarge)Happy to put up a PR if this approach sounds reasonable.