Skip to content

Scale-down can terminate a runner that picks up a job between busy check and termination #5085

@JVenberg

Description

@JVenberg

We hit an issue where the scale-down lambda terminated an EC2 instance while it was actively running a GitHub Actions job (a Helm deploy, in our case). The job failed with:

The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

We traced it back to a race condition in removeRunner in scale-down.ts. The current flow is:

  1. Check if the runner is busy via the GitHub API → returns false
  2. Deregister the runner from GitHub (delete)
  3. Terminate the EC2 instance

The problem is that a job can be assigned to the runner between step 1 and step 2. By the time the instance is terminated, it's already running a job. The runner gets killed mid-execution.

We confirmed this by matching the CloudTrail TerminateInstances event to the exact runner that was running the failed job. The scale-down lambda terminated the instance 13 seconds after the Helm deploy started.

Suggested fix

Swap the order: deregister first, then re-check busy state.

  1. Check busy (fast-path to skip obviously busy runners)
  2. Deregister from GitHub (prevents new job assignment server-side)
  3. Re-check busy state — this check is now stable since no new jobs can be assigned after deregistration

If the runner became busy between step 1 and step 2, the in-flight job still completes because the runner worker uses job-scoped OAuth credentials, not the runner registration. The worker creates its own VssConnection from the job message and never checks registration status during execution. Deregistration only affects the listener (no new job pickup), not the worker (current job).

If the re-check finds the runner busy, we skip termination and let the instance be cleaned up as an orphan once the job finishes.

Environment

  • Module version: 7.3.0 (also verified the same code exists in 7.5.0)
  • Runner type: on-demand, non-ephemeral, repo-level
  • Runner config: aws-4-ubuntu (c7a.xlarge)

Happy to put up a PR if this approach sounds reasonable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions