Skip to content

Deploy dispatches stale commit_sha after a fresh build (UI cache out of sync) #1048

@krusche

Description

@krusche

Summary

Helios's CI/CD deploy action dispatched the prod-like-deployment.yml workflow on ls1intum/Artemis with a stale commit_sha input — the previous develop HEAD, not the current one. The workflow obediently fetched the build artifact for that older commit and deployed it, so the staging1 nodes rolled forward to yesterday's WAR instead of the just-merged PR. The deploy reported success; the smoke tests caught it because the running JVM's git.commit.id did not match the SHA we asked Helios to deploy.

Concrete instance (today, 2026-05-25)

  1. PR ls1intum/Artemis#12711 was merged into develop at 10:45:16Z. Merge commit on develop: 1258484214d47fe42b9a370b1dc860a6aba836fd.
  2. The build.yml workflow for that commit completed success at 10:57:07Z and produced an Artemis.war artifact (run 26396508993).
  3. UI signal: when opening Helios' CI/CD page for the Artemis repo, the new build for 12584842 was not visible for several minutes (suspected list caching).
  4. Triggering "Deploy to staging1" from Helios at 11:01:09Z produced Artemis deploy run 26397117774. From its Print inputs step log:
    Branch: develop
    Commit SHA: 01be782943782cb39d644d5c529223e873c8c51e   ← previous develop HEAD (PR #12778), NOT 12584842
    Environment: staging1.artemis.cit.tum.de
    Triggered by: krusche
    
  5. The workflow's check-build-status job resolved this SHA to build run 26371588020 (yesterday's build of 01be782), downloaded that artifact, and ansible rolled it to all three staging1 core nodes.
  6. After the deploy:
    • systemctl is-active artemisactive on all 6 hosts.
    • /management/health200 UP.
    • /management/info → git.commit.id01be782943782cb39d644d5c529223e873c8c51e (yesterday's commit, NOT the requested SHA).
    • /management/info → activeModuleFeatures → missing athena, apollon, ldap (because the running code predates the migration in #12711).

Re-dispatching the same workflow manually via gh workflow run … -f commit_sha=1258484214d47fe42b9a370b1dc860a6aba836fd … worked correctly, confirming the workflow itself is fine — only the SHA Helios passed was wrong.

Why this is a real problem

The deploy reported success even though the wrong code was rolled out. There is no automatic check that the deployed JVM's git.commit.id matches the requested commit_sha, so the only feedback the operator gets is the green checkmark. A more dangerous variant of the same bug would silently deploy a recent-but-not-current build to production.

Concretely on staging1 today, the migration from Spring profiles to module-feature toggles (Artemis #12711) requires a YAML pre-edit to add three *.enabled: true flags; without it, the new WAR refuses to start (getPropertyOrExitArtemis). Because Helios deployed the OLD WAR, the JVM came up fine — and the operator would only notice the regression if they specifically inspected activeModuleFeatures or the embedded git.commit.id. In a different scenario (e.g. a security fix), this could be considerably worse.

Suggested fixes

Possibly more than one of these:

  • Fix the source of the stale SHA: investigate whether Helios is caching the workflow_runs / branch HEAD response from the GitHub API (e.g. an in-memory or DB cache that isn't invalidated when a new build completes for that branch). The window between "build green for 12584842" (10:57Z) and "Helios still offers 01be782 as the deploy target" (11:01Z) suggests a cache TTL on the build listing.
  • Add a freshness check on dispatch: before dispatching, re-fetch the current HEAD of the target branch and warn / refuse if commit_sha is older than HEAD by more than N minutes (or simply does not match HEAD when deploying develop).
  • Show the actual SHA in the dispatch confirmation modal so the operator can spot a stale value before clicking through. Currently the UI lists builds by other identifiers, not the SHA itself.
  • Optional defense-in-depth on the Artemis side: extend .github/workflows/prod-like-deployment.yml with a post-deploy check that curls /management/info and asserts git.commit.id matches the input commit_sha. Then a stale-SHA deploy would at least fail loudly. (Tracked separately on the Artemis repo if you want — happy to open the PR.)

Environment

  • Helios: helios.aet.cit.tum.de
  • Affected Artemis repo: ls1intum/Artemis
  • Affected GitHub environment: staging1.artemis.cit.tum.de
  • Deploy workflow: .github/workflows/prod-like-deployment.yml

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions