Skip to content

feat(approval): reviewer-aware auto-approval + in-app approve/decline for protected envs#1098

Open
krusche wants to merge 8 commits into
stagingfrom
feat/deployment-approval-in-helios
Open

feat(approval): reviewer-aware auto-approval + in-app approve/decline for protected envs#1098
krusche wants to merge 8 commits into
stagingfrom
feat/deployment-approval-in-helios

Conversation

@krusche

@krusche krusche commented May 28, 2026

Copy link
Copy Markdown
Member

Motivation

Today, when Helios dispatches a deployment to a GitHub environment with a required-reviewer protection rule, three things go wrong:

  1. The webhook handler blindly impersonates the deployment creator and tries to approve, regardless of whether they're actually a required reviewer. When they aren't, GitHub returns 403 and the error is silently swallowed — the deployment sits in WAITING with no signal to the user.
  2. The only way to actually resolve such a pending deployment is to leave Helios, open GitHub's UI, find the workflow run, and click Review pending deployments. Reviewers forget; valuable time is lost.
  3. The Keycloak SSO session is hard-capped at 8h, so users effectively need to re-auth roughly daily — independent of the approval problem but symptomatic of the same "leaves Helios to do basic things" pain.

This PR is Phase 1 + Phase 2 of a focused approval flow that keeps everything inside Helios. Phase 3 (email one-click links) is intentionally out of scope and can land later.

What changes

Reviewer-aware auto-approval (Phase 1)

When GitHub fires deployment_status: waiting for a Helios-originated deployment, ApprovalService.reviewDeployment now branches on the env's protection rule:

Situation Outcome HeliosDeployment.autoApprovalDecision
No required-reviewer rule Do nothing — GitHub handles any wait-timer / branch policy itself NOT_APPLICABLE
Team-type reviewers present Legacy "impersonate deployer, try anyway" (Helios doesn't yet expand teams to members; punting silently would regress working setups) TEAM_REVIEWER_FALLBACK
preventSelfReview is on, or deployer not in User-type reviewer list Defer; surface in the in-app pending-approvals list for someone else to resolve DEFERRED_TO_REVIEWERS
Deployer is a User-type required reviewer Auto-approve as them via Keycloak impersonation, with a source-tagged comment so the GitHub audit trail says Auto-approved by Helios (deployer is required reviewer) AUTO_APPROVED

A bounded retry (up to 3 attempts, 100 ms + 300 ms backoff = 400 ms worst case) handles the race where the webhook arrives before the deploy-side transaction commits. The service is idempotent against webhook redelivery: once autoApprovalDecision is non-null, a second invocation is a no-op.

In-app approve / decline (Phase 2)

Two endpoints, gated on "current user is in this env's User-type required reviewer list":

  • POST /api/deployments/{id}/approve — impersonates the caller, approves on GitHub, marks any sibling pending requests CONSUMED_BY_OTHER in the same transaction.
  • POST /api/deployments/{id}/decline body {comment?} — same but state: rejected (the workflow run is killed for everyone; the UI surfaces that warning explicitly).
  • GET /api/deployments/pending-approvals — caller's outstanding requests, joined with deployment + env + repo summary.

Concurrency: SELECT FOR UPDATE on the HeliosDeployment row serialises concurrent clicks; only one reviewer's GitHub call ever lands. FAILED_AT_GITHUB distinguishes "clicked Approve" from "GitHub accepted Approve", letting the UI offer a clean retry. The transaction is annotated noRollbackFor = ResponseStatusException.class so the FAILED_AT_GITHUB audit row survives the 502 throw — without this, Spring's default rollback would silently discard the row.

DeploymentApprovalRequest entity (Phase 1 → 3)

One row per (deployment, reviewer) tracking the lifecycle (PENDING → APPROVED | DECLINED | EXPIRED | CONSUMED_BY_OTHER | FAILED_AT_GITHUB), tagged with via in {AUTO, IN_APP, EMAIL_LINK}. Auto rows are created in terminal state. Email-link Phase 3 will populate tokenHash (SHA-256 of the link token, plaintext only in the URL). Flyway V53 adds the table plus auto_approval_decision / auto_approval_at on helios_deployment for fast UI lookups + at-a-glance audit. ApprovalDecisionWriter atomically commits the audit row + deployment stamp together so a crash between the two writes cannot leave a half-state.

UI

New cross-repo /pending-approvals page (Approve / Decline per row, decline opens a dialog with an optional reason and the "this kills the workflow for everyone" warning). Badge on the profile button + a Pending approvals entry in the profile popover, both driven by the same 30 s-polled query — visible across every page since the queue is cross-repo.

Realm session lifespans

ssoSessionMaxLifespan 8 h → 30 d, ssoSessionIdleTimeout 8 h → 7 d on helios-example-realm.json. Prod Keycloak must mirror these for the user-visible effect; the example file is only the dev/staging seed.

GitHubService surface

approveDeploymentOnBehalfOfUser gains a comment parameter so the source of every approval (auto / in-app / email-link) is captured in the GitHub-side audit comment. A new rejectDeploymentOnBehalfOfUser(...) implements the decline path. The legacy 4-arg overload still exists. Token-exchange failure now throws IOException instead of silently returning, so callers can react.

Intentionally NOT in this PR

  • Email one-click approve / decline links (Phase 3). The plan lays out a token table (token_hash is already on DeploymentApprovalRequest, ready for use), but no Notification payload or template yet.
  • Team-type reviewer expansion (Phase 4). Today's "impersonate deployer, try anyway" fallback runs when team-type reviewers are present; this preserves existing setups rather than silently no-opping. Phase 4 will resolve team membership via GET /orgs/{org}/teams/{slug}/members (cached).
  • Pre-deploy confirmation modal. Mentioned in the brief; deferred. Phase 2's pending-approvals queue plus the source-tagged audit comments already cover the core requirement ("don't make me leave Helios").
  • deployment_review webhook sync. Approvals that happen outside Helios in the GitHub UI won't sync back to the pending list until Phase 4. The list is correct for Helios-driven deployments, which is the bulk of usage.
  • Admin bypass. GitHub allows repo admins to bypass reviewer rules in its UI; Helios does not replicate this in code.

Testing instructions

Prerequisites

  • Helios dev stack running locally (server on :8080, client on :4200, Postgres, Keycloak, NATS, GitHub-app webhook reachable).
  • A connected GitHub repository in Helios with at least one environment that has a required-reviewer protection rule of the User type (Settings → Environments → … → Deployment protection rules → Required reviewers).
  • Add your GitHub user as a required reviewer of that environment. For the cross-user tests, also add a second user (or use a colleague's account in a second browser profile).
  • Make sure your Helios login has gone through the GitHub IDP at least once so Keycloak holds a GitHub token for you (Keycloak Admin Console → realm → Users → you → Identity Provider Links → github should show a stored token).
  • Have a trivial deployment workflow on the repo's default branch — anything that targets the protected environment (a sleep 5 job is fine) — so you don't need to wait for a long build between attempts.

Quick DB peek commands used below (adjust connection params for your dev stack):

# Latest deployment-approval audit rows
psql -h localhost -U helios -d helios -c \
  "select id, helios_deployment_id, state, via, reviewer_login, responded_at, failure_reason
   from deployment_approval_request order by id desc limit 5;"

# Latest helios_deployment auto-approval decisions
psql -h localhost -U helios -d helios -c \
  "select id, status, auto_approval_decision, auto_approval_at
   from helios_deployment order by id desc limit 5;"

Server unit tests

cd server && ./gradlew :application-server:test \
  --tests "*ApprovalServiceTest" \
  --tests "*DeploymentReviewActionServiceTest" \
  --tests "*GitHubServiceTest"

These exercise: deployer-is-reviewer auto-approve, defer when preventSelfReview, defer when deployer not reviewer, no-rule no-op, team-fallback path, GitHub failure → FAILED_AT_GITHUB, persist/webhook race retry, missing workflowRunId skip, idempotency on replayed webhook, in-app happy-approve / happy-decline / not-a-reviewer-403 / preventSelfReview-403 / already-resolved-409 / no-runId-409 / no-rule-409 / GitHub-fail-502 / sibling-CONSUMED_BY_OTHER. (208 server tests total; all green.)

End-to-end test matrix

The matrix below is what a reviewer should walk through to confirm the feature actually does what the PR promises. Each test takes ~1 min once the prerequisites are in place.

Test 1 — Auto-approval when the deployer is a required reviewer (the headline)

Setup: the protected env's required reviewers include your account; preventSelfReview is off.

  1. In Helios, open Repo → Environments and click Deploy on the protected environment.
  2. Wait for the workflow run to reach the protected job (a few seconds with a trivial workflow).

Expected:

  • The GitHub workflow run resumes automatically within ~1 s of hitting the waiting state — no clicks needed.
  • On GitHub: open the workflow run → "Review pending deployments" log shows comment Auto-approved by Helios (deployer is required reviewer).
  • DB: helios_deployment.auto_approval_decision = AUTO_APPROVED, auto_approval_at set; one new deployment_approval_request row with state=APPROVED, via=AUTO, reviewer_login=<you>.
  • Server log: Auto-approved deployment {id} on behalf of @<you> (AUTO_APPROVED).
Test 2 — Deferral + in-app Approve (the core Phase 2 path)

Setup: the protected env's required reviewer is someone else (not you). Easiest: temporarily replace yourself as required reviewer with a colleague's account, or use a second test account in a second browser profile.

  1. As you (not a reviewer), click Deploy.
  2. Wait until the workflow reaches the protected job.
  3. Switch to the reviewer's browser. Click the profile avatar (top-right).

Expected at the reviewer's side:

  • A warning badge appears on the profile button (within ~30 s — that's the poll interval).
  • The dropdown shows a Pending approvals entry with the same badge count.
  • Clicking it lands on /pending-approvals and the deployment is in the table with env, repo (clickable), the deployer's login, branch, commit SHA, and a link to the GitHub run.
  1. As the reviewer, click Approve on that row → confirm in the dialog.

Expected:

  • The mutation returns 200; a green toast says "Approved".
  • On GitHub the run resumes; the "Review pending deployments" log shows Approved by @<reviewer> via Helios (in-app).
  • DB: the audit row state=APPROVED, via=IN_APP, responded_at set; helios_deployment.auto_approval_decision=DEFERRED_TO_REVIEWERS (the original deferral decision is preserved as the audit anchor).
  • The badge on the reviewer's profile clears within ~30 s.
Test 3 — In-app Decline rejects the workflow for everyone

Setup: same as Test 2 — a pending approval the reviewer can act on.

  1. As the reviewer, click Decline on a row.
  2. The dialog should warn explicitly: "Declining will reject the workflow run on GitHub — the deployment will not happen and the run is marked as failed."
  3. Optionally type a reason ("wrong branch"). Click Decline deployment.

Expected:

  • The dialog closes; a warn-severity toast says "Declined".
  • The GitHub workflow run goes to failure / rejected state (not just paused).
  • GitHub's "Review pending deployments" log shows Declined by @<reviewer> via Helios (in-app): wrong branch (or without : … if no reason).
  • DB: state=DECLINED, via=IN_APP, responded_at set.
Test 4 — preventSelfReview honored

Setup: the env's required reviewers include your account; preventSelfReview is on.

  1. As yourself, click Deploy.
  2. Wait for the waiting event.

Expected:

  • GitHub run does not resume — no auto-approval.
  • DB: auto_approval_decision = DEFERRED_TO_REVIEWERS.
  • /pending-approvals for you does not show this row (you can't self-review).
  • If a second required reviewer exists, they can resolve it through the same Test-2 / Test-3 paths.
Test 5 — Not-a-required-reviewer can't approve

As a logged-in user who is not in the env's required-reviewer list, attempt the API directly:

curl -i -X POST http://localhost:8080/api/deployments/{deployment-id}/approve \
  -H "Authorization: Bearer $KEYCLOAK_BEARER"

Expected: HTTP 403 Forbidden, body contains You are not a required reviewer for this environment. The UI doesn't expose the button to non-reviewers, so this guards direct API misuse.

Test 6 — Already-resolved → 409 Conflict

After Test 2 succeeds, hit the same deployment id again:

curl -i -X POST http://localhost:8080/api/deployments/{same-id}/approve \
  -H "Authorization: Bearer $KEYCLOAK_BEARER"

Expected: HTTP 409 Conflict, body contains This deployment has already been resolved. Also confirms the first-responder race protection: if two reviewers had clicked at the same time, the slower one would see this exact response.

Test 7 — Idempotency on webhook redelivery (the subtle one)

Setup: a deployment from Test 1 (already AUTO_APPROVED).

  1. GitHub → repo Settings → Webhooks → (Helios webhook) → Recent Deliveries → find the deployment_status event with state=waiting for your run → Redeliver.

Expected:

  • The Helios server log shows Skipping reviewDeployment for {id}; already decided as AUTO_APPROVED.
  • No new deployment_approval_request row is created (still exactly one APPROVED row).
  • helios_deployment.auto_approval_decision is not overwritten — it stays AUTO_APPROVED.
  • No new GitHub API call (look for absence of an approveDeploymentOnBehalfOfUser log line).
Test 8 — Env without required reviewers is left alone

Setup: deploy to an env that has no required-reviewer rule (a wait-timer-only env, or no protection rules at all).

  1. Click Deploy.

Expected:

  • GitHub handles the deployment per its own rules (wait-timer expires, or no gate).
  • Helios stamps auto_approval_decision = NOT_APPLICABLE.
  • No deployment_approval_request rows are created.
  • Server log: No required-reviewer rule for environment {name}; leaving deployment {id} to GitHub.

Visual / UX smoke

  • The /pending-approvals page handles empty list ("Nothing waiting on you right now.") and loading skeleton correctly.
  • The Approve / Decline buttons disable while a mutation is in flight (no double-submits possible by spam-clicking).
  • The decline dialog blocks until the user types a reason or clicks "Decline deployment" with no reason — both should work; the GitHub-side comment changes accordingly.
  • The badge on the profile button updates within ~30 s of the underlying list changing (refetch interval).

Operator notes

  • Mirror the realm session lifespans on prod Keycloak: Admin Console → Realm Settings → Sessions → SSO Session Idle: 7 d, SSO Session Max: 30 d. The realm-export json in this PR is only the dev/staging seed.
  • Confirm the GitHub IDP on the prod realm still has Store Tokens + Stored Tokens Readable enabled (these are baked into the example realm; verify on prod).
  • helios.clientBaseUrl per profile is used today by the email pipeline; no change here, but Phase 3 will rely on it.

krusche added 2 commits May 28, 2026 10:56
Phase 1 of the in-Helios deployment approval flow. When GitHub fires
deployment_status: waiting for a Helios-originated deployment, decide
what to do based on the env's protection rules instead of blindly
impersonating the creator (which silently 403'd today when the creator
wasn't actually a required reviewer).

The new branching in ApprovalService.reviewDeployment:

- No required-reviewer rule on the env  → NOT_APPLICABLE (GitHub handles
  any wait-timer / branch-policy gating itself).
- Team-type reviewers present  → TEAM_REVIEWER_FALLBACK: legacy
  "impersonate deployer, try anyway" path (Helios doesn't yet expand
  teams to members; punting silently would regress working setups).
- preventSelfReview is set, or the deployer isn't in the User-type
  reviewer list  → DEFERRED_TO_REVIEWERS: deployment stays WAITING on
  GitHub; Phase 2 will surface it in an in-app pending-approvals list,
  Phase 3 will email reviewers.
- Otherwise  → AUTO_APPROVED on behalf of the deployer with a
  source-tagged comment ("Auto-approved by Helios (deployer is required
  reviewer)") so the audit trail in GitHub distinguishes auto vs in-app
  vs email-link.

Companion changes:

- DeploymentApprovalRequest entity + Flyway V53: one row per (deployment,
  reviewer) tracking the decision, with state machine including
  FAILED_AT_GITHUB (clicked-but-not-yet-accepted-by-GitHub is
  distinguishable from "missed click") and Via in {AUTO, IN_APP,
  EMAIL_LINK}. Indices on (deployment, state), (reviewer, state), and a
  partial unique on token_hash for Phase 3.
- HeliosDeployment gains PENDING_DISPATCH status (reserved for a future
  refactor; today's persist-before-dispatch is already in place) plus
  autoApprovalDecision + autoApprovalAt for at-a-glance audit on the
  deployment row.
- EnvironmentService.resolveReviewers / isUserRequiredReviewer: leaner
  hot-path companions to the DTO-returning getEnvironmentReviewers.
- GitHubService.approveDeploymentOnBehalfOfUser gains a comment
  parameter (legacy 4-arg overload preserved for callers passing the
  default). New rejectDeploymentOnBehalfOfUser(...) for Phase 2's
  decline path. Token-exchange failure now throws IOException instead
  of silently returning, so callers can surface FAILED_AT_GITHUB.
- A bounded retry against the persist-vs-webhook race: three lookups
  for the HeliosDeployment with 100ms/500ms/2s backoff before treating
  the event as a non-Helios deployment.
- Keycloak realm: ssoSessionMaxLifespan 8h→30d, ssoSessionIdleTimeout
  8h→7d (the actual root cause of users getting bounced back through
  the GitHub login roughly daily). Prod Keycloak must mirror these.

Tests: new ApprovalServiceTest covering each branch + retry + GitHub
failure recovery; updated GitHubServiceTest to reflect the new
"throws on token-exchange failure" semantics and exercise both the
comment parameter and the reject path.
Phase 2 of the in-Helios deployment approval flow. Adds a cross-repo
pending-approvals page where a logged-in required reviewer can resolve
deployments waiting for them without ever opening the GitHub UI.

Server:
- DeploymentReviewActionService.approveAsCurrentUser /
  declineAsCurrentUser take a pessimistic write lock on the
  HeliosDeployment row, validate the caller is a User-type required
  reviewer for the env (and not the deployer when preventSelfReview
  is on), then impersonate the caller via the Keycloak token-exchange
  broker to POST state=approved/rejected to GitHub's
  /actions/runs/{id}/pending_deployments. On GitHub failure, persist
  FAILED_AT_GITHUB on the audit row and return 502 so the UI can offer
  a retry without conflating "user clicked Approve" with "GitHub
  accepted Approve".
- First-responder race: every successful click flips sibling PENDING
  rows for the same deployment to CONSUMED_BY_OTHER inside the same
  transaction, so the loser sees a clean "already handled" message.
- HTTP surface: GET /api/deployments/pending-approvals (caller's
  outstanding requests, joined with deployment + env + repo for the
  UI table), POST /api/deployments/{id}/approve,
  POST /api/deployments/{id}/decline {comment?}.
- HeliosDeploymentRepository.findByIdForUpdate (pessimistic write
  lock) for the lock target. DeploymentApprovalRequestRepository gains
  findPendingForReviewer (join-fetched for the list) and
  lockByDeploymentAndReviewerLogin for transactional updates.

Client:
- New /pending-approvals route + page. Table with Approve / Decline
  per row, decline opens a dialog with an optional reason and an
  explicit "this will reject the workflow run on GitHub for everyone"
  warning. Polls every 30 s so a reviewer who keeps the tab open sees
  new pending items.
- Badge on the profile button + a "Pending approvals" entry in the
  profile popover, both driven by the same query — visible across
  every page (the queue is cross-repo).

Tests: DeploymentReviewActionServiceTest covering the eight code
paths (happy approve / happy decline / not-a-reviewer 403 /
preventSelfReview 403 / already-resolved 409 / no-runId 409 /
no-rule 409 / GitHub failure 502 / sibling consumption); spec for
ProfileNavSectionComponent updated to provide a QueryClient for the
new badge query.
@codacy-production

codacy-production Bot commented May 28, 2026

Copy link
Copy Markdown

Not up to standards ⛔

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🔴 Metrics 191 complexity

Metric Results
Complexity ⚠️ 191 (≤ 20 complexity)

View in Codacy

🟢 Coverage 65.07% diff coverage · +0.51% coverage variation

Metric Results
Coverage variation +0.51% coverage variation (-1.00%)
Diff coverage 65.07% diff coverage

View coverage diff in Codacy

Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (23c77b6) 15036 7250 48.22%
Head commit (dfbe636) 15393 (+357) 7500 (+250) 48.72% (+0.51%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#1098) 375 244 65.07%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces Phase 1 + Phase 2 of an in-Helios deployment-approval flow for protected GitHub environments. It replaces today's blind "impersonate the deployer" auto-approval with a branch that respects the env's required-reviewer rule, and adds an in-app approve/decline UI so reviewers no longer need to bounce to the GitHub UI. It also extends the Keycloak example realm's SSO session lifespans (8h → 7d idle / 30d max).

Changes:

  • Adds DeploymentApprovalRequest entity + Flyway V53 (with auto_approval_decision / auto_approval_at on helios_deployment); ApprovalService now branches on ReviewerResolution with a bounded retry for the webhook-vs-dispatch race.
  • Adds DeploymentReviewActionService + DeploymentApprovalController (/api/deployments/{id}/approve|decline, /pending-approvals) with SELECT FOR UPDATE row locking; GitHubService.approveDeploymentOnBehalfOfUser gains a comment parameter and a new rejectDeploymentOnBehalfOfUser; token-exchange null now throws IOException.
  • Adds the /pending-approvals Angular page, regenerated openapi client, profile-nav badge polling, and bumps helios-example-realm.json SSO session lifespans.

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated no comments.

Show a summary per file
File Description
server/.../approval/ApprovalService.java Reviewer-aware auto-approval branching + 3-attempt persist retry
server/.../approval/DeploymentReviewActionService.java In-app approve/decline with pessimistic deployment-row lock + sibling consume
server/.../approval/DeploymentApprovalController.java REST endpoints for approve / decline / pending-approvals
server/.../approval/DeploymentApprovalRequest{,Repository}.java New entity + repo with locked-by-reviewer and pending-for-reviewer queries
server/.../approval/PendingApprovalDto.java, ReviewDeploymentRequest.java DTOs for list + decline body
server/.../environment/EnvironmentService.java resolveReviewers / isUserRequiredReviewer helpers + ReviewerResolution record
server/.../github/GitHubService.java New comment overload, rejectDeploymentOnBehalfOfUser, null token now throws
server/.../heliosdeployment/HeliosDeployment.java autoApprovalDecision field, AutoApprovalDecision enum, PENDING_DISPATCH status
server/.../heliosdeployment/HeliosDeploymentRepository.java findByIdForUpdate with PESSIMISTIC_WRITE lock
server/.../deployment/LatestDeploymentUnion.java Maps new PENDING_DISPATCH status to REQUESTED
server/.../db/migration/V53__deployment_approval_request.sql Creates table, indexes, and helios_deployment columns
server/.../openapi.yaml Adds approval endpoints and DTO schemas
server/.../GitHubServiceTest.java Updates token-exchange-null assertion, adds comment-passthrough and reject tests
server/.../ApprovalServiceTest.java, DeploymentReviewActionServiceTest.java Net-new coverage for auto-approval branches and in-app actions
client/src/app/pages/pending-approvals/* New cross-repo pending-approvals page with approve/decline
client/src/app/components/profile-nav-section/* Pending-approvals badge driven by polled query
client/src/app/app.routes.ts Adds /pending-approvals route
client/src/app/core/modules/openapi/* Regenerated client (types, sdk, schemas, tanstack helpers, index)
helios-example-realm.json Bumps SSO idle (8h→7d) and max lifespan (8h→30d)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Fixes uncovered by a careful second-pass review of the approval flow.
Reliability-grade issues; no functional addition.

- **Idempotency against webhook redelivery (P0).** ApprovalService now
  short-circuits at the top when HeliosDeployment.autoApprovalDecision
  is already set. Without this guard, a replayed deployment_status
  webhook would re-enter the auto-approve branch, append a spurious
  FAILED_AT_GITHUB audit row (because GitHub 422s an already-approved
  deployment), and *overwrite* AUTO_APPROVED with DEFERRED_TO_REVIEWERS.

- **Audit row survived the BAD_GATEWAY throw (P0).** Annotated
  DeploymentReviewActionService's @transactional methods with
  noRollbackFor = ResponseStatusException.class. Spring's default
  rollback-on-RuntimeException silently discarded the FAILED_AT_GITHUB
  row that the in-app path persists just before re-throwing as 502 —
  the UI had no audit trail to retry against.

- **Atomic audit-row + deployment-stamp (P2).** Extracted the
  finalisation pair into a new ApprovalDecisionWriter @service with one
  transaction per outcome (recordSuccess / recordFailure /
  recordDecisionOnly). A crash between the two writes can no longer
  leave the database half-stamped. Done as a separate bean because
  @transactional via self-invocation in ApprovalService would be
  bypassed by Spring's proxy.

- **JPQL nested-enum literal → parameter (P1).** Repository's
  findPendingForReviewer used the non-portable `$` separator
  (`...DeploymentApprovalRequest$State.PENDING`); switched to a typed
  State parameter and renamed to findByReviewerAndState. Also added
  `left join fetch d.creator` so PendingApprovalDto.from doesn't trip
  LazyInitialization if creator is ever flipped to LAZY or OSIV is
  turned off in prod hardening.

- **Controller readOnly transaction (P1).** DeploymentApprovalController
  .myPendingApprovals is now @transactional(readOnly = true) so the
  Hibernate session stays open across the DTO mapping (defensive against
  OSIV=false).

- **Removed unused PENDING_DISPATCH (P1).** The enum value was declared
  but never assigned, and several callers (EnvironmentService
  .isActiveDeploymentStatus, DeploymentService.canRedeploy, the stuck-
  deployment recovery queries) didn't handle it — silent reliability
  gaps the moment anything started producing it. Reintroduce together
  with its producer when persist-before-dispatch is actually wired.

- **expires_at nullable (P3).** AUTO and IN_APP rows have no TTL; the
  old NOT NULL forced us to seed `now()`, which made them look already-
  expired to any future sweep filtering on expires_at. Migration now
  leaves the column nullable; entity matches.

- **Shorter retry budget (P2).** Persist-vs-webhook race retry tightened
  from {100ms, 500ms, 2000ms} (2.6s worst case) to {100ms, 300ms} (400ms
  worst case) — easily enough for the in-process transaction commit
  race, and far less of a NATS dispatcher latency tail.

- **Client consistency (P4).** Converted the four leftover `*ngIf` uses
  in pending-approvals.component.html to the `@if` control flow that
  the rest of the codebase uses.

ApprovalServiceTest reworked to mock the new ApprovalDecisionWriter and
to cover the idempotency-skip path. The DeploymentReviewActionServiceTest
still covers FAILED_AT_GITHUB via its 502-throw assertion, but the
true rollback-vs-audit-survival behavior is only verifiable in a real
DB integration test — flagged for follow-up; the noRollbackFor annotation
plus a code comment ensures the production behavior is correct.
krusche added a commit that referenced this pull request May 28, 2026
The PR's V52__add_workflow_job_and_runner_inventory.sql collides with
V52__add_deployment_workflow_run_id_index.sql that landed on staging
later. Flyway rejects duplicate version numbers
(CompositeMigrationResolver line 93), which is what tanked
server-tests and validate-migrations on the last CI run.

Bumped to V54 (V53 is reserved for the in-flight #1098 approval-flow
migration, so this avoids a second collision when that lands).

No SQL changes — pure file rename. Migration content is identical.
krusche and others added 2 commits May 28, 2026 14:56
When a deployment is deferred to required reviewers (DEFERRED_TO_REVIEWERS
branch), email every User-type reviewer with a "Review in Helios" button
that lands on /pending-approvals?focus=<deploymentId>. The link is just a
deep-link — auth is the existing Keycloak login, not a one-shot token —
so the same endpoints used by the in-app flow (Phase 2) handle the
approve/decline POST. No new attack surface; no token TTL to manage.

Server
- New notification type DEPLOYMENT_APPROVAL_REQUEST with Flyway V54 that
  widens the chk_notification_type CHECK and backfills enabled=true rows
  in notification_preference (default-on for existing users; the
  per-user toggle works from day one).
- DeploymentApprovalRequestPayload record carries username, env, repo,
  creator, branch, sha, workflow URL, deploymentId — the template
  resolves ${heliosBaseUrl}/pending-approvals?focus=${deploymentId}
  from the notification service's configured client_url.
- ApprovalService.notifyReviewers() runs inside DEFERRED_TO_REVIEWERS:
  for each User-type reviewer (skipping creator when preventSelfReview),
  upserts a PENDING DeploymentApprovalRequest row with emailSentAt and
  publishes via NatsNotificationPublisherService. Reviewers without a
  Helios User row are logged and skipped — Team-type expansion is the
  Phase 4 gap and is acknowledged in code.

Client
- /pending-approvals reads ?focus=<id> via ActivatedRoute → toSignal,
  scrolls the matching row into view, and pulses a 2.4s highlight.
  The query-param signal handles both pre- and post-login arrivals
  cleanly (login redirect interleaves before mount).
- alreadyFocusedId guards against the 30s polling refetch re-scrolling
  the user mid-read.

Tests
- ApprovalServiceTest: persistsPendingRowAndPublishesEmailForEachReviewer,
  skipsCreatorWhenPreventSelfReviewBlocksThem, skipsReviewersWithoutHeliosUserRow.
- NotificationPreferenceServiceTest: replaced the hard-coded times(2)
  with Type.values().length - 1 so future enum additions don't break it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@krusche

krusche commented May 29, 2026

Copy link
Copy Markdown
Member Author

@Claudia-Anthropica review

@Claudia-Anthropica

Copy link
Copy Markdown
Collaborator

@krusche on it! Taking a look now 👀

@Claudia-Anthropica Claudia-Anthropica left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krusche The feature direction looks solid, but the approval flow still has a couple of correctness gaps around retries and webhook redelivery. CI is also still failing because the generated OpenAPI spec differs from openapi.yaml for the new notification enum.

Carefully re-reviewed PR #1098 across concurrency, security, data model,
transactions/error-handling, GitHub integration, reviewer resolution,
frontend, notifications, and tests. This commit lands the CI-blocking
fixes, the resolved checkstyle review threads, and the confirmed defects.

CI unblock
- Regenerate openapi.yaml + the generated TS client: Phase 3 added
  DEPLOYMENT_APPROVAL_REQUEST to NotificationPreference.Type but the spec
  and client union were stale, failing the "Validate OpenAPI Spec" check.

Checkstyle (the 5 open reviewdog threads)
- Rename rejectsCallWhenUserIsNotARequiredReviewer /
  defersWhenDeployerIsNotARequiredReviewer to drop the "AR" consecutive
  caps (AbbreviationAsWordInName).
- Add the missing summary sentence to DeploymentReviewActionService.review.
- Make bob/carol/row final (VariableDeclarationUsageDistance).

Confirmed defects
- notifyReviewers: HTML-escape the GitHub-/user-sourced free-text fields
  (login, env, repo, branch, sha) before they reach the email template —
  the notification module substitutes placeholders into HTML with no
  escaping, so a crafted branch name could break or inject into the mail.
- Auto-approve failure on the happy path now also notifies reviewers: a
  transient GitHub/token failure stamped DEFERRED_TO_REVIEWERS but never
  created pending rows or sent email, stranding the deployment. It now
  mirrors the explicit defer branch.
- resolveReviewers: an unparseable reviewers blob no longer collapses to a
  present-but-empty resolution (which silently deferred to zero reviewers
  and notified nobody); it now returns Optional.empty so the deployment is
  left for manual review and the failure is logged loudly.
- user-settings: add the DEPLOYMENT_APPROVAL_REQUEST label so the
  backfilled, default-on preference toggle no longer renders blank.
- privacy policy: disclose the new DEPLOYMENT_APPROVAL_REQUEST event.

Docs / tests
- Document the mixed User+Team reviewer precedence (team-fallback wins),
  the null prevent_self_review default, and emailSentAt's enqueue (not
  delivery) semantics.
- Strengthen the deferral test to assert the email payload's deploymentId,
  environmentName and creatorLogin (not any()); assert the decline path
  returns a DECLINED/IN_APP row.

Verified: server tests (Approval/DeploymentReviewAction/GitHub/
NotificationPreference/Environment) green, checkstyle 10.21.0 clean on all
touched files, client build + lint + 198 unit tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@Claudia-Anthropica Claudia-Anthropica left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krusche The previous blockers are still open in the current code, so I am keeping this at changes requested. The approval decision still is not serialized before the GitHub call, deferred reviewer rows are still not persisted atomically with the decision stamp, and failed GitHub attempts still disappear from the pending-approvals retry queue.

Addresses three substantive review comments on the approval flow:

- FAILED_AT_GITHUB rows are now surfaced in the in-app pending-approvals
  list (and therefore the profile badge, which counts the list). The PR
  promises these rows let the UI offer a clean retry, but the query only
  returned PENDING, so after a transient GitHub failure the reviewer's
  queue looked empty while the deployment was still waiting. The list now
  includes PENDING + FAILED_AT_GITHUB; clicking Approve/Decline reuses the
  same row to retry. New findByReviewerAndStateIn join-fetch query.

- ApprovalDecisionWriter now re-loads the deployment under a SELECT FOR
  UPDATE lock and re-checks that no decision was recorded before writing.
  The webhook auto path reads the row unlocked then calls GitHub before
  persisting, so a concurrent in-app approve/decline (which holds the row
  lock for its whole transaction) could otherwise overwrite a recorded
  decision or write a duplicate audit row. The first decision now wins;
  a losing writer becomes a no-op.

- The deferral branch now persists the per-reviewer pending rows (and
  publishes the emails) BEFORE stamping DEFERRED_TO_REVIEWERS. The stamp
  is the idempotency gate: stamping first and then failing in
  notifyReviewers would let a redelivered webhook short-circuit, stranding
  the deployment with no in-app rows and nobody notified. Stamping last
  leaves autoApprovalDecision null on failure so redelivery re-runs the
  deferral.

Reverted the earlier notify-on-auto-fallback change: on the happy path the
deployer is themselves a required reviewer, so the FAILED_AT_GITHUB row now
surfaces in their own pending list for retry — and the notify-on-fallback
conflicted with the lock-recheck above (it would email reviewers for a
deployment a concurrent in-app click had already resolved).

Verified: Approval/DeploymentReviewAction/GitHub/NotificationPreference/
Environment tests green; checkstyle 10.21.0 clean; no OpenAPI drift.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants