Skip to content

feat(queue): GitHub Actions queue monitoring, ETA, and SLO alerts#1046

Open
krusche wants to merge 9 commits into
stagingfrom
feat/queue-monitoring
Open

feat(queue): GitHub Actions queue monitoring, ETA, and SLO alerts#1046
krusche wants to merge 9 commits into
stagingfrom
feat/queue-monitoring

Conversation

@krusche

@krusche krusche commented May 18, 2026

Copy link
Copy Markdown
Member

Summary

Adds queue visibility to Helios — per-label-set queue depth, self-hosted runner inventory, build-time percentiles, per-job ETA, stuck-job classification, and SLO alerts — closing the regression Artemis maintainers see vs the previous Bamboo dashboard.

All scheduled jobs + controllers are gated by helios.queue.enabled=false and default off. The Flyway migration runs regardless (table creation only); rollback is a forward-fix migration, not a flag flip.

What's in here

  • DB (V51)workflow_job, runner, queue_wait_stat, queue_alert_rule, queue_alert_event, with partial index on status='queued', GIN on labels, FK to repository(repository_id).
  • IngestionGitHubWorkflowJobMessageHandler now also persists durable workflow_job rows + a Caffeine hot index, wrapped in try/catch so a failure cannot poison the existing deployment-timing path. New GitHubSelfHostedRunnerMessageHandler for org-level events.
  • Reconcilers (each @ConditionalOnProperty helios.queue.enabled) — runner inventory (60s), in-progress job filler with last_reconcile_attempt_at backoff (30s), stuck-job classifier (60s), hourly p50/p90/p95 rollup (5min), 30-day backfill (admin-triggered, self-throttling to 180 req/min).
  • REST layerEtagCache + GitHubRestClient with If-None-Match / 304 reuse and rate-limit metrics.
  • ETA — label-superset capacity, 3s Caffeine cache, configurable GitHub-hosted concurrency ceiling.
  • AlertsQueueAlertEvaluator with dedup-via-open-events, quiet-hours cron, email channel + template; pluggable AlertChannel interface.
  • ControllersWorkflowQueueController (depth/jobs/stats/alerts CRUD/backfill, mutations gated by @EnforceAtLeastWritePermission) and RunnerController (@PreAuthorize(\"isAuthenticated()\")).
  • ClientThemeService extracted from MainLayoutComponent so the new HeliosLineChartComponent (PrimeNG <p-chart> + Chart.js) can react to dark-mode toggles. Routes: /repo/:id/ci-cd/queue (overview, runners, stats, alerts) + admin-only top-level /queue.
  • Tests — 70 new (55 server, 15 client); 426-test server suite + client suite green.

Known follow-ups (from deep code review, fix before enabling in prod)

  1. WorkflowJobBackfillService.start() calls this.runAsync() — Spring AOP doesn't proxy self-invocations; the admin POST will block.
  2. HeliosLineChartComponent uses type: 'time' but no module imports chartjs-adapter-date-fns — stats page will throw at runtime.
  3. Manual queueApi() / runnerApi() HttpClient calls don't go through BearerInterceptor — auth will 403 once @PreAuthorize is reached.
  4. LabelSets.hash uses empty separator → [\"a\",\"bc\"] collides with [\"ab\",\"c\"].
  5. QueueIndexService decrements on every redelivery of in_progress/completed — counter drifts.
  6. QueueAlertEvaluator.inQuietHours treats cron as a fire-moment, not a duration window — only suppresses for one minute per night.
  7. WorkflowQueueController.stats averages per-bucket p95s (statistically wrong).

The detailed plan lives at /Users/krusche/.claude/plans/can-you-clone-the-twinkly-cosmos.md; reviews of the plan + the implementation were performed before opening this PR.

Test plan

  • ./gradlew :application-server:test — full server suite passes locally (426 tests).
  • cd client && npx vitest run — client unit suite passes locally.
  • ./gradlew flywayValidate against a copy of staging — V51 applies cleanly.
  • Manual staging walkthrough once GitHub App permissions are updated (subscribe self_hosted_runner, add org administration:read):
    • Set helios.queue.enabled=true; webhook job appears in /ci-cd/queue within 5s.
    • Take a self-hosted runner offline; transitions to OFFLINE within 60s.
    • Trigger a low QUEUE_P95_OVER rule; email fires once and auto-clears.
    • POST /api/queue/admin/backfill; rate-limit metric stays healthy.
  • Address the 7 known follow-ups above before promoting to staging.

🤖 Generated with Claude Code

Adds visibility into GitHub Actions queue state — per-label-set depth,
self-hosted runner inventory, build-time percentiles, per-job ETA,
stuck-job classification, and SLO alerts — closing the regression
Artemis maintainers see versus their previous Bamboo dashboard.

Server (Spring Boot, gated by helios.queue.enabled=false):
- V51 migration: workflow_job, runner, queue_wait_stat, queue_alert_rule,
  queue_alert_event (with partial index on status='queued', GIN on labels,
  FK to repository(repository_id))
- WorkflowJobPersistenceService runs alongside the existing deployment
  timing path inside try/catch so failures cannot poison NATS redelivery
- New GitHubSelfHostedRunnerMessageHandler for org-level events
- EtagCache + GitHubRestClient with If-None-Match + 304 reuse and
  rate-limit metrics; reconcilers (runner inventory, in-progress job
  filler with last_reconcile_attempt_at backoff, hourly p50/p90/p95
  rollup, 30-day backfill that self-throttles to 180 req/min)
- QueueEtaService with label-superset capacity, 3s Caffeine cache,
  configurable GitHub-hosted concurrency ceiling
- StuckJobClassifier (PENDING_APPROVAL / NO_RUNNER_ONLINE / RUNNERS_BUSY
  / CONCURRENCY_LOCK / UNKNOWN), WorkflowYamlCache (snakeyaml)
- QueueAlertEvaluator with dedup via open events, quiet-hours cron
- WorkflowQueueController + RunnerController + DTOs
- Email template + QueueAlertEmailPayload, 3 new NotificationPreference.Type
  values, findUsersByTypeEnabled query

Client (Angular 20):
- ThemeService extracted from MainLayoutComponent so the new
  HeliosLineChartComponent (PrimeNG p-chart + Chart.js) can observe
  dark-mode toggles
- /repo/:repositoryId/ci-cd/queue routes: overview, runners, stats,
  alerts; admin-only top-level /queue
- Manual queue.api.ts pending OpenAPI regen against the new controllers

Tests: 70 new (55 server, 15 client) + 2 pre-existing tests adjusted
for the payload-record arity change and the 3 new enum values. Full
server suite (426 tests) and client suite green.

Known follow-ups from deep review (tracked but not yet fixed):
- @async self-invocation in WorkflowJobBackfillService.start()
- Chart.js time-axis adapter import missing
- Manual queueApi() lacks BearerInterceptor wiring
- LabelSets.hash separator collision for adjacency boundary inputs
- QueueIndexService drifts on redelivered status updates
- QueueAlertEvaluator.inQuietHours treats cron as a moment, not a window

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 18, 2026 13:07
@krusche krusche requested a review from a team as a code owner May 18, 2026 13:07
@codacy-production

codacy-production Bot commented May 18, 2026

Copy link
Copy Markdown

Not up to standards ⛔

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🔴 Metrics 751 complexity

Metric Results
Complexity ⚠️ 751 (≤ 20 complexity)

View in Codacy

🟢 Coverage 65.79% diff coverage · +2.21% coverage variation

Metric Results
Coverage variation +2.21% coverage variation (-1.00%)
Diff coverage 65.79% diff coverage

View coverage diff in Codacy

Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (aef5d83) 8429 3961 46.99%
Head commit (3e7e351) 9490 (+1061) 4669 (+708) 49.20% (+2.21%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#1046) 1061 698 65.79%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Replaces / augments the previous smoke-test scaffolding with tests
that exercise full code paths and pin behaviour matching real bugs
called out in the deep review.

Server (8 new + 2 expanded):
- InProgressJobReconcilerFullPathTest — happy path: REST call fires,
  runner_id/labels/runner_kind filled in; last_reconcile_attempt_at
  touched even when REST returns 304/empty; one REST call per unique
  workflow run regardless of job count.
- EmailAlertChannelTest — recipient resolution per rule kind, per-user
  failure isolation, no-recipients no-op, RUNNER_OFFLINE → correct
  preference type.
- StuckJobClassifierEndToEndTest — exercises the public classify()
  loop end-to-end, asserts is_stuck/queued_reason/stuck_detected_at
  persistence for each candidate.
- WorkflowJobBackfillServiceTest — running flag toggle, double-start
  semantics (sentinel for the @async self-invocation bug).
- WorkflowJobPersistenceServiceTest — added idempotent-re-upsert and
  status-case-preservation tests (the partial index WHERE
  status='queued' is case-sensitive in Postgres).
- QueueIndexServiceDriftTest — @disabled sentinels for the
  redelivery-drift bug (PR #1046 follow-up #5).
- QuietHoursWindowTest — @disabled sentinels for the cron-as-moment
  vs window bug (PR #1046 follow-up #6).
- QueueStatsAveragingTest — @disabled sentinel for the
  per-bucket-percentile averaging bug (PR #1046 follow-up #7).

Client (1 new + 1 expanded):
- helios-line-chart.component.spec.ts — datasets per series, palette
  uniqueness, options rebuild when dark mode toggles.
- theme.service.spec.ts — adds DOM-side-effect coverage (the
  dark-mode-enabled class on <html> follows the signal via effect()).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds GitHub Actions queue monitoring to Helios, spanning persistence, reconciliation, alerting, REST APIs, notification delivery, documentation, and Angular queue dashboards.

Changes:

  • Adds queue/runner database schema, ingestion, reconciliation, ETA, stuck classification, and alert evaluation.
  • Adds REST controllers and Angular pages/components for queue depth, jobs, runners, stats, and alerts.
  • Adds queue alert email templates, notification preference types, admin docs, and supporting tests.

Reviewed changes

Copilot reviewed 85 out of 85 changed files in this pull request and generated 45 comments.

Show a summary per file
File Description
server/notification/src/main/resources/email-templates/queue-alert.html Adds queue alert email template.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/WorkflowJobPersistenceServiceTest.java Tests workflow job persistence derivation/upsert behavior.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/web/WorkflowQueueControllerTest.java Tests queue depth/jobs controller responses.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/web/RunnerControllerTest.java Tests runner listing/pool endpoints.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/StuckJobClassifierTest.java Tests stuck-job classification paths.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/reconcile/RunnerInventoryReconcilerTest.java Tests runner inventory reconciliation.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/reconcile/InProgressJobReconcilerTest.java Tests in-progress job reconciler no-op/backoff behavior.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/QueueIndexServiceTest.java Tests hot queue counter behavior.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/QueueEtaServiceTest.java Tests ETA capacity and hosted-runner behavior.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/LabelSetsTest.java Tests label canonicalization/hash/runner kind helpers.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/github/GitHubSelfHostedRunnerMessageHandlerTest.java Tests self-hosted runner webhook handling.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/alert/QueueAlertEvaluatorTest.java Tests alert firing, dedup, clearing, and quiet-hours logic.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/github/GitHubWorkflowJobTimingServiceTest.java Updates workflow job payload construction for new fields.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/github/GitHubWorkflowJobMessageHandlerTest.java Tests workflow job handler ordering/failure isolation.
server/application-server/src/test/java/de/tum/cit/aet/helios/notification/NotificationPreferenceServiceTest.java Updates notification preference default expectations.
server/application-server/src/test/java/de/tum/cit/aet/helios/github/EtagCacheTest.java Tests ETag cache behavior.
server/application-server/src/main/resources/db/migration/V51__add_workflow_job_and_runner_inventory.sql Adds queue, runner, stats, and alert tables/indexes.
server/application-server/src/main/resources/application-staging.yml Adds staging queue/GitHub REST configuration.
server/application-server/src/main/resources/application-prod.yml Adds production queue/GitHub REST configuration.
server/application-server/src/main/resources/application-dev.yml Adds development queue/GitHub REST configuration.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/WorkflowYamlCache.java Adds cached workflow YAML fetch/parse helper.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/WorkflowJobRepository.java Adds workflow job repository queries.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/WorkflowJobPersistenceService.java Persists workflow_job webhook payloads.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/WorkflowJob.java Adds workflow job JPA entity.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/web/WorkflowQueueController.java Adds queue stats/jobs/depth/alerts/backfill API.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/web/RunnerController.java Adds runner inventory API.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/web/QueueDtos.java Adds queue/runner/alert DTO records.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/StuckJobClassifier.java Adds scheduled stuck-job classification.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/RunnerRepository.java Adds runner repository queries/update.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/Runner.java Adds runner JPA entity.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/reconcile/WorkflowJobBackfillService.java Adds admin-triggered workflow job backfill.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/reconcile/RunnerInventoryReconciler.java Adds scheduled runner inventory polling.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/reconcile/QueueWaitStatRollup.java Adds hourly queue/run percentile rollup.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/reconcile/InProgressJobReconciler.java Adds job runner/label reconciliation.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueWaitStatRepository.java Adds queue stats repository queries.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueWaitStat.java Adds queue stats JPA entity.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueIndexService.java Adds hot queue counter service.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueEtaService.java Adds queue ETA calculation service.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueAlertRuleRepository.java Adds alert rule repository.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueAlertRule.java Adds alert rule JPA entity.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueAlertEventRepository.java Adds alert event repository.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueAlertEvent.java Adds alert event JPA entity.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/LabelSets.java Adds label canonicalization/hash helpers.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/github/GitHubSelfHostedRunnerPayload.java Adds self-hosted runner webhook payload model.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/github/GitHubSelfHostedRunnerMessageHandler.java Adds self-hosted runner webhook handler.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/alert/QueueAlertEvaluator.java Adds scheduled queue alert evaluator.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/alert/EmailAlertChannel.java Adds email alert channel implementation.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/alert/AlertChannel.java Adds alert channel interface.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/github/GitHubWorkflowJobPayload.java Extends workflow job payload fields.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/github/GitHubWorkflowJobMessageHandler.java Adds queue persistence/indexing to workflow job handling.
server/application-server/src/main/java/de/tum/cit/aet/helios/notification/NotificationPreferenceRepository.java Adds enabled-user lookup for notification type.
server/application-server/src/main/java/de/tum/cit/aet/helios/notification/NotificationPreference.java Adds queue alert notification preference types.
server/application-server/src/main/java/de/tum/cit/aet/helios/notification/email/QueueAlertEmailPayload.java Adds queue alert email payload.
server/application-server/src/main/java/de/tum/cit/aet/helios/github/GitHubRestClient.java Adds ETag-aware GitHub REST client.
server/application-server/src/main/java/de/tum/cit/aet/helios/github/EtagCache.java Adds conditional GET cache.
docs/admin/queue-monitoring.rst Adds queue monitoring rollout/admin docs.
client/src/app/pages/queue/runner-list/runner-list.component.ts Adds runner list page.
client/src/app/pages/queue/queue.routes.ts Adds queue child routes.
client/src/app/pages/queue/queue.api.ts Adds manual queue/runner API wrapper.
client/src/app/pages/queue/queue-stats/queue-stats.component.ts Adds queue stats page.
client/src/app/pages/queue/queue-overview.component.ts Adds queue overview page.
client/src/app/pages/queue/queue-alerts/queue-alerts.component.ts Adds alert rules/events page.
client/src/app/pages/queue/components/runner-pool-panel.component.ts Adds runner pool panel component.
client/src/app/pages/queue/components/runner-pool-panel.component.spec.ts Adds runner pool component tests.
client/src/app/pages/queue/components/queued-reason-chip.component.ts Adds queued reason chip component.
client/src/app/pages/queue/components/queued-reason-chip.component.spec.ts Adds queued reason chip tests.
client/src/app/pages/queue/components/queued-jobs-table.component.ts Adds queued jobs table component.
client/src/app/pages/queue/components/queued-jobs-table.component.spec.ts Adds queued jobs table tests.
client/src/app/pages/queue/components/queue-depth-panel.component.ts Adds queue depth panel component.
client/src/app/pages/queue/components/queue-depth-panel.component.spec.ts Adds queue depth panel tests.
client/src/app/pages/main-layout/main-layout.component.ts Moves dark-mode state into ThemeService.
client/src/app/core/services/theme.service.ts Adds shared theme service.
client/src/app/core/services/theme.service.spec.ts Adds theme service tests.
client/src/app/components/navigation-bar/navigation-bar.component.ts Adds queue navigation entries.
client/src/app/components/charts/helios-line-chart.component.ts Adds shared line chart wrapper.
client/src/app/app.routes.ts Adds repo queue and admin queue routes.
client/package.json Adds chart/date dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +123 to +130
CREATE TABLE queue_alert_rule (
id BIGSERIAL PRIMARY KEY,
kind VARCHAR(32) NOT NULL,
threshold_seconds INT,
window_minutes INT NOT NULL DEFAULT 5,
repository_id BIGINT,
label_set_hash CHAR(40),
channels TEXT[] NOT NULL DEFAULT '{EMAIL}',
Comment on lines +111 to +113
CONSTRAINT uq_queue_wait_stat_natural
UNIQUE (repository_id, workflow_name, job_name, head_branch,
label_set_hash, bucket_start)
Comment on lines +41 to +44
@RestController
@RequestMapping("/api/queue")
@RequiredArgsConstructor
public class WorkflowQueueController {
Comment on lines +242 to +244
@EnforceAtLeastWritePermission
@PostMapping("/admin/backfill")
public ResponseEntity<String> startBackfill() {
Comment on lines +219 to +222
return ruleRepository.findById(id).map(rule -> {
applyDto(rule, body);
rule.setRepositoryId(repoId);
return ResponseEntity.ok(toDto(ruleRepository.save(rule)));
Comment on lines +86 to +90
/** Returns rate-limit remaining from the most recent response, or -1 if unknown. */
public int rateLimitRemaining() {
// Caller-driven monitoring point; relies on Sentry breadcrumbs / metrics in production.
return -1;
}
Comment on lines +171 to +178
private Integer medianRunDuration(Long repositoryId) {
// Cheap fallback: median over the last 50 completed jobs in this repo.
List<WorkflowJob> recent = workflowJobRepository
.findByRepositoryIdAndStatus(repositoryId, "completed")
.stream()
.filter(j -> j.getRunDurationSeconds() != null)
.limit(50)
.toList();
Comment on lines +165 to +167
CREATE INDEX idx_queue_alert_event_open
ON queue_alert_event (rule_id)
WHERE cleared_at IS NULL;
Comment on lines +249 to +256
private void applyDto(QueueAlertRule rule, AlertRuleDto body) {
rule.setKind(QueueAlertRule.Kind.valueOf(body.kind()));
rule.setThresholdSeconds(body.thresholdSeconds());
rule.setWindowMinutes(body.windowMinutes() == null ? 5 : body.windowMinutes());
rule.setLabelSetHash(body.labelSetHash());
rule.setChannels(body.channels() == null ? List.of("EMAIL") : body.channels());
rule.setEnabled(body.enabled());
rule.setQuietHoursCron(body.quietHoursCron());
Comment on lines +47 to +50
Map<List<String>, List<Runner>> byLabels = new HashMap<>();
for (Runner r : runnerRepository.findAll()) {
byLabels.computeIfAbsent(r.getLabels() == null ? List.of() : r.getLabels(),
k -> new ArrayList<>()).add(r);
krusche and others added 4 commits May 19, 2026 15:45
Resolves the bulk of the review comments from the deep review +
Copilot reviewer + checkstyle on PR #1046.

Schema (V51):
- Extend chk_notification_type so the 3 new enum values can be saved
- queue_wait_stat natural-key columns NOT NULL DEFAULT '' so ON CONFLICT
  dedups correctly (Postgres NULL-distinct semantics would have inserted
  duplicates indefinitely)
- queue_alert_event open-event partial index made UNIQUE so concurrent
  evaluator threads can't race and double-fire emails
- Rename quiet_hours_cron → quiet_window (cron-as-moment was wrong; new
  semantics are HH:mm-HH:mm local-time ranges)

Runtime correctness:
- WorkflowJobBackfillService: dispatch through a separate proxied
  WorkflowJobBackfillExecutor (Spring @async self-invocation bug fix);
  URL-encode the `created>=...` filter; paginate /actions/runs/{id}/jobs;
  add an abort() flag; rollupRange() historical buckets after backfill
- InProgressJobReconciler: paginate run jobs endpoint too
- RunnerInventoryReconciler: explicit empty-inventory handling — empty
  list now marks all online runners offline (previously skipped)
- GitHubSelfHostedRunnerMessageHandler: verify org.login matches config;
  add `deleted` to the action switch (GitHub's actual removal action);
  canonicalize labels before save so RunnerController.pools groups
  correctly even when label order differs
- QueueIndexService: per-job state tracking so duplicate webhook
  delivery doesn't drift the counter
- QueueEtaService: cache by job id (depends on per-job position, not
  just label-set); exclude the job being estimated from queueAhead;
  return null ETA when capacity is 0; replace
  findByRepositoryIdAndStatus + .limit(50) in Java with a real ORDER BY
  + bounded query
- QueueAlertEvaluator: parse quiet windows as HH:mm-HH:mm ranges
  (handles overnight); apply rule.labelSetHash to all 3 measurements;
  org-wide rules now go to findForRuleWindow (which honours NULL
  repositoryId) instead of substituting 0L; STUCK_JOBS_OVER counts only
  rows that are still status='queued'
- QueueAlertRule.Kind.unit() — explicit SECONDS vs COUNT to remove the
  "threshold in seconds" misnomer for runner-offline / stuck-jobs
- Stats endpoint: sample-weighted percentile aggregate (closer
  approximation; documented limitation vs raw-sample percentile)

Security / scoping:
- WorkflowQueueController and RunnerController feature-gated by
  helios.queue.enabled (matches the schedulers)
- updateRule / deleteRule now scoped by (id, repositoryId) — caller
  can't edit or delete a rule from another repo by guessing its id
- Backfill endpoint upgraded to @EnforceAdmin
- AlertRuleDto: @NotNull + @pattern + @min validation actually wired up
- /jobs endpoint: pageable LIMIT pushed into SQL, capped at 500

Client:
- ThemeService DOM toggle (already extracted)
- HeliosLineChartComponent: import 'chartjs-adapter-date-fns' for side
  effect (the time-scale would otherwise throw at runtime)
- queue.api.ts: rename quietHoursCron → quietWindow
- queue-alerts: per-kind threshold-unit label ("seconds" vs "count");
  null-id template guard so strict templates pass; quietWindow input
- queue-stats: filters → signals (effect now refetches on change);
  toSignal(paramMap) so repositoryId is reactive (repo-switch
  no longer leaves stale polling)
- queue-overview: org-wide /queue route uses orgDepth instead of
  spinning forever waiting for a non-existent repositoryId; reactive
  repositoryId
- queue-alerts: reactive repositoryId
- app.routes: top-level /queue exposes only the overview (stats and
  alerts require a repositoryId)
- navigation-bar: admin-only "Org Queue" entry
- eslint.config.js: ignore dist/ (was failing lint on generated output)
- yarn.lock regenerated for chart.js / date-fns / chartjs-adapter
- openapi.yaml regenerated; SDK regenerated via npm run generate:openapi

OpenAPI profile:
- helios.queue.enabled=true so the new controllers are scanned
- GitHubRestClient MeterRegistry made optional (uses SimpleMeterRegistry
  when actuator isn't auto-configured, e.g. in the openapi profile)

Tests:
- QueueIndexServiceDriftTest: re-enabled (was @disabled); 4 active tests
  covering redelivery, separate jobs, completion of unknown job
- QuietHoursWindowTest: rewritten against the new HH:mm-HH:mm semantics
  (same-day, overnight, invalid input)
- QueueStatsAveragingTest: re-enabled (sample-weighted percentile)
- WorkflowJobBackfillServiceTest: rewritten to verify the proxied async
  dispatch + idempotent start() + abort() short-circuit
- QueueEtaServiceTest: new test for queueAhead=0 single-job-queue case
  (ETA ≈ 0 instead of one full p50run as before)
- QueueIndexServiceTest: status case-handling updated to match new
  JobState.fromStatus mapping
- All WebMvcTest slices now declare helios.queue.enabled=true so the
  feature-gated controllers load

Full suite (446 server + 20 client) green locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_to_helios_deployment)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- LabelSets.hash: SHA-1 → SHA-256 (Codacy critical security finding;
  hash is for bucketing, not crypto, but static analysis can't tell)
- Widen label_set_hash column CHAR(40) → CHAR(64) to fit SHA-256 hex
- Fix 6 minor checkstyle/comprehensibility findings: import order,
  line length, variable-declaration distance, Javadoc summary period
- Regenerate openapi.yaml + client SDK to reflect the column widening

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

There hasn't been any activity on this pull request recently. Therefore, this pull request has been automatically marked as stale and will be closed if no further activity occurs within seven days. Thank you for your contributions.

@github-actions github-actions Bot added the stale label May 27, 2026
krusche added 3 commits May 27, 2026 22:32
The PR's V52__add_workflow_job_and_runner_inventory.sql collides with
V52__add_deployment_workflow_run_id_index.sql that landed on staging
later. Flyway rejects duplicate version numbers
(CompositeMigrationResolver line 93), which is what tanked
server-tests and validate-migrations on the last CI run.

Bumped to V54 (V53 is reserved for the in-flight #1098 approval-flow
migration, so this avoids a second collision when that lands).

No SQL changes — pure file rename. Migration content is identical.
@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown

🚨 Client Code Validation Failed 🚨

The client code in /client/src/app/core/modules/openapi is not up-to-date.
Please regenerate the client code by running:

cd ./client
pnpm generate:openapi

Commit and push the updated files.

@bassner

bassner commented May 28, 2026

Copy link
Copy Markdown
Member

@Claudia-Anthropica review

@github-actions github-actions Bot removed the stale label May 28, 2026
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

There hasn't been any activity on this pull request recently. Therefore, this pull request has been automatically marked as stale and will be closed if no further activity occurs within seven days. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants