Skip to content

ci: efficiency — flaky integration tests + gate AI reviews on CI status #70

Description

@EricAndrechek

Context

We're seeing real cost + noise from our CI pipeline that's worth tightening before the repo goes public. This issue bundles all the CI-hygiene work and the post-#69 cleanup tasks into a single follow-up PR that, once merged, lets us re-add Lint/Test/Integration Tests to the ruleset's required status checks.

Scope updated 2026-04-23 to bundle the golangci-lint flake fix, dropping pull_request from the dual-trigger workflows, and SDK + E2E test coverage into one PR. Reason: all these changes are small workflow tweaks, related by concern, low-risk, and Taite isn't online for a few hours — so one larger PR is more efficient than splitting.


Scope (all bundled into one PR that closes this issue)

1. Fix TestDLQIntegration flakiness (root-cause, not retry-cover)

Two observed failure modes:

Both point at the same root cause: testcontainer readiness race for ClickHouse. Most likely causes:

  • Process ready vs. connection ready timing on startup
  • Port contention / cleanup race with prior tests
  • Missing t.Helper() or t.Cleanup() discipline

Acceptance: TestDLQIntegration runs 10× locally and 10× in CI without a single failure.

2. Gate claude-review.yml on CI pass/skip

Currently Claude review fires on pull_request: opened / synchronize regardless of CI state — tokens get spent on PRs the human will just bounce back as "come back when CI is green."

Implementation options:

  • (A) Switch trigger to workflow_run: { workflows: [CI], types: [completed] } + check conclusion in a first step. Clean semantics; trade-off is workflow_run has reduced default permissions (need to explicitly grant pull-requests: write).
  • (B) Keep pull_request trigger but add a first-step guard that queries gh api .../check-runs and exits 0 if any required check is failure / cancelled. Simpler permission model, slight latency cost (workflow spins up a runner just to decide to skip).
  • (C) Both: gate + sticky comment explaining "CI red; AI review skipped until green."

Acceptance: Claude review doesn't post on PRs whose required CI checks are failing. Post-fix re-push that goes green triggers review normally.

Out of scope: Gemini (managed App) and Copilot (per-seat).

3. Fix golangci-lint config verify external-schema flake

Observed multiple times on main and PRs:

[.golangci.yml] validate: compile schema: failing loading
"https://golangci-lint.run/jsonschema/golangci.v2.11.jsonschema.json":
context deadline exceeded

golangci-lint-action runs golangci-lint config verify by default, which fetches a schema from golangci-lint.run. When that service is slow or the runner's network blips, Lint fails for reasons unrelated to our Go code.

Fix: pass verify: false to the golangci-lint-action step in ci.yml. Skips the schema-validate pre-flight; the actual linter run is unaffected. One-line change.

4. Drop pull_request trigger from pr-title.yml + label.yml

Those workflows were temporarily dual-triggered (both pull_request AND pull_request_target) in #69 as a transition. Now that pull_request_target is on main and confirmed to fire, drop the pull_request trigger. Stops running each check twice on internal PRs, removes the sticky-comment TOCTOU race Claude flagged.

Also: revert the concurrency-group ${{ github.event_name }} suffix once there's only one trigger — no longer needed.

5. Add SDK + E2E test jobs to CI

Gap surfaced by #63 (Dependabot SDK bump): make test-sdk and make test-e2e exist in the Makefile but aren't run by ci.yml. An SDK-only PR currently runs zero TypeScript tests automatically.

Proposed additions:

  • sdk-test job: runs on pull_request with paths: [clients/ts/**, tests/sdk/**]. Executes make test-sdk. Fast (~30s).
  • e2e job: runs on pull_request with paths covering the full ingest surface (internal/ingest/**, internal/api/**, clients/ts/**, tests/sdk/**). Executes make test-e2e. Slower (starts Docker containers), gate behind a label (e.g. run-e2e) or only on ready_for_review to manage runner cost.

Acceptance: SDK bump PRs exercise the SDK unit tests. PRs touching the ingest → CH → query path can optionally trigger E2E via label.

6. Smarter CI (lower priority — pick what fits)

Grab-bag of ideas:

  • Path-based CI skips via paths: / paths-ignore: on ci.yml. Docs-only PRs shouldn't run Go integration tests. Already partially possible; audit and expand.
  • Test parallelization via gotestsum --packages with explicit splits, or go test -parallel N. Could reduce Test job from ~4m → ~2m.
  • Flaky-test auto-retry: gotestsum --rerun-fails=2 --packages ./... in test-integration. Belt-and-suspenders after ci: bump golangci/golangci-lint-action from 7 to 9 #1 root-cause work; not a substitute.
  • Cache warming: actions/cache for Go modules + build cache.
  • Change-aware test selection via go list -deps: only run Go tests for packages whose files (or transitive deps) changed. Probably over-engineering for our pace.

Once this PR merges — ruleset update

After verification on main + at least 3 clean CI runs, re-add these to the ruleset's required_status_checks:

  • Lint
  • Test
  • Integration Tests

Single gh api PUT /repos/Wave-RF/WaveHouse/rulesets/15353356 call.


References

Out of scope (tracked separately)


— Posted by Claude Code on behalf of @EricAndrechek

Metadata

Metadata

Assignees

Labels

area/infraCI, build, deploy, Docker, releaseenhancementNew feature or request

Type

No type

Fields

No fields configured for issues without a type.

Projects

Status
Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions