fix(bootstrap): improve startup reliability for multi-replica deploys by gandhipratik203 · Pull Request #4444 · IBM/mcp-context-forge

gandhipratik203 · 2026-04-26T08:12:57Z

Summary

Fixes issue #4051 — Alembic migration advisory lock hangs when multiple gateway replicas start through PgBouncer in transaction-pooling mode. The session-scoped pg_advisory_lock is orphaned across PgBouncer's backend handoffs; the N-th gateway pod sees the lock held by an effectively-dead session and spins in its retry loop until the ~10-minute timeout fires.

This PR closes the bug along two layers and ships them together so the chart is safe under both deployment patterns:

Layer 1 (correctness) — bootstrap_db.main() now probes alembic_version before attempting the advisory lock. When the schema is at the script directory's head, the lock acquisition is skipped entirely. Replicas 2..N never participate in the race.
Layer 2 (operator ergonomics) — new MCPGATEWAY_SKIP_MIGRATIONS settings flag suppresses the in-pod bootstrap_db.main() call entirely. The mcp-stack Helm chart wires it from migration.enabled so the contract "the pre-install Job owns migrations, app pods skip" is enforced at the chart layer rather than left to operators to coordinate.

Library defaults preserve backward compatibility: a direct docker run mcpgateway:latest continues to bootstrap its own schema. Only the chart and (future) compose overlays opt into SKIP_MIGRATIONS=true.

Related: existing migration.hostKey: host / portKey: port knobs in charts/mcp-stack/profiles/ocp/values-pgo.yaml already let the migration Job bypass PgBouncer; that workaround stays in place but is no longer load-bearing.

Approach

The fix was built reproduction-first: rather than implement against the issue text alone, the bug was first reproduced deterministically in docker-compose (tests/reproduction/issue-4051/) and then again on a real OpenShift cluster against CrunchyData PGO + transaction-pool PgBouncer. Both reproductions are documented in todo/issue-4051-ocp-reproduction.md, and the compose-side artifact lives permanently in the repo as a test fixture.

For each layer (L1 fast-path, L2 chart-wired skip), the cycle was:

   1. write a failing test that captures the invariant ──────► RED
   2. implement just enough to make it pass ─────────────────► GREEN
   3. verify on the OCP cluster against pre-fix and post-fix
      images, side-by-side, on the same stack ──────────────► EVIDENCE
   4. commit

The regression test for the bug — test_bootstrap_db_skips_lock_when_schema_already_at_head — is reliably RED pre-fix (240s timeout, 29/60 retry attempts visible in logs) and reliably GREEN post-fix (~1.4s). It's not a flaky timing-based test: it synthesizes the held-lock condition deterministically by holding the advisory lock in a distinct, still-alive Postgres session for the duration of the second bootstrap_db.main() call. That guarantees the bootstrap's PgBouncer-side session sees pg_try_advisory_lock return FALSE regardless of PgBouncer's internal backend assignment — without that guarantee, PgBouncer can hand the bootstrap the same server backend that the holder used, and pg_try_advisory_lock succeeds via PostgreSQL's reentrant-within-session semantics, masking the bug. (We landed on this design after observing a passing test that should have failed; the false-positive is documented as test_reentrant_acquire_through_same_pgbouncer_is_not_a_counter_example so a future reader who hits the same gotcha understands what they're seeing.)

Gaps closed

HIGH — Bug correctness

Gap 1 (HIGH) — Advisory lock orphan via PgBouncer transaction-pool handoffs: bootstrap_db.advisory_lock always called pg_try_advisory_lock and retried up to 60 times with exponential backoff. When PgBouncer reused a backend whose previous client took the lock, new clients saw it held by an unreachable session and spun until timeout. Fixed with a _alembic_at_head probe that runs before the lock — when the DB is at the script-directory head, the lock is never attempted. The probe uses Alembic's MigrationContext.get_current_heads() and ScriptDirectory.from_config(cfg).get_heads() for the canonical source of truth; any error during probing falls through to the existing slow path so empty / partial / out-of-date databases still go through the explicit handling.

Gap 2 (HIGH) — Worker-level fan-out under-stated in the issue text: each gateway pod runs N gunicorn workers (default 8 in the OCP profile), and each worker calls bootstrap_db.main() in its own lifespan. So a 3-replica deployment produces 24 racing acquirers, not 3. Reproduction on OpenShift confirmed the fan-out: pre-fix, the same pod's logs show interleaved Acquired lock on attempt N and Lock held by another instance, attempt N/60 at the same timestamps — different workers in the same gunicorn process. The L1 fast-path eliminates the race for replicas 2..N entirely; for replica 1 the within-pod race is bounded to a sub-second seed window.

Gap 3 (HIGH) — Chart safety today depends on the OCP-only workaround: migration.hostKey: host / portKey: port route the pre-install Job direct to Postgres (bypassing PgBouncer), masking the bug whenever the Job is enabled. A community user setting migration.enabled=false (because they manage migrations externally, or run the chart in compose-flavoured contexts) loses that protection. Layer 1 makes migration.enabled=false safe.

HIGH — Operator ergonomics

Gap 4 (HIGH) — In-pod bootstrap runs even when a dedicated migration runner has already populated the schema: redundant probe activity per worker, plus the within-pod race for replica 1's seed window. Fixed with MCPGATEWAY_SKIP_MIGRATIONS=true which short-circuits the lifespan call to bootstrap_db.main() and emits a single audit-log line so operators can see the choice in pod logs.

Gap 5 (HIGH) — No chart wiring for the SKIP/Job pairing: even if the env var existed, expecting operators to set both migration.enabled=true AND MCPGATEWAY_SKIP_MIGRATIONS=true is a foot-gun (forgetting one means redundant in-pod bootstrap; forgetting the other means schema never populated). Fixed in charts/mcp-stack/templates/deployment-mcpgateway.yaml: the env var is rendered as {{ ternary "true" "false" .Values.migration.enabled }} and placed after extraEnv so a user override cannot accidentally undo the contract.

MEDIUM — Test infrastructure

Gap 6 (MEDIUM) — No regression test for the advisory-lock invariant: no existing test exercised concurrent bootstrap_db.main() callers through PgBouncer. Fixed with tests/integration/test_migrations_under_transaction_pool.py containing five @pytest.mark.integration tests: three pinning the PgBouncer mechanism (so the substrate behavior is captured for posterity even if PgBouncer changes), one regression gate (synthesizes the orphaned-lock condition and asserts bootstrap_db.main() returns within 10s — fails in 240s pre-fix), and one idempotency invariant (asserts a second call after schema is at head never enters the advisory_lock context manager). Tests skip cleanly without --with-integration.

Gap 7 (MEDIUM) — No reproduction artifact in the repo: future debuggers had nothing to anchor on. Fixed with tests/reproduction/issue-4051/ containing a docker-compose stack (postgres + pgbouncer + gateway), a deterministic demonstrate_orphan.sh (proves the PgBouncer mechanism in 3 psql calls), and a reproduce.sh (scales gateway replicas through PgBouncer and reports per-pod hangs). The compose file lives at tests/integration/fixtures/transaction_pool/ so the integration tests can drive the same stack. The shell scripts are slated for deletion before this PR merges; their evidence will be preserved in the issue thread.

LOW — Documentation and observability

Gap 8 (LOW) — Two readiness markers, one log-line: the L1 fast-path returns without emitting "Database ready" (only the slow-path return logs it). Operators reading pod logs see either the slow-path marker or the fast-path skip line, not a single canonical "bootstrap finished" line. Tracked as a tiny follow-up; not blocking.

Architecture

Before — every replica × every worker races for the lock

                        DB schema starts populated (alembic_version at head)
                                                │
                                                ▼
   ┌────────── pod 1 (8 gunicorn workers) ──────────┐
   │  W1 → SELECT pg_try_advisory_lock(42…42) → t   │ ─┐
   │  W2 → SELECT pg_try_advisory_lock(42…42) → f   │ ─┤
   │  W3 → SELECT pg_try_advisory_lock(42…42) → f   │ ─┤
   │  …                                              │  │
   └─────────────────────────────────────────────────┘  │
                                                        │ all 24 workers
   ┌────────── pod 2 (8 gunicorn workers) ──────────┐  │ converge through
   │  W1..W8 → all 'f', spin in retry loop          │ ─┤ the same pgbouncer
   └─────────────────────────────────────────────────┘  │ pool
                                                        │
   ┌────────── pod 3 (8 gunicorn workers) ──────────┐  │
   │  W1..W8 → all 'f', spin in retry loop          │ ─┘
   └─────────────────────────────────────────────────┘
                                                        │
                                                        ▼
                        ┌──────────────────────────────────────────┐
                        │ PgBouncer (transaction pool)             │
                        │   each transaction commit → backend      │
                        │   handed to next client                  │
                        │   advisory lock outlives client session  │
                        └──────────────────────────────────────────┘
                                                        │
                                                        ▼
                        ┌──────────────────────────────────────────┐
                        │ Postgres                                  │
                        │   1 backend granted advisory lock         │
                        │   N–1 backends "idle in transaction" on   │
                        │   pg_try_advisory_lock(42424242424242)    │
                        └──────────────────────────────────────────┘

   Observable on OCP (issue reporter + our reproduction):
     · pods cycle 0/1 Running → CrashLoopBackOff → … → Ready over ~13 min
     · 7 restarts per pod (×3 = 21 restart events)
     · 60+ visible "Lock held by another instance" retries per restart
     · helm release status: failed

After Layer 1 — fast-path before the lock

   ┌────────── pod 1 worker ─────────────────────────────────────┐
   │  bootstrap_db.main()                                         │
   │    ┌─ probe (read-only) ─────────────────────────────────┐  │
   │    │  alembic_version == script_dir.get_heads() ?        │  │
   │    │       │                                              │  │
   │    │       ├── YES → return early, no lock involved      │  │
   │    │       │                                              │  │
   │    │       └── NO → fall through to advisory_lock path   │  │
   │    │              (existing slow path, unchanged)         │  │
   │    └─────────────────────────────────────────────────────┘  │
   └──────────────────────────────────────────────────────────────┘

   On a fresh deploy (empty DB), only the FIRST worker that wins the
   lock runs the slow path; replicas 2..N probe AFTER the seed and
   skip the lock entirely.

   Observed on OCP (verified end-to-end on context-forge-test-pg1):
     · pod 1 (seed):    0 retries → completes in ~1s
     · pods 2 & 3:      0 retries, 8 fast-path firings each
     · all 3 pods 1/1 Ready in 56 seconds, 0 restarts
     · helm release status: deployed

After Layer 2 — pods don't even probe when the Job owns it

                                                       Helm pre-install Job
                                                       ─────────────────────
                                                       direct to postgres :5432
                                                       runs bootstrap_db,
                                                       exits 0
                                                              │
                                                              ▼
                                              alembic_version at head
                                                              │
                                                              ▼
   ┌────────── gateway Deployment (chart-rendered) ────────────┐
   │  env:                                                      │
   │    - MCPGATEWAY_SKIP_MIGRATIONS: "true"   ← from           │
   │                                              .Values.      │
   │                                              migration.    │
   │                                              enabled       │
   │                                                            │
   │  lifespan:                                                 │
   │    wait_for_db_ready()                                     │
   │    if settings.mcpgateway_skip_migrations:                 │
   │        logger.info("Skipping in-pod migration              │
   │                     bootstrap (MCPGATEWAY_                 │
   │                     SKIP_MIGRATIONS=true)…")               │
   │        return                                              │
   │    else:                                                   │
   │        await bootstrap_db.main()                           │
   └────────────────────────────────────────────────────────────┘

   No probe. No lock. No worker fan-out.
   Single audit-log line per pod so operators can see the choice.

Test results

The change ships with three layers of verification, each addressing a different failure mode:

   ┌──────────────────────────────────────────────────────────────────────┐
   │ Layer            │ Catches                                            │
   ├──────────────────┼────────────────────────────────────────────────────┤
   │ Unit (13 tests)  │ Fast-path predicate logic; settings field          │
   │                  │ defaults; lifespan helper gate; chart render       │
   │                  │ outputs the right env var.                         │
   │                  │                                                    │
   │                  │ A future refactor that breaks ANY of these gets a  │
   │                  │ red unit test in seconds.                          │
   ├──────────────────┼────────────────────────────────────────────────────┤
   │ Integration      │ The actual bug surface: bootstrap_db running       │
   │ (5 tests)        │ against a real Postgres + PgBouncer stack with     │
   │                  │ the held-lock precondition synthesized.            │
   │                  │                                                    │
   │                  │ Reliably RED on main (240s timeout). Reliably      │
   │                  │ GREEN with the L1 fast-path. Plus 3 mechanism      │
   │                  │ tests pinning PgBouncer's transaction-pool         │
   │                  │ semantics for documentation-as-code.               │
   ├──────────────────┼────────────────────────────────────────────────────┤
   │ Cluster smoke    │ End-to-end on OpenShift with CrunchyData PGO +     │
   │ (manual)         │ transaction-pool PgBouncer. Same image, same       │
   │                  │ stack, only the gateway code differs between       │
   │                  │ pre- and post-fix runs. Concrete numbers below.    │
   └──────────────────────────────────────────────────────────────────────┘

The integration tests sit behind --with-integration; the unit tests run on every CI pass. The cluster smoke is documented evidence, not automated, and lives as todo/issue-4051-ocp-reproduction.md for the PR record.

Reproduction harness — how the bug was demonstrated end-to-end

Two artifacts live under tests/reproduction/issue-4051/. They will be removed before this PR merges (their evidence preserved in the issue thread):

   demonstrate_orphan.sh   ── deterministic mechanism proof (3 psql calls)
                              · step 1: through PgBouncer, take advisory lock,
                                disconnect
                              · step 2: directly query pg_locks → lock STILL
                                held by orphaned backend
                              · step 3: from a fresh Postgres session,
                                pg_try_advisory_lock → FALSE
                              always passes; documents PgBouncer behavior

   reproduce.sh            ── replica-scaling symptom reproduction
                              · scales 3 gateway replicas through PgBouncer
                              · watches each pod's logs for the retry storm
                              · dumps pg_locks + pg_stat_activity on hang
                              fails on pre-fix images, succeeds on post-fix

The compose stack itself was promoted to tests/integration/fixtures/transaction_pool/docker-compose.yml so the integration test can drive the same infrastructure that the shell scripts do. PgBouncer's default_pool_size is deliberately small (2) so total client connections comfortably exceed it, forcing backend multiplexing — without that pressure, the bug was hard to trigger locally because bootstrap can finish before any backend handoff happens.

Why two harnesses for one bug? demonstrate_orphan.sh proves the PgBouncer mechanism exists (orphaned advisory locks); reproduce.sh proves the operational symptom exists (replicas hang). Together they answer two questions any reviewer might ask: "is this a real PgBouncer property?" and "does it actually cause user-visible breakage?" — without making us guess timing.

Why the regression test is deterministic (not flaky)

Three ideas had to land before test_bootstrap_db_skips_lock_when_schema_already_at_head was reliably RED on main:

   ❌ First attempt — race the bootstrap calls
       Open 3 concurrent gateway processes, hope one wins the lock and
       the others spin. Worked sometimes (lucky timing) → fragile.

   ❌ Second attempt — synthesize an orphan via PgBouncer
       Take the lock through PgBouncer, disconnect, then call
       bootstrap_db.main() through PgBouncer. PgBouncer reused the same
       server backend for the bootstrap → pg_try_advisory_lock returned
       TRUE (reentrant within the same Postgres session) → FALSE
       POSITIVE (test passed when bug was present).

   ✅ Final design — hold the lock in a distinct postgres session
       Open a direct postgres connection (NOT through PgBouncer), take
       the advisory lock, KEEP the connection alive. Now run
       bootstrap_db.main() through PgBouncer. The bootstrap's session
       is guaranteed different from the holder's session, so
       pg_try_advisory_lock MUST return FALSE → bootstrap retries →
       240s pytest.mark.timeout fires. RED, every time.

The middle attempt's false-positive is captured as a permanent test (test_reentrant_acquire_through_same_pgbouncer_is_not_a_counter_example) so future readers who see "but I called pg_try_advisory_lock through PgBouncer and it returned TRUE!" understand the reentrant-session gotcha.

The same logic underpins the _clean_orphaned_lock autouse fixture: it runs pg_terminate_backend on any session still holding the test lock id between tests, so a single failed test cannot wedge the next one behind a leaked orphan.

Unit — `_alembic_at_head` truth table (5 tests)

tests/unit/mcpgateway/test_bootstrap_db.py::TestAlembicAtHead

  test_returns_true_when_db_heads_match_script_heads          PASSED
  test_returns_false_when_db_has_no_alembic_version           PASSED
  test_returns_false_when_db_head_does_not_match_script_head  PASSED
  test_returns_false_when_script_directory_has_no_heads       PASSED
  test_returns_false_on_probe_exception                       PASSED

5 passed

Pins each branch of the fast-path predicate independently of the integration test, so a future refactor that widens or narrows the trigger condition is caught locally.

Unit — `MCPGATEWAY_SKIP_MIGRATIONS` settings field (3 tests)

tests/unit/mcpgateway/test_config.py

  test_skip_migrations_defaults_to_false        PASSED
  test_skip_migrations_env_true_flips_flag      PASSED
  test_skip_migrations_env_false_keeps_flag_off PASSED

3 passed

Library default stays False (the docker run mcpgateway happy path), env var override flips True, explicit False matches the default. No accidental coercion.

Unit — lifespan helper respects the flag (2 tests)

tests/unit/mcpgateway/test_main.py

  test_run_initial_db_bootstrap_invokes_bootstrap_when_flag_off  PASSED
  test_run_initial_db_bootstrap_skips_when_flag_on               PASSED

2 passed

The skip-case test also asserts the audit-log line is emitted so operators can see the choice in pod logs.

Unit — Helm chart render (3 tests)

tests/unit/charts/test_deployment_mcpgateway.py

  test_skip_migrations_set_to_true_when_migration_job_enabled        PASSED
  test_skip_migrations_off_when_migration_job_disabled               PASSED
  test_skip_migrations_default_matches_migration_enabled_default     PASSED

3 passed

Tests run helm template as a subprocess, parse the rendered manifests, and assert on the gateway Deployment's env block. Skipped automatically when helm is not on PATH; no cluster needed.

Integration — PgBouncer mechanism + bootstrap regression (5 tests, gated by `--with-integration`)

tests/integration/test_migrations_under_transaction_pool.py

  test_session_advisory_lock_persists_across_pgbouncer_client_disconnect  PASSED
  test_orphaned_lock_blocks_a_fresh_postgres_session                       PASSED
  test_reentrant_acquire_through_same_pgbouncer_is_not_a_counter_example  PASSED
  test_bootstrap_db_skips_lock_when_schema_already_at_head                 PASSED
  test_bootstrap_db_is_idempotent_once_schema_is_at_head                   PASSED

5 passed in 1.47s   (post-fix)
1 fails in 240.34s  (pre-fix, hits pytest.mark.timeout — confirmed RED on main)

The mechanism tests document what PgBouncer does (lock orphaning is a real, named property of transaction-pool handoffs) so a future reader who sees "but pg_try_advisory_lock returned TRUE through PgBouncer!" understands the reentrance gotcha. The regression tests gate the actual fix.

Run with the docker-compose stack from tests/integration/fixtures/transaction_pool/docker-compose.yml.

OCP verification — L1 fix on real OpenShift + CrunchyData PGO (cluster: context-forge-test-pg1.cp.fyre.ibm.com)

Image:    docker.io/gandhipratik203/mcp-context-forge:4051-l1-fix-1777129453
Stack:    mcp-stack chart, default values + migration.enabled=false
Initial:  empty DB (DROP SCHEMA public CASCADE; CREATE SCHEMA public; …)

                                       PRE-FIX            POST-L1
                                       ───────            ───────
   Helm release status                 failed             deployed
   Pod restarts before Ready           7 each (21 total)  0
   Wall-clock to 3/3 Ready             ~13 min            56 sec
   Pods using fast-path                n/a                2 of 3, every worker
   Total visible retry messages        64+                28 (all confined to one pod's
                                                              worker-race during a
                                                              sub-second seed window)
   Advisory locks held at sample time  0 (after storm)    0 (no storm)

Per-pod log signature — post-L1
─────────────────────────────────
Pod 9jvtd  (won the seed race — slow path expected here):
  Empty DB detected:           1
  Acquired migration lock:     8
  Lock held by another:       28   ← within-pod race during ~1s seed window
  Schema already at head:      0
  Database ready:              8

Pods 2p9ls and jvb4t  (replicas 2 & 3):
  Empty DB detected:           0
  Acquired migration lock:     0   ← skipped the lock entirely
  Lock held by another:        0   ← no retries
  Schema already at head:      8   ← fast-path fired for every worker
  Database ready:              0

Full reproduction-vs-fix evidence in todo/issue-4051-ocp-reproduction.md.

OCP verification — L2 (chart-wired SKIP_MIGRATIONS) on real OpenShift

Image:    docker.io/gandhipratik203/mcp-context-forge:4051-l1-l2-fix-1777186436
Stack:    mcp-stack chart, DEFAULT values (migration.enabled=true), so the
          chart's pre-install Job runs the migration and the gateway
          Deployment is rendered with MCPGATEWAY_SKIP_MIGRATIONS=true.
Initial:  empty DB (DROP SCHEMA public CASCADE; CREATE SCHEMA public; …)

                                       PRE-FIX            POST-L1            POST-L2
                                       ───────            ───────            ───────
   Helm release status                 failed             deployed           deployed*
   Pod restarts before Ready           7 each (21 total)  0                  0
   Wall-clock to 3/3 Ready             ~13 min            56 sec             91 sec
   bootstrap_db activity in pods       very high          slow-path on 1     ZERO
   Advisory lock acquisitions in pods  many               8 (1 pod's race)   0
   Workers using fast-path             n/a                16 of 24           n/a (skipped)
   Workers using SKIP path             n/a                n/a                24 of 24
   Migration Job runs                  no (disabled)      no (disabled)      yes (Complete)

   * register-fast-time post-install Job failed; it's unrelated to the
     migration/skip story (registers the test fast-time-server via gateway
     API call). Doesn't affect L2 contract verification. Tracked separately.

Per-pod log signature — post-L2 (all 3 pods identical)
──────────────────────────────────────────────────────
  Skipping in-pod migration bootstrap:   8   ← one per gunicorn worker
  Acquired migration lock:               0
  Schema already at Alembic head:        0
  Lock held by another instance:         0
  Empty DB detected:                     0
  Database ready:                        0

DB state after pods Ready:
  Tables in public schema:               61
  alembic_version:                       d3e4f5a6b7c8 (at head)
  Advisory locks held:                   0

The 3-way pre-fix / post-L1 / post-L2 comparison is the headline win. The chart's contract — migration Job runs → app pods skip — is honored automatically; operators don't have to remember to set both migration.enabled AND MCPGATEWAY_SKIP_MIGRATIONS because the chart wires them together.

Full reproduction-vs-fix evidence in todo/issue-4051-ocp-reproduction.md.

Additional improvements

Idempotent post-migration bootstrap extracted to _run_post_migration_bootstrap() — admin-user creation, default-role seeding, and orphaned-resource assignment are now called from both the fast-path and slow-path branches. Previously these were only inside the advisory_lock block, so a fast-path skip would have missed them on a partial-startup edge case (replica 1 stamps head then crashes before bootstrapping the admin user → replica 2 fast-paths and runs the bootstrap steps to fill the gap).
Reproduction harness lives at tests/reproduction/issue-4051/ — docker-compose.repro.yml was promoted to a permanent test fixture under tests/integration/fixtures/transaction_pool/. Shell scripts (reproduce.sh, demonstrate_orphan.sh, README.md) are temporary; they will be removed before this PR merges, with their evidence preserved in the issue thread.
Module docstrings reference the issue number — both the helper functions and the integration test module carry an inline reference to issue [BUG]: Alembic migration advisory lock hangs when multiple gateway pods start through PgBouncer (transaction pooling mode) #4051 so a future reader greps once and lands in the right context.

Limitations

Fast-path returns without emitting "Database ready" — the canonical readiness marker only fires on the slow-path return. Operators reading pod logs see either Database ready or Schema already at Alembic head; skipping migration lock. Both signal "bootstrap finished," but inconsistently. Captured as Gap 8 above; trivial follow-up.
Layer 3 not included — the original plan considered a table-based mutex fallback for environments where migration_database_url (direct-Postgres bypass) isn't available AND the chart's pre-install Job is disabled. With L1+L2 in place, that combination is well-covered: L1 makes the in-pod path safe, L2 lets the chart guarantee the in-pod path is skipped when the Job runs. Reopening if a real-world report justifies it.
DROP SCHEMA in PGO setups requires ownership restoration — observed during the OCP reproduction. CrunchyData PGO assigns schema ownership to a service user; a manual DROP SCHEMA public CASCADE; CREATE SCHEMA public; re-creates the schema owned by postgres, so the service user can't create tables in it until ownership is restored:
```
ALTER SCHEMA public OWNER TO admin;
GRANT ALL ON SCHEMA public TO admin;
GRANT ALL ON SCHEMA public TO public;
```
Documented in the OCP reproduction notes; not a code fix.
Pod readiness can be misleading during pre-fix cycling — observed pods show 1/1 Ready even while several gunicorn workers are still in their retry loop. One worker's lifespan completing seems sufficient to satisfy the readiness probe. With L1+L2 the storm window is gone; this is captured here for operator awareness when reading historical pre-fix logs.

Future work

Compose-side L2 (deferred)

The root docker-compose.yml was left unchanged. It's currently safe thanks to L1 alone (the fast-path makes in-pod bootstrap pgbouncer-safe). Adding a dedicated migrate service in compose plus MCPGATEWAY_SKIP_MIGRATIONS=true on the gateway would mirror the chart's contract one-to-one — useful for users who want production-shaped semantics in dev. Tracked as a separate issue if there's appetite; deliberately out-of-scope here to keep docker-compose.yml touch-free for existing users.

Unify the readiness log line

Per Gap 8 above. Either:

emit "Database ready" from both paths (fast and slow), or
replace it with a single richer line carrying the path taken ("Database ready (fast-path)" / "Database ready (slow-path)").

Either change is a one-line edit; deferred so this PR stays scoped to the bug fix.

`MIGRATION_DATABASE_URL` (only if requested)

The Helm chart already exposes migration.hostKey / portKey to point the pre-install Job direct to Postgres (bypassing PgBouncer). A code-level MIGRATION_DATABASE_URL setting would let docker-compose users do the same without two DATABASE_URL definitions. Not necessary for the bug fix; deferred unless someone explicitly asks for it.

Closes #4051

Adds tests/integration/test_migrations_under_transaction_pool.py — three @pytest.mark.integration assertions pinning the PgBouncer transaction-pool behavior that makes bootstrap_db.main() hang when gateway replicas start concurrently: - a disconnected pgbouncer client's advisory lock persists on Postgres, - a fresh Postgres session is blocked from acquiring the orphaned lock, - a same-backend reuse via pgbouncer is reentrant (documents the gotcha so future readers don't mistake TRUE for "no bug"). Tests are gated behind --with-integration so CI default runs are unaffected. The fixture stack (postgres + pgbouncer with pool_mode= transaction and default_pool_size=2) lives at tests/integration/fixtures/transaction_pool/docker-compose.yml. Also adds tests/reproduction/issue-4051/ with demonstrate_orphan.sh (deterministic mechanism proof) and reproduce.sh (replica-scaling symptom reproduction). These exist for PR-review evidence and local debugging during development; they are expected to be removed when the fix PR lands, with the evidence preserved in the issue thread. No runtime changes. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

Multiple gateway replicas starting concurrently through a transaction- pooling connection pooler (PgBouncer pool_mode=transaction, pgcat, etc.) can hang indefinitely on bootstrap. A previous replica's pg_advisory_lock is session-scoped on its Postgres server backend; when the pooler returns that backend to its pool across a transaction boundary without clearing the lock, the next replica's pg_try_advisory_lock sees it held by an orphaned session and spins through its retry loop until the ~10-minute timeout fires. Add a fast-path in bootstrap_db.main() that probes alembic_version via MigrationContext before attempting the advisory lock: when the DB is already at the Alembic script directory's head, skip the lock entirely and run only the idempotent post-migration bootstrap steps (team visibility normalization, admin user, default roles, orphaned resource assignment). Replicas 2..N of a multi-pod deployment take this branch on every normal restart, so the advisory-lock orphan is never contended. The lock is still acquired for the slow path (empty DB, version mismatch, or any probe error), preserving the existing concurrent- migration protection. Post-migration bootstrap was extracted into _run_post_migration_bootstrap() so both paths share it. Adds tests/integration/test_migrations_under_transaction_pool.py:: test_bootstrap_db_skips_lock_when_schema_already_at_head as a regression gate: it synthesizes the held-lock precondition via a distinct direct-to-postgres session and runs bootstrap through pgbouncer. Pre-fix, the test times out after 240s; post-fix, it completes in under a second. Closes #4051 Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

Adds TestAlembicAtHead with five unit tests covering the bool outcomes that decide whether main() skips the migration advisory lock: - at head → True (fast-path fires) - no alembic_version in DB → False (slow path handles empty/partial DBs) - head mismatch → False (migrations actually need to run) - script directory has zero heads → False (defensive) - any probe exception → False (fallback to slow path) Direct coverage independent of the integration test; narrows the blast radius of a future refactor to this one helper. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

Adds test_bootstrap_db_is_idempotent_once_schema_is_at_head as an additional invariant for the issue #4051 fast-path: after one successful bootstrap, a subsequent call must NOT enter advisory_lock at all — it should take the fast-path exclusively. The test wraps bootstrap_db.advisory_lock with a spy and asserts zero entries on the second invocation. This catches regressions where a future refactor accidentally re-routes bootstrap through the lock even when the schema is at head (the specific failure mode that would bring issue #4051 back under transaction-pooling connection poolers). Also asserts the second call completes under 3s and leaves no advisory lock held. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

…otstrap Adds a settings flag (`mcpgateway_skip_migrations`, default False) that gates the gateway's lifespan call to bootstrap_db.main(). When set, the gateway logs that it is yielding migration ownership to a separate runner and skips the in-pod path entirely. This is the operator-facing half of the issue #4051 fix. The L1 fast-path (already on this branch) makes in-pod bootstrap safe under transaction-pool connection poolers; this flag lets a Helm chart or init-container deployment turn the in-pod path off entirely once a dedicated migration runner has populated the schema. Library default stays False so direct `docker run mcpgateway` continues to bootstrap itself with no operator action. Wires the gate via a small lifespan helper (`_run_initial_db_bootstrap`) to keep the lifespan body unchanged and the gate trivially testable. Tests: - settings.skip_migrations defaults False - MCPGATEWAY_SKIP_MIGRATIONS=true → True - MCPGATEWAY_SKIP_MIGRATIONS=false → False - lifespan helper invokes bootstrap_db when flag off - lifespan helper skips bootstrap_db AND logs an audit line when on Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

Wires the gateway Deployment env var introduced in the prior commit into the chart so the pairing "migration Job runs → app pods skip" is enforced at the chart layer rather than relying on operators to set both flags consistently. The env entry is placed after extraEnv so a user-supplied override cannot accidentally undo the contract. Behavior: migration.enabled=true → MCPGATEWAY_SKIP_MIGRATIONS="true" migration.enabled=false → MCPGATEWAY_SKIP_MIGRATIONS="false" default values → migration.enabled=true → SKIP="true" Adds tests/unit/charts/test_deployment_mcpgateway.py with three helm-template assertions pinning each branch. Tests skip automatically when helm is not on PATH; no cluster needed. Issue #4051 Layer 2. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

gandhipratik203 added 6 commits April 24, 2026 17:25

gandhipratik203 changed the title ~~fix(bootstrap): skip Alembic advisory lock when schema is at head~~ fix(bootstrap): improve startup reliability for multi-replica deploys Apr 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bootstrap): improve startup reliability for multi-replica deploys#4444

fix(bootstrap): improve startup reliability for multi-replica deploys#4444
gandhipratik203 wants to merge 6 commits intomainfrom
fix/4051-alembic-pgbouncer-transaction-pool

gandhipratik203 commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gandhipratik203 commented Apr 26, 2026

Summary

Approach

Gaps closed

HIGH — Bug correctness

HIGH — Operator ergonomics

MEDIUM — Test infrastructure

LOW — Documentation and observability

Architecture

Before — every replica × every worker races for the lock

After Layer 1 — fast-path before the lock

After Layer 2 — pods don't even probe when the Job owns it

Test results

Additional improvements

Limitations

Future work

Compose-side L2 (deferred)

Unify the readiness log line

MIGRATION_DATABASE_URL (only if requested)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`MIGRATION_DATABASE_URL` (only if requested)