Skip to content

fix(bootstrap): improve startup reliability for multi-replica deploys#4444

Draft
gandhipratik203 wants to merge 6 commits intomainfrom
fix/4051-alembic-pgbouncer-transaction-pool
Draft

fix(bootstrap): improve startup reliability for multi-replica deploys#4444
gandhipratik203 wants to merge 6 commits intomainfrom
fix/4051-alembic-pgbouncer-transaction-pool

Conversation

@gandhipratik203
Copy link
Copy Markdown
Collaborator

Summary

Fixes issue #4051 — Alembic migration advisory lock hangs when multiple gateway replicas start through PgBouncer in transaction-pooling mode. The session-scoped pg_advisory_lock is orphaned across PgBouncer's backend handoffs; the N-th gateway pod sees the lock held by an effectively-dead session and spins in its retry loop until the ~10-minute timeout fires.

This PR closes the bug along two layers and ships them together so the chart is safe under both deployment patterns:

  • Layer 1 (correctness)bootstrap_db.main() now probes alembic_version before attempting the advisory lock. When the schema is at the script directory's head, the lock acquisition is skipped entirely. Replicas 2..N never participate in the race.
  • Layer 2 (operator ergonomics) — new MCPGATEWAY_SKIP_MIGRATIONS settings flag suppresses the in-pod bootstrap_db.main() call entirely. The mcp-stack Helm chart wires it from migration.enabled so the contract "the pre-install Job owns migrations, app pods skip" is enforced at the chart layer rather than left to operators to coordinate.

Library defaults preserve backward compatibility: a direct docker run mcpgateway:latest continues to bootstrap its own schema. Only the chart and (future) compose overlays opt into SKIP_MIGRATIONS=true.

Related: existing migration.hostKey: host / portKey: port knobs in charts/mcp-stack/profiles/ocp/values-pgo.yaml already let the migration Job bypass PgBouncer; that workaround stays in place but is no longer load-bearing.

Approach

The fix was built reproduction-first: rather than implement against the issue text alone, the bug was first reproduced deterministically in docker-compose (tests/reproduction/issue-4051/) and then again on a real OpenShift cluster against CrunchyData PGO + transaction-pool PgBouncer. Both reproductions are documented in todo/issue-4051-ocp-reproduction.md, and the compose-side artifact lives permanently in the repo as a test fixture.

For each layer (L1 fast-path, L2 chart-wired skip), the cycle was:

   1. write a failing test that captures the invariant ──────► RED
   2. implement just enough to make it pass ─────────────────► GREEN
   3. verify on the OCP cluster against pre-fix and post-fix
      images, side-by-side, on the same stack ──────────────► EVIDENCE
   4. commit

The regression test for the bug — test_bootstrap_db_skips_lock_when_schema_already_at_head — is reliably RED pre-fix (240s timeout, 29/60 retry attempts visible in logs) and reliably GREEN post-fix (~1.4s). It's not a flaky timing-based test: it synthesizes the held-lock condition deterministically by holding the advisory lock in a distinct, still-alive Postgres session for the duration of the second bootstrap_db.main() call. That guarantees the bootstrap's PgBouncer-side session sees pg_try_advisory_lock return FALSE regardless of PgBouncer's internal backend assignment — without that guarantee, PgBouncer can hand the bootstrap the same server backend that the holder used, and pg_try_advisory_lock succeeds via PostgreSQL's reentrant-within-session semantics, masking the bug. (We landed on this design after observing a passing test that should have failed; the false-positive is documented as test_reentrant_acquire_through_same_pgbouncer_is_not_a_counter_example so a future reader who hits the same gotcha understands what they're seeing.)


Gaps closed

HIGH — Bug correctness

Gap 1 (HIGH) — Advisory lock orphan via PgBouncer transaction-pool handoffs: bootstrap_db.advisory_lock always called pg_try_advisory_lock and retried up to 60 times with exponential backoff. When PgBouncer reused a backend whose previous client took the lock, new clients saw it held by an unreachable session and spun until timeout. Fixed with a _alembic_at_head probe that runs before the lock — when the DB is at the script-directory head, the lock is never attempted. The probe uses Alembic's MigrationContext.get_current_heads() and ScriptDirectory.from_config(cfg).get_heads() for the canonical source of truth; any error during probing falls through to the existing slow path so empty / partial / out-of-date databases still go through the explicit handling.

Gap 2 (HIGH) — Worker-level fan-out under-stated in the issue text: each gateway pod runs N gunicorn workers (default 8 in the OCP profile), and each worker calls bootstrap_db.main() in its own lifespan. So a 3-replica deployment produces 24 racing acquirers, not 3. Reproduction on OpenShift confirmed the fan-out: pre-fix, the same pod's logs show interleaved Acquired lock on attempt N and Lock held by another instance, attempt N/60 at the same timestamps — different workers in the same gunicorn process. The L1 fast-path eliminates the race for replicas 2..N entirely; for replica 1 the within-pod race is bounded to a sub-second seed window.

Gap 3 (HIGH) — Chart safety today depends on the OCP-only workaround: migration.hostKey: host / portKey: port route the pre-install Job direct to Postgres (bypassing PgBouncer), masking the bug whenever the Job is enabled. A community user setting migration.enabled=false (because they manage migrations externally, or run the chart in compose-flavoured contexts) loses that protection. Layer 1 makes migration.enabled=false safe.

HIGH — Operator ergonomics

Gap 4 (HIGH) — In-pod bootstrap runs even when a dedicated migration runner has already populated the schema: redundant probe activity per worker, plus the within-pod race for replica 1's seed window. Fixed with MCPGATEWAY_SKIP_MIGRATIONS=true which short-circuits the lifespan call to bootstrap_db.main() and emits a single audit-log line so operators can see the choice in pod logs.

Gap 5 (HIGH) — No chart wiring for the SKIP/Job pairing: even if the env var existed, expecting operators to set both migration.enabled=true AND MCPGATEWAY_SKIP_MIGRATIONS=true is a foot-gun (forgetting one means redundant in-pod bootstrap; forgetting the other means schema never populated). Fixed in charts/mcp-stack/templates/deployment-mcpgateway.yaml: the env var is rendered as {{ ternary "true" "false" .Values.migration.enabled }} and placed after extraEnv so a user override cannot accidentally undo the contract.

MEDIUM — Test infrastructure

Gap 6 (MEDIUM) — No regression test for the advisory-lock invariant: no existing test exercised concurrent bootstrap_db.main() callers through PgBouncer. Fixed with tests/integration/test_migrations_under_transaction_pool.py containing five @pytest.mark.integration tests: three pinning the PgBouncer mechanism (so the substrate behavior is captured for posterity even if PgBouncer changes), one regression gate (synthesizes the orphaned-lock condition and asserts bootstrap_db.main() returns within 10s — fails in 240s pre-fix), and one idempotency invariant (asserts a second call after schema is at head never enters the advisory_lock context manager). Tests skip cleanly without --with-integration.

Gap 7 (MEDIUM) — No reproduction artifact in the repo: future debuggers had nothing to anchor on. Fixed with tests/reproduction/issue-4051/ containing a docker-compose stack (postgres + pgbouncer + gateway), a deterministic demonstrate_orphan.sh (proves the PgBouncer mechanism in 3 psql calls), and a reproduce.sh (scales gateway replicas through PgBouncer and reports per-pod hangs). The compose file lives at tests/integration/fixtures/transaction_pool/ so the integration tests can drive the same stack. The shell scripts are slated for deletion before this PR merges; their evidence will be preserved in the issue thread.

LOW — Documentation and observability

Gap 8 (LOW) — Two readiness markers, one log-line: the L1 fast-path returns without emitting "Database ready" (only the slow-path return logs it). Operators reading pod logs see either the slow-path marker or the fast-path skip line, not a single canonical "bootstrap finished" line. Tracked as a tiny follow-up; not blocking.


Architecture

Before — every replica × every worker races for the lock

                        DB schema starts populated (alembic_version at head)
                                                │
                                                ▼
   ┌────────── pod 1 (8 gunicorn workers) ──────────┐
   │  W1 → SELECT pg_try_advisory_lock(42…42) → t   │ ─┐
   │  W2 → SELECT pg_try_advisory_lock(42…42) → f   │ ─┤
   │  W3 → SELECT pg_try_advisory_lock(42…42) → f   │ ─┤
   │  …                                              │  │
   └─────────────────────────────────────────────────┘  │
                                                        │ all 24 workers
   ┌────────── pod 2 (8 gunicorn workers) ──────────┐  │ converge through
   │  W1..W8 → all 'f', spin in retry loop          │ ─┤ the same pgbouncer
   └─────────────────────────────────────────────────┘  │ pool
                                                        │
   ┌────────── pod 3 (8 gunicorn workers) ──────────┐  │
   │  W1..W8 → all 'f', spin in retry loop          │ ─┘
   └─────────────────────────────────────────────────┘
                                                        │
                                                        ▼
                        ┌──────────────────────────────────────────┐
                        │ PgBouncer (transaction pool)             │
                        │   each transaction commit → backend      │
                        │   handed to next client                  │
                        │   advisory lock outlives client session  │
                        └──────────────────────────────────────────┘
                                                        │
                                                        ▼
                        ┌──────────────────────────────────────────┐
                        │ Postgres                                  │
                        │   1 backend granted advisory lock         │
                        │   N–1 backends "idle in transaction" on   │
                        │   pg_try_advisory_lock(42424242424242)    │
                        └──────────────────────────────────────────┘

   Observable on OCP (issue reporter + our reproduction):
     · pods cycle 0/1 Running → CrashLoopBackOff → … → Ready over ~13 min
     · 7 restarts per pod (×3 = 21 restart events)
     · 60+ visible "Lock held by another instance" retries per restart
     · helm release status: failed

After Layer 1 — fast-path before the lock

   ┌────────── pod 1 worker ─────────────────────────────────────┐
   │  bootstrap_db.main()                                         │
   │    ┌─ probe (read-only) ─────────────────────────────────┐  │
   │    │  alembic_version == script_dir.get_heads() ?        │  │
   │    │       │                                              │  │
   │    │       ├── YES → return early, no lock involved      │  │
   │    │       │                                              │  │
   │    │       └── NO → fall through to advisory_lock path   │  │
   │    │              (existing slow path, unchanged)         │  │
   │    └─────────────────────────────────────────────────────┘  │
   └──────────────────────────────────────────────────────────────┘

   On a fresh deploy (empty DB), only the FIRST worker that wins the
   lock runs the slow path; replicas 2..N probe AFTER the seed and
   skip the lock entirely.

   Observed on OCP (verified end-to-end on context-forge-test-pg1):
     · pod 1 (seed):    0 retries → completes in ~1s
     · pods 2 & 3:      0 retries, 8 fast-path firings each
     · all 3 pods 1/1 Ready in 56 seconds, 0 restarts
     · helm release status: deployed

After Layer 2 — pods don't even probe when the Job owns it

                                                       Helm pre-install Job
                                                       ─────────────────────
                                                       direct to postgres :5432
                                                       runs bootstrap_db,
                                                       exits 0
                                                              │
                                                              ▼
                                              alembic_version at head
                                                              │
                                                              ▼
   ┌────────── gateway Deployment (chart-rendered) ────────────┐
   │  env:                                                      │
   │    - MCPGATEWAY_SKIP_MIGRATIONS: "true"   ← from           │
   │                                              .Values.      │
   │                                              migration.    │
   │                                              enabled       │
   │                                                            │
   │  lifespan:                                                 │
   │    wait_for_db_ready()                                     │
   │    if settings.mcpgateway_skip_migrations:                 │
   │        logger.info("Skipping in-pod migration              │
   │                     bootstrap (MCPGATEWAY_                 │
   │                     SKIP_MIGRATIONS=true)…")               │
   │        return                                              │
   │    else:                                                   │
   │        await bootstrap_db.main()                           │
   └────────────────────────────────────────────────────────────┘

   No probe. No lock. No worker fan-out.
   Single audit-log line per pod so operators can see the choice.

Test results

The change ships with three layers of verification, each addressing a different failure mode:

   ┌──────────────────────────────────────────────────────────────────────┐
   │ Layer            │ Catches                                            │
   ├──────────────────┼────────────────────────────────────────────────────┤
   │ Unit (13 tests)  │ Fast-path predicate logic; settings field          │
   │                  │ defaults; lifespan helper gate; chart render       │
   │                  │ outputs the right env var.                         │
   │                  │                                                    │
   │                  │ A future refactor that breaks ANY of these gets a  │
   │                  │ red unit test in seconds.                          │
   ├──────────────────┼────────────────────────────────────────────────────┤
   │ Integration      │ The actual bug surface: bootstrap_db running       │
   │ (5 tests)        │ against a real Postgres + PgBouncer stack with     │
   │                  │ the held-lock precondition synthesized.            │
   │                  │                                                    │
   │                  │ Reliably RED on main (240s timeout). Reliably      │
   │                  │ GREEN with the L1 fast-path. Plus 3 mechanism      │
   │                  │ tests pinning PgBouncer's transaction-pool         │
   │                  │ semantics for documentation-as-code.               │
   ├──────────────────┼────────────────────────────────────────────────────┤
   │ Cluster smoke    │ End-to-end on OpenShift with CrunchyData PGO +     │
   │ (manual)         │ transaction-pool PgBouncer. Same image, same       │
   │                  │ stack, only the gateway code differs between       │
   │                  │ pre- and post-fix runs. Concrete numbers below.    │
   └──────────────────────────────────────────────────────────────────────┘

The integration tests sit behind --with-integration; the unit tests run on every CI pass. The cluster smoke is documented evidence, not automated, and lives as todo/issue-4051-ocp-reproduction.md for the PR record.

Reproduction harness — how the bug was demonstrated end-to-end

Two artifacts live under tests/reproduction/issue-4051/. They will be removed before this PR merges (their evidence preserved in the issue thread):

   demonstrate_orphan.sh   ── deterministic mechanism proof (3 psql calls)
                              · step 1: through PgBouncer, take advisory lock,
                                disconnect
                              · step 2: directly query pg_locks → lock STILL
                                held by orphaned backend
                              · step 3: from a fresh Postgres session,
                                pg_try_advisory_lock → FALSE
                              always passes; documents PgBouncer behavior

   reproduce.sh            ── replica-scaling symptom reproduction
                              · scales 3 gateway replicas through PgBouncer
                              · watches each pod's logs for the retry storm
                              · dumps pg_locks + pg_stat_activity on hang
                              fails on pre-fix images, succeeds on post-fix

The compose stack itself was promoted to tests/integration/fixtures/transaction_pool/docker-compose.yml so the integration test can drive the same infrastructure that the shell scripts do. PgBouncer's default_pool_size is deliberately small (2) so total client connections comfortably exceed it, forcing backend multiplexing — without that pressure, the bug was hard to trigger locally because bootstrap can finish before any backend handoff happens.

Why two harnesses for one bug? demonstrate_orphan.sh proves the PgBouncer mechanism exists (orphaned advisory locks); reproduce.sh proves the operational symptom exists (replicas hang). Together they answer two questions any reviewer might ask: "is this a real PgBouncer property?" and "does it actually cause user-visible breakage?" — without making us guess timing.

Why the regression test is deterministic (not flaky)

Three ideas had to land before test_bootstrap_db_skips_lock_when_schema_already_at_head was reliably RED on main:

   ❌ First attempt — race the bootstrap calls
       Open 3 concurrent gateway processes, hope one wins the lock and
       the others spin. Worked sometimes (lucky timing) → fragile.

   ❌ Second attempt — synthesize an orphan via PgBouncer
       Take the lock through PgBouncer, disconnect, then call
       bootstrap_db.main() through PgBouncer. PgBouncer reused the same
       server backend for the bootstrap → pg_try_advisory_lock returned
       TRUE (reentrant within the same Postgres session) → FALSE
       POSITIVE (test passed when bug was present).

   ✅ Final design — hold the lock in a distinct postgres session
       Open a direct postgres connection (NOT through PgBouncer), take
       the advisory lock, KEEP the connection alive. Now run
       bootstrap_db.main() through PgBouncer. The bootstrap's session
       is guaranteed different from the holder's session, so
       pg_try_advisory_lock MUST return FALSE → bootstrap retries →
       240s pytest.mark.timeout fires. RED, every time.

The middle attempt's false-positive is captured as a permanent test (test_reentrant_acquire_through_same_pgbouncer_is_not_a_counter_example) so future readers who see "but I called pg_try_advisory_lock through PgBouncer and it returned TRUE!" understand the reentrant-session gotcha.

The same logic underpins the _clean_orphaned_lock autouse fixture: it runs pg_terminate_backend on any session still holding the test lock id between tests, so a single failed test cannot wedge the next one behind a leaked orphan.

Unit — `_alembic_at_head` truth table (5 tests)
tests/unit/mcpgateway/test_bootstrap_db.py::TestAlembicAtHead

  test_returns_true_when_db_heads_match_script_heads          PASSED
  test_returns_false_when_db_has_no_alembic_version           PASSED
  test_returns_false_when_db_head_does_not_match_script_head  PASSED
  test_returns_false_when_script_directory_has_no_heads       PASSED
  test_returns_false_on_probe_exception                       PASSED

5 passed

Pins each branch of the fast-path predicate independently of the integration test, so a future refactor that widens or narrows the trigger condition is caught locally.

Unit — `MCPGATEWAY_SKIP_MIGRATIONS` settings field (3 tests)
tests/unit/mcpgateway/test_config.py

  test_skip_migrations_defaults_to_false        PASSED
  test_skip_migrations_env_true_flips_flag      PASSED
  test_skip_migrations_env_false_keeps_flag_off PASSED

3 passed

Library default stays False (the docker run mcpgateway happy path), env var override flips True, explicit False matches the default. No accidental coercion.

Unit — lifespan helper respects the flag (2 tests)
tests/unit/mcpgateway/test_main.py

  test_run_initial_db_bootstrap_invokes_bootstrap_when_flag_off  PASSED
  test_run_initial_db_bootstrap_skips_when_flag_on               PASSED

2 passed

The skip-case test also asserts the audit-log line is emitted so operators can see the choice in pod logs.

Unit — Helm chart render (3 tests)
tests/unit/charts/test_deployment_mcpgateway.py

  test_skip_migrations_set_to_true_when_migration_job_enabled        PASSED
  test_skip_migrations_off_when_migration_job_disabled               PASSED
  test_skip_migrations_default_matches_migration_enabled_default     PASSED

3 passed

Tests run helm template as a subprocess, parse the rendered manifests, and assert on the gateway Deployment's env block. Skipped automatically when helm is not on PATH; no cluster needed.

Integration — PgBouncer mechanism + bootstrap regression (5 tests, gated by `--with-integration`)
tests/integration/test_migrations_under_transaction_pool.py

  test_session_advisory_lock_persists_across_pgbouncer_client_disconnect  PASSED
  test_orphaned_lock_blocks_a_fresh_postgres_session                       PASSED
  test_reentrant_acquire_through_same_pgbouncer_is_not_a_counter_example  PASSED
  test_bootstrap_db_skips_lock_when_schema_already_at_head                 PASSED
  test_bootstrap_db_is_idempotent_once_schema_is_at_head                   PASSED

5 passed in 1.47s   (post-fix)
1 fails in 240.34s  (pre-fix, hits pytest.mark.timeout — confirmed RED on main)

The mechanism tests document what PgBouncer does (lock orphaning is a real, named property of transaction-pool handoffs) so a future reader who sees "but pg_try_advisory_lock returned TRUE through PgBouncer!" understands the reentrance gotcha. The regression tests gate the actual fix.

Run with the docker-compose stack from tests/integration/fixtures/transaction_pool/docker-compose.yml.

OCP verification — L1 fix on real OpenShift + CrunchyData PGO (cluster: context-forge-test-pg1.cp.fyre.ibm.com)
Image:    docker.io/gandhipratik203/mcp-context-forge:4051-l1-fix-1777129453
Stack:    mcp-stack chart, default values + migration.enabled=false
Initial:  empty DB (DROP SCHEMA public CASCADE; CREATE SCHEMA public; …)

                                       PRE-FIX            POST-L1
                                       ───────            ───────
   Helm release status                 failed             deployed
   Pod restarts before Ready           7 each (21 total)  0
   Wall-clock to 3/3 Ready             ~13 min            56 sec
   Pods using fast-path                n/a                2 of 3, every worker
   Total visible retry messages        64+                28 (all confined to one pod's
                                                              worker-race during a
                                                              sub-second seed window)
   Advisory locks held at sample time  0 (after storm)    0 (no storm)

Per-pod log signature — post-L1
─────────────────────────────────
Pod 9jvtd  (won the seed race — slow path expected here):
  Empty DB detected:           1
  Acquired migration lock:     8
  Lock held by another:       28   ← within-pod race during ~1s seed window
  Schema already at head:      0
  Database ready:              8

Pods 2p9ls and jvb4t  (replicas 2 & 3):
  Empty DB detected:           0
  Acquired migration lock:     0   ← skipped the lock entirely
  Lock held by another:        0   ← no retries
  Schema already at head:      8   ← fast-path fired for every worker
  Database ready:              0

Full reproduction-vs-fix evidence in todo/issue-4051-ocp-reproduction.md.

OCP verification — L2 (chart-wired SKIP_MIGRATIONS) on real OpenShift
Image:    docker.io/gandhipratik203/mcp-context-forge:4051-l1-l2-fix-1777186436
Stack:    mcp-stack chart, DEFAULT values (migration.enabled=true), so the
          chart's pre-install Job runs the migration and the gateway
          Deployment is rendered with MCPGATEWAY_SKIP_MIGRATIONS=true.
Initial:  empty DB (DROP SCHEMA public CASCADE; CREATE SCHEMA public; …)

                                       PRE-FIX            POST-L1            POST-L2
                                       ───────            ───────            ───────
   Helm release status                 failed             deployed           deployed*
   Pod restarts before Ready           7 each (21 total)  0                  0
   Wall-clock to 3/3 Ready             ~13 min            56 sec             91 sec
   bootstrap_db activity in pods       very high          slow-path on 1     ZERO
   Advisory lock acquisitions in pods  many               8 (1 pod's race)   0
   Workers using fast-path             n/a                16 of 24           n/a (skipped)
   Workers using SKIP path             n/a                n/a                24 of 24
   Migration Job runs                  no (disabled)      no (disabled)      yes (Complete)

   * register-fast-time post-install Job failed; it's unrelated to the
     migration/skip story (registers the test fast-time-server via gateway
     API call). Doesn't affect L2 contract verification. Tracked separately.

Per-pod log signature — post-L2 (all 3 pods identical)
──────────────────────────────────────────────────────
  Skipping in-pod migration bootstrap:   8   ← one per gunicorn worker
  Acquired migration lock:               0
  Schema already at Alembic head:        0
  Lock held by another instance:         0
  Empty DB detected:                     0
  Database ready:                        0

DB state after pods Ready:
  Tables in public schema:               61
  alembic_version:                       d3e4f5a6b7c8 (at head)
  Advisory locks held:                   0

The 3-way pre-fix / post-L1 / post-L2 comparison is the headline win. The chart's contract — migration Job runs → app pods skip — is honored automatically; operators don't have to remember to set both migration.enabled AND MCPGATEWAY_SKIP_MIGRATIONS because the chart wires them together.

Full reproduction-vs-fix evidence in todo/issue-4051-ocp-reproduction.md.


Additional improvements

  • Idempotent post-migration bootstrap extracted to _run_post_migration_bootstrap() — admin-user creation, default-role seeding, and orphaned-resource assignment are now called from both the fast-path and slow-path branches. Previously these were only inside the advisory_lock block, so a fast-path skip would have missed them on a partial-startup edge case (replica 1 stamps head then crashes before bootstrapping the admin user → replica 2 fast-paths and runs the bootstrap steps to fill the gap).

  • Reproduction harness lives at tests/reproduction/issue-4051/docker-compose.repro.yml was promoted to a permanent test fixture under tests/integration/fixtures/transaction_pool/. Shell scripts (reproduce.sh, demonstrate_orphan.sh, README.md) are temporary; they will be removed before this PR merges, with their evidence preserved in the issue thread.

  • Module docstrings reference the issue number — both the helper functions and the integration test module carry an inline reference to issue [BUG]: Alembic migration advisory lock hangs when multiple gateway pods start through PgBouncer (transaction pooling mode) #4051 so a future reader greps once and lands in the right context.


Limitations

  1. Fast-path returns without emitting "Database ready" — the canonical readiness marker only fires on the slow-path return. Operators reading pod logs see either Database ready or Schema already at Alembic head; skipping migration lock. Both signal "bootstrap finished," but inconsistently. Captured as Gap 8 above; trivial follow-up.

  2. Layer 3 not included — the original plan considered a table-based mutex fallback for environments where migration_database_url (direct-Postgres bypass) isn't available AND the chart's pre-install Job is disabled. With L1+L2 in place, that combination is well-covered: L1 makes the in-pod path safe, L2 lets the chart guarantee the in-pod path is skipped when the Job runs. Reopening if a real-world report justifies it.

  3. DROP SCHEMA in PGO setups requires ownership restoration — observed during the OCP reproduction. CrunchyData PGO assigns schema ownership to a service user; a manual DROP SCHEMA public CASCADE; CREATE SCHEMA public; re-creates the schema owned by postgres, so the service user can't create tables in it until ownership is restored:

    ALTER SCHEMA public OWNER TO admin;
    GRANT ALL ON SCHEMA public TO admin;
    GRANT ALL ON SCHEMA public TO public;

    Documented in the OCP reproduction notes; not a code fix.

  4. Pod readiness can be misleading during pre-fix cycling — observed pods show 1/1 Ready even while several gunicorn workers are still in their retry loop. One worker's lifespan completing seems sufficient to satisfy the readiness probe. With L1+L2 the storm window is gone; this is captured here for operator awareness when reading historical pre-fix logs.


Future work

Compose-side L2 (deferred)

The root docker-compose.yml was left unchanged. It's currently safe thanks to L1 alone (the fast-path makes in-pod bootstrap pgbouncer-safe). Adding a dedicated migrate service in compose plus MCPGATEWAY_SKIP_MIGRATIONS=true on the gateway would mirror the chart's contract one-to-one — useful for users who want production-shaped semantics in dev. Tracked as a separate issue if there's appetite; deliberately out-of-scope here to keep docker-compose.yml touch-free for existing users.

Unify the readiness log line

Per Gap 8 above. Either:

  • emit "Database ready" from both paths (fast and slow), or
  • replace it with a single richer line carrying the path taken ("Database ready (fast-path)" / "Database ready (slow-path)").

Either change is a one-line edit; deferred so this PR stays scoped to the bug fix.

MIGRATION_DATABASE_URL (only if requested)

The Helm chart already exposes migration.hostKey / portKey to point the pre-install Job direct to Postgres (bypassing PgBouncer). A code-level MIGRATION_DATABASE_URL setting would let docker-compose users do the same without two DATABASE_URL definitions. Not necessary for the bug fix; deferred unless someone explicitly asks for it.


Closes #4051

Adds tests/integration/test_migrations_under_transaction_pool.py — three
@pytest.mark.integration assertions pinning the PgBouncer transaction-pool
behavior that makes bootstrap_db.main() hang when gateway replicas start
concurrently:

 - a disconnected pgbouncer client's advisory lock persists on Postgres,
 - a fresh Postgres session is blocked from acquiring the orphaned lock,
 - a same-backend reuse via pgbouncer is reentrant (documents the gotcha
   so future readers don't mistake TRUE for "no bug").

Tests are gated behind --with-integration so CI default runs are
unaffected. The fixture stack (postgres + pgbouncer with pool_mode=
transaction and default_pool_size=2) lives at
tests/integration/fixtures/transaction_pool/docker-compose.yml.

Also adds tests/reproduction/issue-4051/ with demonstrate_orphan.sh
(deterministic mechanism proof) and reproduce.sh (replica-scaling symptom
reproduction). These exist for PR-review evidence and local debugging
during development; they are expected to be removed when the fix PR
lands, with the evidence preserved in the issue thread.

No runtime changes.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Multiple gateway replicas starting concurrently through a transaction-
pooling connection pooler (PgBouncer pool_mode=transaction, pgcat, etc.)
can hang indefinitely on bootstrap. A previous replica's pg_advisory_lock
is session-scoped on its Postgres server backend; when the pooler returns
that backend to its pool across a transaction boundary without clearing
the lock, the next replica's pg_try_advisory_lock sees it held by an
orphaned session and spins through its retry loop until the ~10-minute
timeout fires.

Add a fast-path in bootstrap_db.main() that probes alembic_version via
MigrationContext before attempting the advisory lock: when the DB is
already at the Alembic script directory's head, skip the lock entirely
and run only the idempotent post-migration bootstrap steps (team
visibility normalization, admin user, default roles, orphaned resource
assignment). Replicas 2..N of a multi-pod deployment take this branch on
every normal restart, so the advisory-lock orphan is never contended.

The lock is still acquired for the slow path (empty DB, version
mismatch, or any probe error), preserving the existing concurrent-
migration protection. Post-migration bootstrap was extracted into
_run_post_migration_bootstrap() so both paths share it.

Adds tests/integration/test_migrations_under_transaction_pool.py::
test_bootstrap_db_skips_lock_when_schema_already_at_head as a
regression gate: it synthesizes the held-lock precondition via a
distinct direct-to-postgres session and runs bootstrap through
pgbouncer. Pre-fix, the test times out after 240s; post-fix, it
completes in under a second.

Closes #4051

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Adds TestAlembicAtHead with five unit tests covering the bool outcomes
that decide whether main() skips the migration advisory lock:

 - at head → True (fast-path fires)
 - no alembic_version in DB → False (slow path handles empty/partial DBs)
 - head mismatch → False (migrations actually need to run)
 - script directory has zero heads → False (defensive)
 - any probe exception → False (fallback to slow path)

Direct coverage independent of the integration test; narrows the blast
radius of a future refactor to this one helper.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Adds test_bootstrap_db_is_idempotent_once_schema_is_at_head as an
additional invariant for the issue #4051 fast-path: after one successful
bootstrap, a subsequent call must NOT enter advisory_lock at all — it
should take the fast-path exclusively.

The test wraps bootstrap_db.advisory_lock with a spy and asserts zero
entries on the second invocation. This catches regressions where a
future refactor accidentally re-routes bootstrap through the lock even
when the schema is at head (the specific failure mode that would bring
issue #4051 back under transaction-pooling connection poolers).

Also asserts the second call completes under 3s and leaves no advisory
lock held.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
…otstrap

Adds a settings flag (`mcpgateway_skip_migrations`, default False) that
gates the gateway's lifespan call to bootstrap_db.main(). When set,
the gateway logs that it is yielding migration ownership to a separate
runner and skips the in-pod path entirely.

This is the operator-facing half of the issue #4051 fix. The L1
fast-path (already on this branch) makes in-pod bootstrap safe under
transaction-pool connection poolers; this flag lets a Helm chart or
init-container deployment turn the in-pod path off entirely once a
dedicated migration runner has populated the schema. Library default
stays False so direct `docker run mcpgateway` continues to bootstrap
itself with no operator action.

Wires the gate via a small lifespan helper (`_run_initial_db_bootstrap`)
to keep the lifespan body unchanged and the gate trivially testable.

Tests:
 - settings.skip_migrations defaults False
 - MCPGATEWAY_SKIP_MIGRATIONS=true → True
 - MCPGATEWAY_SKIP_MIGRATIONS=false → False
 - lifespan helper invokes bootstrap_db when flag off
 - lifespan helper skips bootstrap_db AND logs an audit line when on
Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Wires the gateway Deployment env var introduced in the prior commit
into the chart so the pairing "migration Job runs → app pods skip" is
enforced at the chart layer rather than relying on operators to set
both flags consistently. The env entry is placed after extraEnv so a
user-supplied override cannot accidentally undo the contract.

Behavior:
  migration.enabled=true   → MCPGATEWAY_SKIP_MIGRATIONS="true"
  migration.enabled=false  → MCPGATEWAY_SKIP_MIGRATIONS="false"
  default values           → migration.enabled=true → SKIP="true"

Adds tests/unit/charts/test_deployment_mcpgateway.py with three
helm-template assertions pinning each branch. Tests skip automatically
when helm is not on PATH; no cluster needed. Issue #4051 Layer 2.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
@gandhipratik203 gandhipratik203 changed the title fix(bootstrap): skip Alembic advisory lock when schema is at head fix(bootstrap): improve startup reliability for multi-replica deploys Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: Alembic migration advisory lock hangs when multiple gateway pods start through PgBouncer (transaction pooling mode)

1 participant