fix(bootstrap): improve startup reliability for multi-replica deploys#4444
Draft
gandhipratik203 wants to merge 6 commits intomainfrom
Draft
fix(bootstrap): improve startup reliability for multi-replica deploys#4444gandhipratik203 wants to merge 6 commits intomainfrom
gandhipratik203 wants to merge 6 commits intomainfrom
Conversation
Adds tests/integration/test_migrations_under_transaction_pool.py — three @pytest.mark.integration assertions pinning the PgBouncer transaction-pool behavior that makes bootstrap_db.main() hang when gateway replicas start concurrently: - a disconnected pgbouncer client's advisory lock persists on Postgres, - a fresh Postgres session is blocked from acquiring the orphaned lock, - a same-backend reuse via pgbouncer is reentrant (documents the gotcha so future readers don't mistake TRUE for "no bug"). Tests are gated behind --with-integration so CI default runs are unaffected. The fixture stack (postgres + pgbouncer with pool_mode= transaction and default_pool_size=2) lives at tests/integration/fixtures/transaction_pool/docker-compose.yml. Also adds tests/reproduction/issue-4051/ with demonstrate_orphan.sh (deterministic mechanism proof) and reproduce.sh (replica-scaling symptom reproduction). These exist for PR-review evidence and local debugging during development; they are expected to be removed when the fix PR lands, with the evidence preserved in the issue thread. No runtime changes. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Multiple gateway replicas starting concurrently through a transaction- pooling connection pooler (PgBouncer pool_mode=transaction, pgcat, etc.) can hang indefinitely on bootstrap. A previous replica's pg_advisory_lock is session-scoped on its Postgres server backend; when the pooler returns that backend to its pool across a transaction boundary without clearing the lock, the next replica's pg_try_advisory_lock sees it held by an orphaned session and spins through its retry loop until the ~10-minute timeout fires. Add a fast-path in bootstrap_db.main() that probes alembic_version via MigrationContext before attempting the advisory lock: when the DB is already at the Alembic script directory's head, skip the lock entirely and run only the idempotent post-migration bootstrap steps (team visibility normalization, admin user, default roles, orphaned resource assignment). Replicas 2..N of a multi-pod deployment take this branch on every normal restart, so the advisory-lock orphan is never contended. The lock is still acquired for the slow path (empty DB, version mismatch, or any probe error), preserving the existing concurrent- migration protection. Post-migration bootstrap was extracted into _run_post_migration_bootstrap() so both paths share it. Adds tests/integration/test_migrations_under_transaction_pool.py:: test_bootstrap_db_skips_lock_when_schema_already_at_head as a regression gate: it synthesizes the held-lock precondition via a distinct direct-to-postgres session and runs bootstrap through pgbouncer. Pre-fix, the test times out after 240s; post-fix, it completes in under a second. Closes #4051 Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Adds TestAlembicAtHead with five unit tests covering the bool outcomes that decide whether main() skips the migration advisory lock: - at head → True (fast-path fires) - no alembic_version in DB → False (slow path handles empty/partial DBs) - head mismatch → False (migrations actually need to run) - script directory has zero heads → False (defensive) - any probe exception → False (fallback to slow path) Direct coverage independent of the integration test; narrows the blast radius of a future refactor to this one helper. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Adds test_bootstrap_db_is_idempotent_once_schema_is_at_head as an additional invariant for the issue #4051 fast-path: after one successful bootstrap, a subsequent call must NOT enter advisory_lock at all — it should take the fast-path exclusively. The test wraps bootstrap_db.advisory_lock with a spy and asserts zero entries on the second invocation. This catches regressions where a future refactor accidentally re-routes bootstrap through the lock even when the schema is at head (the specific failure mode that would bring issue #4051 back under transaction-pooling connection poolers). Also asserts the second call completes under 3s and leaves no advisory lock held. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
…otstrap Adds a settings flag (`mcpgateway_skip_migrations`, default False) that gates the gateway's lifespan call to bootstrap_db.main(). When set, the gateway logs that it is yielding migration ownership to a separate runner and skips the in-pod path entirely. This is the operator-facing half of the issue #4051 fix. The L1 fast-path (already on this branch) makes in-pod bootstrap safe under transaction-pool connection poolers; this flag lets a Helm chart or init-container deployment turn the in-pod path off entirely once a dedicated migration runner has populated the schema. Library default stays False so direct `docker run mcpgateway` continues to bootstrap itself with no operator action. Wires the gate via a small lifespan helper (`_run_initial_db_bootstrap`) to keep the lifespan body unchanged and the gate trivially testable. Tests: - settings.skip_migrations defaults False - MCPGATEWAY_SKIP_MIGRATIONS=true → True - MCPGATEWAY_SKIP_MIGRATIONS=false → False - lifespan helper invokes bootstrap_db when flag off - lifespan helper skips bootstrap_db AND logs an audit line when on Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Wires the gateway Deployment env var introduced in the prior commit into the chart so the pairing "migration Job runs → app pods skip" is enforced at the chart layer rather than relying on operators to set both flags consistently. The env entry is placed after extraEnv so a user-supplied override cannot accidentally undo the contract. Behavior: migration.enabled=true → MCPGATEWAY_SKIP_MIGRATIONS="true" migration.enabled=false → MCPGATEWAY_SKIP_MIGRATIONS="false" default values → migration.enabled=true → SKIP="true" Adds tests/unit/charts/test_deployment_mcpgateway.py with three helm-template assertions pinning each branch. Tests skip automatically when helm is not on PATH; no cluster needed. Issue #4051 Layer 2. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes issue #4051 — Alembic migration advisory lock hangs when multiple gateway replicas start through PgBouncer in transaction-pooling mode. The session-scoped
pg_advisory_lockis orphaned across PgBouncer's backend handoffs; the N-th gateway pod sees the lock held by an effectively-dead session and spins in its retry loop until the ~10-minute timeout fires.This PR closes the bug along two layers and ships them together so the chart is safe under both deployment patterns:
bootstrap_db.main()now probesalembic_versionbefore attempting the advisory lock. When the schema is at the script directory's head, the lock acquisition is skipped entirely. Replicas 2..N never participate in the race.MCPGATEWAY_SKIP_MIGRATIONSsettings flag suppresses the in-podbootstrap_db.main()call entirely. Themcp-stackHelm chart wires it frommigration.enabledso the contract "the pre-install Job owns migrations, app pods skip" is enforced at the chart layer rather than left to operators to coordinate.Library defaults preserve backward compatibility: a direct
docker run mcpgateway:latestcontinues to bootstrap its own schema. Only the chart and (future) compose overlays opt intoSKIP_MIGRATIONS=true.Related: existing
migration.hostKey: host/portKey: portknobs incharts/mcp-stack/profiles/ocp/values-pgo.yamlalready let the migration Job bypass PgBouncer; that workaround stays in place but is no longer load-bearing.Approach
The fix was built reproduction-first: rather than implement against the issue text alone, the bug was first reproduced deterministically in docker-compose (
tests/reproduction/issue-4051/) and then again on a real OpenShift cluster against CrunchyData PGO + transaction-pool PgBouncer. Both reproductions are documented intodo/issue-4051-ocp-reproduction.md, and the compose-side artifact lives permanently in the repo as a test fixture.For each layer (L1 fast-path, L2 chart-wired skip), the cycle was:
The regression test for the bug —
test_bootstrap_db_skips_lock_when_schema_already_at_head— is reliably RED pre-fix (240s timeout, 29/60 retry attempts visible in logs) and reliably GREEN post-fix (~1.4s). It's not a flaky timing-based test: it synthesizes the held-lock condition deterministically by holding the advisory lock in a distinct, still-alive Postgres session for the duration of the secondbootstrap_db.main()call. That guarantees the bootstrap's PgBouncer-side session seespg_try_advisory_lockreturn FALSE regardless of PgBouncer's internal backend assignment — without that guarantee, PgBouncer can hand the bootstrap the same server backend that the holder used, andpg_try_advisory_locksucceeds via PostgreSQL's reentrant-within-session semantics, masking the bug. (We landed on this design after observing a passing test that should have failed; the false-positive is documented astest_reentrant_acquire_through_same_pgbouncer_is_not_a_counter_exampleso a future reader who hits the same gotcha understands what they're seeing.)Gaps closed
HIGH — Bug correctness
Gap 1 (HIGH) — Advisory lock orphan via PgBouncer transaction-pool handoffs:
bootstrap_db.advisory_lockalways calledpg_try_advisory_lockand retried up to 60 times with exponential backoff. When PgBouncer reused a backend whose previous client took the lock, new clients saw it held by an unreachable session and spun until timeout. Fixed with a_alembic_at_headprobe that runs before the lock — when the DB is at the script-directory head, the lock is never attempted. The probe uses Alembic'sMigrationContext.get_current_heads()andScriptDirectory.from_config(cfg).get_heads()for the canonical source of truth; any error during probing falls through to the existing slow path so empty / partial / out-of-date databases still go through the explicit handling.Gap 2 (HIGH) — Worker-level fan-out under-stated in the issue text: each gateway pod runs N gunicorn workers (default 8 in the OCP profile), and each worker calls
bootstrap_db.main()in its own lifespan. So a 3-replica deployment produces 24 racing acquirers, not 3. Reproduction on OpenShift confirmed the fan-out: pre-fix, the same pod's logs show interleavedAcquired lock on attempt NandLock held by another instance, attempt N/60at the same timestamps — different workers in the same gunicorn process. The L1 fast-path eliminates the race for replicas 2..N entirely; for replica 1 the within-pod race is bounded to a sub-second seed window.Gap 3 (HIGH) — Chart safety today depends on the OCP-only workaround:
migration.hostKey: host/portKey: portroute the pre-install Job direct to Postgres (bypassing PgBouncer), masking the bug whenever the Job is enabled. A community user settingmigration.enabled=false(because they manage migrations externally, or run the chart in compose-flavoured contexts) loses that protection. Layer 1 makesmigration.enabled=falsesafe.HIGH — Operator ergonomics
Gap 4 (HIGH) — In-pod bootstrap runs even when a dedicated migration runner has already populated the schema: redundant probe activity per worker, plus the within-pod race for replica 1's seed window. Fixed with
MCPGATEWAY_SKIP_MIGRATIONS=truewhich short-circuits the lifespan call tobootstrap_db.main()and emits a single audit-log line so operators can see the choice in pod logs.Gap 5 (HIGH) — No chart wiring for the SKIP/Job pairing: even if the env var existed, expecting operators to set both
migration.enabled=trueANDMCPGATEWAY_SKIP_MIGRATIONS=trueis a foot-gun (forgetting one means redundant in-pod bootstrap; forgetting the other means schema never populated). Fixed incharts/mcp-stack/templates/deployment-mcpgateway.yaml: the env var is rendered as{{ ternary "true" "false" .Values.migration.enabled }}and placed afterextraEnvso a user override cannot accidentally undo the contract.MEDIUM — Test infrastructure
Gap 6 (MEDIUM) — No regression test for the advisory-lock invariant: no existing test exercised concurrent
bootstrap_db.main()callers through PgBouncer. Fixed withtests/integration/test_migrations_under_transaction_pool.pycontaining five@pytest.mark.integrationtests: three pinning the PgBouncer mechanism (so the substrate behavior is captured for posterity even if PgBouncer changes), one regression gate (synthesizes the orphaned-lock condition and assertsbootstrap_db.main()returns within 10s — fails in 240s pre-fix), and one idempotency invariant (asserts a second call after schema is at head never enters theadvisory_lockcontext manager). Tests skip cleanly without--with-integration.Gap 7 (MEDIUM) — No reproduction artifact in the repo: future debuggers had nothing to anchor on. Fixed with
tests/reproduction/issue-4051/containing a docker-compose stack (postgres + pgbouncer + gateway), a deterministicdemonstrate_orphan.sh(proves the PgBouncer mechanism in 3 psql calls), and areproduce.sh(scales gateway replicas through PgBouncer and reports per-pod hangs). The compose file lives attests/integration/fixtures/transaction_pool/so the integration tests can drive the same stack. The shell scripts are slated for deletion before this PR merges; their evidence will be preserved in the issue thread.LOW — Documentation and observability
Gap 8 (LOW) — Two readiness markers, one log-line: the L1 fast-path returns without emitting
"Database ready"(only the slow-path return logs it). Operators reading pod logs see either the slow-path marker or the fast-path skip line, not a single canonical "bootstrap finished" line. Tracked as a tiny follow-up; not blocking.Architecture
Before — every replica × every worker races for the lock
After Layer 1 — fast-path before the lock
After Layer 2 — pods don't even probe when the Job owns it
Test results
The change ships with three layers of verification, each addressing a different failure mode:
The integration tests sit behind
--with-integration; the unit tests run on every CI pass. The cluster smoke is documented evidence, not automated, and lives astodo/issue-4051-ocp-reproduction.mdfor the PR record.Reproduction harness — how the bug was demonstrated end-to-end
Two artifacts live under
tests/reproduction/issue-4051/. They will be removed before this PR merges (their evidence preserved in the issue thread):The compose stack itself was promoted to
tests/integration/fixtures/transaction_pool/docker-compose.ymlso the integration test can drive the same infrastructure that the shell scripts do. PgBouncer'sdefault_pool_sizeis deliberately small (2) so total client connections comfortably exceed it, forcing backend multiplexing — without that pressure, the bug was hard to trigger locally because bootstrap can finish before any backend handoff happens.Why two harnesses for one bug?
demonstrate_orphan.shproves the PgBouncer mechanism exists (orphaned advisory locks);reproduce.shproves the operational symptom exists (replicas hang). Together they answer two questions any reviewer might ask: "is this a real PgBouncer property?" and "does it actually cause user-visible breakage?" — without making us guess timing.Why the regression test is deterministic (not flaky)
Three ideas had to land before
test_bootstrap_db_skips_lock_when_schema_already_at_headwas reliably RED on main:The middle attempt's false-positive is captured as a permanent test (
test_reentrant_acquire_through_same_pgbouncer_is_not_a_counter_example) so future readers who see "but I called pg_try_advisory_lock through PgBouncer and it returned TRUE!" understand the reentrant-session gotcha.The same logic underpins the
_clean_orphaned_lockautouse fixture: it runspg_terminate_backendon any session still holding the test lock id between tests, so a single failed test cannot wedge the next one behind a leaked orphan.Unit — `_alembic_at_head` truth table (5 tests)
Pins each branch of the fast-path predicate independently of the integration test, so a future refactor that widens or narrows the trigger condition is caught locally.
Unit — `MCPGATEWAY_SKIP_MIGRATIONS` settings field (3 tests)
Library default stays False (the
docker run mcpgatewayhappy path), env var override flips True, explicit False matches the default. No accidental coercion.Unit — lifespan helper respects the flag (2 tests)
The skip-case test also asserts the audit-log line is emitted so operators can see the choice in pod logs.
Unit — Helm chart render (3 tests)
Tests run
helm templateas a subprocess, parse the rendered manifests, and assert on the gateway Deployment's env block. Skipped automatically whenhelmis not on PATH; no cluster needed.Integration — PgBouncer mechanism + bootstrap regression (5 tests, gated by `--with-integration`)
The mechanism tests document what PgBouncer does (lock orphaning is a real, named property of transaction-pool handoffs) so a future reader who sees "but pg_try_advisory_lock returned TRUE through PgBouncer!" understands the reentrance gotcha. The regression tests gate the actual fix.
Run with the docker-compose stack from
tests/integration/fixtures/transaction_pool/docker-compose.yml.OCP verification — L1 fix on real OpenShift + CrunchyData PGO (cluster: context-forge-test-pg1.cp.fyre.ibm.com)
Full reproduction-vs-fix evidence in
todo/issue-4051-ocp-reproduction.md.OCP verification — L2 (chart-wired SKIP_MIGRATIONS) on real OpenShift
The 3-way pre-fix / post-L1 / post-L2 comparison is the headline win. The chart's contract — migration Job runs → app pods skip — is honored automatically; operators don't have to remember to set both
migration.enabledANDMCPGATEWAY_SKIP_MIGRATIONSbecause the chart wires them together.Full reproduction-vs-fix evidence in
todo/issue-4051-ocp-reproduction.md.Additional improvements
Idempotent post-migration bootstrap extracted to
_run_post_migration_bootstrap()— admin-user creation, default-role seeding, and orphaned-resource assignment are now called from both the fast-path and slow-path branches. Previously these were only inside theadvisory_lockblock, so a fast-path skip would have missed them on a partial-startup edge case (replica 1 stamps head then crashes before bootstrapping the admin user → replica 2 fast-paths and runs the bootstrap steps to fill the gap).Reproduction harness lives at
tests/reproduction/issue-4051/—docker-compose.repro.ymlwas promoted to a permanent test fixture undertests/integration/fixtures/transaction_pool/. Shell scripts (reproduce.sh,demonstrate_orphan.sh,README.md) are temporary; they will be removed before this PR merges, with their evidence preserved in the issue thread.Module docstrings reference the issue number — both the helper functions and the integration test module carry an inline reference to issue [BUG]: Alembic migration advisory lock hangs when multiple gateway pods start through PgBouncer (transaction pooling mode) #4051 so a future reader greps once and lands in the right context.
Limitations
Fast-path returns without emitting
"Database ready"— the canonical readiness marker only fires on the slow-path return. Operators reading pod logs see eitherDatabase readyorSchema already at Alembic head; skipping migration lock. Both signal "bootstrap finished," but inconsistently. Captured as Gap 8 above; trivial follow-up.Layer 3 not included — the original plan considered a table-based mutex fallback for environments where
migration_database_url(direct-Postgres bypass) isn't available AND the chart's pre-install Job is disabled. With L1+L2 in place, that combination is well-covered: L1 makes the in-pod path safe, L2 lets the chart guarantee the in-pod path is skipped when the Job runs. Reopening if a real-world report justifies it.DROP SCHEMA in PGO setups requires ownership restoration — observed during the OCP reproduction. CrunchyData PGO assigns schema ownership to a service user; a manual
DROP SCHEMA public CASCADE; CREATE SCHEMA public;re-creates the schema owned bypostgres, so the service user can't create tables in it until ownership is restored:Documented in the OCP reproduction notes; not a code fix.
Pod readiness can be misleading during pre-fix cycling — observed pods show
1/1 Readyeven while several gunicorn workers are still in their retry loop. One worker's lifespan completing seems sufficient to satisfy the readiness probe. With L1+L2 the storm window is gone; this is captured here for operator awareness when reading historical pre-fix logs.Future work
Compose-side L2 (deferred)
The root
docker-compose.ymlwas left unchanged. It's currently safe thanks to L1 alone (the fast-path makes in-pod bootstrap pgbouncer-safe). Adding a dedicatedmigrateservice in compose plusMCPGATEWAY_SKIP_MIGRATIONS=trueon the gateway would mirror the chart's contract one-to-one — useful for users who want production-shaped semantics in dev. Tracked as a separate issue if there's appetite; deliberately out-of-scope here to keepdocker-compose.ymltouch-free for existing users.Unify the readiness log line
Per Gap 8 above. Either:
"Database ready"from both paths (fast and slow), or"Database ready (fast-path)"/"Database ready (slow-path)").Either change is a one-line edit; deferred so this PR stays scoped to the bug fix.
MIGRATION_DATABASE_URL(only if requested)The Helm chart already exposes
migration.hostKey/portKeyto point the pre-install Job direct to Postgres (bypassing PgBouncer). A code-levelMIGRATION_DATABASE_URLsetting would let docker-compose users do the same without twoDATABASE_URLdefinitions. Not necessary for the bug fix; deferred unless someone explicitly asks for it.Closes #4051