fix: serialize heavy index claim to prevent same-table deadlock by longyincug · Pull Request #403 · mantlenetworkio/blockscout

longyincug · 2026-05-22T10:45:01Z

No description provided.

`UpdateInternalTransactionsPrimaryKey` swaps the `internal_transactions` primary key from `(block_hash, block_index)` to `(block_number, transaction_index, index)`. If the fetcher runs while the migration is still pending, it can import rows that violate the new unique constraint, and the subsequent `ADD PRIMARY KEY USING INDEX` step will fail. Gate every fetcher entry point on `InternalTransactionHelper.primary_key_updated?`: - `Indexer.Fetcher.InternalTransaction.async_fetch/5` short-circuits to `:ok` (including the `for_contract_creator?` path) so nothing is queued. - `Indexer.Fetcher.InternalTransaction.init/3` returns the initial accumulator instead of streaming pending block / transaction operations, leaving the `BufferedTask` queue empty until the migration finishes. - `Indexer.Fetcher.InternalTransaction.run/2` returns `:ok` without fetching or importing. - `Indexer.Fetcher.OnDemand.InternalTransaction` extends `internal_transactions_fetching_disabled?` with the same check, so all on-demand entry points (`fetch_latest`, `fetch_by_transaction`, `fetch_by_block`, `fetch_by_address`, `should_fetch?`) return empty results. Tests cover the new short-circuit in both fetchers and reset the `BackgroundMigrations` cache flag in `on_exit` so they do not bleed across the suite.

…lock `HeavyDbIndexOperation`'s readiness check and the `MigrationStatus.set_status(name, "started")` write were two separate queries with nothing in between, so multiple GenServers booting at the same time on the same table could each observe "no one is started yet" and then each flip itself to `started`. They would then run `CREATE INDEX CONCURRENTLY` / `DROP INDEX CONCURRENTLY` against the same table simultaneously, which PostgreSQL deadlocks on. Once the DDL failed the rows stayed pinned at `started` forever, with every GenServer indefinitely waiting on the others through `running_other_heavy_migration_exists?`. Observed in production after deploy: three `internal_transactions` heavy migrations (`drop_..._created_contract_address_hash_partial_index`, `create_..._block_number_transaction_index_index_index`, `drop_..._from_address_hash_index`) all set `started` within the same millisecond, then sat there while the indexer logged deadlock errors; `pg_stat_activity` showed no active DDL and `pg_index` had no invalid leftovers — only the migrations_status rows were stuck. Wrap the ready check + status write in a `Repo.transaction` guarded by `pg_try_advisory_xact_lock`, keyed on `:erlang.phash2({:heavy_index_table_slot, table_name})` so the lock is per-table, not global. The lock is a transaction-scoped advisory lock (auto-released on commit), so it never spans the actual DDL — the second GenServer's next tick simply sees `running_other_heavy_migration_exists?` return true once the first has committed and exits cleanly to retry later. Use `Repo.query/2` (not the bang variant) plus an explicit `$1::bigint` cast: the non-bang form keeps Postgrex/DBConnection errors as `{:error, _}` instead of raising out of the transaction and crashing the GenServer, and the cast removes any chance of Postgres failing to resolve the `pg_try_advisory_xact_lock(bigint)` overload when Postgrex encodes the `phash2` result as int4 or unknown. Note: existing `started` rows in production are not unstuck by this patch — they must be cleared manually (UPDATE to `completed` for migrations whose target index state is already reached, DELETE otherwise) before restart.

The HotSmartContracts fetcher hard-coded a 30-day chain-age gate before writing to `hot_smart_contracts_daily`, which made the 1d/7d/30d scales of `/api/v2/stats/hot-smart-contracts` return empty on freshly launched chains. Replace the module attribute with a runtime helper backed by `INDEXER_HOT_SMART_CONTRACTS_MIN_CHAIN_AGE_DAYS` (default 30, preserving upstream behavior). Operators can lower it for new chains so daily aggregation starts immediately.

Blocks with tens of thousands of transactions exceeded the 65535-parameter postgres protocol limit in a single insert_all, crash-looping the internal transactions DeleteQueue and amplifying indexer memory usage. Chunk the inserts via Repo.safe_insert_all wrapped in a transaction to keep them atomic for callers without one, and select only the transaction fields consumed by the internal transactions fetcher instead of full structs.

The DeleteQueue transaction deletes internal transactions and re-inserts pending operations for whole blocks. For massive blocks (tens of thousands of transactions) this exceeds the default 60s repo timeout, so the pool kills the connection mid-transaction and the batch retries forever. Inner queries already ran with timeout: :infinity - apply the same to the wrapping transaction.

longyincug added 5 commits May 22, 2026 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: serialize heavy index claim to prevent same-table deadlock#403

fix: serialize heavy index claim to prevent same-table deadlock#403
longyincug wants to merge 5 commits into
mainfrom
mantle-arsia

longyincug commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

longyincug commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant