fix: serialize heavy index claim to prevent same-table deadlock#403
Open
longyincug wants to merge 5 commits into
Open
fix: serialize heavy index claim to prevent same-table deadlock#403longyincug wants to merge 5 commits into
longyincug wants to merge 5 commits into
Conversation
`UpdateInternalTransactionsPrimaryKey` swaps the `internal_transactions`
primary key from `(block_hash, block_index)` to
`(block_number, transaction_index, index)`. If the fetcher runs while
the migration is still pending, it can import rows that violate the
new unique constraint, and the subsequent `ADD PRIMARY KEY USING INDEX`
step will fail.
Gate every fetcher entry point on
`InternalTransactionHelper.primary_key_updated?`:
- `Indexer.Fetcher.InternalTransaction.async_fetch/5` short-circuits
to `:ok` (including the `for_contract_creator?` path) so nothing is
queued.
- `Indexer.Fetcher.InternalTransaction.init/3` returns the initial
accumulator instead of streaming pending block / transaction
operations, leaving the `BufferedTask` queue empty until the
migration finishes.
- `Indexer.Fetcher.InternalTransaction.run/2` returns `:ok` without
fetching or importing.
- `Indexer.Fetcher.OnDemand.InternalTransaction` extends
`internal_transactions_fetching_disabled?` with the same check, so
all on-demand entry points (`fetch_latest`, `fetch_by_transaction`,
`fetch_by_block`, `fetch_by_address`, `should_fetch?`) return empty
results.
Tests cover the new short-circuit in both fetchers and reset the
`BackgroundMigrations` cache flag in `on_exit` so they do not bleed
across the suite.
…lock
`HeavyDbIndexOperation`'s readiness check and the
`MigrationStatus.set_status(name, "started")` write were two separate
queries with nothing in between, so multiple GenServers booting at the
same time on the same table could each observe "no one is started yet"
and then each flip itself to `started`. They would then run
`CREATE INDEX CONCURRENTLY` / `DROP INDEX CONCURRENTLY` against the
same table simultaneously, which PostgreSQL deadlocks on. Once the
DDL failed the rows stayed pinned at `started` forever, with every
GenServer indefinitely waiting on the others through
`running_other_heavy_migration_exists?`.
Observed in production after deploy: three `internal_transactions`
heavy migrations (`drop_..._created_contract_address_hash_partial_index`,
`create_..._block_number_transaction_index_index_index`,
`drop_..._from_address_hash_index`) all set `started` within the same
millisecond, then sat there while the indexer logged deadlock errors;
`pg_stat_activity` showed no active DDL and `pg_index` had no invalid
leftovers — only the migrations_status rows were stuck.
Wrap the ready check + status write in a `Repo.transaction` guarded by
`pg_try_advisory_xact_lock`, keyed on
`:erlang.phash2({:heavy_index_table_slot, table_name})` so the lock is
per-table, not global. The lock is a transaction-scoped advisory lock
(auto-released on commit), so it never spans the actual DDL — the
second GenServer's next tick simply sees
`running_other_heavy_migration_exists?` return true once the first
has committed and exits cleanly to retry later.
Use `Repo.query/2` (not the bang variant) plus an explicit `$1::bigint`
cast: the non-bang form keeps Postgrex/DBConnection errors as
`{:error, _}` instead of raising out of the transaction and crashing
the GenServer, and the cast removes any chance of Postgres failing to
resolve the `pg_try_advisory_xact_lock(bigint)` overload when Postgrex
encodes the `phash2` result as int4 or unknown.
Note: existing `started` rows in production are not unstuck by this
patch — they must be cleared manually (UPDATE to `completed` for
migrations whose target index state is already reached, DELETE
otherwise) before restart.
The HotSmartContracts fetcher hard-coded a 30-day chain-age gate before writing to `hot_smart_contracts_daily`, which made the 1d/7d/30d scales of `/api/v2/stats/hot-smart-contracts` return empty on freshly launched chains. Replace the module attribute with a runtime helper backed by `INDEXER_HOT_SMART_CONTRACTS_MIN_CHAIN_AGE_DAYS` (default 30, preserving upstream behavior). Operators can lower it for new chains so daily aggregation starts immediately.
Blocks with tens of thousands of transactions exceeded the 65535-parameter postgres protocol limit in a single insert_all, crash-looping the internal transactions DeleteQueue and amplifying indexer memory usage. Chunk the inserts via Repo.safe_insert_all wrapped in a transaction to keep them atomic for callers without one, and select only the transaction fields consumed by the internal transactions fetcher instead of full structs.
The DeleteQueue transaction deletes internal transactions and re-inserts pending operations for whole blocks. For massive blocks (tens of thousands of transactions) this exceeds the default 60s repo timeout, so the pool kills the connection mid-transaction and the batch retries forever. Inner queries already ran with timeout: :infinity - apply the same to the wrapping transaction.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.