Skip to content

Comments

Add option to validate sst files in the background on DB open#14322

Open
joshkang97 wants to merge 2 commits intofacebook:mainfrom
joshkang97:async_file_open
Open

Add option to validate sst files in the background on DB open#14322
joshkang97 wants to merge 2 commits intofacebook:mainfrom
joshkang97:async_file_open

Conversation

@joshkang97
Copy link
Contributor

@joshkang97 joshkang97 commented Feb 11, 2026

Summary

Add open_files_async option for faster DB startup. When enabled, SST file opening and validation is deferred to a background thread after DB::Open returns, reducing startup latency for databases with many SST files. WAL recovery remains synchronous.

To support this, FindTable is extended with a pinning mechanism that stores the cache handle directly on FileMetaData and sets FileDescriptor::table_reader atomically, so subsequent reads skip cache lookups. FileDescriptor::table_reader is changed from a raw pointer to std::atomic<TableReader*> with acquire/release ordering to safely handle concurrent access between the background opener and read threads.

Should validations fail, the only effect is that the handle will not get pinned. Future read requests will lookup the table reader again, and if any validations fail there it will get propagated to the user (basically existing behavior if max_open_files > 0).

Note that this feature is only useful when max_open_files=-1, because otherwise we always limit file opening to max 16 files, so DB open should never take that long.

Key changes

  • New open_files_async DB option with C and Java API bindings
  • BGWorkAsyncFileOpen background worker that opens all SST files post-DB::Open, with shutdown awareness
  • Add VersionBuilder::LoadTableHandlersFromBase, which loads table handlers from base storage instead of version edits.
  • FindTable extended with pin_table_handle param — when true, pins the table reader on FileMetaData so Get/MultiGet/Iterator skip redundant cache lookups. Also now returns the TableReader for some code simplification.
    • Note we explicitly add a parameter for this rather than make it default because there are some callers that create temporary FileMetadata's, and making it default would require them to properly clean up table handles.
    • Also FindTable does the pinned lookup now instead of having the callers do it
  • FileDescriptor::table_reader changed to std::atomic<TableReader*>. Use acquire release semantics for safe multithreaded handling.
  • VersionEditHandler gains skip_load_table_files so version recovery skips preloading when async open is enabled

Test Plan

  • OpenFilesAsyncTest parameterized over max_open_files (-1, 10), num_flushes (1, 5, 20), and ReadType (Get, MultiGet, Iterator)
    • AfterRead: reads happen before async opener, verifying lazy open and that the opener sees already-pinned readers
    • BeforeRead: async opener completes first, verifying reads use pre-loaded table readers
    • Shutdown: DB closes before async opener starts, verifying clean cancellation
    • Error: injected FindTable errors, verifying graceful fallback to lazy open
  • New option in crash test

Benchmark

Setup a DB with many sstfiles

./db_bench -benchmarks=fillseq -disable_auto_compactions=true -write_buffer_size=1000
./db_bench -use_existing_db=true -benchmarks=readrandom -reads=1 -report_open_timing=true -open_files_async=true

OpenDb:     48.4508 milliseconds
./db_bench -use_existing_db=true -benchmarks=readrandom -reads=1 -report_open_timing=true -open_files_async=false

OpenDb:     152.467 milliseconds

@meta-cla meta-cla bot added the CLA Signed label Feb 11, 2026
@joshkang97 joshkang97 force-pushed the async_file_open branch 11 times, most recently from 9bfa63c to abae004 Compare February 13, 2026 19:24
@joshkang97 joshkang97 marked this pull request as ready for review February 17, 2026 23:38
@meta-codesync
Copy link

meta-codesync bot commented Feb 17, 2026

@joshkang97 has imported this pull request. If you are a Meta employee, you can view this in D93538033.

@xingbowang
Copy link
Contributor

Codex

Review Findings for bf6e4b0f5ad12a5a787b435f9c86651b4d858967

This document consolidates all findings from the review of commit bf6e4b0f5ad12a5a787b435f9c86651b4d858967.

Findings (Ordered by Severity)

1) High: FIFO TTL compaction can delete files too aggressively before async table-open finishes

  • Area: FIFO compaction with open_files_async=true
  • Why it matters: Before table properties are loaded, TTL selection may still enqueue files for deletion. This can cause compactions that do not respect intended TTL semantics and can remove data prematurely.
  • Code paths involved:
    • Recovery skips eager table loading when open_files_async is enabled.
    • Compaction scheduling can begin before background file opening completes.
    • FIFO TTL picker can continue adding files even when table reader/properties are unavailable.

2) High: Data race / UB risk in async file opener due mutex contract violation

  • Area: Background async file open worker
  • Why it matters: GetLatestMutableCFOptions() is documented as requiring DB mutex. Async worker reads it without mutex and then uses the reference during table loading. Concurrent option updates can race and produce undefined behavior.
  • Code paths involved: BGWorkAsyncFileOpen() obtains mutable CF options without holding DB mutex.

3) High: OpenForReadOnly() can crash in compacted-path with open_files_async=true

  • Area: read-only open via CompactedDBImpl
  • Why it matters: CompactedDBImpl::Get/MultiGet dereference fd.table_reader directly. With open_files_async=true, recovery may skip loading table readers, so pointer can be null and trigger SIGSEGV.
  • Reproduction status: Confirmed by unit test (see "Test Evidence" below).

4) Medium: GetCreationTimeOfOldestFile() invariant breaks with async open

  • Area: metadata API assumptions
  • Why it matters: API remains available when max_open_files == -1, but code asserts all table readers are loaded. Under async open, this may not be true immediately after open, leading to assertion failure in debug builds.
  • Code paths involved: Version::GetCreationTimeOfOldestFile() asserts non-null table reader.

5) Medium: Option validation inconsistency across open modes

  • Area: mode-specific option validation
  • Why it matters: DBImpl::ValidateOptions() rejects open_files_async=true with finite max_open_files, but read-only/secondary/follower paths do not consistently apply this validation. Same option tuple behaves differently by open mode.
  • Observed behavior:
    • Writable open path validates and rejects.
    • Read-only and secondary opens currently accept in tested paths.
    • Follower path likely affected similarly, but Linux-only tests are required to execute.

6) Medium: Async file-opening job scheduled in HIGH pool can delay flushes

  • Area: startup scheduling / thread-pool priority
  • Why it matters: Async file-open job is submitted to HIGH-priority pool, same pool used by flush jobs. On low thread counts this can increase startup write latency or stall flush progress.

7) Medium: open_files_async semantics are effectively inconsistent/partial in secondary/follower

  • Area: ReactiveVersionSet/ManifestTailer recovery path
  • Why it matters: Secondary/follower open flows do not use the same recovery + warmup path as primary open. Behavior diverges from primary semantics and from user expectations that the option applies uniformly.

Suggested Fix Directions

  1. Protect compaction correctness under async open

    • Avoid TTL decisions based on missing table properties.
    • Gate TTL deletion logic until required metadata is available, or conservatively skip.
  2. Fix thread-safety in async opener

    • Avoid reading GetLatestMutableCFOptions() without mutex.
    • Copy needed fields under mutex into per-task immutable context.
  3. Harden compacted read-only path

    • Ensure table readers are loaded before use, or lazily open on first read.
    • Add null checks before dereference and return non-fatal status if unavailable.
  4. Unify option validation

    • Apply open_files_async / max_open_files validation consistently in read-only, secondary, and follower open APIs.
  5. Revisit scheduling priority

    • Consider LOW pool or throttled scheduling for async file opening to avoid flush interference.
  6. Add coverage for mode-specific semantics

    • Read-only + compacted-path tests (already added)
    • Secondary/follower consistency tests across platforms
    • API behavior tests for metadata methods immediately after open

Copy link
Contributor

@xingbowang xingbowang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Should we add some metrics to track how many files is pending open. How many files are opened, how long does it take to open one sst file in the context of background open?
  2. Once a corruption got detected during async open, it should call error_handler_.SetBGError(), so that the error is surfaced as soon as possible. We want to avoid running a database with corrupted data too long.

I still have a few more files to review table_cache.cc, db_test.cc and db_compaction_test.cc

@@ -0,0 +1 @@
Add a new option `open_files_async`. The existing behavior is on DB open, we open all sst files and do basic validations. For very large DBs with many ssts, this may take very long. This option performs these validations instead in the background.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more precise, many sst is not the issue. The issue is high latency remote file system.

Comment on lines +635 to +636
// databases with many SST files. This means if max_open_files=-1 and
// open_files_async=true, a corrupted SST will not show up in DBOpen, but
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code, this option only works if max_open_files == -1, otherwise, it returns invalid argument. Should we emphasize it in the comment here? Meantime, I am curious why does it require it to be -1?

// background after DB::Open returns. This reduces DB open time for
// databases with many SST files. This means if max_open_files=-1 and
// open_files_async=true, a corrupted SST will not show up in DBOpen, but
// instead will show up in any read requests that read from that file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"instead will show up in any operation that accesses that file."
this is more precise, as compaction could touch it as well.

ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
}

TEST_P(OpenFilesAsyncTest, Error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite testing a truly corrupted file — it injects the corruption status after FindTable succeeds (via the AfterFindTable callback), so the table reader is already pinned. It only exercises the error logging path in BGWorkAsyncFileOpen, not the scenario where a file genuinely can't be opened and later reads fail. You can try this.

  1. Write data and flush to create SST files
  2. Close the DB
  3. Get the SST file paths (e.g., via GetLiveFilesMetaData before closing, or by scanning the directory for .sst files)
  4. Call test::CorruptFile() on the SST
  5. Reopen with open_files_async = true
  6. Verify that reads against the corrupted file return an error

Copy link
Contributor

@xingbowang xingbowang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through all of the files. Thanks for adding this change. It looks very good as initial commit.

Some additional unit test gap analysis from AI.

Gaps and concerns:

  1. FindTable cache-hit + pin path (line 249-269) — NOT covered
    This is the else branch where cache_.Lookup(key) returns a hit, but the file isn't yet pinned. The test would need a scenario where a file is in the cache (from a previous lookup) but not pinned, and then FindTable is called with pin_table_handle=true. None of the current tests create this condition because with max_open_files=-1, files are always pinned on first access.
  2. Concurrent pinning race (line 258 — table_reader_handle != nullptr under mutex) — NOT covered
    This is the branch at line 258 where two threads try to pin the same file simultaneously: one wins and the other releases its duplicate handle. The AfterRead test creates a race between reads and async open, but because reads and async open iterate files sequentially, it's unlikely they'd race on the same file.
  3. Genuine file corruption — NOT covered
    As discussed earlier, the Error test injects corruption via sync point after FindTable succeeds, so the table reader is already pinned. It doesn't test what happens when a file genuinely can't be opened (e.g., corrupt SST on disk). A test using test::CorruptFile() would verify the real error path: background open fails, fd.table_reader remains null, subsequent reads encounter the error.
  4. Multiple column families — NOT covered
    ScheduleAsyncFileOpening iterates all column families and BGWorkAsyncFileOpen processes each Version. All tests use only the default CF. A test with multiple CFs would exercise the loop at lines 2738-2752.
  5. open_files_async=false (no regression) — NOT explicitly covered
    There's no test that verifies the synchronous open path still works correctly after the refactoring of FindTable and LoadTableHandlers. The existing tests should cover this implicitly, but there's no explicit before/after comparison.
  6. LevelIterator pinning (maybe_pin_table_handle=true at line 1300) — NOT directly tested by new tests
    The new code passes maybe_pin_table_handle=true to NewIterator for LevelIterator. The Iterator read type in the parameterized tests uses db_->NewIterator() which does go through this path, but there's no assertion specifically verifying that files accessed through level iteration get pinned.
  7. shutting_down_ flag during LoadTableHandlersHelper — partially covered
    The Shutdown test forces async open to start after ~DBImpl begins waiting, so shutting_down_ is true and LoadTableHandlersHelper bails early (via the stop atomic). But ASSERT_EQ(file_opens, 0) only checks that no files were opened — it doesn't verify the stop flag was actually what prevented them (vs. some other reason like ordering).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants