Add option to validate sst files in the background on DB open by joshkang97 · Pull Request #14322 · facebook/rocksdb

joshkang97 · 2026-02-11T19:58:17Z

Summary

Add open_files_async option for faster DB startup. When enabled, SST file opening and validation is deferred to a background thread after DB::Open returns, reducing startup latency for databases with many SST files. WAL recovery remains synchronous.

To support this, FindTable is extended with a pinning mechanism that stores the cache handle directly on FileMetaData and sets FileDescriptor::table_reader atomically, so subsequent reads skip cache lookups. FileDescriptor::table_reader is changed from a raw pointer to std::atomic<TableReader*> with acquire/release ordering to safely handle concurrent access between the background opener and read threads.

Should validations fail, the only effect is that the handle will not get pinned. Future read requests will lookup the table reader again, and if any validations fail there it will get propagated to the user (basically existing behavior if max_open_files > 0).

Note that this feature is only useful when max_open_files=-1, because otherwise we always limit file opening to max 16 files, so DB open should never take that long.

Key changes

New open_files_async DB option with C and Java API bindings
BGWorkAsyncFileOpen background worker that opens all SST files post-DB::Open, with shutdown awareness
Add VersionBuilder::LoadTableHandlersFromBase, which loads table handlers from base storage instead of version edits.
FindTable extended with pin_table_handle param — when true, pins the table reader on FileMetaData so Get/MultiGet/Iterator skip redundant cache lookups. Also now returns the TableReader for some code simplification.
- Note we explicitly add a parameter for this rather than make it default because there are some callers that create temporary FileMetadata's, and making it default would require them to properly clean up table handles.
- Also FindTable does the pinned lookup now instead of having the callers do it
FileDescriptor::table_reader changed to std::atomic<TableReader*>. Use acquire release semantics for safe multithreaded handling.
VersionEditHandler gains skip_load_table_files so version recovery skips preloading when async open is enabled

Test Plan

OpenFilesAsyncTest parameterized over max_open_files (-1, 10), num_flushes (1, 5, 20), and ReadType (Get, MultiGet, Iterator)
- AfterRead: reads happen before async opener, verifying lazy open and that the opener sees already-pinned readers
- BeforeRead: async opener completes first, verifying reads use pre-loaded table readers
- Shutdown: DB closes before async opener starts, verifying clean cancellation
- Error: injected FindTable errors, verifying graceful fallback to lazy open
New option in crash test

Benchmark

Setup a DB with many sstfiles

./db_bench -benchmarks=fillseq -disable_auto_compactions=true -write_buffer_size=1000

./db_bench -use_existing_db=true -benchmarks=readrandom -reads=1 -report_open_timing=true -open_files_async=true

OpenDb:     48.4508 milliseconds

./db_bench -use_existing_db=true -benchmarks=readrandom -reads=1 -report_open_timing=true -open_files_async=false

OpenDb:     152.467 milliseconds

meta-codesync · 2026-02-17T23:38:44Z

@joshkang97 has imported this pull request. If you are a Meta employee, you can view this in D93538033.

xingbowang · 2026-02-18T23:39:50Z

Codex

Review Findings for `bf6e4b0f5ad12a5a787b435f9c86651b4d858967`

This document consolidates all findings from the review of commit bf6e4b0f5ad12a5a787b435f9c86651b4d858967.

Findings (Ordered by Severity)

1) High: FIFO TTL compaction can delete files too aggressively before async table-open finishes

Area: FIFO compaction with open_files_async=true
Why it matters: Before table properties are loaded, TTL selection may still enqueue files for deletion. This can cause compactions that do not respect intended TTL semantics and can remove data prematurely.
Code paths involved:
- Recovery skips eager table loading when open_files_async is enabled.
- Compaction scheduling can begin before background file opening completes.
- FIFO TTL picker can continue adding files even when table reader/properties are unavailable.

2) High: Data race / UB risk in async file opener due mutex contract violation

Area: Background async file open worker
Why it matters: GetLatestMutableCFOptions() is documented as requiring DB mutex. Async worker reads it without mutex and then uses the reference during table loading. Concurrent option updates can race and produce undefined behavior.
Code paths involved: BGWorkAsyncFileOpen() obtains mutable CF options without holding DB mutex.

3) High: `OpenForReadOnly()` can crash in compacted-path with `open_files_async=true`

Area: read-only open via CompactedDBImpl
Why it matters: CompactedDBImpl::Get/MultiGet dereference fd.table_reader directly. With open_files_async=true, recovery may skip loading table readers, so pointer can be null and trigger SIGSEGV.
Reproduction status: Confirmed by unit test (see "Test Evidence" below).

4) Medium: `GetCreationTimeOfOldestFile()` invariant breaks with async open

Area: metadata API assumptions
Why it matters: API remains available when max_open_files == -1, but code asserts all table readers are loaded. Under async open, this may not be true immediately after open, leading to assertion failure in debug builds.
Code paths involved: Version::GetCreationTimeOfOldestFile() asserts non-null table reader.

5) Medium: Option validation inconsistency across open modes

Area: mode-specific option validation
Why it matters: DBImpl::ValidateOptions() rejects open_files_async=true with finite max_open_files, but read-only/secondary/follower paths do not consistently apply this validation. Same option tuple behaves differently by open mode.
Observed behavior:
- Writable open path validates and rejects.
- Read-only and secondary opens currently accept in tested paths.
- Follower path likely affected similarly, but Linux-only tests are required to execute.

6) Medium: Async file-opening job scheduled in HIGH pool can delay flushes

Area: startup scheduling / thread-pool priority
Why it matters: Async file-open job is submitted to HIGH-priority pool, same pool used by flush jobs. On low thread counts this can increase startup write latency or stall flush progress.

7) Medium: `open_files_async` semantics are effectively inconsistent/partial in secondary/follower

Area: ReactiveVersionSet/ManifestTailer recovery path
Why it matters: Secondary/follower open flows do not use the same recovery + warmup path as primary open. Behavior diverges from primary semantics and from user expectations that the option applies uniformly.

Suggested Fix Directions

Protect compaction correctness under async open
- Avoid TTL decisions based on missing table properties.
- Gate TTL deletion logic until required metadata is available, or conservatively skip.
Fix thread-safety in async opener
- Avoid reading GetLatestMutableCFOptions() without mutex.
- Copy needed fields under mutex into per-task immutable context.
Harden compacted read-only path
- Ensure table readers are loaded before use, or lazily open on first read.
- Add null checks before dereference and return non-fatal status if unavailable.
Unify option validation
- Apply open_files_async / max_open_files validation consistently in read-only, secondary, and follower open APIs.
Revisit scheduling priority
- Consider LOW pool or throttled scheduling for async file opening to avoid flush interference.
Add coverage for mode-specific semantics
- Read-only + compacted-path tests (already added)
- Secondary/follower consistency tests across platforms
- API behavior tests for metadata methods immediately after open

xingbowang

Should we add some metrics to track how many files is pending open. How many files are opened, how long does it take to open one sst file in the context of background open?
Once a corruption got detected during async open, it should call error_handler_.SetBGError(), so that the error is surfaced as soon as possible. We want to avoid running a database with corrupted data too long.

I still have a few more files to review table_cache.cc, db_test.cc and db_compaction_test.cc

xingbowang · 2026-02-19T00:09:17Z

unreleased_history/new_features/async_file_open

@@ -0,0 +1 @@
+Add a new option `open_files_async`. The existing behavior is on DB open, we open all sst files and do basic validations. For very large DBs with many ssts, this may take very long. This option performs these validations instead in the background. 


To be more precise, many sst is not the issue. The issue is high latency remote file system.

db/db_impl/db_impl.h

xingbowang · 2026-02-19T00:16:06Z

include/rocksdb/options.h

+  // databases with many SST files. This means if max_open_files=-1 and
+  // open_files_async=true, a corrupted SST will not show up in DBOpen, but


Looking at the code, this option only works if max_open_files == -1, otherwise, it returns invalid argument. Should we emphasize it in the comment here? Meantime, I am curious why does it require it to be -1?

xingbowang · 2026-02-19T00:18:45Z

include/rocksdb/options.h

+  // background after DB::Open returns. This reduces DB open time for
+  // databases with many SST files. This means if max_open_files=-1 and
+  // open_files_async=true, a corrupted SST will not show up in DBOpen, but
+  // instead will show up in any read requests that read from that file.


"instead will show up in any operation that accesses that file."
this is more precise, as compaction could touch it as well.

xingbowang · 2026-02-19T00:22:53Z

db/db_test.cc

+  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
+}
+
+TEST_P(OpenFilesAsyncTest, Error) {


This isn't quite testing a truly corrupted file — it injects the corruption status after FindTable succeeds (via the AfterFindTable callback), so the table reader is already pinned. It only exercises the error logging path in BGWorkAsyncFileOpen, not the scenario where a file genuinely can't be opened and later reads fail. You can try this.

Write data and flush to create SST files

Close the DB

Get the SST file paths (e.g., via GetLiveFilesMetaData before closing, or by scanning the directory for .sst files)

Call test::CorruptFile() on the SST

Reopen with open_files_async = true

Verify that reads against the corrupted file return an error

db/version_builder.cc

db/version_builder.h

xingbowang

I went through all of the files. Thanks for adding this change. It looks very good as initial commit.

Some additional unit test gap analysis from AI.

Gaps and concerns:

FindTable cache-hit + pin path (line 249-269) — NOT covered
This is the else branch where cache_.Lookup(key) returns a hit, but the file isn't yet pinned. The test would need a scenario where a file is in the cache (from a previous lookup) but not pinned, and then FindTable is called with pin_table_handle=true. None of the current tests create this condition because with max_open_files=-1, files are always pinned on first access.
Concurrent pinning race (line 258 — table_reader_handle != nullptr under mutex) — NOT covered
This is the branch at line 258 where two threads try to pin the same file simultaneously: one wins and the other releases its duplicate handle. The AfterRead test creates a race between reads and async open, but because reads and async open iterate files sequentially, it's unlikely they'd race on the same file.
Genuine file corruption — NOT covered
As discussed earlier, the Error test injects corruption via sync point after FindTable succeeds, so the table reader is already pinned. It doesn't test what happens when a file genuinely can't be opened (e.g., corrupt SST on disk). A test using test::CorruptFile() would verify the real error path: background open fails, fd.table_reader remains null, subsequent reads encounter the error.
Multiple column families — NOT covered
ScheduleAsyncFileOpening iterates all column families and BGWorkAsyncFileOpen processes each Version. All tests use only the default CF. A test with multiple CFs would exercise the loop at lines 2738-2752.
open_files_async=false (no regression) — NOT explicitly covered
There's no test that verifies the synchronous open path still works correctly after the refactoring of FindTable and LoadTableHandlers. The existing tests should cover this implicitly, but there's no explicit before/after comparison.
LevelIterator pinning (maybe_pin_table_handle=true at line 1300) — NOT directly tested by new tests
The new code passes maybe_pin_table_handle=true to NewIterator for LevelIterator. The Iterator read type in the parameterized tests uses db_->NewIterator() which does go through this path, but there's no assertion specifically verifying that files accessed through level iteration get pinned.
shutting_down_ flag during LoadTableHandlersHelper — partially covered
The Shutdown test forces async open to start after ~DBImpl begins waiting, so shutting_down_ is true and LoadTableHandlersHelper bails early (via the stop atomic). But ASSERT_EQ(file_opens, 0) only checks that no files were opened — it doesn't verify the stop flag was actually what prevented them (vs. some other reason like ordering).

db/table_cache.cc

db/table_cache.h

meta-cla bot added the CLA Signed label Feb 11, 2026

joshkang97 force-pushed the async_file_open branch 11 times, most recently from 9bfa63c to abae004 Compare February 13, 2026 19:24

initial commit

bf6e4b0

joshkang97 force-pushed the async_file_open branch from abae004 to bf6e4b0 Compare February 13, 2026 22:31

Merge branch 'main' into async_file_open

f3741ce

joshkang97 requested review from pdillinger and xingbowang February 17, 2026 23:37

joshkang97 marked this pull request as ready for review February 17, 2026 23:38

xingbowang reviewed Feb 19, 2026

View reviewed changes

db/table_cache.cc Show resolved Hide resolved

db/table_cache.cc Show resolved Hide resolved

db/table_cache.cc Show resolved Hide resolved

db/table_cache.cc Show resolved Hide resolved

db/table_cache.h Show resolved Hide resolved

		@@ -0,0 +1 @@
		Add a new option `open_files_async`. The existing behavior is on DB open, we open all sst files and do basic validations. For very large DBs with many ssts, this may take very long. This option performs these validations instead in the background.

		// databases with many SST files. This means if max_open_files=-1 and
		// open_files_async=true, a corrupted SST will not show up in DBOpen, but

Comments

Conversation

joshkang97 commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes

Test Plan

Benchmark

Uh oh!

meta-codesync bot commented Feb 17, 2026

Uh oh!

xingbowang commented Feb 18, 2026

Review Findings for bf6e4b0f5ad12a5a787b435f9c86651b4d858967

Findings (Ordered by Severity)

1) High: FIFO TTL compaction can delete files too aggressively before async table-open finishes

2) High: Data race / UB risk in async file opener due mutex contract violation

3) High: OpenForReadOnly() can crash in compacted-path with open_files_async=true

4) Medium: GetCreationTimeOfOldestFile() invariant breaks with async open

5) Medium: Option validation inconsistency across open modes

6) Medium: Async file-opening job scheduled in HIGH pool can delay flushes

7) Medium: open_files_async semantics are effectively inconsistent/partial in secondary/follower

Suggested Fix Directions

Uh oh!

xingbowang left a comment

Choose a reason for hiding this comment

Uh oh!

xingbowang Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xingbowang Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

xingbowang Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

xingbowang Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xingbowang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joshkang97 commented Feb 11, 2026 •

edited

Loading

Review Findings for `bf6e4b0f5ad12a5a787b435f9c86651b4d858967`

3) High: `OpenForReadOnly()` can crash in compacted-path with `open_files_async=true`

4) Medium: `GetCreationTimeOfOldestFile()` invariant breaks with async open

7) Medium: `open_files_async` semantics are effectively inconsistent/partial in secondary/follower