Add option to validate sst files in the background on DB open#14322
Add option to validate sst files in the background on DB open#14322joshkang97 wants to merge 2 commits intofacebook:mainfrom
Conversation
9bfa63c to
abae004
Compare
abae004 to
bf6e4b0
Compare
|
@joshkang97 has imported this pull request. If you are a Meta employee, you can view this in D93538033. |
|
Codex Review Findings for
|
xingbowang
left a comment
There was a problem hiding this comment.
- Should we add some metrics to track how many files is pending open. How many files are opened, how long does it take to open one sst file in the context of background open?
- Once a corruption got detected during async open, it should call error_handler_.SetBGError(), so that the error is surfaced as soon as possible. We want to avoid running a database with corrupted data too long.
I still have a few more files to review table_cache.cc, db_test.cc and db_compaction_test.cc
| @@ -0,0 +1 @@ | |||
| Add a new option `open_files_async`. The existing behavior is on DB open, we open all sst files and do basic validations. For very large DBs with many ssts, this may take very long. This option performs these validations instead in the background. | |||
There was a problem hiding this comment.
To be more precise, many sst is not the issue. The issue is high latency remote file system.
| // databases with many SST files. This means if max_open_files=-1 and | ||
| // open_files_async=true, a corrupted SST will not show up in DBOpen, but |
There was a problem hiding this comment.
Looking at the code, this option only works if max_open_files == -1, otherwise, it returns invalid argument. Should we emphasize it in the comment here? Meantime, I am curious why does it require it to be -1?
| // background after DB::Open returns. This reduces DB open time for | ||
| // databases with many SST files. This means if max_open_files=-1 and | ||
| // open_files_async=true, a corrupted SST will not show up in DBOpen, but | ||
| // instead will show up in any read requests that read from that file. |
There was a problem hiding this comment.
"instead will show up in any operation that accesses that file."
this is more precise, as compaction could touch it as well.
| ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks(); | ||
| } | ||
|
|
||
| TEST_P(OpenFilesAsyncTest, Error) { |
There was a problem hiding this comment.
This isn't quite testing a truly corrupted file — it injects the corruption status after FindTable succeeds (via the AfterFindTable callback), so the table reader is already pinned. It only exercises the error logging path in BGWorkAsyncFileOpen, not the scenario where a file genuinely can't be opened and later reads fail. You can try this.
- Write data and flush to create SST files
- Close the DB
- Get the SST file paths (e.g., via GetLiveFilesMetaData before closing, or by scanning the directory for .sst files)
- Call test::CorruptFile() on the SST
- Reopen with open_files_async = true
- Verify that reads against the corrupted file return an error
xingbowang
left a comment
There was a problem hiding this comment.
I went through all of the files. Thanks for adding this change. It looks very good as initial commit.
Some additional unit test gap analysis from AI.
Gaps and concerns:
- FindTable cache-hit + pin path (line 249-269) — NOT covered
This is the else branch where cache_.Lookup(key) returns a hit, but the file isn't yet pinned. The test would need a scenario where a file is in the cache (from a previous lookup) but not pinned, and then FindTable is called with pin_table_handle=true. None of the current tests create this condition because with max_open_files=-1, files are always pinned on first access. - Concurrent pinning race (line 258 — table_reader_handle != nullptr under mutex) — NOT covered
This is the branch at line 258 where two threads try to pin the same file simultaneously: one wins and the other releases its duplicate handle. The AfterRead test creates a race between reads and async open, but because reads and async open iterate files sequentially, it's unlikely they'd race on the same file. - Genuine file corruption — NOT covered
As discussed earlier, the Error test injects corruption via sync point after FindTable succeeds, so the table reader is already pinned. It doesn't test what happens when a file genuinely can't be opened (e.g., corrupt SST on disk). A test using test::CorruptFile() would verify the real error path: background open fails, fd.table_reader remains null, subsequent reads encounter the error. - Multiple column families — NOT covered
ScheduleAsyncFileOpening iterates all column families and BGWorkAsyncFileOpen processes each Version. All tests use only the default CF. A test with multiple CFs would exercise the loop at lines 2738-2752. - open_files_async=false (no regression) — NOT explicitly covered
There's no test that verifies the synchronous open path still works correctly after the refactoring of FindTable and LoadTableHandlers. The existing tests should cover this implicitly, but there's no explicit before/after comparison. - LevelIterator pinning (maybe_pin_table_handle=true at line 1300) — NOT directly tested by new tests
The new code passes maybe_pin_table_handle=true to NewIterator for LevelIterator. The Iterator read type in the parameterized tests uses db_->NewIterator() which does go through this path, but there's no assertion specifically verifying that files accessed through level iteration get pinned. - shutting_down_ flag during LoadTableHandlersHelper — partially covered
The Shutdown test forces async open to start after ~DBImpl begins waiting, so shutting_down_ is true and LoadTableHandlersHelper bails early (via the stop atomic). But ASSERT_EQ(file_opens, 0) only checks that no files were opened — it doesn't verify the stop flag was actually what prevented them (vs. some other reason like ordering).
Summary
Add
open_files_asyncoption for faster DB startup. When enabled, SST file opening and validation is deferred to a background thread afterDB::Openreturns, reducing startup latency for databases with many SST files. WAL recovery remains synchronous.To support this,
FindTableis extended with a pinning mechanism that stores the cache handle directly onFileMetaDataand setsFileDescriptor::table_readeratomically, so subsequent reads skip cache lookups.FileDescriptor::table_readeris changed from a raw pointer tostd::atomic<TableReader*>with acquire/release ordering to safely handle concurrent access between the background opener and read threads.Should validations fail, the only effect is that the handle will not get pinned. Future read requests will lookup the table reader again, and if any validations fail there it will get propagated to the user (basically existing behavior if max_open_files > 0).
Note that this feature is only useful when max_open_files=-1, because otherwise we always limit file opening to max 16 files, so DB open should never take that long.
Key changes
open_files_asyncDB option with C and Java API bindingsBGWorkAsyncFileOpenbackground worker that opens all SST files post-DB::Open, with shutdown awarenessVersionBuilder::LoadTableHandlersFromBase, which loads table handlers from base storage instead of version edits.FindTableextended withpin_table_handleparam — when true, pins the table reader onFileMetaDataso Get/MultiGet/Iterator skip redundant cache lookups. Also now returns theTableReaderfor some code simplification.FileDescriptor::table_readerchanged tostd::atomic<TableReader*>. Use acquire release semantics for safe multithreaded handling.VersionEditHandlergainsskip_load_table_filesso version recovery skips preloading when async open is enabledTest Plan
OpenFilesAsyncTestparameterized overmax_open_files(-1, 10),num_flushes(1, 5, 20), andReadType(Get, MultiGet, Iterator)FindTableerrors, verifying graceful fallback to lazy openBenchmark
Setup a DB with many sstfiles