Skip to content

Comments

Add corruption guards against hardware bit flips#14354

Open
xingbowang wants to merge 1 commit intofacebook:mainfrom
xingbowang:2026_02_18_corruption_622561
Open

Add corruption guards against hardware bit flips#14354
xingbowang wants to merge 1 commit intofacebook:mainfrom
xingbowang:2026_02_18_corruption_622561

Conversation

@xingbowang
Copy link
Contributor

@xingbowang xingbowang commented Feb 19, 2026

Summary:
Investigation of a production SST corruption revealed that a single hardware bit flip (bit 32) in a value Slice's size_ field during compaction caused a 5.6MB value to be treated as 4.3GB. The corruption was silent because:

  • BlockBuilder's varint encoding uses static_cast<uint32_t>(value.size()), which truncated the corrupted size back to the correct value
  • buffer_.append(value.data(), value.size()) used the full 64-bit corrupted size, appending 4GB of adjacent heap memory into the block
  • The block checksum was computed over the corrupted data, so it matched
  • Compression was bypassed (block exceeded kCompressionSizeLimit)
  • paranoid_file_checks was disabled

This change adds 2 layers of defense:

  1. BlockBuilder uint32 truncation guard: In AddWithLastKeyImpl, detect when value.size() exceeds uint32_t range before the varint encoding silently truncates it. This catches bit flips in bits 32-63 and aborts immediately, preventing the massive buffer_.append from occurring. This applies to both flush and compaction paths since all key-value pairs flow through BlockBuilder::AddWithLastKeyImpl.

  2. Compaction output/input size ratio check: New option max_compaction_output_to_input_ratio (default: 10, 0 to disable) in MutableCFOptions. After compaction, if total output size exceeds this ratio times total input size, return Status::Corruption. This catches cases where corrupted values inflate the output file far beyond what input data justifies. This check applies to compaction only (not flush), since flush writes from a memtable and does not have input files for comparison. However, flush output is still protected by guards Miss Spelling in README #1 and OSX compilation #2.

Test Plan:
Unit test

@meta-cla meta-cla bot added the CLA Signed label Feb 19, 2026
@xingbowang xingbowang force-pushed the 2026_02_18_corruption_622561 branch 3 times, most recently from 9bcc95a to b72a991 Compare February 20, 2026 15:20
@meta-codesync
Copy link

meta-codesync bot commented Feb 20, 2026

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D93868104.

@xingbowang xingbowang marked this pull request as ready for review February 20, 2026 15:35
Summary:
Investigation of a production SST corruption revealed that a single hardware
bit flip (bit 32) in a value Slice's size_ field during compaction caused a
5.6MB value to be treated as 4.3GB. The corruption was silent because:
- BlockBuilder's varint encoding uses static_cast<uint32_t>(value.size()),
  which truncated the corrupted size back to the correct value
- buffer_.append(value.data(), value.size()) used the full 64-bit corrupted
  size, appending 4GB of adjacent heap memory into the block
- The block checksum was computed over the corrupted data, so it matched
- paranoid_file_checks was disabled

This change adds 2 layers of defense:

1. **Value size uint32 truncation guard**: In BlockBasedTableBuilder::Add(),
   detect when value.size() exceeds uint32_t range before passing it to
   BlockBuilder where the varint encoding would silently truncate it.
   Returns Status::Corruption so the flush/compaction can abort gracefully
   without crashing the process. This applies to both flush and compaction
   paths since all key-value pairs flow through BlockBasedTableBuilder::Add().

2. **Compaction output/input size ratio check**: New option
   max_compaction_output_to_input_ratio (default: 10, 0 to disable) in
   MutableCFOptions. After compaction, if total output size exceeds this ratio
   times total input size, return Status::Corruption. This catches cases where
   corrupted values inflate the output file far beyond what input data justifies.
   This check applies to compaction only (not flush), since flush writes from a
   memtable and does not have input files for comparison. However, flush output
   is still protected by guards facebook#1 and facebook#2.

Additionally, both the flush loop (db/builder.cc) and the compaction output
loop (db/compaction/compaction_outputs.cc) now check builder->status() after
each Add() call. Previously, a corruption error set inside the builder was
only surfaced when Finish() was called, causing the loop to continue
iterating through potentially millions of keys doing wasted work.

Test Plan:
Unit test
@xingbowang xingbowang force-pushed the 2026_02_18_corruption_622561 branch from b72a991 to de62d01 Compare February 20, 2026 17:19
@meta-codesync
Copy link

meta-codesync bot commented Feb 20, 2026

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D93868104.


// Count consecutive transient in-memory corruption errors (e.g., hardware
// bit flips). On first occurrence, allow retry. On second, escalate to fatal.
int transient_corruption_retry_count_ = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we run into this corruption in one compaction then while retrying if we detect another corruption from another compaction (or flush), we will give up I guess (even though retry may go through for both).

if (log_io_s.ok()) {
if (ShouldRetryTransientCorruption(s, cfd->GetName().c_str())) {
// Transient corruption: memtable has been rolled back and
// imm_flush_needed is set, so the flush will be rescheduled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking for my own understanding - where are we setting imm_flush_needed back to true? Or do we call RollbackMemtableFlush() in this flow?

// Default: 10 (output cannot exceed 10x input size)
//
// Dynamically changeable through SetOptions() API.
uint64_t max_compaction_output_to_input_ratio = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wonder if we want to prevent users from setting this to 1 inadvertently.

@hx235 hx235 self-requested a review February 21, 2026 01:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants