Separate keys and values in data blocks by joshkang97 · Pull Request #14287 · facebook/rocksdb

joshkang97 · 2026-01-30T23:59:14Z

Summary

Introduce new table option with separated key-value storage in data blocks. Included is also a format_version bump to 8.

This PR implements a new SST block format where keys and values are stored in separate sections within data blocks, rather than interleaved. Keys are stored first, followed by all values in a contiguous section. The motivation is better cpu cache hit rate during seeks and potentially better compression.

The additional storage cost is a varint per restart point, and 4 bytes additional block footer. For a data block with a restart interval of 16, it is approximately 1 bit of overhead per entry. But compression actually performs better, resulting in ~3% storage savings from benchmark.

For now I've opted to not separate kvs in non-data blocks since restart interval for those blocks is typically 1, and values are typically small and probably better inlined.

New block layout

+------------------+
| Keys Section     |  <- Key entries with delta encoding
+------------------+
| Values Section   |  <- (new) Values stored contiguously
+------------------+
| Restart Array    |  <- Fixed32 offsets to restart points
+------------------+
| Values Offset    |  <- (new) 4 bytes: offset to values section
| Footer           |  <- 4 bytes: packed index_type + num_restarts
+------------------+

Entry Format

At restart points

+--------------+------------------+----------------+-----------------+-----------+
| shared (v32) | non_shared (v32) | value_sz (v32) | value_off (v32) | key_delta |
+--------------+------------------+----------------+-----------------+-----------+

At non-restart points

+--------------+------------------+----------------+-----------+
| shared (v32) | non_shared (v32) | value_sz (v32) | key_delta |
+--------------+------------------+----------------+-----------+

value_offset is only stored at restart points to save space
For non-restart entries, value offset is computed as: prev_value_offset + prev_value_size

Key Changes

BlockBuilder: Accumulates values in a separate buffer; value offsets are stored only at restart points (other entries derive offset from previous value's position). There is an additional memcpy cost to place the value data after the key data.
Block iteration: Iteration now needs to know if we are at a restart point. This will rely on cur_entry_idx_, which was previously only used for per-kv checksum purposes. In this new format, we also need to know the block_restart_interval, which was previously also only calculated for per-kv checksums.
Table properties: Store data_block_restart_interval, index_block_restart_interval, and separate_key_value_in_data_block in table properties

Test Plan

Extended block_test, table_test, compaction_test to contain new separated_kv param
Added new parameter to crash test

Benchmark

Varying Value Size

Write: db_bench --num=1000000 --format_version=8 --separate_key_value_in_data_block=<bool> --value_size=<X> --benchmarks=fillrandom,compact
Read: db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq

Value Size	FillRandom (ops/s)		Compact (s)		ReadRandom (ops/s)		ReadSeq (ops/s)		SST Size (MB)
	baseline	sep_kv	baseline	sep_kv	baseline	sep_kv	baseline	sep_kv	baseline	sep_kv
1	253,280	264,977	0.516	0.497 (-3.7%)	596,328	586,625 (-1.6%)	5,904,681	6,069,581 (+2.8%)	5.1	4.2 (-18.6%)
16	235,757	242,367	0.572	0.533 (-6.8%)	570,751	572,815 (+0.4%)	5,371,138	5,604,908 (+4.4%)	14.7	13.4 (-8.5%)
100	264,299	265,427	0.461	0.454 (-1.5%)	323,696	332,790 (+2.8%)	4,239,725	4,232,416 (-0.2%)	38.8	37.6 (-3.2%)
1,000	238,992	242,764	2.349	2.329 (-0.9%)	244,608	261,403 (+6.9%)	1,285,394	1,265,868 (-1.5%)	342.1	342.0 (-0.0%)

Varying Block Restart Interval

Write: db_bench --num=1000000 --format_version=8 --separate_key_value_in_data_block=<bool> --block_restart_interval=<X> --benchmarks=fillrandom,compact
Read: db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq

BRI	FillRandom (ops/s)		Compact (s)		ReadRandom (ops/s)		ReadSeq (ops/s)		SST Size (MB)
	baseline	sep_kv	baseline	sep_kv	baseline	sep_kv	baseline	sep_kv	baseline	sep_kv
1	251,654	263,707	0.453	0.485 (+7.1%)	334,653	328,708 (-1.8%)	4,194,342	3,954,291 (-5.7%)	40.6	42.4 (+4.5%)
4	253,797	252,394	0.476	0.476 (+0.0%)	332,719	341,676 (+2.7%)	4,135,691	4,051,151 (-2.0%)	39.2	39.3 (+0.3%)
8	260,143	262,273	0.496	0.460 (-7.3%)	330,859	337,567 (+2.0%)	4,144,081	4,187,389 (+1.0%)	38.9	38.1 (-2.1%)
16	252,875	263,176	0.464	0.455 (-1.9%)	323,783	335,418 (+3.6%)	4,127,310	4,217,028 (+2.2%)	38.8	37.6 (-3.2%)
32	260,224	269,422	0.464	0.451 (-2.8%)	304,001	314,989 (+3.6%)	4,310,162	4,247,248 (-1.5%)	38.8	37.3 (-3.8%)

Varying Compression

Write: db_bench --num=1000000 --format_version=8 --separate_key_value_in_data_block=<bool> --compression_type=<X> [--compression_level=<N>] --benchmarks=fillrandom,compact
Read: db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq

Compression	FillRandom (ops/s)		Compact (s)		ReadRandom (ops/s)		ReadSeq (ops/s)		SST Size (MB)
	baseline	sep_kv	baseline	sep_kv	baseline	sep_kv	baseline	sep_kv	baseline	sep_kv
None	252,494	260,552	0.413	0.419 (+1.5%)	356,290	371,535 (+4.3%)	4,479,261	4,507,133 (+0.6%)	73.5	73.6 (+0.2%)
LZ4	246,010	256,360	0.477	0.455 (-4.6%)	342,497	345,882 (+1.0%)	4,400,570	4,268,102 (-3.0%)	38.3	37.6 (-2.0%)
ZSTD (L3)	254,748	258,556	1.067	1.055 (-1.1%)	176,724	177,566 (+0.5%)	2,736,841	2,717,739 (-0.7%)	32.9	31.3 (-4.7%)
ZSTD (L6)	256,459	259,388	1.556	1.462 (-6.0%)	177,390	176,691 (-0.4%)	2,754,336	2,688,682 (-2.4%)	32.8	31.1 (-5.1%)

Varying Block Size

Write: db_bench --num=1000000 --format_version=8 --separate_key_value_in_data_block=<bool> --block_size=<X> --benchmarks=fillrandom,compact
Read: db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq

Block Size	FillRandom (ops/s)		Compact (s)		ReadRandom (ops/s)		ReadSeq (ops/s)		SST Size (MB)
	baseline	sep_kv	baseline	sep_kv	baseline	sep_kv	baseline	sep_kv	baseline	sep_kv
4 KB	263,203	260,362	0.469	0.461 (-1.7%)	324,178	332,249 (+2.5%)	4,231,537	4,217,763 (-0.3%)	38.8	37.6 (-3.1%)
16 KB	252,742	263,161	0.426	0.428 (+0.5%)	227,805	222,873 (-2.2%)	5,146,997	5,081,080 (-1.3%)	38.1	36.7 (-3.6%)
64 KB	257,490	260,225	0.423	0.414 (-2.1%)	86,807	91,586 (+5.5%)	5,380,403	5,372,372 (-0.1%)	36.3	35.0 (-3.5%)

Varying Key Size

Write: db_bench --num=1000000 --format_version=8 --separate_key_value_in_data_block=<bool> --min_key_size=10 --max_key_size=100 --benchmarks=fillrandom,compact
Read: db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq

Key Size	FillRandom (ops/s)		Compact (s)		ReadRandom (ops/s)		ReadSeq (ops/s)		SST Size (MB)
	baseline	sep_kv	baseline	sep_kv	baseline	sep_kv	baseline	sep_kv	baseline	sep_kv
10–100	243,740	255,183	0.618	0.622 (+0.6%)	284,304	307,569 (+8.2%)	3,738,921	3,686,676 (-1.4%)	41.5	41.2 (-0.8%)

CPU Profile Notes

No compression: DataBlock::SeekForGet uses less cpu (13.2% vs 13.9%)
- https://fburl.com/strobelight/6mwwebft with separated KV
- https://fburl.com/strobelight/m9m798ka without
ZSTD compression: rocksdb::DecompressSerializedBlock uses more CPU (45.8% vs 44.9%), while DataBlock::SeekForGet uses less cpu (5.09% vs 6.52%)
- https://fburl.com/strobelight/3x5nw1k4 with separated KV
- https://fburl.com/strobelight/e7809046 without

meta-codesync · 2026-02-03T00:27:16Z

@joshkang97 has imported this pull request. If you are a Meta employee, you can view this in D92103024.

xingbowang

Overall looks pretty good.

Diff is missing C and java binding for the new option.
I am also curious whether we have test coverage on changing restart interval when separate_key_value is enabled. Maybe this is covered in the existing test case and we parameterized it. Please confirm it.

table/table_test.cc

include/rocksdb/table.h

tools/db_crashtest.py

util/coding.h

xingbowang · 2026-02-05T23:41:13Z

Looks like the format check still fails. Run make format

joshkang97 · 2026-02-06T00:15:12Z

@xingbowang, this is pretty weird make format does not fix it. And when I apply the suggested fixes of the formatter manually, make format undos it.

xingbowang · 2026-02-06T01:10:45Z

I guess make format uses clang formatter. Maybe the clang version you used is different from the one used in github CI? Maybe we need to upgrade clang version in github CI?

include/rocksdb/table.h

meta-codesync · 2026-02-09T19:26:13Z

@joshkang97 has imported this pull request. If you are a Meta employee, you can view this in D92103024.

meta-codesync · 2026-02-09T19:33:27Z

@joshkang97 has imported this pull request. If you are a Meta employee, you can view this in D92103024.

meta-codesync · 2026-02-09T21:46:38Z

@joshkang97 has imported this pull request. If you are a Meta employee, you can view this in D92103024.

include/rocksdb/table.h

pdillinger · 2026-02-10T20:29:33Z

include/rocksdb/table.h

+  // improve read performance at a cost of a varint per restart interval (~1 bit
+  // per key by default). It has also been shown to improve compression. Small
+  // values or low block_restart_interval may prefer to set this as false.
+  // Requires format_version >= 8.


format_version is generally for passive features, when you want to clearly phase out doing things the old way. Depending on the universality of the evidence, this could be considered an "always on in the future" feature, but I'm not sure we've reached that conclusion. In general, if a feature is opt-in and there's a reasonable failure scenario on downgrade to an incompatible release (no quiet data corruption), it doesn't need a format_version.

For example, Anand's work on custom table readers and indexes didn't use a format_version bump. Ribbon filters did not use a format_version bump. The passive feature in format_version=7 is tracking the set of compression types used in the file, which enables choosing optimized decompressors, with important side benefits in good error messaging (from #13659).

In general, try to choose between an option or a format_version, not both.

Looking at the details a bit more, I believe this was overlooked in removing format_version stuff: "and there's a reasonable failure scenario on downgrade to an incompatible release (no quiet data corruption)". It looks like "use_separated_kv_storage" is determined by a table property, which would be ignored by older versions and they would read data in a corrupt way.

For fixing, aside from format_version, there is a place in the footer that could be used to mark the new feature, but it's not a great fit. https://github.com/facebook/rocksdb/blob/10.11.fb/table/format.cc#L214 Ideally each block would be marked with its own metadata about how to interpret that block, to reduce extra plumbing and in case it makes sense to vary the strategy per block in the future. Currently there's a marker for data block hash index, but there's no existing reserved space in that uint32 for new features like that.

But I've been able to work around that in what I believe is a quite reasonable way, in #14332, without needed a new format_version. Please rework this change to work with that one. You might still include a table property for debugging and statistical purposes.

pdillinger

For benchmarks, I would also like to see

(a) What is the difference in uncompressed SST file size? (After compaction, to avoid issues with differences in obsolete data)
-> The importance of the uncompressed size is memory usage

(b) Run a similar comparison with small values, like 16 bytes.

(c) With profiling, where is the CPU difference? Is it in indexing or something else like decompression?
(c2) Run a similar comparison with compression disabled

(d) Is there a compressed size improvement with LZ4, ZSTD default level 3, and ZSTD level 6, and 16KB block size?

(e) What does the improvement look like with variable key size (modify db_bench)?

(f) What about with range queries returning several key-values on average?

(g) What about CPU (time) in compaction? Generating the blocks is not as local of an operation.

If we continue to get strong signals, we could consider making this format_version=8 instead of a named option, to phase out the old format (and avoid option explosion).

pdillinger · 2026-02-13T17:14:53Z

table/block_based/block.cc

+      DecodeFixed32(data() + size() - sizeof(uint32_t));
+
+  if (use_separated_kv_storage) {
+    if (value_offset) {


This if doesn't make sense to me. It seems like the footer should always be two fixed32 when use_separated_kv_storage.

pdillinger · 2026-02-13T23:40:33Z

include/rocksdb/table.h

+  // improve read performance at a cost of a varint per restart interval (~1 bit
+  // per key by default). It has also been shown to improve compression. Small
+  // values or low block_restart_interval may prefer to set this as false.
+  // Requires format_version >= 8.


Looking at the details a bit more, I believe this was overlooked in removing format_version stuff: "and there's a reasonable failure scenario on downgrade to an incompatible release (no quiet data corruption)". It looks like "use_separated_kv_storage" is determined by a table property, which would be ignored by older versions and they would read data in a corrupt way.

For fixing, aside from format_version, there is a place in the footer that could be used to mark the new feature, but it's not a great fit. https://github.com/facebook/rocksdb/blob/10.11.fb/table/format.cc#L214 Ideally each block would be marked with its own metadata about how to interpret that block, to reduce extra plumbing and in case it makes sense to vary the strategy per block in the future. Currently there's a marker for data block hash index, but there's no existing reserved space in that uint32 for new features like that.

But I've been able to work around that in what I believe is a quite reasonable way, in #14332, without needed a new format_version. Please rework this change to work with that one. You might still include a table property for debugging and statistical purposes.

…es (#14332) Summary: I'm implementing this intending it to be used for #14287 Refactor the data block footer encoding/decoding to use a struct-based Encode/Decode API (DataBlockFooter), reserving the top 4 bits of the footer for metadata: - Bit 31: Hash index present (kDataBlockBinaryAndHash) - existing use - Bits 28-30: Reserved for future features Comments have some detail for why it is safe to assume no practical existing SST files would use these newly reserved bits. And for forward compatibility, existing versions detect (non-zero) use of these new bits as impossibly large num_restarts and report "bad block contents". Not perfect, but not bad. Key changes: - Replace PackIndexTypeAndNumRestarts/UnPackIndexTypeAndNumRestarts with DataBlockFooter::EncodeTo/DecodeFrom methods - DecodeFrom returns a detailed error when reserved bits are set, enabling graceful failure on newer format versions - Reduce kMaxNumRestarts from 2^31-1 to 2^28-1 (268M), which is adequate for the maximum possible restarts in a 4GiB block - Add GetCorruptionStatus() to Block for detailed error messages (Note that we are sensitive to the size of Block objects, so have to avoid adding unnecessary new members.) - Remove obsolete kMaxBlockSizeSupportedByHashIndex size checks Pull Request resolved: #14332 Test Plan: - Existing unit tests and format compatibility test - Add test for reserved bit detection (ReservedBitInDataBlockFooter) Reviewed By: joshkang97 Differential Revision: D93293152 Pulled By: pdillinger fbshipit-source-id: b65a83e96bb09a98fb9b8b2dd9f754653ca7ed4d

pdillinger

LGTM with some minor tweaks

pdillinger · 2026-02-21T00:37:44Z

table/block_based/block.h

  DataBlockHashIndex data_block_hash_index_;
+
+  // Pointer to values section, nullptr if not using separated KV
+  const char* values_section_{nullptr};


Block is currently 80 bytes (no internal fragmentation). This would grow it to 88 bytes, which under jemalloc internal fragmentation would be 96 bytes. For 16KB block, that's a 0.1% increase in block cache memory size (major memory user) for index and data blocks, which is not nothing.

I don't want to derail this feature, but we have enough opt-in features here that everyone is paying for in memory overheads that I think it's worth trying to (re-)optimize sometime.

pdillinger · 2026-02-21T00:42:00Z

table/block_based/data_block_footer.h

 //   - The high 4 bits are reserved for metadata/features:
 //     - Bit 31: Hash index present (kDataBlockBinaryAndHash)
-//     - Bits 28-30: Reserved for future features
+//     - Bit 30: Separated KV storage (keys and values stored in separate


You found a bug with bit 30 and old versions rejecting new data, right?

pdillinger · 2026-02-21T00:42:49Z

table/block_based/data_block_footer.h

+  // sections). When true, values_section_offset indicates where the values
+  // section begins within the block data.
+  bool separated_kv = false;
+  uint32_t values_section_offset = 0;


Likely better struct layout by reversing the order of these fields.

pdillinger · 2026-02-21T01:55:23Z

table/block_based/block.cc

+    // Set up values_section_ from footer if separated KV storage is used
+    if (size != 0 && footer.separated_kv) {
+      if (footer.values_section_offset > restart_offset_) {
+        size = 0;


// Error marker

pdillinger · 2026-02-21T01:57:24Z

table/block_based/block.h


  const char* TEST_GetKVChecksum() const { return kv_checksum_; }

+ private:


pdillinger · 2026-02-21T02:03:12Z

include/rocksdb/table_properties.h

+
+  // Whether the SST file uses separated key/value storage in data blocks (0 =
+  // false).
+  uint64_t separated_kv_in_data_block = 0;


I prefer not to have slightly different names for essentially the same thing
separated_kv_in_data_block
separate_key_value_in_data_block

This property might not age well if we start to mix separation strategies based on factors in the data, such as avoiding separation for small values.

pdillinger · 2026-02-21T02:06:05Z

unreleased_history/performance_improvements/separate_kv_in_data_blocks

@@ -0,0 +1 @@
+Add a new table option `separate_key_value_in_data_block`. When set to true keys and values will be stored separately in the data block, which can result in higher cpu cache hit rate and better compression. Works best with data blocks with sufficient restart intervals and large values.


Previous versions of RocksDB will reject files written using this option.

meta-cla bot added the CLA Signed label Jan 30, 2026

joshkang97 changed the title ~~separate keys and values~~ Separate keys and values in data blocks Jan 31, 2026

joshkang97 force-pushed the separate_key_values_in_block branch from 75efe6c to 5d82802 Compare January 31, 2026 00:07

separate keys and values

f63fb86

joshkang97 force-pushed the separate_key_values_in_block branch from 5d82802 to f63fb86 Compare January 31, 2026 00:13

expose as table option instead of fv

96d18c6

joshkang97 force-pushed the separate_key_values_in_block branch from c87a047 to 96d18c6 Compare February 3, 2026 00:22

joshkang97 requested review from pdillinger and xingbowang February 3, 2026 00:35

joshkang97 marked this pull request as ready for review February 3, 2026 00:59

xingbowang reviewed Feb 5, 2026

View reviewed changes

table/table_test.cc Show resolved Hide resolved

include/rocksdb/table.h Show resolved Hide resolved

tools/db_crashtest.py Outdated Show resolved Hide resolved

util/coding.h Outdated Show resolved Hide resolved

address comments

65f04b0

joshkang97 force-pushed the separate_key_values_in_block branch from 71be3b2 to 65f04b0 Compare February 5, 2026 18:44

joshkang97 added 4 commits February 5, 2026 10:48

Merge branch 'main' into separate_key_values_in_block

6869152

add java/c bindings

69bf786

add restart_interval test coverage

c11ba88

fix test

f010c8b

Merge branch 'main' into separate_key_values_in_block

f6d0ddf

xingbowang reviewed Feb 9, 2026

View reviewed changes

include/rocksdb/table.h Show resolved Hide resolved

bump default fv

b58c5cb

xingbowang approved these changes Feb 9, 2026

View reviewed changes

bump fv for java

49cdfff

pdillinger requested changes Feb 10, 2026

View reviewed changes

pdillinger reviewed Feb 11, 2026

View reviewed changes

joshkang97 added 3 commits February 13, 2026 10:41

remove fv=8

7032df2

Merge branch 'main' into separate_key_values_in_block

ed3761e

formatting

9733c89

pdillinger mentioned this pull request Feb 13, 2026

Refactor data block footer to reserve metadata bits for future features #14332

Closed

pdillinger requested changes Feb 13, 2026

View reviewed changes

joshkang97 added 3 commits February 18, 2026 11:49

Merge branch 'main' into separate_key_values_in_block

4851f32

fix corruption

70f891d

format

812d16a

pdillinger approved these changes Feb 21, 2026

View reviewed changes


		const char* TEST_GetKVChecksum() const { return kv_checksum_; }

		private:

		@@ -0,0 +1 @@
		Add a new table option `separate_key_value_in_data_block`. When set to true keys and values will be stored separately in the data block, which can result in higher cpu cache hit rate and better compression. Works best with data blocks with sufficient restart intervals and large values.

Comments

Conversation

joshkang97 commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New block layout

Entry Format

Key Changes

Test Plan

Benchmark

Varying Value Size

Varying Block Restart Interval

Varying Compression

Varying Block Size

Varying Key Size

CPU Profile Notes

Uh oh!

meta-codesync bot commented Feb 3, 2026

Uh oh!

xingbowang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xingbowang commented Feb 5, 2026

Uh oh!

joshkang97 commented Feb 6, 2026

Uh oh!

xingbowang commented Feb 6, 2026

Uh oh!

Uh oh!

meta-codesync bot commented Feb 9, 2026

Uh oh!

meta-codesync bot commented Feb 9, 2026

Uh oh!

meta-codesync bot commented Feb 9, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pdillinger left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pdillinger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joshkang97 commented Jan 30, 2026 •

edited

Loading

pdillinger left a comment •

edited

Loading