Separate keys and values in data blocks#14287
Separate keys and values in data blocks#14287joshkang97 wants to merge 16 commits intofacebook:mainfrom
Conversation
75efe6c to
5d82802
Compare
5d82802 to
f63fb86
Compare
c87a047 to
96d18c6
Compare
|
@joshkang97 has imported this pull request. If you are a Meta employee, you can view this in D92103024. |
xingbowang
left a comment
There was a problem hiding this comment.
Overall looks pretty good.
Diff is missing C and java binding for the new option.
I am also curious whether we have test coverage on changing restart interval when separate_key_value is enabled. Maybe this is covered in the existing test case and we parameterized it. Please confirm it.
71be3b2 to
65f04b0
Compare
|
Looks like the format check still fails. Run |
|
@xingbowang, this is pretty weird |
|
I guess |
|
@joshkang97 has imported this pull request. If you are a Meta employee, you can view this in D92103024. |
|
@joshkang97 has imported this pull request. If you are a Meta employee, you can view this in D92103024. |
|
@joshkang97 has imported this pull request. If you are a Meta employee, you can view this in D92103024. |
include/rocksdb/table.h
Outdated
| // improve read performance at a cost of a varint per restart interval (~1 bit | ||
| // per key by default). It has also been shown to improve compression. Small | ||
| // values or low block_restart_interval may prefer to set this as false. | ||
| // Requires format_version >= 8. |
There was a problem hiding this comment.
format_version is generally for passive features, when you want to clearly phase out doing things the old way. Depending on the universality of the evidence, this could be considered an "always on in the future" feature, but I'm not sure we've reached that conclusion. In general, if a feature is opt-in and there's a reasonable failure scenario on downgrade to an incompatible release (no quiet data corruption), it doesn't need a format_version.
For example, Anand's work on custom table readers and indexes didn't use a format_version bump. Ribbon filters did not use a format_version bump. The passive feature in format_version=7 is tracking the set of compression types used in the file, which enables choosing optimized decompressors, with important side benefits in good error messaging (from #13659).
In general, try to choose between an option or a format_version, not both.
There was a problem hiding this comment.
Looking at the details a bit more, I believe this was overlooked in removing format_version stuff: "and there's a reasonable failure scenario on downgrade to an incompatible release (no quiet data corruption)". It looks like "use_separated_kv_storage" is determined by a table property, which would be ignored by older versions and they would read data in a corrupt way.
For fixing, aside from format_version, there is a place in the footer that could be used to mark the new feature, but it's not a great fit. https://github.com/facebook/rocksdb/blob/10.11.fb/table/format.cc#L214 Ideally each block would be marked with its own metadata about how to interpret that block, to reduce extra plumbing and in case it makes sense to vary the strategy per block in the future. Currently there's a marker for data block hash index, but there's no existing reserved space in that uint32 for new features like that.
But I've been able to work around that in what I believe is a quite reasonable way, in #14332, without needed a new format_version. Please rework this change to work with that one. You might still include a table property for debugging and statistical purposes.
There was a problem hiding this comment.
For benchmarks, I would also like to see
(a) What is the difference in uncompressed SST file size? (After compaction, to avoid issues with differences in obsolete data)
-> The importance of the uncompressed size is memory usage
(b) Run a similar comparison with small values, like 16 bytes.
(c) With profiling, where is the CPU difference? Is it in indexing or something else like decompression?
(c2) Run a similar comparison with compression disabled
(d) Is there a compressed size improvement with LZ4, ZSTD default level 3, and ZSTD level 6, and 16KB block size?
(e) What does the improvement look like with variable key size (modify db_bench)?
(f) What about with range queries returning several key-values on average?
(g) What about CPU (time) in compaction? Generating the blocks is not as local of an operation.
If we continue to get strong signals, we could consider making this format_version=8 instead of a named option, to phase out the old format (and avoid option explosion).
table/block_based/block.cc
Outdated
| DecodeFixed32(data() + size() - sizeof(uint32_t)); | ||
|
|
||
| if (use_separated_kv_storage) { | ||
| if (value_offset) { |
There was a problem hiding this comment.
This if doesn't make sense to me. It seems like the footer should always be two fixed32 when use_separated_kv_storage.
include/rocksdb/table.h
Outdated
| // improve read performance at a cost of a varint per restart interval (~1 bit | ||
| // per key by default). It has also been shown to improve compression. Small | ||
| // values or low block_restart_interval may prefer to set this as false. | ||
| // Requires format_version >= 8. |
There was a problem hiding this comment.
Looking at the details a bit more, I believe this was overlooked in removing format_version stuff: "and there's a reasonable failure scenario on downgrade to an incompatible release (no quiet data corruption)". It looks like "use_separated_kv_storage" is determined by a table property, which would be ignored by older versions and they would read data in a corrupt way.
For fixing, aside from format_version, there is a place in the footer that could be used to mark the new feature, but it's not a great fit. https://github.com/facebook/rocksdb/blob/10.11.fb/table/format.cc#L214 Ideally each block would be marked with its own metadata about how to interpret that block, to reduce extra plumbing and in case it makes sense to vary the strategy per block in the future. Currently there's a marker for data block hash index, but there's no existing reserved space in that uint32 for new features like that.
But I've been able to work around that in what I believe is a quite reasonable way, in #14332, without needed a new format_version. Please rework this change to work with that one. You might still include a table property for debugging and statistical purposes.
…es (#14332) Summary: I'm implementing this intending it to be used for #14287 Refactor the data block footer encoding/decoding to use a struct-based Encode/Decode API (DataBlockFooter), reserving the top 4 bits of the footer for metadata: - Bit 31: Hash index present (kDataBlockBinaryAndHash) - existing use - Bits 28-30: Reserved for future features Comments have some detail for why it is safe to assume no practical existing SST files would use these newly reserved bits. And for forward compatibility, existing versions detect (non-zero) use of these new bits as impossibly large num_restarts and report "bad block contents". Not perfect, but not bad. Key changes: - Replace PackIndexTypeAndNumRestarts/UnPackIndexTypeAndNumRestarts with DataBlockFooter::EncodeTo/DecodeFrom methods - DecodeFrom returns a detailed error when reserved bits are set, enabling graceful failure on newer format versions - Reduce kMaxNumRestarts from 2^31-1 to 2^28-1 (268M), which is adequate for the maximum possible restarts in a 4GiB block - Add GetCorruptionStatus() to Block for detailed error messages (Note that we are sensitive to the size of Block objects, so have to avoid adding unnecessary new members.) - Remove obsolete kMaxBlockSizeSupportedByHashIndex size checks Pull Request resolved: #14332 Test Plan: - Existing unit tests and format compatibility test - Add test for reserved bit detection (ReservedBitInDataBlockFooter) Reviewed By: joshkang97 Differential Revision: D93293152 Pulled By: pdillinger fbshipit-source-id: b65a83e96bb09a98fb9b8b2dd9f754653ca7ed4d
pdillinger
left a comment
There was a problem hiding this comment.
LGTM with some minor tweaks
| DataBlockHashIndex data_block_hash_index_; | ||
|
|
||
| // Pointer to values section, nullptr if not using separated KV | ||
| const char* values_section_{nullptr}; |
There was a problem hiding this comment.
Block is currently 80 bytes (no internal fragmentation). This would grow it to 88 bytes, which under jemalloc internal fragmentation would be 96 bytes. For 16KB block, that's a 0.1% increase in block cache memory size (major memory user) for index and data blocks, which is not nothing.
I don't want to derail this feature, but we have enough opt-in features here that everyone is paying for in memory overheads that I think it's worth trying to (re-)optimize sometime.
| // - The high 4 bits are reserved for metadata/features: | ||
| // - Bit 31: Hash index present (kDataBlockBinaryAndHash) | ||
| // - Bits 28-30: Reserved for future features | ||
| // - Bit 30: Separated KV storage (keys and values stored in separate |
There was a problem hiding this comment.
You found a bug with bit 30 and old versions rejecting new data, right?
| // sections). When true, values_section_offset indicates where the values | ||
| // section begins within the block data. | ||
| bool separated_kv = false; | ||
| uint32_t values_section_offset = 0; |
There was a problem hiding this comment.
Likely better struct layout by reversing the order of these fields.
| // Set up values_section_ from footer if separated KV storage is used | ||
| if (size != 0 && footer.separated_kv) { | ||
| if (footer.values_section_offset > restart_offset_) { | ||
| size = 0; |
|
|
||
| const char* TEST_GetKVChecksum() const { return kv_checksum_; } | ||
|
|
||
| private: |
|
|
||
| // Whether the SST file uses separated key/value storage in data blocks (0 = | ||
| // false). | ||
| uint64_t separated_kv_in_data_block = 0; |
There was a problem hiding this comment.
I prefer not to have slightly different names for essentially the same thing
separated_kv_in_data_block
separate_key_value_in_data_block
This property might not age well if we start to mix separation strategies based on factors in the data, such as avoiding separation for small values.
| @@ -0,0 +1 @@ | |||
| Add a new table option `separate_key_value_in_data_block`. When set to true keys and values will be stored separately in the data block, which can result in higher cpu cache hit rate and better compression. Works best with data blocks with sufficient restart intervals and large values. | |||
There was a problem hiding this comment.
Previous versions of RocksDB will reject files written using this option.
Summary
Introduce new table option with separated key-value storage in data blocks. Included is also a format_version bump to 8.
This PR implements a new SST block format where keys and values are stored in separate sections within data blocks, rather than interleaved. Keys are stored first, followed by all values in a contiguous section. The motivation is better cpu cache hit rate during seeks and potentially better compression.
The additional storage cost is a varint per restart point, and 4 bytes additional block footer. For a data block with a restart interval of 16, it is approximately 1 bit of overhead per entry. But compression actually performs better, resulting in ~3% storage savings from benchmark.
For now I've opted to not separate kvs in non-data blocks since restart interval for those blocks is typically 1, and values are typically small and probably better inlined.
New block layout
Entry Format
At restart points
At non-restart points
value_offsetis only stored at restart points to save spaceprev_value_offset + prev_value_sizeKey Changes
cur_entry_idx_, which was previously only used for per-kv checksum purposes. In this new format, we also need to know the block_restart_interval, which was previously also only calculated for per-kv checksums.data_block_restart_interval,index_block_restart_interval, andseparate_key_value_in_data_blockin table propertiesTest Plan
Benchmark
Varying Value Size
Write:
db_bench --num=1000000 --format_version=8 --separate_key_value_in_data_block=<bool> --value_size=<X> --benchmarks=fillrandom,compactRead:
db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseqVarying Block Restart Interval
Write:
db_bench --num=1000000 --format_version=8 --separate_key_value_in_data_block=<bool> --block_restart_interval=<X> --benchmarks=fillrandom,compactRead:
db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseqVarying Compression
Write:
db_bench --num=1000000 --format_version=8 --separate_key_value_in_data_block=<bool> --compression_type=<X> [--compression_level=<N>] --benchmarks=fillrandom,compactRead:
db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseqVarying Block Size
Write:
db_bench --num=1000000 --format_version=8 --separate_key_value_in_data_block=<bool> --block_size=<X> --benchmarks=fillrandom,compactRead:
db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseqVarying Key Size
Write:
db_bench --num=1000000 --format_version=8 --separate_key_value_in_data_block=<bool> --min_key_size=10 --max_key_size=100 --benchmarks=fillrandom,compactRead:
db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseqCPU Profile Notes