Enable data sieving for chunks that can't be cached #6111

jhendersonHDF · 2025-12-15T23:31:37Z

Fixed an issue that prevented use of a data sieve buffer for I/O on dataset chunks when those chunks couldn't be cached by the library. This issue could result in worst-case behavior of I/O on a single data element at a time when chunks are non-contiguous with respect to memory layout.

Added a test to attempt to catch performance regressions in I/O on dataset chunks that are non-contiguous with respect to memory layout

Updated the External File List logic to set the data sieve buffer size to the smaller of the dataset size and the size set in the FAPL, similar to the logic elsewhere in the library

Important

Re-enable data sieve buffer for non-cached dataset chunks to improve I/O performance and add a test for performance regressions.

Behavior:
- Re-enable data sieve buffer for I/O on non-cached dataset chunks in H5D__chunk_read() and H5D__chunk_write() in H5Dchunk.c.
- Update H5D__efl_construct() in H5Defl.c to set sieve buffer size to the smaller of dataset size and FAPL size.
- Add test chunk_non_contig_mem_io in io_perf.c to catch performance regressions for non-contiguous memory layout chunks.
Misc:
- Update CHANGELOG.md to document the performance fix.
- Add io_perf to H5_EXPRESS_TESTS in CMakeLists.txt.

^{This description was created by}^{for cb23842. You can customize this summary. It will automatically update as commits are pushed.}

jhendersonHDF · 2025-12-15T23:34:36Z

Related writeup:
HDF5_chunked_dataset_IO_performance_issue.pdf

Note that this is a rather quick fix for the issue and a bit of refactoring in the chunked I/O code would be better in the long run.

test/io_perf.c

src/H5Dchunk.c

jhendersonHDF · 2025-12-15T23:36:30Z

src/H5Dchunk.c

+         */
+        dset_info->dset->shared->cache.contig.sieve_buf_size = H5F_SIEVE_BUF_SIZE(dset_info->dset->oloc.file);
+        dset_info->dset->shared->cache.contig.sieve_loc      = HADDR_UNDEF;
+        dset_info->dset->shared->cache.contig.sieve_size     = 0;


These lines are to force re-reads into the sieve buffer on each I/O to account for cases like changing the extent of a dataset between writes and reads

So the sieve buffer isn't persistent for chunked datasets?

Should we free the sieve buffer after I/O since the data in it is useless?

The sieve buffer is currently freed when the dataset is closed, but we could free it after I/O if you think that's better; it just seemed more efficient to keep it around. Setting sieve_loc to HADDR_UNDEF and sieve_size to 0 is just to make sure that the library doesn't read stale data from the sieve buffer in the case where the dataset extent changes. Note the sieve_size here is just the value tracking how much data is currently in the buffer.

Since we don't need the data in the buffer keeping it around is effectively treating it like a buffer on the free list. The sieve buffer is already integrated with the free list however, so I think we should just free it after I/O (it will be held open by the H5FL code if appropriate).

jhendersonHDF · 2025-12-15T23:38:11Z

src/H5Defl.c

+    /* Get the sieve buffer size for this dataset - the smaller of the dataset size and
+     * the sieve buffer size from the FAPL is used
+     */
+    dset->shared->cache.contig.sieve_buf_size =


Updated the EFL code here to set the sieve buffer size to the smaller of the dataset size and the FAPL-specified size. This just matches the logic elsewhere in the library and what's currently documented for H5Pset_sieve_buf_size():

Internally, the library checks the storage sizes of the datasets in
the file. It picks the smaller one between the size from the file
access property and the size of the dataset to allocate the sieve
buffer for the dataset in order to save memory usage.

jhendersonHDF · 2025-12-15T23:45:40Z

test/io_perf.c

When a sieve buffer is not used, the test in this file is designed to run a long time to try and cause a timeout in CMake for TestExpress of 0, but run very quickly for a TestExpress of 3. We currently do testing with a TestExpress of 0 for every PR, but those tests should probably be moved to a scheduled run as originally intended since tests like this can be problematic for testing at that level for every PR.

src/H5Dchunk.c

Fixed an issue that prevented use of a data sieve buffer for I/O on dataset chunks when those chunks couldn't be cached by the library. This issue could result in worst-case behavior of I/O on a single data element at a time when chunks are non-contiguous with respect to memory layout. Added a test to attempt to catch performance regressions in I/O on dataset chunks that are non-contiguous with respect to memory layout Updated the External File List logic to set the data sieve buffer size to the smaller of the dataset size and the size set in the FAPL, similar to the logic elsewhere in the library

fortnern · 2026-01-16T16:00:32Z

I would consider this a new feature, not a fix

fortnern · 2026-01-16T16:01:41Z

Has data sieving ever been used in this case before? The AI seems to think so but I think it's wrong.

src/H5Dint.c

fortnern

Marking "request changes", really just requesting discussion/responses to my comments, then we'll decide if changes are needed

jhendersonHDF · 2026-01-16T16:11:05Z

I would consider this a new feature, not a fix

I can see that, but it was very clearly a bug due to oversight in the way that the chunking code was written which led to drastic performance problems.

Has data sieving ever been used in this case before?

I haven't gone through the history much, but randomly picking 1.10.3 as a test branch shows the same problem.

release_docs/CHANGELOG.md

jhendersonHDF · 2026-01-16T21:18:39Z

src/H5Dchunk.c

 H5FL_EXTERN(H5S_sel_iter_t);

+/* Declare the external PQ free list for the sieve buffer information */
+H5FL_BLK_EXTERN(sieve_buf);


With the sieve buffer being managed by H5Dchunk.c as well, it needs a reference to the free list. This is a bit messy as the sieve buffer is still almost completely controlled by H5Dcontig.c, but refactoring that seems outside the scope of this PR.

jhendersonHDF · 2026-01-16T21:19:46Z

src/H5Dchunk.c


 done:
+    /* Free dataset sieve buffer and reset cached fields */
+    if (dset_info->dset->shared->cache.sieve.sieve_buf) {


While the resetting of fields in these sections is mostly redundant with the same above, I wanted to keep both so the context (comment) isn't lost in potential future refactoring.

jhendersonHDF added Component - C Library Core C library issues (usually in the src directory) Component - Documentation Doxygen, markdown, etc. labels Dec 15, 2025

github-project-automation bot added this to HDF5 - TRIAGE & TRACK Dec 15, 2025

jhendersonHDF added the Component - Testing Code in test or testpar directories, GitHub workflows label Dec 15, 2025

github-project-automation bot moved this to To be triaged in HDF5 - TRIAGE & TRACK Dec 15, 2025

ellipsis-dev bot reviewed Dec 15, 2025

View reviewed changes

test/io_perf.c Outdated Show resolved Hide resolved

jhendersonHDF commented Dec 15, 2025

View reviewed changes

src/H5Dchunk.c Outdated Show resolved Hide resolved

jhendersonHDF commented Dec 15, 2025

View reviewed changes

jhendersonHDF force-pushed the chunked_io_perf_issue branch from 7f37860 to cb23842 Compare December 15, 2025 23:39

jhendersonHDF commented Dec 15, 2025

View reviewed changes

jhendersonHDF assigned brtnfld, fortnern and mattjala Dec 15, 2025

jhendersonHDF marked this pull request as ready for review December 16, 2025 05:04

jhendersonHDF requested review from bmribler, brtnfld, byrnHDF, derobins, fortnern, glennsong09, lrknox, mattjala, qkoziol and vchoi-hdfgroup as code owners December 16, 2025 05:04

mattjala previously approved these changes Dec 17, 2025

View reviewed changes

lrknox reviewed Dec 18, 2025

View reviewed changes

src/H5Dchunk.c Outdated Show resolved Hide resolved

jhendersonHDF dismissed mattjala’s stale review via f5e42d8 December 18, 2025 16:35

jhendersonHDF force-pushed the chunked_io_perf_issue branch from cb23842 to f5e42d8 Compare December 18, 2025 16:35

jhendersonHDF requested a review from mattjala December 18, 2025 16:35

mattjala previously approved these changes Dec 23, 2025

View reviewed changes

lrknox previously approved these changes Jan 16, 2026

View reviewed changes

fortnern reviewed Jan 16, 2026

View reviewed changes

src/H5Dint.c Outdated Show resolved Hide resolved

fortnern requested changes Jan 16, 2026

View reviewed changes

github-project-automation bot moved this from To be triaged to In progress in HDF5 - TRIAGE & TRACK Jan 16, 2026

fortnern reviewed Jan 16, 2026

View reviewed changes

release_docs/CHANGELOG.md Outdated Show resolved Hide resolved

ajelenak added this to the HDF5 2.1.0 milestone Jan 16, 2026

jhendersonHDF dismissed stale reviews from lrknox and mattjala via 70943da January 16, 2026 20:38

jhendersonHDF force-pushed the chunked_io_perf_issue branch from b1a99b2 to 680331c Compare January 16, 2026 20:40

Address review comments

24a543e

jhendersonHDF force-pushed the chunked_io_perf_issue branch from 680331c to 24a543e Compare January 16, 2026 20:57

jhendersonHDF commented Jan 16, 2026

View reviewed changes

jhendersonHDF requested review from fortnern, lrknox and mattjala January 16, 2026 22:18

fortnern approved these changes Jan 20, 2026

View reviewed changes

Uh oh!

Enable data sieving for chunks that can't be cached #6111

Are you sure you want to change the base?

Enable data sieving for chunks that can't be cached #6111

Uh oh!

Conversation

jhendersonHDF commented Dec 15, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhendersonHDF commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fortnern commented Jan 16, 2026

Uh oh!

fortnern commented Jan 16, 2026

Uh oh!

Uh oh!

fortnern left a comment

Choose a reason for hiding this comment

Uh oh!

jhendersonHDF commented Jan 16, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jhendersonHDF commented Dec 15, 2025 •

edited by ellipsis-dev bot

Loading

jhendersonHDF commented Dec 15, 2025 •

edited

Loading