-
Notifications
You must be signed in to change notification settings - Fork 289
Description
There are so many open issues, it can feel quite intimidating! This freezes action and creates a vicious cycle where more issues leads to more paralysis, which leads to more issues.
I asked claude sonnet 4.6 to look at our issues and suggest what could be done to close large numbers of them. Here's the analysis:
Top 10 Bugfixes/Improvements to Close the Most Open Issues on Unidata/netcdf-c
Analysis of all 285 open issues on Unidata/netcdf-c, clustered by root cause. Issues often span multiple categories, so a single fix in a high-impact area has outsized effect.
1. Fix nc_get/put_vars Stride Performance (~15 issues)
Categories: Performance, Windows perf, nccopy slowness
The NCDEFAULT_get/put_vars code path is the single most complained-about performance bottleneck. It causes:
- Strided reads 100–1000× slower than contiguous (Speed up NCDEFAULT_get/put_vars code #1381, Slicing very slow in 4.7.4 (reported) #1757, NetCDF slow writes when using the stride parameter. #1877, nc_get_vars incredibly slow in Windows compared to Linux #2721)
nccopyregressions between versions ('nccopy' much slower in 4.7.4 vs. 4.6.1 #1947, Using nccopy, setting deflate level to 0 ignores chunking specification #391)nc_openslowdowns (Slowernc_openwith version 4.9.3 #3183)- Chunking + compression write time growth (nc_put_var_double execution time increases in subsequent runs when a variable is written with chunking and compression #2750)
NC_SHAREperformance changes (Performance change with NC_SHARE #1773)- HDF5 stride semantic mismatch with unlimited dims (HDF5 stride has different semantics than netcdf stride semantics wrt unlimited. #1380)
Fix: Rewrite NCDEFAULT_get/put_vars to use chunk-aware bulk I/O instead of element-by-element dispatch. Also reconcile HDF5 vs netCDF stride semantics for unlimited dimensions.
2. Modernize CMake Build System & Fix Windows/MSVC Builds (~35 unique issues)
Categories: CMake (22), Windows/MSVC (20), Static/Linking (19) — heavy overlap
The build system is the #1 source of user frustration. Recurring themes:
- MSVC 2022 build failures (build netcdf with CMAKE under MSVC 2022 #2697, build netcdf 4.9.3 with hdf5-1.14.3 in windows using MSVC 2022 using cmake #2877, build version 4.9.3 with MSVC 2022 and CMAKE GUI #3172, Intel OneAPI on Windows compile failure #3260)
- Static build link order wrong (For static build, libraries are listed in wrong order #3120, Unresolved external using static build on Windows #3129)
_MSC_VERvs_WIN32misuse breaks MinGW (I/O issues in mingw due to _MSC_VER misuse #1105, _MSC_VER -> _WIN32 #1108)- Incomplete DLL exports (Incomplete library exports on Windows #554)
- CMake config file inconsistencies (Inconsistent NetCDF naming in CMake Config file #1140, ConfigPackageLocation needs to be absolute path #2893, HDF5 not found if using hdf5-config.cmake #877)
- Dependency detection failures (curl, szip, blosc, HDF5) (Build failure on Windows when using CURL (e.g. error C2061: syntax error: identifier 'curl_socket_t') #3148, Can't manually specify szip library location when using cmake #2570, cmake build fails when Blosc is enabled #2268, [CMAKE] build with HDF5 with external zlib #3099)
TTL_LIBSdebug/optimized keyword issue ([CMake] Variable TTL_LIBS debug/optimized keywords are getting eaten #1579)
Fix: CMake modernization (#2713) using proper targets, find_package configs, and generator expressions. Fix Windows symbol exports with a single .def file or __declspec audit. This alone would close 30+ issues.
3. Fix NCZarr Interoperability & Correctness (~25 issues)
Categories: NCZarr/Zarr (25), overlaps with S3 and filters
NCZarr is the newest major feature and has the most open bugs per feature:
- Can't read Xarray-generated Zarr (data values not right when opening Xarray-generated zarr with ncdump #2449, NCZarr does not support reading of many zarr files #2474, Interoperability with Xarray Zarr #3214)
- Scalars not handled (Zarr scalars aren't not correctly handled #3108)
- Empty variables/metadata parsing (Bad parsing of Zarr metadata in case empty variable (v. 4.9.2) #2748, NCZarr reading a Zarr file with no variable name #2603)
- String variables fail (Issue creating zarr from datasets with string variables #2259, netCDF fails on zarr file with variable length array #2516)
- Dimension names lost in
noxarraymode (Dimension names are lost in mode nczarr,noxarray #2647) - Parallel read/write not supported (NCZarr parallel read/write support #2657)
- Filter support incomplete (NCZarr Support for Zarr Filters #2006, Zarr copy fails on write - variable uncompressable #3075)
- In-memory Zarr not supported (In-memory support for nczarr. #2079)
Fix: Zarr V2 spec compliance audit + Xarray interop testing. Most of these are metadata-handling bugs in libnczarr. A systematic pass through the Zarr V2 spec would close ~15 issues.
4. Fix DAP2/DAP4 Client Bugs (~24 issues)
Categories: DAP/OPeNDAP (24), overlaps with ncdump, authentication
DAP issues cluster into three sub-problems:
- Authentication/cookies broken: unable to authenticate OpenDAP #1966, OpenDAP access with Authentication #1833, unable to use thredds access on Windows, problem with cookies #2380, While working with thredds server, netcdf-c library creates thousands of /tmp/occookie* files #2184 (thousands of
/tmp/occookie*files), Cookie file cannot be read and written: (null) #1827 - DAP4 correctness: Segfault on string vars (Segmentation fault when reading string variable over DAP4 #3042), checksum inconsistency (Inconsistent Checksum behavior in DAP4 #3151), attribute name escaping (ncdump DAP4 response fails to handle the attribute name that contains special characters #3115), HTTP headers too large (ncdump issues HTTP GET requests with headers that are too big (>8k bytes) #3113)
- DAP2 regressions: 'ncdap_tst_remote3' test failure in Linux #2242, Wrong return status when opendaping... #1074, IRI/LDEO OPeNDAP endpoint for NOAA ESSRT errors in NetCDF >= 4.7.3 #2013, Potential error of data type between opendap access and fileserver access #175
Fix: (a) Rewrite the cookie/auth handling to use libcurl's cookie jar properly. (b) Fix DAP4 string/attribute parsing. These two fixes would close ~15 DAP issues.
5. Fix Big-Endian / s390x Support (~10 issues, blocks CI)
Categories: Big-Endian (10), overlaps with test infrastructure
Every release breaks on big-endian:
ncx.m4byte-swap bugs (Bug in ncx.m4 only exposed under certain (big-endian) build systems #3286)- Endianness test failures (endianess tests fail when run on big-endian system #3284, tst_h5_endians fails on s390x #3062)
- ncdump fails (ncdump fails on s390x #2670)
- Test suite failures (Multiple test failures on Big Endian/s390x systems #2696, netcdf-c-4.7.4 testsuite fails on s390x (big-endian) #1896, nczarr_test_run_ut_mapapi fails on s390x (big-endian) #1987)
byteswap8static declaration conflict (4.7.4 fails to build on big endian architectures (error: static declaration of ‘byteswap8’ follows non-static declaration) #1687)
Fix: Add a big-endian CI workflow (#3282) and fix the ncx.m4 byte-swap code. Most of these are the same root cause — untested byte-swap paths. A CI + ncx.m4 fix would close all 10.
6. Fix VLEN/Compound Type Handling (~9 issues)
Categories: VLEN/Compound (9), overlaps with memory safety
VLEN types are a persistent source of crashes and data corruption:
- Crash reading VLEN with unlimited dim (Crash on reading NC_VLEN variable with unlimited dimension #2181)
- HDF error with VLEN + fill value + chunking (HDF error on reading back NC_VLEN variable with fill value and chunking #2212)
charvlenbugtest failure (Error with new "charvlenbug" test #2160)- Compound types >= 64KB fail (Attempting to create a variable using a compound with size larger or equal than 2**16 bytes fails #2738)
- Nested compounds fail in-memory (Cannot create nested compound types when creating netCDF file in memory #1489)
- No documented fill value behavior (Question: default fillValue for NC_VLEN types? #2068, Best practices for missing values (aka _FillValues) within VLENs? #1011)
Fix: Audit the VLEN reclaim/allocation paths in libhdf5 and libdispatch. The crashes (#2181, #2496) and the charvlenbug (#2160) likely share a root cause in how VLEN memory is managed during read-back with unlimited dimensions.
7. Harden Memory Safety in libhdf5 (~14 issues)
Categories: Memory/Crash/Segfault (14)
Multiple fuzzer-found and user-reported crashes:
- 4 issues in
get_attached_infoalone (SEGV in get_attached_info netcdf/libhdf5/hdf5open.c #2664, heap-buffer-overflow in get_attached_info netcdf/libhdf5/hdf5open.c #2666, memcpy-param-overlap in get_attached_info netcdf/libhdf5/hdf5open.c #2667, heap-buffer-overflow in NC4_get_vars netcdf/libhdf5/hdf5var.c #2668) — heap overflows, SEGV, memcpy overlap - Memory leaks in hashmap (detected memory leaks in NC_hashmapnew netcdf/libdispatch/nchashmap.c #2665) and file open/close (Memory leak with opening/closing of files? #2626)
- Segfault on corrupt files (SegV with corrupt netCDF files #2436)
- Segfault on byte-range read (Segmentation violation during byte range reading of netcdf4 variable #3044)
- Thread-safety crash (Crash when opening same NC4 file from different threads BUT under global mutex #2496)
Fix: Add bounds checking and NULL guards in hdf5open.c:get_attached_info() and hdf5var.c:NC4_get_vars(). Fix the hashmap leak. This is ~4 functions that account for 8+ crash reports.
8. Fix Filter/Plugin Path & Discovery (~25 issues)
Categories: Filter/Plugin/Compression (25)
Plugin handling is broken in multiple ways:
- Plugin path from configure not auto-added (cause plugin path specified in configure step to be automatically added to the netcdf plugin path #3025)
nccopycan't find filter plugins (filter plugin path with nccopy #3048)make installcan't write to plugin dir (Make install cannot write to plugin directory #2381)- Classic-only build doesn't handle filters/quantization (classic only build not dealing with filters/quantization well #3020)
- zstd can't be toggled on/off (zstd is an optional deps that cannot be turned on or off #2831)
- Bzip2/Bz2 CMake confusion (Bzip2/Bz2 confusion in Cmake build #2717)
- Filter tests run when HDF5 doesn't support them (filter tests should not run when HDF5 doesn't support them. #1517)
- MSVC filter test build failure (Filter Testing with MSVC fails to build. #1245)
Fix: Centralize plugin path resolution — one function that checks HDF5_PLUGIN_PATH, configure-time path, and install-time path in order. Fix the CMake find_package for optional compression libs. Would close ~12 issues.
9. Thread Safety (~7 issues, high user impact)
Categories: Thread Safety (7)
Thread safety has been requested since 2017 (#382) and remains unfixed:
- HDF5 error stack not thread-safe (Thread safety and the HDF5 error stack in 4.9.3 #3193)
- Crash under global mutex (Crash when opening same NC4 file from different threads BUT under global mutex #2496)
- Test race conditions (parallel test race condition between ref_ctest/ref_ctest64 and ncdump_tst_output #3272, Race conditions sneaking in to concurrent testing #2564)
- Core request for thread-safe library (Thread safety Part 1 #1373, Thread Safe netcdf-c library #382)
Fix: Implement per-thread HDF5 error stack isolation and audit global state in libdispatch. Even partial thread safety (read-only concurrent access) would satisfy most users and close 5+ issues.
10. Documentation Overhaul (~14 issues)
Categories: Documentation (14)
Recent audit found massive gaps:
- Missing doxygen in
libdap4,libdap2,libsrc,liblib(much missing documentation in libsrcp #3274, documentation issues in liblib and examples #3275, missing documentation in libdap2 #3278, missing doxygen in libdap4 #3280) - Doc typos in
libhdf5/libhdf4(doc typos and minor mistakes in libhdf5 and libhdf4 #3267, documentation typos and minor mistakes in netcdf-4 doxygen docs #3266) - Auth docs out of date ([docs] auth.html is out of date and is not "officially" included in docs site #2952)
- Error code table missing (Re-add error code table in documentation. #2483)
- Website organization (Organisation of documentation website #2566)
- Broken links (Documentation -- Users Guide has some incorrect links #2358, Not working URL in README.md #1711)
Fix: A systematic doxygen pass through the public API headers + fixing the doc build system (#2581) would close all 14 in one effort.
Summary Table
| # | Fix | Issues Closed | Key Issue Numbers |
|---|---|---|---|
| 1 | Stride/vars performance rewrite | ~15 | #1381, #1757, #1877, #2721, #1947 |
| 2 | CMake modernization + Windows fixes | ~35 | #2713, #554, #1108, #3172, #2697 |
| 3 | NCZarr spec compliance & Xarray interop | ~25 | #2449, #3214, #3108, #2474, #2657 |
| 4 | DAP2/DAP4 auth + correctness fixes | ~24 | #3042, #1966, #3113, #3151, #2184 |
| 5 | Big-endian CI + ncx.m4 fix |
~10 | #3286, #3284, #3282, #2696, #1687 |
| 6 | VLEN/Compound type handling | ~9 | #2181, #2212, #2738, #2160, #1489 |
| 7 | Memory safety in libhdf5 |
~14 | #2664-2668, #2626, #2436, #3044 |
| 8 | Filter/plugin path & discovery | ~12 | #3025, #3048, #2381, #2831, #1245 |
| 9 | Thread safety (at least read-only) | ~7 | #382, #1373, #3193, #2496 |
| 10 | Documentation overhaul | ~14 | #3274, #3278, #2483, #2952, #2566 |
Total unique issues addressable: ~130-150 out of 285 (many issues span multiple categories, so the raw sum double-counts). The CMake/Windows cluster (#2) and NCZarr (#3) are the two highest-leverage targets by sheer volume.