Skip to content

PoC Set variables async#11171

Open
BorisTheBrave wants to merge 10 commits intopydata:mainfrom
BorisTheBrave:set_variables_async
Open

PoC Set variables async#11171
BorisTheBrave wants to merge 10 commits intopydata:mainfrom
BorisTheBrave:set_variables_async

Conversation

@BorisTheBrave
Copy link

@BorisTheBrave BorisTheBrave commented Feb 14, 2026

This is my proposal for resolving #10622.

As suggested here, the idea is:

  • Internally, we use zarr's async API
  • Change the sync methods of xarray to use zarr's sync() function, which opts xarray methods into zarr's built in event loop

The advantage of using the async API is it's easy to resolve all the parallelism issues such as #9455. I demonstrate this for Store.set_variables. It will work for users without them making any changes, and can be applied incrementally to the codebase without poisoning everything with async methods. It doesn't require any new capabilities from zarr, or any significant structural changes.

zarr's synchronous methods are simply a wrapper around async anyway, so this is just lifting the wrapper one level higher.

The specific change I've done here just parallelizes writing the metadata of variables for a dataset. It doesn't help much for deeply nested trees. It also does not parallelize writing data. That still has to be done via dask, which has to be opted in by the user. One consequence of this is I've re-ordered write operations. It now writes all metadata, then all data, rather than interleaving. This is causing a few tests to fail.

I've had to duplicate a lot of code into a v3 and v2 branch, as it looks like xarray is designed to support both versions of zarr, but there's no async API in the former.

Another downside is that as I'm co-opting zarr's event loop, this approach only works in the zarr backend. That limits the scope compared with some other solutions, and makes it a bit more awkward to implement (as the async code can't leak into the rest of the xarray code base easily).

This is just PoC for disucssion. A full implementation might

  • Add tests
  • Change existing tests to understand the new order of data writing
  • expose async versions of xarray's sync API
  • Extend the async area of the code to more things
  • Figure out async ArrayWriter
  • Think about limiting concurrency
  • Avoid duplicating code, either:
    • Drop support for v2
    • Create an async->sync store shim so the async code can be the only path (I've prepared def _zarr_async_group() for this)

I did some quick performance benchmarking to determine that this change has the desired effect;

# %%
import xarray as xr
import numpy as np
from zarr.storage import LocalStore
from zarr.testing.store import LatencyStore

# Setup a dataset with many variables
data_vars = {f'x{i}': (('a', 'b'), np.arange(6).reshape(2, 3)) for i in range(10)}
ds = xr.Dataset(data_vars)
ds = ds.chunk() # Parallelize writing the data


# Start from empty
import shutil
shutil.rmtree('test.zarr', ignore_errors=True)

# Use latency store to make concurrency obvious
store = LocalStore('test.zarr')
store = LatencyStore(store, set_latency=1)

import time
start = time.time()
ds.to_zarr(store, mode='w')
end = time.time()
print(f"Time taken: {end - start:.3f} seconds")

Before: Time taken: 24.215 seconds
After: Time taken: 6.131 seconds

@github-actions github-actions bot added topic-backends topic-zarr Related to zarr storage library topic-typing io labels Feb 14, 2026
io = [
"netCDF4>=1.6.0",
"h5netcdf>=1.4.0",
"h5netcdf[h5py]>=1.4.0",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've isolated these toml changes to a separate PR #11172

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant