Skip to content

Support sharding through config and raster_write_kwargs#1106

Open
melonora wants to merge 30 commits intoscverse:mainfrom
melonora:support_sharding
Open

Support sharding through config and raster_write_kwargs#1106
melonora wants to merge 30 commits intoscverse:mainfrom
melonora:support_sharding

Conversation

@melonora
Copy link
Copy Markdown
Collaborator

@melonora melonora commented Apr 14, 2026

This PR adds the following:

  • passing kwargs for zarr.create_array directly as raster_write_kwargs for io functions like .write and .write_element. This also adds the ability to write sharded arrays. Support for anndata sharding is to be added in a follow up PR.
  • proper docstrings for the new raster_write_kwargs argument.
  • Extension of the current config to include raster_chunks and raster_shards. The config can now be stored in a default location or a custom location. Additionally, environment variables can be set to temporarily override the values.
  • Adding zarrs as a dependency and enabled the codec by default to allow for faster io when writing shards. This is a discussion point of whether we should do this or provide more of an opt-in for advanced users.

Additional changes

  • Minimal supported version of dask is 2026.3.0. The reason here is that only this provides the api in such a way that you don't risk zarr format 2 being written in a zarr v3 group and vice versa + it includes the setting that prevents collaps of partitions of dask dataframes after reading from parquet.

@LucaMarconato

@melonora
Copy link
Copy Markdown
Collaborator Author

melonora commented Apr 14, 2026

Failing atm due to ome-zarr not yet being released. You can test locally with ome-zarr-py from main.

Also, need to add support for zarrs to improve speed of shard io

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 90.26549% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.95%. Comparing base (cf91ad5) to head (03aafc9).

Files with missing lines Patch % Lines
src/spatialdata/config.py 87.50% 9 Missing ⚠️
src/spatialdata/_core/_utils.py 89.47% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1106      +/-   ##
==========================================
+ Coverage   91.93%   91.95%   +0.02%     
==========================================
  Files          51       51              
  Lines        7772     7881     +109     
==========================================
+ Hits         7145     7247     +102     
- Misses        627      634       +7     
Files with missing lines Coverage Δ
src/spatialdata/__init__.py 96.00% <100.00%> (+0.34%) ⬆️
src/spatialdata/_core/spatialdata.py 92.01% <100.00%> (+0.08%) ⬆️
src/spatialdata/_io/io_raster.py 94.59% <100.00%> (+2.50%) ⬆️
src/spatialdata/_core/_utils.py 97.01% <89.47%> (-2.99%) ⬇️
src/spatialdata/config.py 88.60% <87.50%> (-11.40%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The reason for only supporting these versions is that they provide the proper use of the zarr api inside dask and also
the possibility for setting the tune optimization. The latter is required to prevent errors due to collapsing dask partitions
when reading data back in from parquet.
@Mr-Milk
Copy link
Copy Markdown

Mr-Milk commented Apr 15, 2026

Should we also allow the control of sharding for anndata?

@melonora
Copy link
Copy Markdown
Collaborator Author

Yes, but not as part of this PR. I will adjust the config though to accommodate.

@melonora melonora marked this pull request as ready for review April 27, 2026 14:33
@melonora melonora requested a review from LucaMarconato April 27, 2026 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants