The best way to distribute large scientific datasets is via the Cloud, in Cloud-Optimized formats 1. But often this data is stuck in archival pre-Cloud file formats such as netCDF.
VirtualiZarr2 makes it easy to create "Virtual" Zarr datacubes, allowing performant access to archival data as if it were in the Cloud-Optimized Zarr format, without duplicating any data.
Please see the documentation.
- Create virtual references pointing to bytes inside an archival file with
open_virtual_dataset
. - Supports a range of archival file formats, including netCDF4 and HDF5, and has a pluggable system for supporting new formats.
- Access data via the zarr-python API by reading from the zarr-compatible
ManifestStore
. - Combine data from multiple files into one larger datacube using xarray's combining functions, such as
xarray.concat
. - Commit the virtual references to storage either using the Kerchunk references specification or the Icechunk transactional storage engine.
- Users access the virtual datacube simply as a single zarr-compatible store using
xarray.open_zarr
.
VirtualiZarr grew out of discussions on the Kerchunk repository, and is an attempt to provide the game-changing power of kerchunk but in a zarr-native way, and with a familiar array-like API.
You now have a choice between using VirtualiZarr and Kerchunk: VirtualiZarr provides almost all the same features as Kerchunk.
VirtualiZarr version 1 (mostly) achieves feature parity with kerchunk's logic for combining datasets, providing an easier way to manipulate kerchunk references in memory and generate kerchunk reference files on disk.
VirtualiZarr version 2 (unreleased) will bring:
- Zarr v3 support,
- A pluggable system of "parsers" for virtualizing custom file formats,
- The
ManifestStore
abstraction, which allows for loading data without serializing to Kerchunk/Icechunk first, - Integration with
obstore
, - Reference parsing that doesn't rely on kerchunk under the hood.
Future VirtualiZarr development will focus on generalizing and upstreaming useful concepts into the Zarr specification, the Zarr-Python library, Xarray, and possibly some new packages.
We have a lot of ideas, including:
- Zarr-native on-disk chunk manifest format
- "Virtual concatenation" of separate Zarr arrays
- ManifestArrays as an intermediate layer in-memory in Zarr-Python
- Separating CF-related Codecs from xarray
If you see other opportunities then we would love to hear your ideas!
- 2024/11/21 - MET Office Architecture Guild - Tom Nicholas - Slides
- 2024/11/13 - Cloud-Native Geospatial conference - Raphael Hagen - Slides
- 2024/07/24 - ESIP Meeting - Sean Harkins - Event / Recording
- 2024/05/15 - Pangeo showcase - Tom Nicholas - Event / Recording / Slides
This package was originally developed by Tom Nicholas whilst working at [C]Worthy, who deserve credit for allowing him to prioritise a generalizable open-source solution to the dataset virtualization problem. VirtualiZarr is now a community-owned multi-stakeholder project.
Apache 2.0
Footnotes
-
Cloud-Native Repositories for Big Scientific Data, Abernathey et. al., Computing in Science & Engineering. ↩
-
(Pronounced "Virtual-Eye-Zarr" - like "virtualizer" but more piratey 🦜) ↩