Buckets:
Load scientific data
Scientific datasets are often stored in formats like HDF5 or Zarr (chunked, N-dimensional arrays). 🤗 Datasets can stream these formats directly from local files (and remote filesystems supported by fsspec) without converting the full dataset to Arrow first.
HDF5
>>> from datasets import load_dataset
>>> ds = load_dataset("hdf5", data_files=["/path/to/data.h5"], split="train", streaming=True)
>>> print(next(iter(ds)))
{'temperature': 280.12, 'pressure': 101325.0, ...}
Zarr
Zarr support is currently experimental. Install it with pip install "datasets[zarr]" (or pip install zarr).
Zarr stores are directory-based. You can point data_files to either the Zarr store root directory (recommended for convenience) or the Zarr root metadata file:
- Zarr store root directory:
.../store.zarr(auto-detects metadata) - Zarr v3:
.../store.zarr/zarr.json - Zarr v2 (consolidated):
.../store.zarr/.zmetadata
For Zarr v2 non-consolidated stores, pass the store root (.../store.zarr) with consolidated=False.
>>> from datasets import load_dataset
>>> ds = load_dataset("zarr", data_files=["/path/to/store.zarr"], split="train", streaming=True)
>>> print(next(iter(ds)))
{'int32': 0, 'float32': 0.0, 'matrix_2d': [[...], ...]}
In streaming mode, Zarr is sharded by row ranges aligned to axis-0 chunk boundaries (instead of one shard per metadata file). You can control this with:
rows_per_shard: target number of rows per shard (rounded up to chunk boundaries)target_num_shards: target number of shards across each input store
If a root group contains arrays with different axis-0 lengths (for example coordinate arrays and time-varying arrays together), 🤗 Datasets automatically infers the primary row dimension (preferring N-D data arrays) and ignores non-row-aligned arrays. You can also target a specific subgroup with group=... for more control.
You can also load from the Hub via the hf:// protocol:
>>> from datasets import DownloadConfig, load_dataset
>>> download_config = DownloadConfig(storage_options={"hf": {"token": None}}) # set token for private/gated repos
>>> ds = load_dataset(
... "zarr",
... data_files=["hf://datasets//@main/path/to/store.zarr"],
... split="train",
... streaming=True,
... download_config=download_config,
... )
Xet Storage Details
- Size:
- 2.37 kB
- Xet hash:
- 16ff7a437cffd41590939127708672c87ecac9758a30a7464e8d3278c85156be
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.