Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / datasets /pr_7983 /en /scientific_load.md

rtrm

about 2 months ago

preview code

download

raw

2.37 kB

	# Load scientific data

	Scientific datasets are often stored in formats like HDF5 or Zarr (chunked, N-dimensional arrays). 🤗 Datasets can stream these formats directly from local files (and remote filesystems supported by `fsspec`) without converting the full dataset to Arrow first.

	## HDF5

	```py
	>>> from datasets import load_dataset
	>>> ds = load_dataset("hdf5", data_files=["/path/to/data.h5"], split="train", streaming=True)
	>>> print(next(iter(ds)))
	{'temperature': 280.12, 'pressure': 101325.0, ...}
	```

	## Zarr

	Zarr support is currently experimental. Install it with `pip install "datasets[zarr]"` (or `pip install zarr`).

	Zarr stores are directory-based. You can point `data_files` to either the Zarr store root directory (recommended for convenience) or the Zarr root metadata file:

	- Zarr store root directory: `.../store.zarr` (auto-detects metadata)
	- Zarr v3: `.../store.zarr/zarr.json`
	- Zarr v2 (consolidated): `.../store.zarr/.zmetadata`

	For Zarr v2 non-consolidated stores, pass the store root (`.../store.zarr`) with `consolidated=False`.

	```py
	>>> from datasets import load_dataset
	>>> ds = load_dataset("zarr", data_files=["/path/to/store.zarr"], split="train", streaming=True)
	>>> print(next(iter(ds)))
	{'int32': 0, 'float32': 0.0, 'matrix_2d': [[...], ...]}
	```

	In streaming mode, Zarr is sharded by row ranges aligned to axis-0 chunk boundaries (instead of one shard per metadata file). You can control this with:

	- `rows_per_shard`: target number of rows per shard (rounded up to chunk boundaries)
	- `target_num_shards`: target number of shards across each input store

	If a root group contains arrays with different axis-0 lengths (for example coordinate arrays and time-varying arrays together), 🤗 Datasets automatically infers the primary row dimension (preferring N-D data arrays) and ignores non-row-aligned arrays. You can also target a specific subgroup with `group=...` for more control.

	You can also load from the Hub via the `hf://` protocol:

	```py
	>>> from datasets import DownloadConfig, load_dataset
	>>> download_config = DownloadConfig(storage_options={"hf": {"token": None}}) # set token for private/gated repos
	>>> ds = load_dataset(
	... "zarr",
	... data_files=["hf://datasets//@main/path/to/store.zarr"],
	... split="train",
	... streaming=True,
	... download_config=download_config,
	... )
	```

Xet Storage Details

Size:: 2.37 kB
Xet hash:: 16ff7a437cffd41590939127708672c87ecac9758a30a7464e8d3278c85156be

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.