Data Pipeline
The Problem: HuggingFace
datasetsdoesn't natively support NIfTI/BIDS neuroimaging formats. The Solution:neuroimaging-go-brrrrextendsdatasetswithNifti()feature type.
What is neuroimaging-go-brrrr?
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β neuroimaging-go-brrrr EXTENDS HUGGINGFACE DATASETS β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β pip install datasets pip install neuroimaging-go-brrrr β
β βββββββββββββββββββββββ βββββββββββββββββββββββββββββββββ β
β Standard HuggingFace EXTENDS datasets with: β
β β’ Images, text, audio β’ Nifti() feature type for .nii.gz β
β β’ Parquet/Arrow storage β’ BIDS directory parsing β
β β’ Hub integration β’ Upload utilities (BIDSβHub) β
β β’ Validation utilities β
β β’ Bug workarounds for upstream issues β
β β
β When you install neuroimaging-go-brrrr, you get: β
β β’ A patched datasets library with Nifti() support (pinned git commit) β
β β’ bids_hub module for upload/validation β
β β’ All upstream bug workarounds in one place β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key insight: neuroimaging-go-brrrr pins to a specific commit of datasets that includes Nifti() support:
# From neuroimaging-go-brrrr/pyproject.toml
[tool.uv.sources]
datasets = { git = "https://github.com/huggingface/datasets.git", rev = "004a5bf4..." }
The Two Pipelines
Pipeline 1: UPLOAD (How Data Gets to HuggingFace)
βββββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββββββ
β Local BIDS β β neuroimaging-go- β β HuggingFace Hub β
β Directory β βββΊ β brrrr (bids_hub) β βββΊ β hugging-science/ β
β (Zenodo) β β β β isles24-stroke β
βββββββββββββββββββ β β’ build_isles24_ β βββββββββββββββββββββββ
β file_table() β
β β’ Nifti() features β
β β’ push_to_hub() β
ββββββββββββββββββββββββ
Pipeline 2: CONSUMPTION (How This Demo Loads Data)
THE CORRECT PATTERN:
from datasets import load_dataset
# neuroimaging-go-brrrr provides the patched datasets with Nifti() support
ds = load_dataset("hugging-science/isles24-stroke", split="train")
# Access data - Nifti() returns nibabel.Nifti1Image objects
example = ds[0]
dwi = example["dwi"] # nibabel.Nifti1Image (NOT numpy array)
adc = example["adc"] # nibabel.Nifti1Image
lesion_mask = example["lesion_mask"] # nibabel.Nifti1Image
# To get numpy array: dwi.get_fdata()
# To save to file: dwi.to_filename("dwi.nii.gz")
This is the intended consumption pattern. It should just work because:
neuroimaging-go-brrrrprovides the patcheddatasetswithNifti()support- The dataset was uploaded with
Nifti()features Nifti(decode=True)returns nibabel images with affine/header preserved
Current State: REFACTOR NEEDED
Problem: stroke-deepisles-demo currently has a hand-rolled workaround in data/adapter.py that bypasses datasets.load_dataset(). This workaround uses HfFileSystem + pyarrow directly to download individual parquet files.
Why this is wrong:
- Duplicates bug workarounds that should live in
neuroimaging-go-brrrr - Doesn't use the
Nifti()feature type properly - Harder to maintain - fixes need to happen in multiple places
The fix:
- Delete the custom
HuggingFaceDatasetadapter indata/adapter.py - Use standard
datasets.load_dataset()consumption pattern - If there are bugs, fix them in
neuroimaging-go-brrrr, not locally
Dependency Relationship
stroke-deepisles-demo (this repo)
β
βββ neuroimaging-go-brrrr @ v0.2.1
β
βββ datasets @ git commit 004a5bf4... (patched with Nifti())
βββ huggingface-hub
βββ bids_hub module (upload + validation utilities)
The consumption should flow through the standard pattern:
stroke-deepisles-demo
β
β from datasets import load_dataset
β ds = load_dataset("hugging-science/isles24-stroke")
βΌ
neuroimaging-go-brrrr (provides patched datasets)
β
β Nifti() feature type handles lazy loading
βΌ
HuggingFace Hub (isles24-stroke dataset)
Dataset Info
| Property | Value |
|---|---|
| Dataset ID | hugging-science/isles24-stroke |
| Subjects | 149 |
| Modalities | DWI, ADC, Lesion Mask, NCCT, CTA, CTP, Perfusion Maps |
| Source | Zenodo 17652035 |
What bids_hub Provides
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β neuroimaging-go-brrrr (bids_hub) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β FOR UPLOADING: FOR CONSUMING: β
β ββββββββββββββ ββββββββββββββ β
β build_isles24_file_table() Patched datasets with Nifti() β
β get_isles24_features() βββ Use standard load_dataset() β
β push_dataset_to_hub() β
β validate_isles24_download() β
β We DON'T use these in this demo. βββ ISLES24_EXPECTED_COUNTS β
β Dataset already uploaded. βββ Can use for sanity checking β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Related Documentation
TODO: Refactor Data Loading
The current hand-rolled adapter in data/adapter.py should be replaced with standard datasets.load_dataset() consumption. This refactor should:
- Remove
HuggingFaceDatasetclass fromdata/adapter.py - Update
data/loader.pyto usedatasets.load_dataset() - Remove pre-computed constants in
data/constants.py(no longer needed) - Test that
Nifti()lazy loading works correctly - If bugs are found, report/fix them in
neuroimaging-go-brrrr