stroke-deepisles-demo / DATA-PIPELINE.md
VibecoderMcSwaggins's picture
fix(arch): Config SSOT, reproducible builds, and data pipeline documentation (#41)
ba32591 unverified

Data Pipeline

The Problem: HuggingFace datasets doesn't natively support NIfTI/BIDS neuroimaging formats. The Solution: neuroimaging-go-brrrr extends datasets with Nifti() feature type.


What is neuroimaging-go-brrrr?

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              neuroimaging-go-brrrr EXTENDS HUGGINGFACE DATASETS                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                                 β”‚
β”‚   pip install datasets              pip install neuroimaging-go-brrrr           β”‚
β”‚   ───────────────────────           ─────────────────────────────────           β”‚
β”‚   Standard HuggingFace              EXTENDS datasets with:                      β”‚
β”‚   β€’ Images, text, audio             β€’ Nifti() feature type for .nii.gz          β”‚
β”‚   β€’ Parquet/Arrow storage           β€’ BIDS directory parsing                    β”‚
│   ‒ Hub integration                 ‒ Upload utilities (BIDS→Hub)               │
β”‚                                     β€’ Validation utilities                      β”‚
β”‚                                     β€’ Bug workarounds for upstream issues       β”‚
β”‚                                                                                 β”‚
β”‚   When you install neuroimaging-go-brrrr, you get:                              β”‚
β”‚   β€’ A patched datasets library with Nifti() support (pinned git commit)         β”‚
β”‚   β€’ bids_hub module for upload/validation                                       β”‚
β”‚   β€’ All upstream bug workarounds in one place                                   β”‚
β”‚                                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key insight: neuroimaging-go-brrrr pins to a specific commit of datasets that includes Nifti() support:

# From neuroimaging-go-brrrr/pyproject.toml
[tool.uv.sources]
datasets = { git = "https://github.com/huggingface/datasets.git", rev = "004a5bf4..." }

The Two Pipelines

Pipeline 1: UPLOAD (How Data Gets to HuggingFace)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Local BIDS     β”‚     β”‚  neuroimaging-go-    β”‚     β”‚   HuggingFace Hub   β”‚
β”‚  Directory      β”‚ ──► β”‚  brrrr (bids_hub)    β”‚ ──► β”‚   hugging-science/  β”‚
β”‚  (Zenodo)       β”‚     β”‚                      β”‚     β”‚   isles24-stroke    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  β€’ build_isles24_    β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚    file_table()      β”‚
                        β”‚  β€’ Nifti() features  β”‚
                        β”‚  β€’ push_to_hub()     β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Pipeline 2: CONSUMPTION (How This Demo Loads Data)

THE CORRECT PATTERN:

from datasets import load_dataset

# neuroimaging-go-brrrr provides the patched datasets with Nifti() support
ds = load_dataset("hugging-science/isles24-stroke", split="train")

# Access data - Nifti() returns nibabel.Nifti1Image objects
example = ds[0]
dwi = example["dwi"]           # nibabel.Nifti1Image (NOT numpy array)
adc = example["adc"]           # nibabel.Nifti1Image
lesion_mask = example["lesion_mask"]  # nibabel.Nifti1Image

# To get numpy array: dwi.get_fdata()
# To save to file: dwi.to_filename("dwi.nii.gz")

This is the intended consumption pattern. It should just work because:

  1. neuroimaging-go-brrrr provides the patched datasets with Nifti() support
  2. The dataset was uploaded with Nifti() features
  3. Nifti(decode=True) returns nibabel images with affine/header preserved

Current State: REFACTOR NEEDED

Problem: stroke-deepisles-demo currently has a hand-rolled workaround in data/adapter.py that bypasses datasets.load_dataset(). This workaround uses HfFileSystem + pyarrow directly to download individual parquet files.

Why this is wrong:

  1. Duplicates bug workarounds that should live in neuroimaging-go-brrrr
  2. Doesn't use the Nifti() feature type properly
  3. Harder to maintain - fixes need to happen in multiple places

The fix:

  1. Delete the custom HuggingFaceDataset adapter in data/adapter.py
  2. Use standard datasets.load_dataset() consumption pattern
  3. If there are bugs, fix them in neuroimaging-go-brrrr, not locally

Dependency Relationship

stroke-deepisles-demo (this repo)
        β”‚
        └── neuroimaging-go-brrrr @ v0.2.1
                β”‚
                └── datasets @ git commit 004a5bf4... (patched with Nifti())
                └── huggingface-hub
                └── bids_hub module (upload + validation utilities)

The consumption should flow through the standard pattern:

stroke-deepisles-demo
        β”‚
        β”‚ from datasets import load_dataset
        β”‚ ds = load_dataset("hugging-science/isles24-stroke")
        β–Ό
neuroimaging-go-brrrr (provides patched datasets)
        β”‚
        β”‚ Nifti() feature type handles lazy loading
        β–Ό
HuggingFace Hub (isles24-stroke dataset)

Dataset Info

Property Value
Dataset ID hugging-science/isles24-stroke
Subjects 149
Modalities DWI, ADC, Lesion Mask, NCCT, CTA, CTP, Perfusion Maps
Source Zenodo 17652035

What bids_hub Provides

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    neuroimaging-go-brrrr (bids_hub)                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                                 β”‚
β”‚   FOR UPLOADING:                       FOR CONSUMING:                           β”‚
β”‚   ──────────────                       ──────────────                           β”‚
β”‚   build_isles24_file_table()           Patched datasets with Nifti()            β”‚
β”‚   get_isles24_features()               └── Use standard load_dataset()          β”‚
β”‚   push_dataset_to_hub()                                                         β”‚
β”‚                                        validate_isles24_download()              β”‚
β”‚   We DON'T use these in this demo.     └── ISLES24_EXPECTED_COUNTS              β”‚
β”‚   Dataset already uploaded.            └── Can use for sanity checking          β”‚
β”‚                                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Related Documentation


TODO: Refactor Data Loading

The current hand-rolled adapter in data/adapter.py should be replaced with standard datasets.load_dataset() consumption. This refactor should:

  1. Remove HuggingFaceDataset class from data/adapter.py
  2. Update data/loader.py to use datasets.load_dataset()
  3. Remove pre-computed constants in data/constants.py (no longer needed)
  4. Test that Nifti() lazy loading works correctly
  5. If bugs are found, report/fix them in neuroimaging-go-brrrr