stroke-viewer-frontend / docs /specs /02-phase-1-data-access.md
VibecoderMcSwaggins's picture
feat(phase-4): Gradio UI with NiiVue visualization (#5)
d77e99f unverified
|
raw
history blame
12.3 kB

phase 1: data access layer

purpose

Implement a data loading layer that provides typed access to ISLES24 neuroimaging cases. This phase is split into sub-phases due to a critical discovery: the upstream dataset is not properly formatted for HuggingFace consumption.

critical discovery (2025-12-04)

YongchengYAO/ISLES24-MR-Lite is NOT a proper HuggingFace Dataset.

What we expected What actually exists
load_dataset() returns Dataset with columns load_dataset() FAILS with "no data"
Columns: dwi, adc, mask, participant_id No columns - just raw ZIP files
Parquet/Arrow format Three ZIP archives dumped on HF

Evidence: data/discovery/isles24_schema_report.txt

This means the demo must be built in phases:

  1. Phase 1A: Local file loader (works NOW with extracted data)
  2. Phase 1B: Test Tobias's Nifti() feature on local files (proves loading works)
  3. Phase 1C: Upload properly to HuggingFace (future - proves production pipeline)
  4. Phase 1D: Consume via Tobias's fork (future - proves full round-trip)

phase 1a: local file loader (CURRENT PRIORITY)

data location

data/isles24/                       # Git-ignored
β”œβ”€β”€ Images-DWI/                     # 149 files
β”‚   └── sub-stroke{XXXX}_ses-02_dwi.nii.gz
β”œβ”€β”€ Images-ADC/                     # 149 files
β”‚   └── sub-stroke{XXXX}_ses-02_adc.nii.gz
└── Masks/                          # 149 files
    └── sub-stroke{XXXX}_ses-02_lesion-msk.nii.gz

file naming convention (BIDS-like)

Component Pattern Example
Subject ID sub-stroke{XXXX} sub-stroke0005
Session ses-02 Always "02" in this dataset
Modality dwi, adc, lesion-msk -
Extension .nii.gz Compressed NIfTI

Subject ID regex: sub-stroke(\d{4})_ses-02_.*\.nii\.gz

Note: Subject IDs have gaps (e.g., 0018 missing). Range is 0001-0189, total 149 cases.

deliverables

  • src/stroke_deepisles_demo/data/loader.py - Rewrite with local mode
  • src/stroke_deepisles_demo/data/adapter.py - Rewrite for file-based access
  • src/stroke_deepisles_demo/data/staging.py - Already correct, no changes
  • Unit tests with synthetic fixtures
  • Integration test with actual extracted data

interfaces

data/loader.py

"""Load ISLES24 data from local directory or HuggingFace Hub."""

from __future__ import annotations

from dataclasses import dataclass
from pathlib import Path
from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from stroke_deepisles_demo.data.adapter import LocalDataset


@dataclass
class DatasetInfo:
    """Metadata about the dataset."""

    source: str  # "local" or HF dataset ID
    num_cases: int
    modalities: list[str]
    has_ground_truth: bool


def load_isles_dataset(
    source: str | Path = "data/isles24",
    *,
    local_mode: bool = True,  # Default to local for now
) -> LocalDataset:
    """
    Load ISLES24 dataset.

    Args:
        source: Local directory path or HuggingFace dataset ID
        local_mode: If True, treat source as local directory

    Returns:
        Dataset-like object providing case access

    Raises:
        DataLoadError: If data cannot be loaded
    """
    if local_mode or isinstance(source, Path):
        return _load_from_local_directory(Path(source))
    # Future: return _load_from_huggingface(source)
    raise NotImplementedError("HuggingFace mode not yet implemented")


def _load_from_local_directory(data_dir: Path) -> LocalDataset:
    """
    Load cases from extracted local files.

    Expects structure:
        data_dir/
        β”œβ”€β”€ Images-DWI/sub-stroke{XXXX}_ses-02_dwi.nii.gz
        β”œβ”€β”€ Images-ADC/sub-stroke{XXXX}_ses-02_adc.nii.gz
        └── Masks/sub-stroke{XXXX}_ses-02_lesion-msk.nii.gz
    """
    ...

data/adapter.py

"""Provide typed access to ISLES24 cases."""

from __future__ import annotations

import re
from dataclasses import dataclass
from pathlib import Path
from typing import Iterator

from stroke_deepisles_demo.core.types import CaseFiles


@dataclass
class LocalDataset:
    """File-based dataset for local ISLES24 data."""

    data_dir: Path
    cases: dict[str, CaseFiles]  # subject_id -> files

    def __len__(self) -> int:
        return len(self.cases)

    def __iter__(self) -> Iterator[str]:
        return iter(self.cases.keys())

    def list_case_ids(self) -> list[str]:
        """Return sorted list of subject IDs."""
        return sorted(self.cases.keys())

    def get_case(self, case_id: str | int) -> CaseFiles:
        """Get files for a case by ID or index."""
        if isinstance(case_id, int):
            case_id = self.list_case_ids()[case_id]
        return self.cases[case_id]


# Subject ID extraction
SUBJECT_PATTERN = re.compile(r"sub-(stroke\d{4})_ses-\d+_.*\.nii\.gz")


def parse_subject_id(filename: str) -> str | None:
    """Extract subject ID from BIDS filename."""
    match = SUBJECT_PATTERN.match(filename)
    return f"sub-{match.group(1)}" if match else None


def build_local_dataset(data_dir: Path) -> LocalDataset:
    """
    Scan directory and build case mapping.

    Matches DWI + ADC + Mask files by subject ID.
    """
    dwi_dir = data_dir / "Images-DWI"
    adc_dir = data_dir / "Images-ADC"
    mask_dir = data_dir / "Masks"

    cases: dict[str, CaseFiles] = {}

    # Scan DWI files to get subject IDs
    for dwi_file in dwi_dir.glob("*.nii.gz"):
        subject_id = parse_subject_id(dwi_file.name)
        if not subject_id:
            continue

        # Find matching ADC and Mask
        adc_file = adc_dir / dwi_file.name.replace("_dwi.", "_adc.")
        mask_file = mask_dir / dwi_file.name.replace("_dwi.", "_lesion-msk.")

        if not adc_file.exists():
            continue  # Skip incomplete cases

        cases[subject_id] = CaseFiles(
            dwi=dwi_file,
            adc=adc_file,
            ground_truth=mask_file if mask_file.exists() else None,
        )

    return LocalDataset(data_dir=data_dir, cases=cases)

synthetic fixture structure

Unit tests MUST use fixtures that replicate the exact directory structure. Add to tests/conftest.py:

@pytest.fixture
def synthetic_isles_dir(temp_dir: Path) -> Path:
    """
    Create synthetic ISLES24-like directory structure.

    Structure:
        temp_dir/
        β”œβ”€β”€ Images-DWI/
        β”‚   β”œβ”€β”€ sub-stroke0001_ses-02_dwi.nii.gz
        β”‚   └── sub-stroke0002_ses-02_dwi.nii.gz
        β”œβ”€β”€ Images-ADC/
        β”‚   β”œβ”€β”€ sub-stroke0001_ses-02_adc.nii.gz
        β”‚   └── sub-stroke0002_ses-02_adc.nii.gz
        └── Masks/
            β”œβ”€β”€ sub-stroke0001_ses-02_lesion-msk.nii.gz
            └── sub-stroke0002_ses-02_lesion-msk.nii.gz
    """
    dwi_dir = temp_dir / "Images-DWI"
    adc_dir = temp_dir / "Images-ADC"
    mask_dir = temp_dir / "Masks"

    dwi_dir.mkdir()
    adc_dir.mkdir()
    mask_dir.mkdir()

    for subject_num in [1, 2]:
        subject_id = f"sub-stroke{subject_num:04d}"

        # Create DWI
        dwi_data = np.random.rand(10, 10, 5).astype(np.float32)
        dwi_img = nib.Nifti1Image(dwi_data, affine=np.eye(4))
        nib.save(dwi_img, dwi_dir / f"{subject_id}_ses-02_dwi.nii.gz")

        # Create ADC
        adc_data = np.random.rand(10, 10, 5).astype(np.float32) * 2000
        adc_img = nib.Nifti1Image(adc_data, affine=np.eye(4))
        nib.save(adc_img, adc_dir / f"{subject_id}_ses-02_adc.nii.gz")

        # Create Mask
        mask_data = (np.random.rand(10, 10, 5) > 0.9).astype(np.uint8)
        mask_img = nib.Nifti1Image(mask_data, affine=np.eye(4))
        nib.save(mask_img, mask_dir / f"{subject_id}_ses-02_lesion-msk.nii.gz")

    return temp_dir

tdd plan

# tests/data/test_loader.py

def test_load_from_local_returns_local_dataset(synthetic_isles_dir):
    """Local mode returns LocalDataset."""
    ...

def test_load_from_local_finds_all_cases(synthetic_isles_dir):
    """Finds all cases in synthetic structure."""
    ...

# tests/data/test_adapter.py

def test_parse_subject_id_extracts_correctly():
    """Extracts subject ID from BIDS filename."""
    assert parse_subject_id("sub-stroke0005_ses-02_dwi.nii.gz") == "sub-stroke0005"

def test_build_local_dataset_matches_files(synthetic_isles_dir):
    """Matches DWI, ADC, Mask by subject ID."""
    ...

def test_get_case_returns_case_files(synthetic_isles_dir):
    """get_case returns CaseFiles with correct paths."""
    ...

done criteria (phase 1a)

  • uv run pytest tests/data/ -v passes
  • Can load all 149 cases from data/isles24/
  • list_case_ids() returns 149 subject IDs
  • get_case("sub-stroke0005") returns valid CaseFiles
  • Type checking passes: uv run mypy src/stroke_deepisles_demo/data/

phase 1b: test tobias's nifti feature (NEXT)

purpose

Verify that Tobias's Nifti() feature type from the datasets fork can correctly load/parse NIfTI files. This proves the loading part of the consumption pipeline works, even though the download part is broken.

approach

# Test script to verify Nifti() feature works on local files
from datasets import Features, Value
from datasets.features import Nifti  # From Tobias's fork

# Create a simple dataset from local files
features = Features({
    "subject_id": Value("string"),
    "dwi": Nifti(),
    "adc": Nifti(),
    "mask": Nifti(),
})

# Load a single case and verify Nifti() decodes correctly

done criteria (phase 1b)

  • Tobias's Nifti() feature loads local .nii.gz files
  • Decoded NIfTI has correct shape/dtype
  • Can access voxel data via nibabel-like interface

phase 1c: proper huggingface upload (FUTURE)

purpose

Re-upload ISLES24 data to HuggingFace properly using the arc-aphasia-bids approach. This proves the production pipeline works.

approach

  1. Use BIDS loader from Tobias's fork
  2. Create proper parquet schema with columns:
    • subject: string
    • session: string
    • dwi: Nifti()
    • adc: Nifti()
    • mask: Nifti()
  3. Upload to new HuggingFace repo (e.g., The-Obstacle-Is-The-Way/ISLES24-BIDS)

done criteria (phase 1c)

  • Dataset uploaded to HuggingFace with proper schema
  • HuggingFace dataset viewer shows data correctly
  • load_dataset("new-repo-id") returns Dataset with expected columns

phase 1d: consumption verification (FUTURE)

purpose

Verify the full round-trip: Download from HuggingFace using Tobias's fork.

approach

from datasets import load_dataset

# This should work after Phase 1C
ds = load_dataset("The-Obstacle-Is-The-Way/ISLES24-BIDS")
case = ds["train"][0]
print(case["dwi"].shape)  # Should work!

new adapter function

When Phase 1D is implemented, adapter.py will need a new function alongside build_local_dataset:

def adapt_hf_case(hf_row: dict) -> CaseFiles:
    """
    Adapt a HuggingFace Dataset row to CaseFiles.

    Args:
        hf_row: Row from load_dataset() with columns:
            - dwi: Nifti feature (nibabel-like object)
            - adc: Nifti feature
            - mask: Nifti feature
            - subject: str

    Returns:
        CaseFiles with materialized paths or nibabel objects
    """
    # Implementation depends on how Nifti() feature exposes data
    # May need to write to temp files or pass nibabel objects directly
    ...

This maintains the same CaseFiles contract for downstream phases regardless of data source.

done criteria (phase 1d)

  • load_dataset() works on properly uploaded dataset
  • adapt_hf_case() function converts HF rows to CaseFiles
  • Full demo runs with HuggingFace consumption (not just local files)
  • Documents the pitfall for future projects

dependencies

No new dependencies needed beyond Phase 0.

notes

  • The original adapter.py assumed HF Dataset with columns - COMPLETELY WRONG
  • The original loader.py called load_dataset() directly - FAILS on this dataset
  • staging.py is still correct - it just needs CaseFiles with paths