Spaces:

VibecoderMcSwaggins
/

stroke-viewer-frontend

Running

App Files Files Community

VibecoderMcSwaggins commited on 5 days ago

Commit

363ba14

unverified ·

1 Parent(s): 4157a29

feat(data): integrate HuggingFace dataset as primary data source (#11)

Browse files

## Summary
Integrates HuggingFace dataset `hugging-science/isles24-stroke` as the primary data source.

### Key Changes:
- Added `HuggingFaceDataset` adapter with temp-file caching and cleanup
- Updated `load_isles_dataset()` to auto-detect local vs HF mode
- Added comprehensive mocked unit tests for HF adapter
- Extended `Dataset` protocol with context manager support

### CodeRabbit Findings Addressed:
1. ✅ Sort `list_case_ids()` return value in HuggingFaceDataset
2. ✅ Simplified auto-detection heuristic (removed parent.exists() check)
3. ✅ Use context manager in integration test
4. ❌ Rejected: patch target change (lazy import makes current approach correct)

### Test Results:
- 125 tests pass
- ruff clean
- mypy clean

Files changed (8) hide show

.gitignore +3 -0
README.md +2 -2
data/README.md +38 -25
docs/dataset-card/isles24-stroke.md +179 -0
src/stroke_deepisles_demo/data/adapter.py +177 -4
src/stroke_deepisles_demo/data/loader.py +66 -14
tests/data/test_hf_adapter.py +182 -0
tests/data/test_loader.py +15 -5

.gitignore CHANGED Viewed

@@ -212,3 +212,6 @@ data/isles24/
 # Discovery artifacts (schema reports, samples)
 data/discovery/
 data/scratch/

 # Discovery artifacts (schema reports, samples)
 data/discovery/
 data/scratch/
+# macOS
+.DS_Store

README.md CHANGED Viewed

@@ -12,7 +12,7 @@ short_description: Ischemic stroke lesion segmentation using DeepISLES
 models:
   - isleschallenge/deepisles
 datasets:
-  - YongchengYAO/ISLES24-MR-Lite
 tags:
   - medical-imaging
   - stroke
@@ -29,7 +29,7 @@ tags:
 [![Code style: ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
 [![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)
-A demonstration pipeline and UI for ischemic stroke lesion segmentation using **DeepISLES** and **ISLES24-MR-Lite** data.
 This project provides a complete end-to-end workflow:
 1.  **Data Loading**: Lazy-loading of NIfTI neuroimaging data from HuggingFace.

 models:
   - isleschallenge/deepisles
 datasets:
+  - hugging-science/isles24-stroke
 tags:
   - medical-imaging
   - stroke
 [![Code style: ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
 [![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)
+A demonstration pipeline and UI for ischemic stroke lesion segmentation using **DeepISLES** and **ISLES'24** data.
 This project provides a complete end-to-end workflow:
 1.  **Data Loading**: Lazy-loading of NIfTI neuroimaging data from HuggingFace.

data/README.md CHANGED Viewed

@@ -1,39 +1,52 @@
 # Data Directory
-This folder contains local neuroimaging data for the stroke-deepisles-demo project.
-## Structure
-```text
-data/
-├── README.md           # This file (tracked)
-├── isles24/            # ISLES24 NIfTI files (gitignored)
-│   ├── Images-DWI/     # DWI volumes (149 files)
-│   ├── Images-ADC/     # ADC maps (149 files)
-│   └── Masks/          # Ground truth lesion masks (149 files)
-└── discovery/          # Schema reports (gitignored)
-    └── isles24_schema_report.txt
 ```
-## Setup
-1. Download ISLES24-MR-Lite from [HuggingFace](https://huggingface.co/datasets/YongchengYAO/ISLES24-MR-Lite)
-2. Extract the ZIP files into `data/isles24/`:
-   - `Images-DWI.zip` → `data/isles24/Images-DWI/`
-   - `Images-ADC.zip` → `data/isles24/Images-ADC/`
-   - `Masks.zip` → `data/isles24/Masks/`
-## File Naming Convention
-Files follow BIDS-like naming:
-```text
-sub-stroke{XXXX}_ses-02_{modality}.nii.gz
 ```
-Example: `sub-stroke0005_ses-02_dwi.nii.gz`
 ## Notes
-- All data files are gitignored to avoid committing large binaries
-- The `discovery/` folder contains schema reports from data exploration scripts
-- See `docs/specs/02-phase-1-data-access.md` for detailed data loading documentation

 # Data Directory
+This folder is for local development data only. The primary data source is HuggingFace.
+## Data Source
+**Primary**: [hugging-science/isles24-stroke](https://huggingface.co/datasets/hugging-science/isles24-stroke)
+The dataset is automatically downloaded and cached by HuggingFace when you run:
+```python
+from stroke_deepisles_demo.data import load_isles_dataset
+# Loads from HuggingFace (default)
+dataset = load_isles_dataset()
+# Access cases
+case = dataset.get_case(0)  # or dataset.get_case("sub-stroke0001")
 ```
+## HuggingFace Cache Location
+Data is cached at: `~/.cache/huggingface/datasets/hugging-science___isles24-stroke/`
+## Dataset Contents
+149 acute ischemic stroke cases with:
+- **Imaging**: DWI, ADC, CT, CTA, perfusion maps (tmax, mtt, cbf, cbv)
+- **Masks**: lesion_mask, lvo_mask, cow_segmentation
+- **Clinical**: age, sex, nihss_admission, mrs_admission, mrs_3month
+## Local Development (Optional)
+For offline development, you can still use a local directory:
+```python
+dataset = load_isles_dataset("path/to/local/data", local_mode=True)
 ```
+Expected structure for local mode:
+```text
+data/
+├── Images-DWI/     # DWI volumes
+├── Images-ADC/     # ADC maps
+└── Masks/          # Ground truth lesion masks
+```
 ## Notes
+- All data files are gitignored
+- On HuggingFace Spaces, data loads automatically from the HF cache
+- See dataset card for citation requirements

docs/dataset-card/isles24-stroke.md ADDED Viewed

	@@ -0,0 +1,179 @@

+---
+license: cc-by-nc-sa-4.0
+task_categories:
+  - image-segmentation
+tags:
+  - medical
+  - neuroimaging
+  - stroke
+  - CT
+  - MRI
+  - perfusion
+  - ISLES
+  - BIDS
+size_categories:
+  - n<1K
+---
+# ISLES'24 Stroke Training Dataset
+Multi-center longitudinal multimodal acute ischemic stroke training dataset from the ISLES'24 Challenge.
+## Dataset Description
+- **Source:** [Zenodo Record 17652035](https://zenodo.org/records/17652035) (v7, November 2025)
+- **Challenge:** [ISLES 2024](https://isles-24.grand-challenge.org/)
+- **Paper:** [Riedel et al., arXiv:2408.11142](https://arxiv.org/abs/2408.11142)
+- **License:** CC BY-NC-SA 4.0
+- **Size:** 99 GB (compressed)
+## Overview
+149 acute ischemic stroke training cases with:
+- **Admission imaging (ses-01):** Non-contrast CT, CT angiography, 4D CT perfusion
+- **Follow-up imaging (ses-02):** Post-treatment MRI (DWI, ADC)
+- **Clinical data:** Demographics, patient history, admission NIHSS, 3-month mRS outcomes
+- **Annotations:** Infarct masks, large vessel occlusion masks, Circle of Willis anatomy
+> **Note:** The ISLES'24 paper describes a training set of 150 cases; the Zenodo v7 training archive contains 149 publicly released subjects.
+## Dataset Structure
+### Imaging Modalities
+| Session | Modality | Description |
+|---------|----------|-------------|
+| ses-01 (Acute) | `ncct` | Non-contrast CT |
+| ses-01 (Acute) | `cta` | CT Angiography |
+| ses-01 (Acute) | `ctp` | 4D CT Perfusion time series |
+| ses-01 (Acute) | `tmax` | Time-to-maximum perfusion map |
+| ses-01 (Acute) | `mtt` | Mean transit time map |
+| ses-01 (Acute) | `cbf` | Cerebral blood flow map |
+| ses-01 (Acute) | `cbv` | Cerebral blood volume map |
+| ses-02 (Follow-up) | `dwi` | Diffusion-weighted MRI |
+| ses-02 (Follow-up) | `adc` | Apparent diffusion coefficient |
+### Derivative Masks
+| Mask | Description |
+|------|-------------|
+| `lesion_mask` | Binary infarct segmentation (from follow-up MRI) |
+| `lvo_mask` | Large vessel occlusion mask (from CTA) |
+| `cow_mask` | Circle of Willis anatomy (multi-label, auto-generated from CTA) |
+### Clinical Variables
+Clinical variables are extracted from per-subject XLSX files in the `phenotype/` directory:
+| Variable | Source File | Description |
+|----------|-------------|-------------|
+| `age` | demographic_baseline.xlsx | Patient age at admission |
+| `sex` | demographic_baseline.xlsx | Patient sex (M/F) |
+| `nihss_admission` | demographic_baseline.xlsx | NIH Stroke Scale score at admission |
+| `mrs_admission` | demographic_baseline.xlsx | Modified Rankin Scale at admission |
+| `mrs_3month` | outcome.xlsx | Modified Rankin Scale at 3 months (primary outcome) |
+## Usage
+```python
+from datasets import load_dataset
+ds = load_dataset("hugging-science/isles24-stroke", split="train")
+# Access a subject
+example = ds[0]
+print(example["subject_id"])      # "sub-stroke0001"
+print(example["ncct"])            # Non-contrast CT array
+print(example["dwi"])             # Diffusion-weighted MRI
+print(example["lesion_mask"])     # Ground truth segmentation
+print(example["nihss_admission"]) # NIH Stroke Scale at admission
+print(example["mrs_3month"])      # Modified Rankin Scale at 3 months
+```
+## Data Organization
+The source data follows BIDS structure. This tree shows the actual Zenodo v7 layout:
+```
+train/
+├── clinical_data-description.xlsx
+├── raw_data/
+│   └── sub-stroke0001/
+│       └── ses-01/
+│           ├── sub-stroke0001_ses-01_ncct.nii.gz
+│           ├── sub-stroke0001_ses-01_cta.nii.gz
+│           ├── sub-stroke0001_ses-01_ctp.nii.gz
+│           └── perfusion-maps/
+│               ├── sub-stroke0001_ses-01_tmax.nii.gz
+│               ├── sub-stroke0001_ses-01_mtt.nii.gz
+│               ├── sub-stroke0001_ses-01_cbf.nii.gz
+│               └── sub-stroke0001_ses-01_cbv.nii.gz
+├── derivatives/
+│   └── sub-stroke0001/
+│       ├── ses-01/
+│       │   ├── perfusion-maps/
+│       │   │   ├── sub-stroke0001_ses-01_space-ncct_tmax.nii.gz
+│       │   │   ├── sub-stroke0001_ses-01_space-ncct_mtt.nii.gz
+│       │   │   ├── sub-stroke0001_ses-01_space-ncct_cbf.nii.gz
+│       │   │   └── sub-stroke0001_ses-01_space-ncct_cbv.nii.gz
+│       │   ├── sub-stroke0001_ses-01_space-ncct_cta.nii.gz
+│       │   ├── sub-stroke0001_ses-01_space-ncct_ctp.nii.gz
+│       │   ├── sub-stroke0001_ses-01_space-ncct_lvo-msk.nii.gz
+│       │   └── sub-stroke0001_ses-01_space-ncct_cow-msk.nii.gz
+│       └── ses-02/
+│           ├── sub-stroke0001_ses-02_space-ncct_dwi.nii.gz
+│           ├── sub-stroke0001_ses-02_space-ncct_adc.nii.gz
+│           └── sub-stroke0001_ses-02_space-ncct_lesion-msk.nii.gz
+└── phenotype/
+    └── sub-stroke0001/
+        ├── ses-01/
+        └── ses-02/
+```
+## Citation
+When using this dataset, please cite:
+```bibtex
+@article{riedel2024isles,
+  title={ISLES'24 -- A Real-World Longitudinal Multimodal Stroke Dataset},
+  author={Riedel, Evamaria Olga and de la Rosa, Ezequiel and Baran, The Anh and
+          Hernandez Petzsche, Moritz and Baazaoui, Hakim and Yang, Kaiyuan and
+          Musio, Fabio Antonio and Huang, Houjing and Robben, David and
+          Seia, Joaquin Oscar and Wiest, Roland and Reyes, Mauricio and
+          Su, Ruisheng and Zimmer, Claus and Boeckh-Behrens, Tobias and
+          Berndt, Maria and Menze, Bjoern and Rueckert, Daniel and
+          Wiestler, Benedikt and Wegener, Susanne and Kirschke, Jan Stefan},
+  journal={arXiv preprint arXiv:2408.11142},
+  year={2024}
+}
+@article{delarosa2024isles,
+  title={ISLES'24: Final Infarct Prediction with Multimodal Imaging and Clinical Data. Where Do We Stand?},
+  author={de la Rosa, Ezequiel and Su, Ruisheng and Reyes, Mauricio and
+          Wiest, Roland and Riedel, Evamaria Olga and Kofler, Florian and
+          others and Menze, Bjoern},
+  journal={arXiv preprint arXiv:2408.10966},
+  year={2024}
+}
+```
+If using Circle of Willis masks, also cite:
+```bibtex
+@article{yang2023benchmarking,
+  title={Benchmarking the CoW with the TopCoW Challenge: Topology-Aware Anatomical
+         Segmentation of the Circle of Willis for CTA and MRA},
+  author={Yang, Kaiyuan and Musio, Fabio and Ma, Yue and Juchler, Norman and
+          Paetzold, Johannes C and Al-Maskari, Rami and others and Menze, Bjoern},
+  journal={arXiv preprint arXiv:2312.17670},
+  year={2023}
+}
+```
+## Related Resources
+- [ISLES 2024 Challenge](https://isles-24.grand-challenge.org/)
+- [Zenodo Dataset (DOI: 10.5281/zenodo.17652035)](https://doi.org/10.5281/zenodo.17652035)
+- [Dataset Paper (arXiv:2408.11142)](https://arxiv.org/abs/2408.11142)
+- [Challenge Paper (arXiv:2408.10966)](https://arxiv.org/abs/2408.10966)

src/stroke_deepisles_demo/data/adapter.py CHANGED Viewed

@@ -3,14 +3,17 @@
 from __future__ import annotations
 import re
-from dataclasses import dataclass
-from typing import TYPE_CHECKING
 from stroke_deepisles_demo.core.logging import get_logger
 if TYPE_CHECKING:
     from collections.abc import Iterator
-    from pathlib import Path
     from stroke_deepisles_demo.core.types import CaseFiles
@@ -19,7 +22,15 @@ logger = get_logger(__name__)
 @dataclass
 class LocalDataset:
-    """File-based dataset for local ISLES24 data."""
     data_dir: Path
     cases: dict[str, CaseFiles]  # subject_id -> files
@@ -30,6 +41,13 @@ class LocalDataset:
     def __iter__(self) -> Iterator[str]:
         return iter(self.cases.keys())
     def list_case_ids(self) -> list[str]:
         """Return sorted list of subject IDs."""
         return sorted(self.cases.keys())
@@ -40,6 +58,10 @@ class LocalDataset:
             case_id = self.list_case_ids()[case_id]
         return self.cases[case_id]
 # Subject ID extraction
 SUBJECT_PATTERN = re.compile(r"sub-(stroke\d{4})_ses-\d+_.*\.nii\.gz")
@@ -111,3 +133,154 @@ def build_local_dataset(data_dir: Path) -> LocalDataset:
     logger.info("Loaded %d cases from %s", len(cases), data_dir)
     return LocalDataset(data_dir=data_dir, cases=cases)

 from __future__ import annotations
 import re
+import shutil
+import tempfile
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import TYPE_CHECKING, Any, Self
+from stroke_deepisles_demo.core.exceptions import DataLoadError
 from stroke_deepisles_demo.core.logging import get_logger
 if TYPE_CHECKING:
     from collections.abc import Iterator
     from stroke_deepisles_demo.core.types import CaseFiles
 @dataclass
 class LocalDataset:
+    """File-based dataset for local ISLES24 data.
+    Can be used as a context manager for consistency with HuggingFaceDataset,
+    though no cleanup is needed for local files.
+    Example:
+        with build_local_dataset(path) as ds:
+            case = ds.get_case(0)
+    """
     data_dir: Path
     cases: dict[str, CaseFiles]  # subject_id -> files
     def __iter__(self) -> Iterator[str]:
         return iter(self.cases.keys())
+    def __enter__(self) -> Self:
+        return self
+    def __exit__(self, *args: object) -> None:
+        # No cleanup needed for local files
+        pass
     def list_case_ids(self) -> list[str]:
         """Return sorted list of subject IDs."""
         return sorted(self.cases.keys())
             case_id = self.list_case_ids()[case_id]
         return self.cases[case_id]
+    def cleanup(self) -> None:
+        """No-op for local dataset (files are not temporary)."""
+        pass
 # Subject ID extraction
 SUBJECT_PATTERN = re.compile(r"sub-(stroke\d{4})_ses-\d+_.*\.nii\.gz")
     logger.info("Loaded %d cases from %s", len(cases), data_dir)
     return LocalDataset(data_dir=data_dir, cases=cases)
+# =============================================================================
+# HuggingFace Dataset Adapter
+# =============================================================================
+@dataclass
+class HuggingFaceDataset:
+    """Dataset adapter for HuggingFace ISLES24 dataset.
+    Wraps the HuggingFace dataset and provides the same interface as LocalDataset.
+    When get_case() is called, writes NIfTI bytes to temp files and returns paths.
+    IMPORTANT: Use as a context manager to ensure temp files are cleaned up:
+        with load_isles_dataset() as ds:
+            case = ds.get_case(0)
+            # ... process case ...
+        # temp files automatically cleaned up
+    Or call cleanup() manually when done.
+    """
+    dataset_id: str
+    _hf_dataset: Any = field(repr=False)
+    _case_ids: list[str] = field(default_factory=list)
+    _temp_dir: Path | None = field(default=None, repr=False)
+    _cached_cases: dict[str, CaseFiles] = field(default_factory=dict, repr=False)
+    def __len__(self) -> int:
+        return len(self._hf_dataset)
+    def __iter__(self) -> Iterator[str]:
+        return iter(self._case_ids)
+    def __enter__(self) -> Self:
+        return self
+    def __exit__(self, *args: object) -> None:
+        self.cleanup()
+    def list_case_ids(self) -> list[str]:
+        """Return sorted list of subject IDs."""
+        return sorted(self._case_ids)
+    def get_case(self, case_id: str | int) -> CaseFiles:
+        """Get files for a case by ID or index.
+        Writes NIfTI bytes to temp files on first access; returns cached paths
+        on subsequent calls for the same case.
+        Raises:
+            DataError: If HuggingFace data is malformed or missing required fields.
+        """
+        if isinstance(case_id, int):
+            idx = case_id
+            subject_id = self._case_ids[idx]
+        else:
+            subject_id = case_id
+            idx = self._case_ids.index(subject_id)
+        # Return cached case if already materialized
+        if subject_id in self._cached_cases:
+            return self._cached_cases[subject_id]
+        # Create shared temp directory on first use
+        if self._temp_dir is None:
+            self._temp_dir = Path(tempfile.mkdtemp(prefix="isles24_hf_"))
+            logger.debug("Created temp directory: %s", self._temp_dir)
+        # Get the HuggingFace example
+        example = self._hf_dataset[idx]
+        # Create case subdirectory
+        case_dir = self._temp_dir / subject_id
+        case_dir.mkdir(exist_ok=True)
+        # Write NIfTI files to temp directory
+        dwi_path = case_dir / f"{subject_id}_ses-02_dwi.nii.gz"
+        adc_path = case_dir / f"{subject_id}_ses-02_adc.nii.gz"
+        mask_path = case_dir / f"{subject_id}_ses-02_lesion-msk.nii.gz"
+        # Extract bytes with defensive error handling
+        try:
+            dwi_bytes = example["dwi"]["bytes"]
+            adc_bytes = example["adc"]["bytes"]
+        except (KeyError, TypeError) as e:
+            raise DataLoadError(
+                f"Malformed HuggingFace data for {subject_id}: missing 'dwi' or 'adc' bytes. "
+                f"The dataset schema may have changed. Error: {e}"
+            ) from e
+        # Write the gzipped NIfTI bytes
+        dwi_path.write_bytes(dwi_bytes)
+        adc_path.write_bytes(adc_bytes)
+        case_files: CaseFiles = {
+            "dwi": dwi_path,
+            "adc": adc_path,
+        }
+        # Write lesion mask if available
+        try:
+            mask_data = example.get("lesion_mask")
+            if mask_data and mask_data.get("bytes"):
+                mask_path.write_bytes(mask_data["bytes"])
+                case_files["ground_truth"] = mask_path
+        except (KeyError, TypeError):
+            # Mask is optional, log and continue
+            logger.debug("No lesion mask available for %s", subject_id)
+        # Cache for subsequent calls
+        self._cached_cases[subject_id] = case_files
+        return case_files
+    def cleanup(self) -> None:
+        """Remove temp directory and clear cache."""
+        if self._temp_dir and self._temp_dir.exists():
+            shutil.rmtree(self._temp_dir, ignore_errors=True)
+            logger.debug("Cleaned up temp directory: %s", self._temp_dir)
+        self._temp_dir = None
+        self._cached_cases.clear()
+def build_huggingface_dataset(dataset_id: str) -> HuggingFaceDataset:
+    """
+    Load ISLES24 dataset from HuggingFace Hub.
+    Args:
+        dataset_id: HuggingFace dataset identifier (e.g., "hugging-science/isles24-stroke")
+    Returns:
+        HuggingFaceDataset providing case access
+    """
+    from datasets import load_dataset
+    logger.info("Loading HuggingFace dataset: %s", dataset_id)
+    hf_dataset = load_dataset(dataset_id, split="train")
+    # Extract case IDs
+    case_ids = [example["subject_id"] for example in hf_dataset]
+    logger.info("Loaded %d cases from HuggingFace: %s", len(case_ids), dataset_id)
+    return HuggingFaceDataset(
+        dataset_id=dataset_id,
+        _hf_dataset=hf_dataset,
+        _case_ids=case_ids,
+    )

src/stroke_deepisles_demo/data/loader.py CHANGED Viewed

@@ -4,10 +4,29 @@ from __future__ import annotations
 from dataclasses import dataclass
 from pathlib import Path
-from typing import TYPE_CHECKING
 if TYPE_CHECKING:
-    from stroke_deepisles_demo.data.adapter import LocalDataset
 @dataclass
@@ -20,28 +39,61 @@ class DatasetInfo:
     has_ground_truth: bool
 def load_isles_dataset(
-    source: str | Path = "data/isles24",
     *,
-    local_mode: bool = True,  # Default to local for now
-) -> LocalDataset:
     """
-    Load ISLES24 dataset.
     Args:
-        source: Local directory path or HuggingFace dataset ID
-        local_mode: If True, treat source as local directory
     Returns:
-        Dataset-like object providing case access
-    Raises:
-        NotImplementedError: If non-local mode is requested
     """
-    if local_mode or isinstance(source, Path):
         from stroke_deepisles_demo.data.adapter import build_local_dataset
         return build_local_dataset(Path(source))
-    # Future: return _load_from_huggingface(source)
-    raise NotImplementedError("HuggingFace mode not yet implemented")

 from dataclasses import dataclass
 from pathlib import Path
+from typing import TYPE_CHECKING, Protocol, Self
 if TYPE_CHECKING:
+    from stroke_deepisles_demo.core.types import CaseFiles
+class Dataset(Protocol):
+    """Protocol for dataset access.
+    All dataset implementations support context manager usage for proper cleanup:
+        with load_isles_dataset() as ds:
+            case = ds.get_case(0)
+            # ... process case ...
+        # cleanup happens automatically
+    """
+    def __len__(self) -> int: ...
+    def __enter__(self) -> Self: ...
+    def __exit__(self, *args: object) -> None: ...
+    def list_case_ids(self) -> list[str]: ...
+    def get_case(self, case_id: str | int) -> CaseFiles: ...
+    def cleanup(self) -> None: ...
 @dataclass
     has_ground_truth: bool
+# Default HuggingFace dataset ID
+DEFAULT_HF_DATASET = "hugging-science/isles24-stroke"
 def load_isles_dataset(
+    source: str | Path | None = None,
     *,
+    local_mode: bool | None = None,
+) -> Dataset:
     """
+    Load ISLES24 dataset from local directory or HuggingFace Hub.
     Args:
+        source: Local directory path or HuggingFace dataset ID.
+                If None, uses HuggingFace dataset by default.
+        local_mode: If True, treat source as local directory.
+                    If None, auto-detect based on source type.
     Returns:
+        Dataset-like object providing case access. Use as context manager
+        for automatic cleanup of temp files (important for HuggingFace mode).
+    Examples:
+        # Load from HuggingFace with automatic cleanup (recommended)
+        with load_isles_dataset() as ds:
+            case = ds.get_case(0)
+        # Load from local directory
+        ds = load_isles_dataset("data/isles24", local_mode=True)
+        # Load specific HuggingFace dataset
+        ds = load_isles_dataset("hugging-science/isles24-stroke")
     """
+    # Auto-detect mode if not specified
+    if local_mode is None:
+        if source is None:
+            local_mode = False  # Default to HuggingFace
+        elif isinstance(source, Path):
+            local_mode = True
+        else:
+            # String: check if it's an existing local path
+            # Only select local mode if the path itself exists
+            # (avoids misclassifying HF dataset IDs like "org/dataset")
+            source_path = Path(source)
+            local_mode = source_path.exists()
+    if local_mode:
         from stroke_deepisles_demo.data.adapter import build_local_dataset
+        if source is None:
+            source = "data/isles24"
         return build_local_dataset(Path(source))
+    # HuggingFace mode
+    from stroke_deepisles_demo.data.adapter import build_huggingface_dataset
+    dataset_id = source if source else DEFAULT_HF_DATASET
+    return build_huggingface_dataset(str(dataset_id))

tests/data/test_hf_adapter.py ADDED Viewed

	@@ -0,0 +1,182 @@

+"""Unit tests for HuggingFace dataset adapter with mocked HF dataset."""
+from __future__ import annotations
+from typing import Any
+from unittest.mock import MagicMock, patch
+import pytest
+from stroke_deepisles_demo.core.exceptions import DataLoadError
+from stroke_deepisles_demo.data.adapter import HuggingFaceDataset, build_huggingface_dataset
+def create_mock_hf_example(subject_id: str, include_mask: bool = True) -> dict[str, Any]:
+    """Create a mock HuggingFace dataset example."""
+    example: dict[str, Any] = {
+        "subject_id": subject_id,
+        "dwi": {"bytes": b"fake_dwi_nifti_data", "path": f"{subject_id}_dwi.nii.gz"},
+        "adc": {"bytes": b"fake_adc_nifti_data", "path": f"{subject_id}_adc.nii.gz"},
+    }
+    if include_mask:
+        example["lesion_mask"] = {
+            "bytes": b"fake_mask_nifti_data",
+            "path": f"{subject_id}_lesion-msk.nii.gz",
+        }
+    else:
+        example["lesion_mask"] = None
+    return example
+@pytest.fixture
+def mock_hf_dataset() -> MagicMock:
+    """Create a mock HuggingFace dataset with 3 subjects."""
+    examples = [
+        create_mock_hf_example("sub-stroke0001"),
+        create_mock_hf_example("sub-stroke0002"),
+        create_mock_hf_example("sub-stroke0003", include_mask=False),
+    ]
+    mock_ds = MagicMock()
+    mock_ds.__len__ = MagicMock(return_value=len(examples))
+    mock_ds.__iter__ = MagicMock(return_value=iter(examples))
+    mock_ds.__getitem__ = MagicMock(side_effect=lambda i: examples[i])
+    return mock_ds
+class TestHuggingFaceDataset:
+    """Tests for HuggingFaceDataset class."""
+    def test_get_case_writes_files_to_temp_dir(self, mock_hf_dataset: MagicMock) -> None:
+        """Test that get_case writes NIfTI bytes to temp files."""
+        case_ids = ["sub-stroke0001", "sub-stroke0002", "sub-stroke0003"]
+        ds = HuggingFaceDataset(
+            dataset_id="test/dataset",
+            _hf_dataset=mock_hf_dataset,
+            _case_ids=case_ids,
+        )
+        try:
+            case = ds.get_case(0)
+            assert "dwi" in case
+            assert "adc" in case
+            assert case["dwi"].exists()
+            assert case["adc"].exists()
+            assert case["dwi"].read_bytes() == b"fake_dwi_nifti_data"
+            assert case["adc"].read_bytes() == b"fake_adc_nifti_data"
+        finally:
+            ds.cleanup()
+    def test_get_case_includes_ground_truth_when_available(
+        self, mock_hf_dataset: MagicMock
+    ) -> None:
+        """Test that ground truth is included when lesion_mask is present."""
+        case_ids = ["sub-stroke0001", "sub-stroke0002", "sub-stroke0003"]
+        ds = HuggingFaceDataset(
+            dataset_id="test/dataset",
+            _hf_dataset=mock_hf_dataset,
+            _case_ids=case_ids,
+        )
+        try:
+            case = ds.get_case(0)  # Has mask
+            assert "ground_truth" in case
+            assert case["ground_truth"].read_bytes() == b"fake_mask_nifti_data"
+            case_no_mask = ds.get_case(2)  # No mask
+            assert "ground_truth" not in case_no_mask
+        finally:
+            ds.cleanup()
+    def test_get_case_caches_results(self, mock_hf_dataset: MagicMock) -> None:
+        """Test that get_case returns cached paths on subsequent calls."""
+        case_ids = ["sub-stroke0001", "sub-stroke0002", "sub-stroke0003"]
+        ds = HuggingFaceDataset(
+            dataset_id="test/dataset",
+            _hf_dataset=mock_hf_dataset,
+            _case_ids=case_ids,
+        )
+        try:
+            case1 = ds.get_case(0)
+            case2 = ds.get_case(0)
+            # Same object returned (cached)
+            assert case1 is case2
+            # Dataset was only accessed once
+            assert mock_hf_dataset.__getitem__.call_count == 1
+        finally:
+            ds.cleanup()
+    def test_context_manager_cleans_up_temp_files(self, mock_hf_dataset: MagicMock) -> None:
+        """Test that using context manager cleans up temp files."""
+        case_ids = ["sub-stroke0001"]
+        ds = HuggingFaceDataset(
+            dataset_id="test/dataset",
+            _hf_dataset=mock_hf_dataset,
+            _case_ids=case_ids,
+        )
+        with ds:
+            case = ds.get_case(0)
+            temp_dir = case["dwi"].parent.parent
+            assert temp_dir.exists()
+        # After context exit, temp dir should be gone
+        assert not temp_dir.exists()
+    def test_cleanup_clears_cache(self, mock_hf_dataset: MagicMock) -> None:
+        """Test that cleanup clears the case cache."""
+        case_ids = ["sub-stroke0001"]
+        ds = HuggingFaceDataset(
+            dataset_id="test/dataset",
+            _hf_dataset=mock_hf_dataset,
+            _case_ids=case_ids,
+        )
+        ds.get_case(0)
+        assert len(ds._cached_cases) == 1
+        ds.cleanup()
+        assert len(ds._cached_cases) == 0
+    def test_get_case_raises_data_load_error_on_malformed_data(self) -> None:
+        """Test that get_case raises DataLoadError for malformed HF data."""
+        # Create mock with missing 'bytes' key
+        malformed_example = {"subject_id": "sub-stroke0001", "dwi": {}, "adc": {}}
+        mock_ds = MagicMock()
+        mock_ds.__len__ = MagicMock(return_value=1)
+        mock_ds.__getitem__ = MagicMock(return_value=malformed_example)
+        ds = HuggingFaceDataset(
+            dataset_id="test/dataset",
+            _hf_dataset=mock_ds,
+            _case_ids=["sub-stroke0001"],
+        )
+        try:
+            with pytest.raises(DataLoadError, match="Malformed HuggingFace data"):
+                ds.get_case(0)
+        finally:
+            ds.cleanup()
+class TestBuildHuggingFaceDataset:
+    """Tests for build_huggingface_dataset function."""
+    @patch("datasets.load_dataset")
+    def test_loads_dataset_from_hub(self, mock_load_dataset: MagicMock) -> None:
+        """Test that build_huggingface_dataset calls load_dataset correctly."""
+        mock_ds = MagicMock()
+        mock_ds.__iter__ = MagicMock(return_value=iter([{"subject_id": "sub-stroke0001"}]))
+        mock_load_dataset.return_value = mock_ds
+        result = build_huggingface_dataset("test/my-dataset")
+        mock_load_dataset.assert_called_once_with("test/my-dataset", split="train")
+        assert isinstance(result, HuggingFaceDataset)
+        assert result.dataset_id == "test/my-dataset"
+        assert result._case_ids == ["sub-stroke0001"]

tests/data/test_loader.py CHANGED Viewed

@@ -5,8 +5,9 @@ from __future__ import annotations
 from typing import TYPE_CHECKING
 import pytest
-from stroke_deepisles_demo.data.adapter import LocalDataset
 from stroke_deepisles_demo.data.loader import load_isles_dataset
 if TYPE_CHECKING:
@@ -27,7 +28,16 @@ def test_load_from_local_finds_all_cases(synthetic_isles_dir: Path) -> None:
     assert dataset.list_case_ids() == ["sub-stroke0001", "sub-stroke0002"]
-def test_load_raises_not_implemented_for_hf() -> None:
-    """Test that HF mode raises NotImplementedError."""
-    with pytest.raises(NotImplementedError):
-        load_isles_dataset(source="fake/dataset", local_mode=False)

 from typing import TYPE_CHECKING
 import pytest
+from datasets.exceptions import DatasetNotFoundError
+from stroke_deepisles_demo.data.adapter import HuggingFaceDataset, LocalDataset
 from stroke_deepisles_demo.data.loader import load_isles_dataset
 if TYPE_CHECKING:
     assert dataset.list_case_ids() == ["sub-stroke0001", "sub-stroke0002"]
+def test_load_hf_raises_on_invalid_dataset() -> None:
+    """Test that loading a non-existent HF dataset raises DatasetNotFoundError."""
+    with pytest.raises(DatasetNotFoundError):
+        load_isles_dataset(source="fake/nonexistent-dataset", local_mode=False)
+@pytest.mark.integration
+def test_load_from_huggingface_returns_hf_dataset() -> None:
+    """Test that loading from HuggingFace returns a HuggingFaceDataset."""
+    with load_isles_dataset() as dataset:  # Default is HuggingFace mode
+        assert isinstance(dataset, HuggingFaceDataset)
+        assert len(dataset) == 149
+        assert dataset.list_case_ids()[0] == "sub-stroke0001"