| # phase 1: data access layer | |
| ## purpose | |
| Implement a data loading layer that provides typed access to ISLES24 neuroimaging cases. This phase is split into sub-phases due to a critical discovery: the upstream dataset is not properly formatted for HuggingFace consumption. | |
| ## critical discovery (2025-12-04) | |
| **`YongchengYAO/ISLES24-MR-Lite` is NOT a proper HuggingFace Dataset.** | |
| | What we expected | What actually exists | | |
| |------------------|---------------------| | |
| | `load_dataset()` returns Dataset with columns | `load_dataset()` FAILS with "no data" | | |
| | Columns: `dwi`, `adc`, `mask`, `participant_id` | No columns - just raw ZIP files | | |
| | Parquet/Arrow format | Three ZIP archives dumped on HF | | |
| **Evidence**: `data/discovery/isles24_schema_report.txt` | |
| This means the demo must be built in phases: | |
| 1. **Phase 1A**: Local file loader (works NOW with extracted data) | |
| 2. **Phase 1B**: Test Tobias's `Nifti()` feature on local files (proves loading works) | |
| 3. **Phase 1C**: Upload properly to HuggingFace (future - proves production pipeline) | |
| 4. **Phase 1D**: Consume via Tobias's fork (future - proves full round-trip) | |
| --- | |
| ## phase 1a: local file loader (CURRENT PRIORITY) | |
| ### data location | |
| ``` | |
| data/isles24/ # Git-ignored | |
| βββ Images-DWI/ # 149 files | |
| β βββ sub-stroke{XXXX}_ses-02_dwi.nii.gz | |
| βββ Images-ADC/ # 149 files | |
| β βββ sub-stroke{XXXX}_ses-02_adc.nii.gz | |
| βββ Masks/ # 149 files | |
| βββ sub-stroke{XXXX}_ses-02_lesion-msk.nii.gz | |
| ``` | |
| ### file naming convention (BIDS-like) | |
| | Component | Pattern | Example | | |
| |-----------|---------|---------| | |
| | Subject ID | `sub-stroke{XXXX}` | `sub-stroke0005` | | |
| | Session | `ses-02` | Always "02" in this dataset | | |
| | Modality | `dwi`, `adc`, `lesion-msk` | - | | |
| | Extension | `.nii.gz` | Compressed NIfTI | | |
| **Subject ID regex**: `sub-stroke(\d{4})_ses-02_.*\.nii\.gz` | |
| **Note**: Subject IDs have gaps (e.g., 0018 missing). Range is 0001-0189, total 149 cases. | |
| ### deliverables | |
| - [ ] `src/stroke_deepisles_demo/data/loader.py` - Rewrite with local mode | |
| - [ ] `src/stroke_deepisles_demo/data/adapter.py` - Rewrite for file-based access | |
| - [ ] `src/stroke_deepisles_demo/data/staging.py` - Already correct, no changes | |
| - [ ] Unit tests with synthetic fixtures | |
| - [ ] Integration test with actual extracted data | |
| ### interfaces | |
| #### `data/loader.py` | |
| ```python | |
| """Load ISLES24 data from local directory or HuggingFace Hub.""" | |
| from __future__ import annotations | |
| from dataclasses import dataclass | |
| from pathlib import Path | |
| from typing import TYPE_CHECKING | |
| if TYPE_CHECKING: | |
| from stroke_deepisles_demo.data.adapter import LocalDataset | |
| @dataclass | |
| class DatasetInfo: | |
| """Metadata about the dataset.""" | |
| source: str # "local" or HF dataset ID | |
| num_cases: int | |
| modalities: list[str] | |
| has_ground_truth: bool | |
| def load_isles_dataset( | |
| source: str | Path = "data/isles24", | |
| *, | |
| local_mode: bool = True, # Default to local for now | |
| ) -> LocalDataset: | |
| """ | |
| Load ISLES24 dataset. | |
| Args: | |
| source: Local directory path or HuggingFace dataset ID | |
| local_mode: If True, treat source as local directory | |
| Returns: | |
| Dataset-like object providing case access | |
| Raises: | |
| DataLoadError: If data cannot be loaded | |
| """ | |
| if local_mode or isinstance(source, Path): | |
| return _load_from_local_directory(Path(source)) | |
| # Future: return _load_from_huggingface(source) | |
| raise NotImplementedError("HuggingFace mode not yet implemented") | |
| def _load_from_local_directory(data_dir: Path) -> LocalDataset: | |
| """ | |
| Load cases from extracted local files. | |
| Expects structure: | |
| data_dir/ | |
| βββ Images-DWI/sub-stroke{XXXX}_ses-02_dwi.nii.gz | |
| βββ Images-ADC/sub-stroke{XXXX}_ses-02_adc.nii.gz | |
| βββ Masks/sub-stroke{XXXX}_ses-02_lesion-msk.nii.gz | |
| """ | |
| ... | |
| ``` | |
| #### `data/adapter.py` | |
| ```python | |
| """Provide typed access to ISLES24 cases.""" | |
| from __future__ import annotations | |
| import re | |
| from dataclasses import dataclass | |
| from pathlib import Path | |
| from typing import Iterator | |
| from stroke_deepisles_demo.core.types import CaseFiles | |
| @dataclass | |
| class LocalDataset: | |
| """File-based dataset for local ISLES24 data.""" | |
| data_dir: Path | |
| cases: dict[str, CaseFiles] # subject_id -> files | |
| def __len__(self) -> int: | |
| return len(self.cases) | |
| def __iter__(self) -> Iterator[str]: | |
| return iter(self.cases.keys()) | |
| def list_case_ids(self) -> list[str]: | |
| """Return sorted list of subject IDs.""" | |
| return sorted(self.cases.keys()) | |
| def get_case(self, case_id: str | int) -> CaseFiles: | |
| """Get files for a case by ID or index.""" | |
| if isinstance(case_id, int): | |
| case_id = self.list_case_ids()[case_id] | |
| return self.cases[case_id] | |
| # Subject ID extraction | |
| SUBJECT_PATTERN = re.compile(r"sub-(stroke\d{4})_ses-\d+_.*\.nii\.gz") | |
| def parse_subject_id(filename: str) -> str | None: | |
| """Extract subject ID from BIDS filename.""" | |
| match = SUBJECT_PATTERN.match(filename) | |
| return f"sub-{match.group(1)}" if match else None | |
| def build_local_dataset(data_dir: Path) -> LocalDataset: | |
| """ | |
| Scan directory and build case mapping. | |
| Matches DWI + ADC + Mask files by subject ID. | |
| """ | |
| dwi_dir = data_dir / "Images-DWI" | |
| adc_dir = data_dir / "Images-ADC" | |
| mask_dir = data_dir / "Masks" | |
| cases: dict[str, CaseFiles] = {} | |
| # Scan DWI files to get subject IDs | |
| for dwi_file in dwi_dir.glob("*.nii.gz"): | |
| subject_id = parse_subject_id(dwi_file.name) | |
| if not subject_id: | |
| continue | |
| # Find matching ADC and Mask | |
| adc_file = adc_dir / dwi_file.name.replace("_dwi.", "_adc.") | |
| mask_file = mask_dir / dwi_file.name.replace("_dwi.", "_lesion-msk.") | |
| if not adc_file.exists(): | |
| continue # Skip incomplete cases | |
| cases[subject_id] = CaseFiles( | |
| dwi=dwi_file, | |
| adc=adc_file, | |
| ground_truth=mask_file if mask_file.exists() else None, | |
| ) | |
| return LocalDataset(data_dir=data_dir, cases=cases) | |
| ``` | |
| ### synthetic fixture structure | |
| Unit tests MUST use fixtures that replicate the **exact** directory structure. Add to `tests/conftest.py`: | |
| ```python | |
| @pytest.fixture | |
| def synthetic_isles_dir(temp_dir: Path) -> Path: | |
| """ | |
| Create synthetic ISLES24-like directory structure. | |
| Structure: | |
| temp_dir/ | |
| βββ Images-DWI/ | |
| β βββ sub-stroke0001_ses-02_dwi.nii.gz | |
| β βββ sub-stroke0002_ses-02_dwi.nii.gz | |
| βββ Images-ADC/ | |
| β βββ sub-stroke0001_ses-02_adc.nii.gz | |
| β βββ sub-stroke0002_ses-02_adc.nii.gz | |
| βββ Masks/ | |
| βββ sub-stroke0001_ses-02_lesion-msk.nii.gz | |
| βββ sub-stroke0002_ses-02_lesion-msk.nii.gz | |
| """ | |
| dwi_dir = temp_dir / "Images-DWI" | |
| adc_dir = temp_dir / "Images-ADC" | |
| mask_dir = temp_dir / "Masks" | |
| dwi_dir.mkdir() | |
| adc_dir.mkdir() | |
| mask_dir.mkdir() | |
| for subject_num in [1, 2]: | |
| subject_id = f"sub-stroke{subject_num:04d}" | |
| # Create DWI | |
| dwi_data = np.random.rand(10, 10, 5).astype(np.float32) | |
| dwi_img = nib.Nifti1Image(dwi_data, affine=np.eye(4)) | |
| nib.save(dwi_img, dwi_dir / f"{subject_id}_ses-02_dwi.nii.gz") | |
| # Create ADC | |
| adc_data = np.random.rand(10, 10, 5).astype(np.float32) * 2000 | |
| adc_img = nib.Nifti1Image(adc_data, affine=np.eye(4)) | |
| nib.save(adc_img, adc_dir / f"{subject_id}_ses-02_adc.nii.gz") | |
| # Create Mask | |
| mask_data = (np.random.rand(10, 10, 5) > 0.9).astype(np.uint8) | |
| mask_img = nib.Nifti1Image(mask_data, affine=np.eye(4)) | |
| nib.save(mask_img, mask_dir / f"{subject_id}_ses-02_lesion-msk.nii.gz") | |
| return temp_dir | |
| ``` | |
| ### tdd plan | |
| ```python | |
| # tests/data/test_loader.py | |
| def test_load_from_local_returns_local_dataset(synthetic_isles_dir): | |
| """Local mode returns LocalDataset.""" | |
| ... | |
| def test_load_from_local_finds_all_cases(synthetic_isles_dir): | |
| """Finds all cases in synthetic structure.""" | |
| ... | |
| # tests/data/test_adapter.py | |
| def test_parse_subject_id_extracts_correctly(): | |
| """Extracts subject ID from BIDS filename.""" | |
| assert parse_subject_id("sub-stroke0005_ses-02_dwi.nii.gz") == "sub-stroke0005" | |
| def test_build_local_dataset_matches_files(synthetic_isles_dir): | |
| """Matches DWI, ADC, Mask by subject ID.""" | |
| ... | |
| def test_get_case_returns_case_files(synthetic_isles_dir): | |
| """get_case returns CaseFiles with correct paths.""" | |
| ... | |
| ``` | |
| ### done criteria (phase 1a) | |
| - [ ] `uv run pytest tests/data/ -v` passes | |
| - [ ] Can load all 149 cases from `data/isles24/` | |
| - [ ] `list_case_ids()` returns 149 subject IDs | |
| - [ ] `get_case("sub-stroke0005")` returns valid CaseFiles | |
| - [ ] Type checking passes: `uv run mypy src/stroke_deepisles_demo/data/` | |
| --- | |
| ## phase 1b: test tobias's nifti feature (NEXT) | |
| ### purpose | |
| Verify that Tobias's `Nifti()` feature type from the datasets fork can correctly load/parse NIfTI files. This proves the **loading** part of the consumption pipeline works, even though the **download** part is broken. | |
| ### approach | |
| ```python | |
| # Test script to verify Nifti() feature works on local files | |
| from datasets import Features, Value | |
| from datasets.features import Nifti # From Tobias's fork | |
| # Create a simple dataset from local files | |
| features = Features({ | |
| "subject_id": Value("string"), | |
| "dwi": Nifti(), | |
| "adc": Nifti(), | |
| "mask": Nifti(), | |
| }) | |
| # Load a single case and verify Nifti() decodes correctly | |
| ``` | |
| ### done criteria (phase 1b) | |
| - [ ] Tobias's `Nifti()` feature loads local `.nii.gz` files | |
| - [ ] Decoded NIfTI has correct shape/dtype | |
| - [ ] Can access voxel data via nibabel-like interface | |
| --- | |
| ## phase 1c: proper huggingface upload (FUTURE) | |
| ### purpose | |
| Re-upload ISLES24 data to HuggingFace **properly** using the arc-aphasia-bids approach. This proves the **production** pipeline works. | |
| ### approach | |
| 1. Use BIDS loader from Tobias's fork | |
| 2. Create proper parquet schema with columns: | |
| - `subject`: string | |
| - `session`: string | |
| - `dwi`: Nifti() | |
| - `adc`: Nifti() | |
| - `mask`: Nifti() | |
| 3. Upload to new HuggingFace repo (e.g., `The-Obstacle-Is-The-Way/ISLES24-BIDS`) | |
| ### done criteria (phase 1c) | |
| - [ ] Dataset uploaded to HuggingFace with proper schema | |
| - [ ] HuggingFace dataset viewer shows data correctly | |
| - [ ] `load_dataset("new-repo-id")` returns Dataset with expected columns | |
| --- | |
| ## phase 1d: consumption verification (FUTURE) | |
| ### purpose | |
| Verify the full round-trip: Download from HuggingFace using Tobias's fork. | |
| ### approach | |
| ```python | |
| from datasets import load_dataset | |
| # This should work after Phase 1C | |
| ds = load_dataset("The-Obstacle-Is-The-Way/ISLES24-BIDS") | |
| case = ds["train"][0] | |
| print(case["dwi"].shape) # Should work! | |
| ``` | |
| ### new adapter function | |
| When Phase 1D is implemented, `adapter.py` will need a new function alongside `build_local_dataset`: | |
| ```python | |
| def adapt_hf_case(hf_row: dict) -> CaseFiles: | |
| """ | |
| Adapt a HuggingFace Dataset row to CaseFiles. | |
| Args: | |
| hf_row: Row from load_dataset() with columns: | |
| - dwi: Nifti feature (nibabel-like object) | |
| - adc: Nifti feature | |
| - mask: Nifti feature | |
| - subject: str | |
| Returns: | |
| CaseFiles with materialized paths or nibabel objects | |
| """ | |
| # Implementation depends on how Nifti() feature exposes data | |
| # May need to write to temp files or pass nibabel objects directly | |
| ... | |
| ``` | |
| This maintains the same `CaseFiles` contract for downstream phases regardless of data source. | |
| ### done criteria (phase 1d) | |
| - [ ] `load_dataset()` works on properly uploaded dataset | |
| - [ ] `adapt_hf_case()` function converts HF rows to CaseFiles | |
| - [ ] Full demo runs with HuggingFace consumption (not just local files) | |
| - [ ] Documents the pitfall for future projects | |
| --- | |
| ## dependencies | |
| No new dependencies needed beyond Phase 0. | |
| ## notes | |
| - The original `adapter.py` assumed HF Dataset with columns - COMPLETELY WRONG | |
| - The original `loader.py` called `load_dataset()` directly - FAILS on this dataset | |
| - `staging.py` is still correct - it just needs `CaseFiles` with paths | |