# phase 1: data access layer

## purpose

Implement a data loading layer that provides typed access to ISLES24 neuroimaging cases. This phase is split into sub-phases due to a critical discovery: the upstream dataset is not properly formatted for HuggingFace consumption.

## critical discovery (2025-12-04)

**`YongchengYAO/ISLES24-MR-Lite` is NOT a proper HuggingFace Dataset.**

| What we expected | What actually exists |
|------------------|---------------------|
| `load_dataset()` returns Dataset with columns | `load_dataset()` FAILS with "no data" |
| Columns: `dwi`, `adc`, `mask`, `participant_id` | No columns - just raw ZIP files |
| Parquet/Arrow format | Three ZIP archives dumped on HF |

**Evidence**: `data/discovery/isles24_schema_report.txt`

This means the demo must be built in phases:
1. **Phase 1A**: Local file loader (works NOW with extracted data)
2. **Phase 1B**: Test Tobias's `Nifti()` feature on local files (proves loading works)
3. **Phase 1C**: Upload properly to HuggingFace (future - proves production pipeline)
4. **Phase 1D**: Consume via Tobias's fork (future - proves full round-trip)

---

## phase 1a: local file loader (CURRENT PRIORITY)

### data location

```
data/isles24/                       # Git-ignored
├── Images-DWI/                     # 149 files
│   └── sub-stroke{XXXX}_ses-02_dwi.nii.gz
├── Images-ADC/                     # 149 files
│   └── sub-stroke{XXXX}_ses-02_adc.nii.gz
└── Masks/                          # 149 files
    └── sub-stroke{XXXX}_ses-02_lesion-msk.nii.gz
```

### file naming convention (BIDS-like)

| Component | Pattern | Example |
|-----------|---------|---------|
| Subject ID | `sub-stroke{XXXX}` | `sub-stroke0005` |
| Session | `ses-02` | Always "02" in this dataset |
| Modality | `dwi`, `adc`, `lesion-msk` | - |
| Extension | `.nii.gz` | Compressed NIfTI |

**Subject ID regex**: `sub-stroke(\d{4})_ses-02_.*\.nii\.gz`

**Note**: Subject IDs have gaps (e.g., 0018 missing). Range is 0001-0189, total 149 cases.

### deliverables

- [ ] `src/stroke_deepisles_demo/data/loader.py` - Rewrite with local mode
- [ ] `src/stroke_deepisles_demo/data/adapter.py` - Rewrite for file-based access
- [ ] `src/stroke_deepisles_demo/data/staging.py` - Already correct, no changes
- [ ] Unit tests with synthetic fixtures
- [ ] Integration test with actual extracted data

### interfaces

#### `data/loader.py`

```python
"""Load ISLES24 data from local directory or HuggingFace Hub."""

from __future__ import annotations

from dataclasses import dataclass
from pathlib import Path
from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from stroke_deepisles_demo.data.adapter import LocalDataset


@dataclass
class DatasetInfo:
    """Metadata about the dataset."""

    source: str  # "local" or HF dataset ID
    num_cases: int
    modalities: list[str]
    has_ground_truth: bool


def load_isles_dataset(
    source: str | Path = "data/isles24",
    *,
    local_mode: bool = True,  # Default to local for now
) -> LocalDataset:
    """
    Load ISLES24 dataset.

    Args:
        source: Local directory path or HuggingFace dataset ID
        local_mode: If True, treat source as local directory

    Returns:
        Dataset-like object providing case access

    Raises:
        DataLoadError: If data cannot be loaded
    """
    if local_mode or isinstance(source, Path):
        return _load_from_local_directory(Path(source))
    # Future: return _load_from_huggingface(source)
    raise NotImplementedError("HuggingFace mode not yet implemented")


def _load_from_local_directory(data_dir: Path) -> LocalDataset:
    """
    Load cases from extracted local files.

    Expects structure:
        data_dir/
        ├── Images-DWI/sub-stroke{XXXX}_ses-02_dwi.nii.gz
        ├── Images-ADC/sub-stroke{XXXX}_ses-02_adc.nii.gz
        └── Masks/sub-stroke{XXXX}_ses-02_lesion-msk.nii.gz
    """
    ...
```

#### `data/adapter.py`

```python
"""Provide typed access to ISLES24 cases."""

from __future__ import annotations

import re
from dataclasses import dataclass
from pathlib import Path
from typing import Iterator

from stroke_deepisles_demo.core.types import CaseFiles


@dataclass
class LocalDataset:
    """File-based dataset for local ISLES24 data."""

    data_dir: Path
    cases: dict[str, CaseFiles]  # subject_id -> files

    def __len__(self) -> int:
        return len(self.cases)

    def __iter__(self) -> Iterator[str]:
        return iter(self.cases.keys())

    def list_case_ids(self) -> list[str]:
        """Return sorted list of subject IDs."""
        return sorted(self.cases.keys())

    def get_case(self, case_id: str | int) -> CaseFiles:
        """Get files for a case by ID or index."""
        if isinstance(case_id, int):
            case_id = self.list_case_ids()[case_id]
        return self.cases[case_id]


# Subject ID extraction
SUBJECT_PATTERN = re.compile(r"sub-(stroke\d{4})_ses-\d+_.*\.nii\.gz")


def parse_subject_id(filename: str) -> str | None:
    """Extract subject ID from BIDS filename."""
    match = SUBJECT_PATTERN.match(filename)
    return f"sub-{match.group(1)}" if match else None


def build_local_dataset(data_dir: Path) -> LocalDataset:
    """
    Scan directory and build case mapping.

    Matches DWI + ADC + Mask files by subject ID.
    """
    dwi_dir = data_dir / "Images-DWI"
    adc_dir = data_dir / "Images-ADC"
    mask_dir = data_dir / "Masks"

    cases: dict[str, CaseFiles] = {}

    # Scan DWI files to get subject IDs
    for dwi_file in dwi_dir.glob("*.nii.gz"):
        subject_id = parse_subject_id(dwi_file.name)
        if not subject_id:
            continue

        # Find matching ADC and Mask
        adc_file = adc_dir / dwi_file.name.replace("_dwi.", "_adc.")
        mask_file = mask_dir / dwi_file.name.replace("_dwi.", "_lesion-msk.")

        if not adc_file.exists():
            continue  # Skip incomplete cases

        cases[subject_id] = CaseFiles(
            dwi=dwi_file,
            adc=adc_file,
            ground_truth=mask_file if mask_file.exists() else None,
        )

    return LocalDataset(data_dir=data_dir, cases=cases)
```

### synthetic fixture structure

Unit tests MUST use fixtures that replicate the **exact** directory structure. Add to `tests/conftest.py`:

```python
@pytest.fixture
def synthetic_isles_dir(temp_dir: Path) -> Path:
    """
    Create synthetic ISLES24-like directory structure.

    Structure:
        temp_dir/
        ├── Images-DWI/
        │   ├── sub-stroke0001_ses-02_dwi.nii.gz
        │   └── sub-stroke0002_ses-02_dwi.nii.gz
        ├── Images-ADC/
        │   ├── sub-stroke0001_ses-02_adc.nii.gz
        │   └── sub-stroke0002_ses-02_adc.nii.gz
        └── Masks/
            ├── sub-stroke0001_ses-02_lesion-msk.nii.gz
            └── sub-stroke0002_ses-02_lesion-msk.nii.gz
    """
    dwi_dir = temp_dir / "Images-DWI"
    adc_dir = temp_dir / "Images-ADC"
    mask_dir = temp_dir / "Masks"

    dwi_dir.mkdir()
    adc_dir.mkdir()
    mask_dir.mkdir()

    for subject_num in [1, 2]:
        subject_id = f"sub-stroke{subject_num:04d}"

        # Create DWI
        dwi_data = np.random.rand(10, 10, 5).astype(np.float32)
        dwi_img = nib.Nifti1Image(dwi_data, affine=np.eye(4))
        nib.save(dwi_img, dwi_dir / f"{subject_id}_ses-02_dwi.nii.gz")

        # Create ADC
        adc_data = np.random.rand(10, 10, 5).astype(np.float32) * 2000
        adc_img = nib.Nifti1Image(adc_data, affine=np.eye(4))
        nib.save(adc_img, adc_dir / f"{subject_id}_ses-02_adc.nii.gz")

        # Create Mask
        mask_data = (np.random.rand(10, 10, 5) > 0.9).astype(np.uint8)
        mask_img = nib.Nifti1Image(mask_data, affine=np.eye(4))
        nib.save(mask_img, mask_dir / f"{subject_id}_ses-02_lesion-msk.nii.gz")

    return temp_dir
```

### tdd plan

```python
# tests/data/test_loader.py

def test_load_from_local_returns_local_dataset(synthetic_isles_dir):
    """Local mode returns LocalDataset."""
    ...

def test_load_from_local_finds_all_cases(synthetic_isles_dir):
    """Finds all cases in synthetic structure."""
    ...

# tests/data/test_adapter.py

def test_parse_subject_id_extracts_correctly():
    """Extracts subject ID from BIDS filename."""
    assert parse_subject_id("sub-stroke0005_ses-02_dwi.nii.gz") == "sub-stroke0005"

def test_build_local_dataset_matches_files(synthetic_isles_dir):
    """Matches DWI, ADC, Mask by subject ID."""
    ...

def test_get_case_returns_case_files(synthetic_isles_dir):
    """get_case returns CaseFiles with correct paths."""
    ...
```

### done criteria (phase 1a)

- [ ] `uv run pytest tests/data/ -v` passes
- [ ] Can load all 149 cases from `data/isles24/`
- [ ] `list_case_ids()` returns 149 subject IDs
- [ ] `get_case("sub-stroke0005")` returns valid CaseFiles
- [ ] Type checking passes: `uv run mypy src/stroke_deepisles_demo/data/`

---

## phase 1b: test tobias's nifti feature (NEXT)

### purpose

Verify that Tobias's `Nifti()` feature type from the datasets fork can correctly load/parse NIfTI files. This proves the **loading** part of the consumption pipeline works, even though the **download** part is broken.

### approach

```python
# Test script to verify Nifti() feature works on local files
from datasets import Features, Value
from datasets.features import Nifti  # From Tobias's fork

# Create a simple dataset from local files
features = Features({
    "subject_id": Value("string"),
    "dwi": Nifti(),
    "adc": Nifti(),
    "mask": Nifti(),
})

# Load a single case and verify Nifti() decodes correctly
```

### done criteria (phase 1b)

- [ ] Tobias's `Nifti()` feature loads local `.nii.gz` files
- [ ] Decoded NIfTI has correct shape/dtype
- [ ] Can access voxel data via nibabel-like interface

---

## phase 1c: proper huggingface upload (FUTURE)

### purpose

Re-upload ISLES24 data to HuggingFace **properly** using the arc-aphasia-bids approach. This proves the **production** pipeline works.

### approach

1. Use BIDS loader from Tobias's fork
2. Create proper parquet schema with columns:
   - `subject`: string
   - `session`: string
   - `dwi`: Nifti()
   - `adc`: Nifti()
   - `mask`: Nifti()
3. Upload to new HuggingFace repo (e.g., `The-Obstacle-Is-The-Way/ISLES24-BIDS`)

### done criteria (phase 1c)

- [ ] Dataset uploaded to HuggingFace with proper schema
- [ ] HuggingFace dataset viewer shows data correctly
- [ ] `load_dataset("new-repo-id")` returns Dataset with expected columns

---

## phase 1d: consumption verification (FUTURE)

### purpose

Verify the full round-trip: Download from HuggingFace using Tobias's fork.

### approach

```python
from datasets import load_dataset

# This should work after Phase 1C
ds = load_dataset("The-Obstacle-Is-The-Way/ISLES24-BIDS")
case = ds["train"][0]
print(case["dwi"].shape)  # Should work!
```

### new adapter function

When Phase 1D is implemented, `adapter.py` will need a new function alongside `build_local_dataset`:

```python
def adapt_hf_case(hf_row: dict) -> CaseFiles:
    """
    Adapt a HuggingFace Dataset row to CaseFiles.

    Args:
        hf_row: Row from load_dataset() with columns:
            - dwi: Nifti feature (nibabel-like object)
            - adc: Nifti feature
            - mask: Nifti feature
            - subject: str

    Returns:
        CaseFiles with materialized paths or nibabel objects
    """
    # Implementation depends on how Nifti() feature exposes data
    # May need to write to temp files or pass nibabel objects directly
    ...
```

This maintains the same `CaseFiles` contract for downstream phases regardless of data source.

### done criteria (phase 1d)

- [ ] `load_dataset()` works on properly uploaded dataset
- [ ] `adapt_hf_case()` function converts HF rows to CaseFiles
- [ ] Full demo runs with HuggingFace consumption (not just local files)
- [ ] Documents the pitfall for future projects

---

## dependencies

No new dependencies needed beyond Phase 0.

## notes

- The original `adapter.py` assumed HF Dataset with columns - COMPLETELY WRONG
- The original `loader.py` called `load_dataset()` directly - FAILS on this dataset
- `staging.py` is still correct - it just needs `CaseFiles` with paths