File size: 12,300 Bytes
aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 d77e99f 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 d77e99f aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 d77e99f 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a d77e99f aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a 8e0cd11 aef1f5a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 |
# phase 1: data access layer
## purpose
Implement a data loading layer that provides typed access to ISLES24 neuroimaging cases. This phase is split into sub-phases due to a critical discovery: the upstream dataset is not properly formatted for HuggingFace consumption.
## critical discovery (2025-12-04)
**`YongchengYAO/ISLES24-MR-Lite` is NOT a proper HuggingFace Dataset.**
| What we expected | What actually exists |
|------------------|---------------------|
| `load_dataset()` returns Dataset with columns | `load_dataset()` FAILS with "no data" |
| Columns: `dwi`, `adc`, `mask`, `participant_id` | No columns - just raw ZIP files |
| Parquet/Arrow format | Three ZIP archives dumped on HF |
**Evidence**: `data/discovery/isles24_schema_report.txt`
This means the demo must be built in phases:
1. **Phase 1A**: Local file loader (works NOW with extracted data)
2. **Phase 1B**: Test Tobias's `Nifti()` feature on local files (proves loading works)
3. **Phase 1C**: Upload properly to HuggingFace (future - proves production pipeline)
4. **Phase 1D**: Consume via Tobias's fork (future - proves full round-trip)
---
## phase 1a: local file loader (CURRENT PRIORITY)
### data location
```
data/isles24/ # Git-ignored
βββ Images-DWI/ # 149 files
β βββ sub-stroke{XXXX}_ses-02_dwi.nii.gz
βββ Images-ADC/ # 149 files
β βββ sub-stroke{XXXX}_ses-02_adc.nii.gz
βββ Masks/ # 149 files
βββ sub-stroke{XXXX}_ses-02_lesion-msk.nii.gz
```
### file naming convention (BIDS-like)
| Component | Pattern | Example |
|-----------|---------|---------|
| Subject ID | `sub-stroke{XXXX}` | `sub-stroke0005` |
| Session | `ses-02` | Always "02" in this dataset |
| Modality | `dwi`, `adc`, `lesion-msk` | - |
| Extension | `.nii.gz` | Compressed NIfTI |
**Subject ID regex**: `sub-stroke(\d{4})_ses-02_.*\.nii\.gz`
**Note**: Subject IDs have gaps (e.g., 0018 missing). Range is 0001-0189, total 149 cases.
### deliverables
- [ ] `src/stroke_deepisles_demo/data/loader.py` - Rewrite with local mode
- [ ] `src/stroke_deepisles_demo/data/adapter.py` - Rewrite for file-based access
- [ ] `src/stroke_deepisles_demo/data/staging.py` - Already correct, no changes
- [ ] Unit tests with synthetic fixtures
- [ ] Integration test with actual extracted data
### interfaces
#### `data/loader.py`
```python
"""Load ISLES24 data from local directory or HuggingFace Hub."""
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from stroke_deepisles_demo.data.adapter import LocalDataset
@dataclass
class DatasetInfo:
"""Metadata about the dataset."""
source: str # "local" or HF dataset ID
num_cases: int
modalities: list[str]
has_ground_truth: bool
def load_isles_dataset(
source: str | Path = "data/isles24",
*,
local_mode: bool = True, # Default to local for now
) -> LocalDataset:
"""
Load ISLES24 dataset.
Args:
source: Local directory path or HuggingFace dataset ID
local_mode: If True, treat source as local directory
Returns:
Dataset-like object providing case access
Raises:
DataLoadError: If data cannot be loaded
"""
if local_mode or isinstance(source, Path):
return _load_from_local_directory(Path(source))
# Future: return _load_from_huggingface(source)
raise NotImplementedError("HuggingFace mode not yet implemented")
def _load_from_local_directory(data_dir: Path) -> LocalDataset:
"""
Load cases from extracted local files.
Expects structure:
data_dir/
βββ Images-DWI/sub-stroke{XXXX}_ses-02_dwi.nii.gz
βββ Images-ADC/sub-stroke{XXXX}_ses-02_adc.nii.gz
βββ Masks/sub-stroke{XXXX}_ses-02_lesion-msk.nii.gz
"""
...
```
#### `data/adapter.py`
```python
"""Provide typed access to ISLES24 cases."""
from __future__ import annotations
import re
from dataclasses import dataclass
from pathlib import Path
from typing import Iterator
from stroke_deepisles_demo.core.types import CaseFiles
@dataclass
class LocalDataset:
"""File-based dataset for local ISLES24 data."""
data_dir: Path
cases: dict[str, CaseFiles] # subject_id -> files
def __len__(self) -> int:
return len(self.cases)
def __iter__(self) -> Iterator[str]:
return iter(self.cases.keys())
def list_case_ids(self) -> list[str]:
"""Return sorted list of subject IDs."""
return sorted(self.cases.keys())
def get_case(self, case_id: str | int) -> CaseFiles:
"""Get files for a case by ID or index."""
if isinstance(case_id, int):
case_id = self.list_case_ids()[case_id]
return self.cases[case_id]
# Subject ID extraction
SUBJECT_PATTERN = re.compile(r"sub-(stroke\d{4})_ses-\d+_.*\.nii\.gz")
def parse_subject_id(filename: str) -> str | None:
"""Extract subject ID from BIDS filename."""
match = SUBJECT_PATTERN.match(filename)
return f"sub-{match.group(1)}" if match else None
def build_local_dataset(data_dir: Path) -> LocalDataset:
"""
Scan directory and build case mapping.
Matches DWI + ADC + Mask files by subject ID.
"""
dwi_dir = data_dir / "Images-DWI"
adc_dir = data_dir / "Images-ADC"
mask_dir = data_dir / "Masks"
cases: dict[str, CaseFiles] = {}
# Scan DWI files to get subject IDs
for dwi_file in dwi_dir.glob("*.nii.gz"):
subject_id = parse_subject_id(dwi_file.name)
if not subject_id:
continue
# Find matching ADC and Mask
adc_file = adc_dir / dwi_file.name.replace("_dwi.", "_adc.")
mask_file = mask_dir / dwi_file.name.replace("_dwi.", "_lesion-msk.")
if not adc_file.exists():
continue # Skip incomplete cases
cases[subject_id] = CaseFiles(
dwi=dwi_file,
adc=adc_file,
ground_truth=mask_file if mask_file.exists() else None,
)
return LocalDataset(data_dir=data_dir, cases=cases)
```
### synthetic fixture structure
Unit tests MUST use fixtures that replicate the **exact** directory structure. Add to `tests/conftest.py`:
```python
@pytest.fixture
def synthetic_isles_dir(temp_dir: Path) -> Path:
"""
Create synthetic ISLES24-like directory structure.
Structure:
temp_dir/
βββ Images-DWI/
β βββ sub-stroke0001_ses-02_dwi.nii.gz
β βββ sub-stroke0002_ses-02_dwi.nii.gz
βββ Images-ADC/
β βββ sub-stroke0001_ses-02_adc.nii.gz
β βββ sub-stroke0002_ses-02_adc.nii.gz
βββ Masks/
βββ sub-stroke0001_ses-02_lesion-msk.nii.gz
βββ sub-stroke0002_ses-02_lesion-msk.nii.gz
"""
dwi_dir = temp_dir / "Images-DWI"
adc_dir = temp_dir / "Images-ADC"
mask_dir = temp_dir / "Masks"
dwi_dir.mkdir()
adc_dir.mkdir()
mask_dir.mkdir()
for subject_num in [1, 2]:
subject_id = f"sub-stroke{subject_num:04d}"
# Create DWI
dwi_data = np.random.rand(10, 10, 5).astype(np.float32)
dwi_img = nib.Nifti1Image(dwi_data, affine=np.eye(4))
nib.save(dwi_img, dwi_dir / f"{subject_id}_ses-02_dwi.nii.gz")
# Create ADC
adc_data = np.random.rand(10, 10, 5).astype(np.float32) * 2000
adc_img = nib.Nifti1Image(adc_data, affine=np.eye(4))
nib.save(adc_img, adc_dir / f"{subject_id}_ses-02_adc.nii.gz")
# Create Mask
mask_data = (np.random.rand(10, 10, 5) > 0.9).astype(np.uint8)
mask_img = nib.Nifti1Image(mask_data, affine=np.eye(4))
nib.save(mask_img, mask_dir / f"{subject_id}_ses-02_lesion-msk.nii.gz")
return temp_dir
```
### tdd plan
```python
# tests/data/test_loader.py
def test_load_from_local_returns_local_dataset(synthetic_isles_dir):
"""Local mode returns LocalDataset."""
...
def test_load_from_local_finds_all_cases(synthetic_isles_dir):
"""Finds all cases in synthetic structure."""
...
# tests/data/test_adapter.py
def test_parse_subject_id_extracts_correctly():
"""Extracts subject ID from BIDS filename."""
assert parse_subject_id("sub-stroke0005_ses-02_dwi.nii.gz") == "sub-stroke0005"
def test_build_local_dataset_matches_files(synthetic_isles_dir):
"""Matches DWI, ADC, Mask by subject ID."""
...
def test_get_case_returns_case_files(synthetic_isles_dir):
"""get_case returns CaseFiles with correct paths."""
...
```
### done criteria (phase 1a)
- [ ] `uv run pytest tests/data/ -v` passes
- [ ] Can load all 149 cases from `data/isles24/`
- [ ] `list_case_ids()` returns 149 subject IDs
- [ ] `get_case("sub-stroke0005")` returns valid CaseFiles
- [ ] Type checking passes: `uv run mypy src/stroke_deepisles_demo/data/`
---
## phase 1b: test tobias's nifti feature (NEXT)
### purpose
Verify that Tobias's `Nifti()` feature type from the datasets fork can correctly load/parse NIfTI files. This proves the **loading** part of the consumption pipeline works, even though the **download** part is broken.
### approach
```python
# Test script to verify Nifti() feature works on local files
from datasets import Features, Value
from datasets.features import Nifti # From Tobias's fork
# Create a simple dataset from local files
features = Features({
"subject_id": Value("string"),
"dwi": Nifti(),
"adc": Nifti(),
"mask": Nifti(),
})
# Load a single case and verify Nifti() decodes correctly
```
### done criteria (phase 1b)
- [ ] Tobias's `Nifti()` feature loads local `.nii.gz` files
- [ ] Decoded NIfTI has correct shape/dtype
- [ ] Can access voxel data via nibabel-like interface
---
## phase 1c: proper huggingface upload (FUTURE)
### purpose
Re-upload ISLES24 data to HuggingFace **properly** using the arc-aphasia-bids approach. This proves the **production** pipeline works.
### approach
1. Use BIDS loader from Tobias's fork
2. Create proper parquet schema with columns:
- `subject`: string
- `session`: string
- `dwi`: Nifti()
- `adc`: Nifti()
- `mask`: Nifti()
3. Upload to new HuggingFace repo (e.g., `The-Obstacle-Is-The-Way/ISLES24-BIDS`)
### done criteria (phase 1c)
- [ ] Dataset uploaded to HuggingFace with proper schema
- [ ] HuggingFace dataset viewer shows data correctly
- [ ] `load_dataset("new-repo-id")` returns Dataset with expected columns
---
## phase 1d: consumption verification (FUTURE)
### purpose
Verify the full round-trip: Download from HuggingFace using Tobias's fork.
### approach
```python
from datasets import load_dataset
# This should work after Phase 1C
ds = load_dataset("The-Obstacle-Is-The-Way/ISLES24-BIDS")
case = ds["train"][0]
print(case["dwi"].shape) # Should work!
```
### new adapter function
When Phase 1D is implemented, `adapter.py` will need a new function alongside `build_local_dataset`:
```python
def adapt_hf_case(hf_row: dict) -> CaseFiles:
"""
Adapt a HuggingFace Dataset row to CaseFiles.
Args:
hf_row: Row from load_dataset() with columns:
- dwi: Nifti feature (nibabel-like object)
- adc: Nifti feature
- mask: Nifti feature
- subject: str
Returns:
CaseFiles with materialized paths or nibabel objects
"""
# Implementation depends on how Nifti() feature exposes data
# May need to write to temp files or pass nibabel objects directly
...
```
This maintains the same `CaseFiles` contract for downstream phases regardless of data source.
### done criteria (phase 1d)
- [ ] `load_dataset()` works on properly uploaded dataset
- [ ] `adapt_hf_case()` function converts HF rows to CaseFiles
- [ ] Full demo runs with HuggingFace consumption (not just local files)
- [ ] Documents the pitfall for future projects
---
## dependencies
No new dependencies needed beyond Phase 0.
## notes
- The original `adapter.py` assumed HF Dataset with columns - COMPLETELY WRONG
- The original `loader.py` called `load_dataset()` directly - FAILS on this dataset
- `staging.py` is still correct - it just needs `CaseFiles` with paths
|