Spaces:

VibecoderMcSwaggins
/

stroke-viewer-frontend

Running

App Files Files Community

stroke-viewer-frontend / docs /specs /02-phase-1-data-access.md

VibecoderMcSwaggins

feat(phase-4): Gradio UI with NiiVue visualization (#5)

d77e99f unverified 9 days ago

preview code

raw

history blame

12.3 kB

	# phase 1: data access layer

	## purpose

	Implement a data loading layer that provides typed access to ISLES24 neuroimaging cases. This phase is split into sub-phases due to a critical discovery: the upstream dataset is not properly formatted for HuggingFace consumption.

	## critical discovery (2025-12-04)

	`YongchengYAO/ISLES24-MR-Lite` is NOT a proper HuggingFace Dataset.

	\| What we expected \| What actually exists \|
	\|------------------\|---------------------\|
	\| `load_dataset()` returns Dataset with columns \| `load_dataset()` FAILS with "no data" \|
	\| Columns: `dwi`, `adc`, `mask`, `participant_id` \| No columns - just raw ZIP files \|
	\| Parquet/Arrow format \| Three ZIP archives dumped on HF \|

	Evidence: `data/discovery/isles24_schema_report.txt`

	This means the demo must be built in phases:
	1. Phase 1A: Local file loader (works NOW with extracted data)
	2. Phase 1B: Test Tobias's `Nifti()` feature on local files (proves loading works)
	3. Phase 1C: Upload properly to HuggingFace (future - proves production pipeline)
	4. Phase 1D: Consume via Tobias's fork (future - proves full round-trip)

	---

	## phase 1a: local file loader (CURRENT PRIORITY)

	### data location

	```
	data/isles24/ # Git-ignored
	├── Images-DWI/ # 149 files
	│ └── sub-stroke{XXXX}_ses-02_dwi.nii.gz
	├── Images-ADC/ # 149 files
	│ └── sub-stroke{XXXX}_ses-02_adc.nii.gz
	└── Masks/ # 149 files
	└── sub-stroke{XXXX}_ses-02_lesion-msk.nii.gz
	```

	### file naming convention (BIDS-like)

	\| Component \| Pattern \| Example \|
	\|-----------\|---------\|---------\|
	\| Subject ID \| `sub-stroke{XXXX}` \| `sub-stroke0005` \|
	\| Session \| `ses-02` \| Always "02" in this dataset \|
	\| Modality \| `dwi`, `adc`, `lesion-msk` \| - \|
	\| Extension \| `.nii.gz` \| Compressed NIfTI \|

	Subject ID regex: `sub-stroke(\d{4})_ses-02_.*\.nii\.gz`

	Note: Subject IDs have gaps (e.g., 0018 missing). Range is 0001-0189, total 149 cases.

	### deliverables

	- [ ] `src/stroke_deepisles_demo/data/loader.py` - Rewrite with local mode
	- [ ] `src/stroke_deepisles_demo/data/adapter.py` - Rewrite for file-based access
	- [ ] `src/stroke_deepisles_demo/data/staging.py` - Already correct, no changes
	- [ ] Unit tests with synthetic fixtures
	- [ ] Integration test with actual extracted data

	### interfaces

	#### `data/loader.py`

	```python
	"""Load ISLES24 data from local directory or HuggingFace Hub."""

	from __future__ import annotations

	from dataclasses import dataclass
	from pathlib import Path
	from typing import TYPE_CHECKING

	if TYPE_CHECKING:
	from stroke_deepisles_demo.data.adapter import LocalDataset


	@dataclass
	class DatasetInfo:
	"""Metadata about the dataset."""

	source: str # "local" or HF dataset ID
	num_cases: int
	modalities: list[str]
	has_ground_truth: bool


	def load_isles_dataset(
	source: str \| Path = "data/isles24",
	*,
	local_mode: bool = True, # Default to local for now
	) -> LocalDataset:
	"""
	Load ISLES24 dataset.

	Args:
	source: Local directory path or HuggingFace dataset ID
	local_mode: If True, treat source as local directory

	Returns:
	Dataset-like object providing case access

	Raises:
	DataLoadError: If data cannot be loaded
	"""
	if local_mode or isinstance(source, Path):
	return _load_from_local_directory(Path(source))
	# Future: return _load_from_huggingface(source)
	raise NotImplementedError("HuggingFace mode not yet implemented")


	def _load_from_local_directory(data_dir: Path) -> LocalDataset:
	"""
	Load cases from extracted local files.

	Expects structure:
	data_dir/
	├── Images-DWI/sub-stroke{XXXX}_ses-02_dwi.nii.gz
	├── Images-ADC/sub-stroke{XXXX}_ses-02_adc.nii.gz
	└── Masks/sub-stroke{XXXX}_ses-02_lesion-msk.nii.gz
	"""
	...
	```

	#### `data/adapter.py`

	```python
	"""Provide typed access to ISLES24 cases."""

	from __future__ import annotations

	import re
	from dataclasses import dataclass
	from pathlib import Path
	from typing import Iterator

	from stroke_deepisles_demo.core.types import CaseFiles


	@dataclass
	class LocalDataset:
	"""File-based dataset for local ISLES24 data."""

	data_dir: Path
	cases: dict[str, CaseFiles] # subject_id -> files

	def __len__(self) -> int:
	return len(self.cases)

	def __iter__(self) -> Iterator[str]:
	return iter(self.cases.keys())

	def list_case_ids(self) -> list[str]:
	"""Return sorted list of subject IDs."""
	return sorted(self.cases.keys())

	def get_case(self, case_id: str \| int) -> CaseFiles:
	"""Get files for a case by ID or index."""
	if isinstance(case_id, int):
	case_id = self.list_case_ids()[case_id]
	return self.cases[case_id]


	# Subject ID extraction
	SUBJECT_PATTERN = re.compile(r"sub-(stroke\d{4})_ses-\d+_.*\.nii\.gz")


	def parse_subject_id(filename: str) -> str \| None:
	"""Extract subject ID from BIDS filename."""
	match = SUBJECT_PATTERN.match(filename)
	return f"sub-{match.group(1)}" if match else None


	def build_local_dataset(data_dir: Path) -> LocalDataset:
	"""
	Scan directory and build case mapping.

	Matches DWI + ADC + Mask files by subject ID.
	"""
	dwi_dir = data_dir / "Images-DWI"
	adc_dir = data_dir / "Images-ADC"
	mask_dir = data_dir / "Masks"

	cases: dict[str, CaseFiles] = {}

	# Scan DWI files to get subject IDs
	for dwi_file in dwi_dir.glob("*.nii.gz"):
	subject_id = parse_subject_id(dwi_file.name)
	if not subject_id:
	continue

	# Find matching ADC and Mask
	adc_file = adc_dir / dwi_file.name.replace("_dwi.", "_adc.")
	mask_file = mask_dir / dwi_file.name.replace("_dwi.", "_lesion-msk.")

	if not adc_file.exists():
	continue # Skip incomplete cases

	cases[subject_id] = CaseFiles(
	dwi=dwi_file,
	adc=adc_file,
	ground_truth=mask_file if mask_file.exists() else None,
	)

	return LocalDataset(data_dir=data_dir, cases=cases)
	```

	### synthetic fixture structure

	Unit tests MUST use fixtures that replicate the exact directory structure. Add to `tests/conftest.py`:

	```python
	@pytest.fixture
	def synthetic_isles_dir(temp_dir: Path) -> Path:
	"""
	Create synthetic ISLES24-like directory structure.

	Structure:
	temp_dir/
	├── Images-DWI/
	│ ├── sub-stroke0001_ses-02_dwi.nii.gz
	│ └── sub-stroke0002_ses-02_dwi.nii.gz
	├── Images-ADC/
	│ ├── sub-stroke0001_ses-02_adc.nii.gz
	│ └── sub-stroke0002_ses-02_adc.nii.gz
	└── Masks/
	├── sub-stroke0001_ses-02_lesion-msk.nii.gz
	└── sub-stroke0002_ses-02_lesion-msk.nii.gz
	"""
	dwi_dir = temp_dir / "Images-DWI"
	adc_dir = temp_dir / "Images-ADC"
	mask_dir = temp_dir / "Masks"

	dwi_dir.mkdir()
	adc_dir.mkdir()
	mask_dir.mkdir()

	for subject_num in [1, 2]:
	subject_id = f"sub-stroke{subject_num:04d}"

	# Create DWI
	dwi_data = np.random.rand(10, 10, 5).astype(np.float32)
	dwi_img = nib.Nifti1Image(dwi_data, affine=np.eye(4))
	nib.save(dwi_img, dwi_dir / f"{subject_id}_ses-02_dwi.nii.gz")

	# Create ADC
	adc_data = np.random.rand(10, 10, 5).astype(np.float32) * 2000
	adc_img = nib.Nifti1Image(adc_data, affine=np.eye(4))
	nib.save(adc_img, adc_dir / f"{subject_id}_ses-02_adc.nii.gz")

	# Create Mask
	mask_data = (np.random.rand(10, 10, 5) > 0.9).astype(np.uint8)
	mask_img = nib.Nifti1Image(mask_data, affine=np.eye(4))
	nib.save(mask_img, mask_dir / f"{subject_id}_ses-02_lesion-msk.nii.gz")

	return temp_dir
	```

	### tdd plan

	```python
	# tests/data/test_loader.py

	def test_load_from_local_returns_local_dataset(synthetic_isles_dir):
	"""Local mode returns LocalDataset."""
	...

	def test_load_from_local_finds_all_cases(synthetic_isles_dir):
	"""Finds all cases in synthetic structure."""
	...

	# tests/data/test_adapter.py

	def test_parse_subject_id_extracts_correctly():
	"""Extracts subject ID from BIDS filename."""
	assert parse_subject_id("sub-stroke0005_ses-02_dwi.nii.gz") == "sub-stroke0005"

	def test_build_local_dataset_matches_files(synthetic_isles_dir):
	"""Matches DWI, ADC, Mask by subject ID."""
	...

	def test_get_case_returns_case_files(synthetic_isles_dir):
	"""get_case returns CaseFiles with correct paths."""
	...
	```

	### done criteria (phase 1a)

	- [ ] `uv run pytest tests/data/ -v` passes
	- [ ] Can load all 149 cases from `data/isles24/`
	- [ ] `list_case_ids()` returns 149 subject IDs
	- [ ] `get_case("sub-stroke0005")` returns valid CaseFiles
	- [ ] Type checking passes: `uv run mypy src/stroke_deepisles_demo/data/`

	---

	## phase 1b: test tobias's nifti feature (NEXT)

	### purpose

	Verify that Tobias's `Nifti()` feature type from the datasets fork can correctly load/parse NIfTI files. This proves the loading part of the consumption pipeline works, even though the download part is broken.

	### approach

	```python
	# Test script to verify Nifti() feature works on local files
	from datasets import Features, Value
	from datasets.features import Nifti # From Tobias's fork

	# Create a simple dataset from local files
	features = Features({
	"subject_id": Value("string"),
	"dwi": Nifti(),
	"adc": Nifti(),
	"mask": Nifti(),
	})

	# Load a single case and verify Nifti() decodes correctly
	```

	### done criteria (phase 1b)

	- [ ] Tobias's `Nifti()` feature loads local `.nii.gz` files
	- [ ] Decoded NIfTI has correct shape/dtype
	- [ ] Can access voxel data via nibabel-like interface

	---

	## phase 1c: proper huggingface upload (FUTURE)

	### purpose

	Re-upload ISLES24 data to HuggingFace properly using the arc-aphasia-bids approach. This proves the production pipeline works.

	### approach

	1. Use BIDS loader from Tobias's fork
	2. Create proper parquet schema with columns:
	- `subject`: string
	- `session`: string
	- `dwi`: Nifti()
	- `adc`: Nifti()
	- `mask`: Nifti()
	3. Upload to new HuggingFace repo (e.g., `The-Obstacle-Is-The-Way/ISLES24-BIDS`)

	### done criteria (phase 1c)

	- [ ] Dataset uploaded to HuggingFace with proper schema
	- [ ] HuggingFace dataset viewer shows data correctly
	- [ ] `load_dataset("new-repo-id")` returns Dataset with expected columns

	---

	## phase 1d: consumption verification (FUTURE)

	### purpose

	Verify the full round-trip: Download from HuggingFace using Tobias's fork.

	### approach

	```python
	from datasets import load_dataset

	# This should work after Phase 1C
	ds = load_dataset("The-Obstacle-Is-The-Way/ISLES24-BIDS")
	case = ds["train"][0]
	print(case["dwi"].shape) # Should work!
	```

	### new adapter function

	When Phase 1D is implemented, `adapter.py` will need a new function alongside `build_local_dataset`:

	```python
	def adapt_hf_case(hf_row: dict) -> CaseFiles:
	"""
	Adapt a HuggingFace Dataset row to CaseFiles.

	Args:
	hf_row: Row from load_dataset() with columns:
	- dwi: Nifti feature (nibabel-like object)
	- adc: Nifti feature
	- mask: Nifti feature
	- subject: str

	Returns:
	CaseFiles with materialized paths or nibabel objects
	"""
	# Implementation depends on how Nifti() feature exposes data
	# May need to write to temp files or pass nibabel objects directly
	...
	```

	This maintains the same `CaseFiles` contract for downstream phases regardless of data source.

	### done criteria (phase 1d)

	- [ ] `load_dataset()` works on properly uploaded dataset
	- [ ] `adapt_hf_case()` function converts HF rows to CaseFiles
	- [ ] Full demo runs with HuggingFace consumption (not just local files)
	- [ ] Documents the pitfall for future projects

	---

	## dependencies

	No new dependencies needed beyond Phase 0.

	## notes

	- The original `adapter.py` assumed HF Dataset with columns - COMPLETELY WRONG
	- The original `loader.py` called `load_dataset()` directly - FAILS on this dataset
	- `staging.py` is still correct - it just needs `CaseFiles` with paths