Spaces:

VibecoderMcSwaggins
/

stroke-viewer-frontend

Running

App Files Files Community

stroke-viewer-frontend / docs /specs /archive /08-bug-hf-spaces-dataset-loop.md

VibecoderMcSwaggins

fix(ui): NiiVue viewer re-initializes after segmentation completes (#21)

0b424f6 unverified 8 days ago

preview code

raw

history blame

9.6 kB

	# Bug Spec: HuggingFace Spaces Dataset Loading Issues

	Status: Root Causes Identified → Comprehensive Fix Ready
	Priority: P0 (Blocks deployment)
	Branch: `fix/pipeline-resource-leak`
	Date: 2025-12-08
	Updated: 2025-12-08

	## Executive Summary

	Two distinct bugs prevent the HuggingFace Spaces deployment from working:

	\| Bug \| Symptom \| Root Cause \| Impact \| Fix \|
	\|-----\|---------\|------------\|--------\|-----\|
	\| #1 \| Dropdown never populates \| PyArrow streaming bug \| App hangs at startup \| Pre-computed case IDs \|
	\| #2 \| OOM on case selection \| `load_dataset()` downloads 99GB \| App crashes on first use \| HfFileSystem + pyarrow \|

	Both bugs stem from fundamental incompatibilities between the `datasets` library and our 99GB parquet dataset on resource-constrained HF Spaces hardware.

	---

	## Bug #1: Streaming Iteration Hang

	### Summary

	The dropdown never populates because `load_dataset(..., streaming=True)` hangs indefinitely on parquet datasets. This is a known PyArrow bug, not a HuggingFace datasets bug.

	### The Bug Chain

	1. Our code calls `load_dataset("hugging-science/isles24-stroke", streaming=True)`
	2. HF datasets internally uses `ParquetFileFragment.to_batches()` for streaming
	3. PyArrow hangs when iterating batches from parquet with partial consumption
	4. Result: Script hangs forever, never returns case IDs

	### Upstream Issues

	- PyArrow Issue: [apache/arrow#45214](https://github.com/apache/arrow/issues/45214) - Root cause
	- HF Datasets Issue: [huggingface/datasets#7467](https://github.com/huggingface/datasets/issues/7467) - HF tracking
	- Status: Open, no fix ETA
	- Maintainer: @lhoestq (HF datasets core dev) correctly escalated to PyArrow team

	### Minimal Reproduction (Pure PyArrow, no HF)

	```python
	import pyarrow.dataset as ds

	file = "test-00000-of-00003.parquet"
	with open(file, "rb") as f:
	parquet_fragment = ds.ParquetFileFormat().make_fragment(f)
	for record_batch in parquet_fragment.to_batches():
	print(len(record_batch))
	break # ← Partial consumption causes hang
	# Script hangs here forever
	```

	This proves the bug is in PyArrow's C++ layer, not HuggingFace datasets.

	### Fix: Pre-computed Case ID List

	Why this is professional, not hacky:

	1. ISLES24 is a static challenge dataset - case IDs will never change
	2. Industry standard - many production ML systems pre-define dataset indices
	3. Zero startup latency - dropdown populates instantly
	4. No network dependency - works offline for dropdown population
	5. Bypasses upstream bug - doesn't depend on PyArrow fix timeline

	---

	## Bug #2: Full Dataset OOM on Case Access

	### Summary

	Even after fixing Bug #1, the application would crash immediately upon selecting a case. The current `get_case()` implementation calls:

	```python
	# adapter.py:213
	self._hf_dataset = load_dataset(self.dataset_id, split="train")
	```

	This attempts to download the entire 99GB dataset into memory, which OOMs on HF Spaces.

	### Why This Wasn't Caught

	The bug document initially focused on the dropdown hang (Bug #1). Bug #2 would only manifest after Bug #1 was fixed and a user actually selected a case.

	### Investigation Results

	\| Approach \| Result \| Time \| Memory \|
	\|----------\|--------\|------\|--------\|
	\| `load_dataset(..., streaming=True)` \| HANGS \| ∞ \| N/A \|
	\| `load_dataset(...)` (full download) \| OOMs \| ~10 min \| 99GB+ \|
	\| `HfFileSystem` + `pyarrow` (single file) \| WORKS \| 1.7s \| ~50MB \|

	### Dataset Structure Discovery

	Critical finding: Each case is stored in a separate parquet file:

	- 149 parquet files named `train-00000-of-00149.parquet` through `train-00148-of-00149.parquet`
	- Each file = one case (~600-700MB raw data per case)
	- Schema: `subject_id`, `dwi`, `adc`, `lesion_mask` (NIfTI bytes stored as binary)

	This means we can directly access individual cases without loading the full dataset!

	### Fix: Direct Parquet Access via HfFileSystem

	```python
	from huggingface_hub import HfFileSystem
	import pyarrow.parquet as pq

	fs = HfFileSystem()
	fpath = f"datasets/{dataset_id}/data/train-{idx:05d}-of-00149.parquet"

	with fs.open(fpath, 'rb') as f:
	pf = pq.ParquetFile(f)
	table = pf.read(columns=['subject_id', 'dwi', 'adc', 'lesion_mask'])
	# Extract ~50MB for one case in ~2 seconds
	```

	Benefits:
	- Downloads only the single case needed (~50MB vs 99GB)
	- Completes in 1.7 seconds (vs hanging or OOM)
	- No dependency on `datasets` library for data access
	- Bypasses both PyArrow streaming bug and memory constraints

	---

	## Comprehensive Fix Implementation

	### 1. Create `constants.py` with case ID → file index mapping

	```python
	# src/stroke_deepisles_demo/data/constants.py

	# Pre-computed case IDs for ISLES24 dataset (static challenge dataset)
	# Extracted via HfFileSystem enumeration on 2025-12-08
	ISLES24_CASE_IDS: tuple[str, ...] = (
	"sub-stroke0001", "sub-stroke0002", ..., "sub-stroke0189"
	)

	# Mapping from case ID to parquet file index (0-indexed)
	ISLES24_CASE_INDEX: dict[str, int] = {
	case_id: idx for idx, case_id in enumerate(ISLES24_CASE_IDS)
	}
	```

	### 2. Rewrite `HuggingFaceDataset.get_case()` to use HfFileSystem

	Replace `load_dataset()` call with direct parquet access:

	```python
	def get_case(self, case_id: str \| int) -> CaseFiles:
	from huggingface_hub import HfFileSystem
	import pyarrow.parquet as pq

	idx = self._case_index[case_id]
	fpath = f"datasets/{self.dataset_id}/data/train-{idx:05d}-of-00149.parquet"

	fs = HfFileSystem()
	with fs.open(fpath, 'rb') as f:
	table = pq.ParquetFile(f).read(columns=['dwi', 'adc', 'lesion_mask'])
	# Extract bytes and write to temp files...
	```

	### 3. Remove all `load_dataset()` calls from HuggingFace path

	The `datasets` library is completely bypassed for the HuggingFace workflow.

	---

	## All 149 Case IDs (Extracted via HfFileSystem)

	```
	sub-stroke0001, sub-stroke0002, sub-stroke0003, sub-stroke0004, sub-stroke0005,
	sub-stroke0006, sub-stroke0007, sub-stroke0008, sub-stroke0009, sub-stroke0010,
	sub-stroke0011, sub-stroke0012, sub-stroke0013, sub-stroke0014, sub-stroke0015,
	sub-stroke0016, sub-stroke0017, sub-stroke0019, sub-stroke0020, sub-stroke0021,
	sub-stroke0022, sub-stroke0025, sub-stroke0026, sub-stroke0027, sub-stroke0028,
	sub-stroke0030, sub-stroke0033, sub-stroke0036, sub-stroke0037, sub-stroke0038,
	sub-stroke0040, sub-stroke0043, sub-stroke0045, sub-stroke0047, sub-stroke0048,
	sub-stroke0049, sub-stroke0052, sub-stroke0053, sub-stroke0054, sub-stroke0055,
	sub-stroke0057, sub-stroke0062, sub-stroke0066, sub-stroke0068, sub-stroke0070,
	sub-stroke0071, sub-stroke0073, sub-stroke0074, sub-stroke0075, sub-stroke0076,
	sub-stroke0077, sub-stroke0078, sub-stroke0079, sub-stroke0080, sub-stroke0081,
	sub-stroke0082, sub-stroke0083, sub-stroke0084, sub-stroke0085, sub-stroke0086,
	sub-stroke0087, sub-stroke0088, sub-stroke0089, sub-stroke0090, sub-stroke0091,
	sub-stroke0092, sub-stroke0093, sub-stroke0094, sub-stroke0095, sub-stroke0096,
	sub-stroke0097, sub-stroke0098, sub-stroke0099, sub-stroke0100, sub-stroke0101,
	sub-stroke0102, sub-stroke0103, sub-stroke0104, sub-stroke0105, sub-stroke0106,
	sub-stroke0107, sub-stroke0108, sub-stroke0109, sub-stroke0110, sub-stroke0111,
	sub-stroke0112, sub-stroke0113, sub-stroke0114, sub-stroke0115, sub-stroke0116,
	sub-stroke0117, sub-stroke0118, sub-stroke0119, sub-stroke0133, sub-stroke0134,
	sub-stroke0135, sub-stroke0136, sub-stroke0137, sub-stroke0138, sub-stroke0139,
	sub-stroke0140, sub-stroke0141, sub-stroke0142, sub-stroke0143, sub-stroke0144,
	sub-stroke0145, sub-stroke0146, sub-stroke0147, sub-stroke0148, sub-stroke0149,
	sub-stroke0150, sub-stroke0151, sub-stroke0152, sub-stroke0153, sub-stroke0154,
	sub-stroke0155, sub-stroke0156, sub-stroke0157, sub-stroke0158, sub-stroke0159,
	sub-stroke0161, sub-stroke0162, sub-stroke0163, sub-stroke0164, sub-stroke0165,
	sub-stroke0166, sub-stroke0167, sub-stroke0168, sub-stroke0169, sub-stroke0170,
	sub-stroke0171, sub-stroke0172, sub-stroke0173, sub-stroke0174, sub-stroke0175,
	sub-stroke0176, sub-stroke0177, sub-stroke0178, sub-stroke0179, sub-stroke0180,
	sub-stroke0181, sub-stroke0182, sub-stroke0183, sub-stroke0184, sub-stroke0185,
	sub-stroke0186, sub-stroke0187, sub-stroke0188, sub-stroke0189
	```

	---

	## Environment

	- Space: `VibecoderMcSwaggins/stroke-deepisles-demo`
	- Hardware: T4-small GPU (limited memory)
	- Dataset: `hugging-science/isles24-stroke` (149 parquet files, ~99GB total)
	- Dependencies:
	- `datasets @ git+https://github.com/CloseChoice/datasets.git@c1c15aa...` (fork with Nifti support)
	- `pyarrow` (inherited, contains Bug #1)
	- `huggingface_hub` (used for Bug #2 fix)

	---

	## References

	- [PyArrow Issue #45214](https://github.com/apache/arrow/issues/45214) - Bug #1 root cause
	- [PyArrow Issue #43604](https://github.com/apache/arrow/issues/43604) - Related hang issue
	- [HF Datasets Issue #7467](https://github.com/huggingface/datasets/issues/7467) - HF tracking issue
	- [HF Datasets Issue #7357](https://github.com/huggingface/datasets/issues/7357) - Original report

	---

	## Checklist

	1. [x] Identify Bug #1 root cause (PyArrow streaming hang)
	2. [x] Identify Bug #2 root cause (OOM on full download)
	3. [x] Extract all 149 case IDs via HfFileSystem
	4. [x] Validate direct parquet access works (1.7s per case)
	5. [x] Implement pre-computed case ID list (`constants.py`)
	6. [x] Rewrite `get_case()` to use HfFileSystem + pyarrow
	7. [x] Update tests
	8. [ ] Test on HF Spaces
	9. [ ] Monitor PyArrow issue for upstream fix