Spaces:

VibecoderMcSwaggins
/

stroke-viewer-frontend

Running

App Files Files Community

stroke-viewer-frontend / docs /specs /archive /08-bug-hf-spaces-dataset-loop.md

VibecoderMcSwaggins

fix(ui): NiiVue viewer re-initializes after segmentation completes (#21)

0b424f6 unverified 8 days ago

preview code

raw

history blame

9.6 kB

Bug Spec: HuggingFace Spaces Dataset Loading Issues

Status: Root Causes Identified → Comprehensive Fix Ready Priority: P0 (Blocks deployment) Branch: fix/pipeline-resource-leak Date: 2025-12-08 Updated: 2025-12-08

Executive Summary

Two distinct bugs prevent the HuggingFace Spaces deployment from working:

Bug	Symptom	Root Cause	Impact	Fix
#1	Dropdown never populates	PyArrow streaming bug	App hangs at startup	Pre-computed case IDs
#2	OOM on case selection	`load_dataset()` downloads 99GB	App crashes on first use	HfFileSystem + pyarrow

Both bugs stem from fundamental incompatibilities between the datasets library and our 99GB parquet dataset on resource-constrained HF Spaces hardware.

Bug #1: Streaming Iteration Hang

Summary

The dropdown never populates because load_dataset(..., streaming=True) hangs indefinitely on parquet datasets. This is a known PyArrow bug, not a HuggingFace datasets bug.

The Bug Chain

Our code calls load_dataset("hugging-science/isles24-stroke", streaming=True)
HF datasets internally uses ParquetFileFragment.to_batches() for streaming
PyArrow hangs when iterating batches from parquet with partial consumption
Result: Script hangs forever, never returns case IDs

Upstream Issues

PyArrow Issue: apache/arrow#45214 - Root cause
HF Datasets Issue: huggingface/datasets#7467 - HF tracking
Status: Open, no fix ETA
Maintainer: @lhoestq (HF datasets core dev) correctly escalated to PyArrow team

Minimal Reproduction (Pure PyArrow, no HF)

import pyarrow.dataset as ds

file = "test-00000-of-00003.parquet"
with open(file, "rb") as f:
    parquet_fragment = ds.ParquetFileFormat().make_fragment(f)
    for record_batch in parquet_fragment.to_batches():
        print(len(record_batch))
        break  # ← Partial consumption causes hang
# Script hangs here forever

This proves the bug is in PyArrow's C++ layer, not HuggingFace datasets.

Fix: Pre-computed Case ID List

Why this is professional, not hacky:

ISLES24 is a static challenge dataset - case IDs will never change
Industry standard - many production ML systems pre-define dataset indices
Zero startup latency - dropdown populates instantly
No network dependency - works offline for dropdown population
Bypasses upstream bug - doesn't depend on PyArrow fix timeline

Bug #2: Full Dataset OOM on Case Access

Summary

Even after fixing Bug #1, the application would crash immediately upon selecting a case. The current get_case() implementation calls:

# adapter.py:213
self._hf_dataset = load_dataset(self.dataset_id, split="train")

This attempts to download the entire 99GB dataset into memory, which OOMs on HF Spaces.

Why This Wasn't Caught

The bug document initially focused on the dropdown hang (Bug #1). Bug #2 would only manifest after Bug #1 was fixed and a user actually selected a case.

Investigation Results

Approach	Result	Time	Memory
`load_dataset(..., streaming=True)`	HANGS	∞	N/A
`load_dataset(...)` (full download)	OOMs	~10 min	99GB+
`HfFileSystem` + `pyarrow` (single file)	WORKS	1.7s	~50MB

Dataset Structure Discovery

Critical finding: Each case is stored in a separate parquet file:

149 parquet files named train-00000-of-00149.parquet through train-00148-of-00149.parquet
Each file = one case (~600-700MB raw data per case)
Schema: subject_id, dwi, adc, lesion_mask (NIfTI bytes stored as binary)

This means we can directly access individual cases without loading the full dataset!

Fix: Direct Parquet Access via HfFileSystem

from huggingface_hub import HfFileSystem
import pyarrow.parquet as pq

fs = HfFileSystem()
fpath = f"datasets/{dataset_id}/data/train-{idx:05d}-of-00149.parquet"

with fs.open(fpath, 'rb') as f:
    pf = pq.ParquetFile(f)
    table = pf.read(columns=['subject_id', 'dwi', 'adc', 'lesion_mask'])
    # Extract ~50MB for one case in ~2 seconds

Benefits:

Downloads only the single case needed (~50MB vs 99GB)
Completes in 1.7 seconds (vs hanging or OOM)
No dependency on datasets library for data access
Bypasses both PyArrow streaming bug and memory constraints

Comprehensive Fix Implementation

1. Create `constants.py` with case ID → file index mapping

# src/stroke_deepisles_demo/data/constants.py

# Pre-computed case IDs for ISLES24 dataset (static challenge dataset)
# Extracted via HfFileSystem enumeration on 2025-12-08
ISLES24_CASE_IDS: tuple[str, ...] = (
    "sub-stroke0001", "sub-stroke0002", ..., "sub-stroke0189"
)

# Mapping from case ID to parquet file index (0-indexed)
ISLES24_CASE_INDEX: dict[str, int] = {
    case_id: idx for idx, case_id in enumerate(ISLES24_CASE_IDS)
}

2. Rewrite `HuggingFaceDataset.get_case()` to use HfFileSystem

Replace load_dataset() call with direct parquet access:

def get_case(self, case_id: str | int) -> CaseFiles:
    from huggingface_hub import HfFileSystem
    import pyarrow.parquet as pq

    idx = self._case_index[case_id]
    fpath = f"datasets/{self.dataset_id}/data/train-{idx:05d}-of-00149.parquet"

    fs = HfFileSystem()
    with fs.open(fpath, 'rb') as f:
        table = pq.ParquetFile(f).read(columns=['dwi', 'adc', 'lesion_mask'])
        # Extract bytes and write to temp files...

3. Remove all `load_dataset()` calls from HuggingFace path

The datasets library is completely bypassed for the HuggingFace workflow.

All 149 Case IDs (Extracted via HfFileSystem)

sub-stroke0001, sub-stroke0002, sub-stroke0003, sub-stroke0004, sub-stroke0005,
sub-stroke0006, sub-stroke0007, sub-stroke0008, sub-stroke0009, sub-stroke0010,
sub-stroke0011, sub-stroke0012, sub-stroke0013, sub-stroke0014, sub-stroke0015,
sub-stroke0016, sub-stroke0017, sub-stroke0019, sub-stroke0020, sub-stroke0021,
sub-stroke0022, sub-stroke0025, sub-stroke0026, sub-stroke0027, sub-stroke0028,
sub-stroke0030, sub-stroke0033, sub-stroke0036, sub-stroke0037, sub-stroke0038,
sub-stroke0040, sub-stroke0043, sub-stroke0045, sub-stroke0047, sub-stroke0048,
sub-stroke0049, sub-stroke0052, sub-stroke0053, sub-stroke0054, sub-stroke0055,
sub-stroke0057, sub-stroke0062, sub-stroke0066, sub-stroke0068, sub-stroke0070,
sub-stroke0071, sub-stroke0073, sub-stroke0074, sub-stroke0075, sub-stroke0076,
sub-stroke0077, sub-stroke0078, sub-stroke0079, sub-stroke0080, sub-stroke0081,
sub-stroke0082, sub-stroke0083, sub-stroke0084, sub-stroke0085, sub-stroke0086,
sub-stroke0087, sub-stroke0088, sub-stroke0089, sub-stroke0090, sub-stroke0091,
sub-stroke0092, sub-stroke0093, sub-stroke0094, sub-stroke0095, sub-stroke0096,
sub-stroke0097, sub-stroke0098, sub-stroke0099, sub-stroke0100, sub-stroke0101,
sub-stroke0102, sub-stroke0103, sub-stroke0104, sub-stroke0105, sub-stroke0106,
sub-stroke0107, sub-stroke0108, sub-stroke0109, sub-stroke0110, sub-stroke0111,
sub-stroke0112, sub-stroke0113, sub-stroke0114, sub-stroke0115, sub-stroke0116,
sub-stroke0117, sub-stroke0118, sub-stroke0119, sub-stroke0133, sub-stroke0134,
sub-stroke0135, sub-stroke0136, sub-stroke0137, sub-stroke0138, sub-stroke0139,
sub-stroke0140, sub-stroke0141, sub-stroke0142, sub-stroke0143, sub-stroke0144,
sub-stroke0145, sub-stroke0146, sub-stroke0147, sub-stroke0148, sub-stroke0149,
sub-stroke0150, sub-stroke0151, sub-stroke0152, sub-stroke0153, sub-stroke0154,
sub-stroke0155, sub-stroke0156, sub-stroke0157, sub-stroke0158, sub-stroke0159,
sub-stroke0161, sub-stroke0162, sub-stroke0163, sub-stroke0164, sub-stroke0165,
sub-stroke0166, sub-stroke0167, sub-stroke0168, sub-stroke0169, sub-stroke0170,
sub-stroke0171, sub-stroke0172, sub-stroke0173, sub-stroke0174, sub-stroke0175,
sub-stroke0176, sub-stroke0177, sub-stroke0178, sub-stroke0179, sub-stroke0180,
sub-stroke0181, sub-stroke0182, sub-stroke0183, sub-stroke0184, sub-stroke0185,
sub-stroke0186, sub-stroke0187, sub-stroke0188, sub-stroke0189

Environment

Space: VibecoderMcSwaggins/stroke-deepisles-demo
Hardware: T4-small GPU (limited memory)
Dataset: hugging-science/isles24-stroke (149 parquet files, ~99GB total)
Dependencies:
- datasets @ git+https://github.com/CloseChoice/datasets.git@c1c15aa... (fork with Nifti support)
- pyarrow (inherited, contains Bug #1)
- huggingface_hub (used for Bug #2 fix)

References

PyArrow Issue #45214 - Bug #1 root cause
PyArrow Issue #43604 - Related hang issue
HF Datasets Issue #7467 - HF tracking issue
HF Datasets Issue #7357 - Original report

Checklist

Identify Bug #1 root cause (PyArrow streaming hang)
Identify Bug #2 root cause (OOM on full download)
Extract all 149 case IDs via HfFileSystem
Validate direct parquet access works (1.7s per case)
Implement pre-computed case ID list (constants.py)
Rewrite get_case() to use HfFileSystem + pyarrow
Update tests
Test on HF Spaces
Monitor PyArrow issue for upstream fix

Bug Spec: HuggingFace Spaces Dataset Loading Issues

Executive Summary

Bug #1: Streaming Iteration Hang

Summary

The Bug Chain

Upstream Issues

Minimal Reproduction (Pure PyArrow, no HF)

Fix: Pre-computed Case ID List

Bug #2: Full Dataset OOM on Case Access

Summary

Why This Wasn't Caught

Investigation Results

Dataset Structure Discovery

Fix: Direct Parquet Access via HfFileSystem

Comprehensive Fix Implementation

1. Create constants.py with case ID → file index mapping

2. Rewrite HuggingFaceDataset.get_case() to use HfFileSystem

3. Remove all load_dataset() calls from HuggingFace path

All 149 Case IDs (Extracted via HfFileSystem)

Environment

References

Checklist

1. Create `constants.py` with case ID → file index mapping

2. Rewrite `HuggingFaceDataset.get_case()` to use HfFileSystem

3. Remove all `load_dataset()` calls from HuggingFace path