Bug Spec: HuggingFace Spaces Dataset Loading Issues
Status: Root Causes Identified β Comprehensive Fix Ready
Priority: P0 (Blocks deployment)
Branch: fix/pipeline-resource-leak
Date: 2025-12-08
Updated: 2025-12-08
Executive Summary
Two distinct bugs prevent the HuggingFace Spaces deployment from working:
| Bug | Symptom | Root Cause | Impact | Fix |
|---|---|---|---|---|
| #1 | Dropdown never populates | PyArrow streaming bug | App hangs at startup | Pre-computed case IDs |
| #2 | OOM on case selection | load_dataset() downloads 99GB |
App crashes on first use | HfFileSystem + pyarrow |
Both bugs stem from fundamental incompatibilities between the datasets library and our 99GB parquet dataset on resource-constrained HF Spaces hardware.
Bug #1: Streaming Iteration Hang
Summary
The dropdown never populates because load_dataset(..., streaming=True) hangs indefinitely on parquet datasets. This is a known PyArrow bug, not a HuggingFace datasets bug.
The Bug Chain
- Our code calls
load_dataset("hugging-science/isles24-stroke", streaming=True) - HF datasets internally uses
ParquetFileFragment.to_batches()for streaming - PyArrow hangs when iterating batches from parquet with partial consumption
- Result: Script hangs forever, never returns case IDs
Upstream Issues
- PyArrow Issue: apache/arrow#45214 - Root cause
- HF Datasets Issue: huggingface/datasets#7467 - HF tracking
- Status: Open, no fix ETA
- Maintainer: @lhoestq (HF datasets core dev) correctly escalated to PyArrow team
Minimal Reproduction (Pure PyArrow, no HF)
import pyarrow.dataset as ds
file = "test-00000-of-00003.parquet"
with open(file, "rb") as f:
parquet_fragment = ds.ParquetFileFormat().make_fragment(f)
for record_batch in parquet_fragment.to_batches():
print(len(record_batch))
break # β Partial consumption causes hang
# Script hangs here forever
This proves the bug is in PyArrow's C++ layer, not HuggingFace datasets.
Fix: Pre-computed Case ID List
Why this is professional, not hacky:
- ISLES24 is a static challenge dataset - case IDs will never change
- Industry standard - many production ML systems pre-define dataset indices
- Zero startup latency - dropdown populates instantly
- No network dependency - works offline for dropdown population
- Bypasses upstream bug - doesn't depend on PyArrow fix timeline
Bug #2: Full Dataset OOM on Case Access
Summary
Even after fixing Bug #1, the application would crash immediately upon selecting a case. The current get_case() implementation calls:
# adapter.py:213
self._hf_dataset = load_dataset(self.dataset_id, split="train")
This attempts to download the entire 99GB dataset into memory, which OOMs on HF Spaces.
Why This Wasn't Caught
The bug document initially focused on the dropdown hang (Bug #1). Bug #2 would only manifest after Bug #1 was fixed and a user actually selected a case.
Investigation Results
| Approach | Result | Time | Memory |
|---|---|---|---|
load_dataset(..., streaming=True) |
HANGS | β | N/A |
load_dataset(...) (full download) |
OOMs | ~10 min | 99GB+ |
HfFileSystem + pyarrow (single file) |
WORKS | 1.7s | ~50MB |
Dataset Structure Discovery
Critical finding: Each case is stored in a separate parquet file:
- 149 parquet files named
train-00000-of-00149.parquetthroughtrain-00148-of-00149.parquet - Each file = one case (~600-700MB raw data per case)
- Schema:
subject_id,dwi,adc,lesion_mask(NIfTI bytes stored as binary)
This means we can directly access individual cases without loading the full dataset!
Fix: Direct Parquet Access via HfFileSystem
from huggingface_hub import HfFileSystem
import pyarrow.parquet as pq
fs = HfFileSystem()
fpath = f"datasets/{dataset_id}/data/train-{idx:05d}-of-00149.parquet"
with fs.open(fpath, 'rb') as f:
pf = pq.ParquetFile(f)
table = pf.read(columns=['subject_id', 'dwi', 'adc', 'lesion_mask'])
# Extract ~50MB for one case in ~2 seconds
Benefits:
- Downloads only the single case needed (~50MB vs 99GB)
- Completes in 1.7 seconds (vs hanging or OOM)
- No dependency on
datasetslibrary for data access - Bypasses both PyArrow streaming bug and memory constraints
Comprehensive Fix Implementation
1. Create constants.py with case ID β file index mapping
# src/stroke_deepisles_demo/data/constants.py
# Pre-computed case IDs for ISLES24 dataset (static challenge dataset)
# Extracted via HfFileSystem enumeration on 2025-12-08
ISLES24_CASE_IDS: tuple[str, ...] = (
"sub-stroke0001", "sub-stroke0002", ..., "sub-stroke0189"
)
# Mapping from case ID to parquet file index (0-indexed)
ISLES24_CASE_INDEX: dict[str, int] = {
case_id: idx for idx, case_id in enumerate(ISLES24_CASE_IDS)
}
2. Rewrite HuggingFaceDataset.get_case() to use HfFileSystem
Replace load_dataset() call with direct parquet access:
def get_case(self, case_id: str | int) -> CaseFiles:
from huggingface_hub import HfFileSystem
import pyarrow.parquet as pq
idx = self._case_index[case_id]
fpath = f"datasets/{self.dataset_id}/data/train-{idx:05d}-of-00149.parquet"
fs = HfFileSystem()
with fs.open(fpath, 'rb') as f:
table = pq.ParquetFile(f).read(columns=['dwi', 'adc', 'lesion_mask'])
# Extract bytes and write to temp files...
3. Remove all load_dataset() calls from HuggingFace path
The datasets library is completely bypassed for the HuggingFace workflow.
All 149 Case IDs (Extracted via HfFileSystem)
sub-stroke0001, sub-stroke0002, sub-stroke0003, sub-stroke0004, sub-stroke0005,
sub-stroke0006, sub-stroke0007, sub-stroke0008, sub-stroke0009, sub-stroke0010,
sub-stroke0011, sub-stroke0012, sub-stroke0013, sub-stroke0014, sub-stroke0015,
sub-stroke0016, sub-stroke0017, sub-stroke0019, sub-stroke0020, sub-stroke0021,
sub-stroke0022, sub-stroke0025, sub-stroke0026, sub-stroke0027, sub-stroke0028,
sub-stroke0030, sub-stroke0033, sub-stroke0036, sub-stroke0037, sub-stroke0038,
sub-stroke0040, sub-stroke0043, sub-stroke0045, sub-stroke0047, sub-stroke0048,
sub-stroke0049, sub-stroke0052, sub-stroke0053, sub-stroke0054, sub-stroke0055,
sub-stroke0057, sub-stroke0062, sub-stroke0066, sub-stroke0068, sub-stroke0070,
sub-stroke0071, sub-stroke0073, sub-stroke0074, sub-stroke0075, sub-stroke0076,
sub-stroke0077, sub-stroke0078, sub-stroke0079, sub-stroke0080, sub-stroke0081,
sub-stroke0082, sub-stroke0083, sub-stroke0084, sub-stroke0085, sub-stroke0086,
sub-stroke0087, sub-stroke0088, sub-stroke0089, sub-stroke0090, sub-stroke0091,
sub-stroke0092, sub-stroke0093, sub-stroke0094, sub-stroke0095, sub-stroke0096,
sub-stroke0097, sub-stroke0098, sub-stroke0099, sub-stroke0100, sub-stroke0101,
sub-stroke0102, sub-stroke0103, sub-stroke0104, sub-stroke0105, sub-stroke0106,
sub-stroke0107, sub-stroke0108, sub-stroke0109, sub-stroke0110, sub-stroke0111,
sub-stroke0112, sub-stroke0113, sub-stroke0114, sub-stroke0115, sub-stroke0116,
sub-stroke0117, sub-stroke0118, sub-stroke0119, sub-stroke0133, sub-stroke0134,
sub-stroke0135, sub-stroke0136, sub-stroke0137, sub-stroke0138, sub-stroke0139,
sub-stroke0140, sub-stroke0141, sub-stroke0142, sub-stroke0143, sub-stroke0144,
sub-stroke0145, sub-stroke0146, sub-stroke0147, sub-stroke0148, sub-stroke0149,
sub-stroke0150, sub-stroke0151, sub-stroke0152, sub-stroke0153, sub-stroke0154,
sub-stroke0155, sub-stroke0156, sub-stroke0157, sub-stroke0158, sub-stroke0159,
sub-stroke0161, sub-stroke0162, sub-stroke0163, sub-stroke0164, sub-stroke0165,
sub-stroke0166, sub-stroke0167, sub-stroke0168, sub-stroke0169, sub-stroke0170,
sub-stroke0171, sub-stroke0172, sub-stroke0173, sub-stroke0174, sub-stroke0175,
sub-stroke0176, sub-stroke0177, sub-stroke0178, sub-stroke0179, sub-stroke0180,
sub-stroke0181, sub-stroke0182, sub-stroke0183, sub-stroke0184, sub-stroke0185,
sub-stroke0186, sub-stroke0187, sub-stroke0188, sub-stroke0189
Environment
- Space:
VibecoderMcSwaggins/stroke-deepisles-demo - Hardware: T4-small GPU (limited memory)
- Dataset:
hugging-science/isles24-stroke(149 parquet files, ~99GB total) - Dependencies:
datasets @ git+https://github.com/CloseChoice/datasets.git@c1c15aa...(fork with Nifti support)pyarrow(inherited, contains Bug #1)huggingface_hub(used for Bug #2 fix)
References
- PyArrow Issue #45214 - Bug #1 root cause
- PyArrow Issue #43604 - Related hang issue
- HF Datasets Issue #7467 - HF tracking issue
- HF Datasets Issue #7357 - Original report
Checklist
- Identify Bug #1 root cause (PyArrow streaming hang)
- Identify Bug #2 root cause (OOM on full download)
- Extract all 149 case IDs via HfFileSystem
- Validate direct parquet access works (1.7s per case)
- Implement pre-computed case ID list (
constants.py) - Rewrite
get_case()to use HfFileSystem + pyarrow - Update tests
- Test on HF Spaces
- Monitor PyArrow issue for upstream fix