fix(data): use streaming mode to fix HF Spaces dataset hang (#15)
Browse files* docs: add bug spec for HF Spaces dataset loading loop (P0)
* fix(data): use streaming mode for case ID enumeration to fix HF Spaces hang
Root cause: load_dataset() without streaming downloads and processes ALL
149 NIfTI files before returning, causing "Generating train split" to hang
at ~63 examples on HF Spaces due to resource limits.
Fix:
- Use streaming=True in build_huggingface_dataset() to quickly enumerate
case IDs without downloading binary data
- Lazy-load full dataset only when get_case() is actually called
- Add show_error=True to launch() for better error visibility
This fixes the P0 blocker where the dropdown never populated because
dataset processing exceeded HF Spaces timeout/resource limits.
* chore: address CodeRabbit review feedback
- Fix misleading "metadata-only" comment in adapter.py
- Add code fence language to bug spec markdown
- Convert bare URL to markdown link
|
@@ -0,0 +1,166 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Bug Spec: HuggingFace Spaces Dataset Loading Loop
|
| 2 |
+
|
| 3 |
+
**Status:** Open
|
| 4 |
+
**Priority:** P0 (Blocks deployment)
|
| 5 |
+
**Branch:** `debug/hf-spaces-dataset-error`
|
| 6 |
+
**Date:** 2025-12-08
|
| 7 |
+
|
| 8 |
+
## Observed Behavior
|
| 9 |
+
|
| 10 |
+
Container enters infinite restart loop:
|
| 11 |
+
1. Application starts successfully (`Running on local URL: http://0.0.0.0:7860`)
|
| 12 |
+
2. Dataset download completes (`Downloading data: 100%|██████████| 149/149`)
|
| 13 |
+
3. "Generating train split" begins processing
|
| 14 |
+
4. **Container restarts** (new `Application Startup` timestamp)
|
| 15 |
+
5. Cycle repeats indefinitely
|
| 16 |
+
|
| 17 |
+
The "Select Case" dropdown **never** populates. Users see "Preparing Space" spinner forever.
|
| 18 |
+
|
| 19 |
+
## Environment
|
| 20 |
+
|
| 21 |
+
- **Space:** `VibecoderMcSwaggins/stroke-deepisles-demo`
|
| 22 |
+
- **Hardware:** T4-small GPU
|
| 23 |
+
- **Base Image:** `isleschallenge/deepisles:latest`
|
| 24 |
+
- **Dataset:** `hugging-science/isles24-stroke` (149 NIfTI files, ~2-5MB each)
|
| 25 |
+
- **Commit:** `a2223b1`
|
| 26 |
+
|
| 27 |
+
## Timeline from Logs
|
| 28 |
+
|
| 29 |
+
```text
|
| 30 |
+
16:43:33 - Application Startup
|
| 31 |
+
16:43:33 - Initializing dataset...
|
| 32 |
+
16:43:33 - Downloading data: 0%
|
| 33 |
+
16:48:10 - Downloading data: 100% (149/149) [~5 min]
|
| 34 |
+
16:48:10 - Generating train split: starts
|
| 35 |
+
16:56:53 - Application Startup (RESTART - lost all progress)
|
| 36 |
+
16:56:53 - Downloading data: 0% (starts over)
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
## Hypotheses
|
| 40 |
+
|
| 41 |
+
### H1: Memory OOM during train split generation
|
| 42 |
+
- Processing 149 NIfTI files into HF Dataset format
|
| 43 |
+
- Each file loaded into memory for processing
|
| 44 |
+
- T4-small may have limited RAM
|
| 45 |
+
- **Evidence:** Restart happens during "Generating train split" phase
|
| 46 |
+
|
| 47 |
+
### H2: Disk space exhaustion
|
| 48 |
+
- HF Spaces ephemeral storage limit (~50GB based on org space error)
|
| 49 |
+
- DeepISLES base image is large
|
| 50 |
+
- Dataset download + cache + processing temps
|
| 51 |
+
- **Evidence:** Org space explicitly failed with "storage limit exceeded (50G)"
|
| 52 |
+
|
| 53 |
+
### H3: Gradio demo.load() timeout
|
| 54 |
+
- `demo.load()` has internal timeout?
|
| 55 |
+
- 7+ minutes for dataset loading exceeds limit?
|
| 56 |
+
- **Evidence:** UI shows "Preparing Space" during load
|
| 57 |
+
|
| 58 |
+
### H4: HF Spaces health check failure
|
| 59 |
+
- Even though port 7860 is bound, health check may require response
|
| 60 |
+
- Long-running `demo.load()` blocks event loop?
|
| 61 |
+
- **Evidence:** Container restarts after ~13 min total
|
| 62 |
+
|
| 63 |
+
### H5: Exception swallowed during train split
|
| 64 |
+
- Our try/except returns `gr.Dropdown(info=f"Error: {e}")`
|
| 65 |
+
- But Gradio shows generic "Error" not our message
|
| 66 |
+
- Something crashes before our handler
|
| 67 |
+
|
| 68 |
+
## Code Under Suspicion
|
| 69 |
+
|
| 70 |
+
### `src/stroke_deepisles_demo/ui/app.py:34-56`
|
| 71 |
+
```python
|
| 72 |
+
def initialize_case_selector() -> gr.Dropdown:
|
| 73 |
+
try:
|
| 74 |
+
logger.info("Initializing dataset for case selector...")
|
| 75 |
+
case_ids = list_case_ids() # <-- This triggers full dataset load
|
| 76 |
+
|
| 77 |
+
if not case_ids:
|
| 78 |
+
return gr.Dropdown(choices=[], info="No cases found in dataset.")
|
| 79 |
+
|
| 80 |
+
return gr.Dropdown(
|
| 81 |
+
choices=case_ids,
|
| 82 |
+
value=case_ids[0],
|
| 83 |
+
info="Choose a case from isles24-stroke dataset",
|
| 84 |
+
interactive=True,
|
| 85 |
+
)
|
| 86 |
+
except Exception as e:
|
| 87 |
+
logger.exception("Failed to initialize dataset")
|
| 88 |
+
return gr.Dropdown(choices=[], info=f"Error loading data: {e!s}")
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
### `src/stroke_deepisles_demo/data/loader.py`
|
| 92 |
+
- `list_case_ids()` calls `load_isles_dataset()`
|
| 93 |
+
- `load_isles_dataset()` calls HF `load_dataset()` (non-streaming)
|
| 94 |
+
- Full dataset downloaded and processed into memory
|
| 95 |
+
|
| 96 |
+
## Potential Fixes
|
| 97 |
+
|
| 98 |
+
### Fix 1: Streaming Mode (Recommended)
|
| 99 |
+
```python
|
| 100 |
+
# Instead of:
|
| 101 |
+
ds = load_dataset("hugging-science/isles24-stroke")
|
| 102 |
+
|
| 103 |
+
# Use streaming:
|
| 104 |
+
ds = load_dataset("hugging-science/isles24-stroke", streaming=True)
|
| 105 |
+
case_ids = [ex["case_id"] for ex in ds] # Iterate without full load
|
| 106 |
+
```
|
| 107 |
+
- **Pros:** Zero disk usage, immediate start
|
| 108 |
+
- **Cons:** Can't random access, must iterate
|
| 109 |
+
|
| 110 |
+
### Fix 2: Lazy case ID loading
|
| 111 |
+
- Only load case IDs, not full dataset
|
| 112 |
+
- Use HF Hub API to list files without downloading
|
| 113 |
+
|
| 114 |
+
### Fix 3: Pre-computed case ID list
|
| 115 |
+
- Hardcode or cache the 149 case IDs
|
| 116 |
+
- Skip dataset enumeration entirely for dropdown
|
| 117 |
+
|
| 118 |
+
### Fix 4: Persistent Storage
|
| 119 |
+
- Enable HF Spaces Persistent Storage add-on
|
| 120 |
+
- Cache survives restarts
|
| 121 |
+
- **Cons:** Costs money, doesn't fix root cause
|
| 122 |
+
|
| 123 |
+
### Fix 5: Background thread with timeout
|
| 124 |
+
- Run dataset load in background thread
|
| 125 |
+
- Show "Loading..." in dropdown immediately
|
| 126 |
+
- Update dropdown when ready (if ever)
|
| 127 |
+
|
| 128 |
+
## Investigation Needed
|
| 129 |
+
|
| 130 |
+
1. **Get actual error:** What exception/signal causes restart?
|
| 131 |
+
- Need HF Spaces runtime logs (not just container logs)
|
| 132 |
+
- Check for OOM killer, SIGKILL, etc.
|
| 133 |
+
|
| 134 |
+
2. **Measure resource usage:**
|
| 135 |
+
- Disk usage during download/processing
|
| 136 |
+
- Memory usage during train split generation
|
| 137 |
+
|
| 138 |
+
3. **Test streaming mode locally:**
|
| 139 |
+
- Does `streaming=True` work with our dataset?
|
| 140 |
+
- Can we still get case IDs?
|
| 141 |
+
|
| 142 |
+
4. **Check Gradio demo.load() behavior:**
|
| 143 |
+
- Is there a timeout?
|
| 144 |
+
- Does long-running load block health checks?
|
| 145 |
+
|
| 146 |
+
## Reproduction Steps
|
| 147 |
+
|
| 148 |
+
1. Go to [the demo space](https://huggingface.co/spaces/VibecoderMcSwaggins/stroke-deepisles-demo)
|
| 149 |
+
2. Open Logs tab
|
| 150 |
+
3. Watch download complete (5 min)
|
| 151 |
+
4. Watch "Generating train split" start
|
| 152 |
+
5. Observe container restart (~7-13 min mark)
|
| 153 |
+
6. See download start over from 0%
|
| 154 |
+
|
| 155 |
+
## Related Issues
|
| 156 |
+
|
| 157 |
+
- Org space (`hugging-science/stroke-deepisles-demo`) failed with explicit "storage limit exceeded (50G)"
|
| 158 |
+
- This suggests disk space IS a factor
|
| 159 |
+
- Personal space may have same limit but hits it slower
|
| 160 |
+
|
| 161 |
+
## Next Steps
|
| 162 |
+
|
| 163 |
+
1. [ ] Get deep analysis from senior reviewer / external agent
|
| 164 |
+
2. [ ] Test streaming mode locally
|
| 165 |
+
3. [ ] Add resource monitoring/logging
|
| 166 |
+
4. [ ] Consider pre-computed case ID approach as quick fix
|
|
@@ -164,7 +164,7 @@ class HuggingFaceDataset:
|
|
| 164 |
_cached_cases: dict[str, CaseFiles] = field(default_factory=dict, repr=False)
|
| 165 |
|
| 166 |
def __len__(self) -> int:
|
| 167 |
-
return len(self.
|
| 168 |
|
| 169 |
def __iter__(self) -> Iterator[str]:
|
| 170 |
return iter(self._case_ids)
|
|
@@ -204,6 +204,15 @@ class HuggingFaceDataset:
|
|
| 204 |
self._temp_dir = Path(tempfile.mkdtemp(prefix="isles24_hf_"))
|
| 205 |
logger.debug("Created temp directory: %s", self._temp_dir)
|
| 206 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 207 |
# Get the HuggingFace example
|
| 208 |
example = self._hf_dataset[idx]
|
| 209 |
|
|
@@ -263,6 +272,9 @@ def build_huggingface_dataset(dataset_id: str) -> HuggingFaceDataset:
|
|
| 263 |
"""
|
| 264 |
Load ISLES24 dataset from HuggingFace Hub.
|
| 265 |
|
|
|
|
|
|
|
|
|
|
| 266 |
Args:
|
| 267 |
dataset_id: HuggingFace dataset identifier (e.g., "hugging-science/isles24-stroke")
|
| 268 |
|
|
@@ -272,15 +284,23 @@ def build_huggingface_dataset(dataset_id: str) -> HuggingFaceDataset:
|
|
| 272 |
from datasets import load_dataset
|
| 273 |
|
| 274 |
logger.info("Loading HuggingFace dataset: %s", dataset_id)
|
| 275 |
-
hf_dataset = load_dataset(dataset_id, split="train")
|
| 276 |
|
| 277 |
-
#
|
| 278 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 279 |
|
| 280 |
-
logger.info("
|
| 281 |
|
|
|
|
| 282 |
return HuggingFaceDataset(
|
| 283 |
dataset_id=dataset_id,
|
| 284 |
-
_hf_dataset=
|
| 285 |
_case_ids=case_ids,
|
| 286 |
)
|
|
|
|
| 164 |
_cached_cases: dict[str, CaseFiles] = field(default_factory=dict, repr=False)
|
| 165 |
|
| 166 |
def __len__(self) -> int:
|
| 167 |
+
return len(self._case_ids)
|
| 168 |
|
| 169 |
def __iter__(self) -> Iterator[str]:
|
| 170 |
return iter(self._case_ids)
|
|
|
|
| 204 |
self._temp_dir = Path(tempfile.mkdtemp(prefix="isles24_hf_"))
|
| 205 |
logger.debug("Created temp directory: %s", self._temp_dir)
|
| 206 |
|
| 207 |
+
# Lazy load full dataset on first get_case() call
|
| 208 |
+
# This defers the expensive download until actually needed
|
| 209 |
+
if self._hf_dataset is None:
|
| 210 |
+
from datasets import load_dataset
|
| 211 |
+
|
| 212 |
+
logger.info("Loading full dataset for case access (lazy load)...")
|
| 213 |
+
self._hf_dataset = load_dataset(self.dataset_id, split="train")
|
| 214 |
+
logger.info("Full dataset loaded: %d examples", len(self._hf_dataset))
|
| 215 |
+
|
| 216 |
# Get the HuggingFace example
|
| 217 |
example = self._hf_dataset[idx]
|
| 218 |
|
|
|
|
| 272 |
"""
|
| 273 |
Load ISLES24 dataset from HuggingFace Hub.
|
| 274 |
|
| 275 |
+
Uses streaming mode to quickly enumerate case IDs without downloading
|
| 276 |
+
the full dataset. Actual data is downloaded lazily when get_case() is called.
|
| 277 |
+
|
| 278 |
Args:
|
| 279 |
dataset_id: HuggingFace dataset identifier (e.g., "hugging-science/isles24-stroke")
|
| 280 |
|
|
|
|
| 284 |
from datasets import load_dataset
|
| 285 |
|
| 286 |
logger.info("Loading HuggingFace dataset: %s", dataset_id)
|
|
|
|
| 287 |
|
| 288 |
+
# Use streaming to quickly get case IDs without downloading full dataset
|
| 289 |
+
# This avoids the "Generating train split" phase that hangs on HF Spaces
|
| 290 |
+
logger.info("Streaming dataset to enumerate case IDs...")
|
| 291 |
+
streaming_ds = load_dataset(dataset_id, split="train", streaming=True)
|
| 292 |
+
|
| 293 |
+
# Extract case IDs from streaming dataset (accesses only subject_id field,
|
| 294 |
+
# deferring heavy binary NIfTI downloads to get_case())
|
| 295 |
+
case_ids = []
|
| 296 |
+
for example in streaming_ds:
|
| 297 |
+
case_ids.append(example["subject_id"])
|
| 298 |
|
| 299 |
+
logger.info("Found %d cases from HuggingFace: %s", len(case_ids), dataset_id)
|
| 300 |
|
| 301 |
+
# Return dataset with lazy loading - full data downloaded only when get_case() called
|
| 302 |
return HuggingFaceDataset(
|
| 303 |
dataset_id=dataset_id,
|
| 304 |
+
_hf_dataset=None, # Lazy load on first get_case()
|
| 305 |
_case_ids=case_ids,
|
| 306 |
)
|
|
@@ -232,4 +232,5 @@ if __name__ == "__main__":
|
|
| 232 |
share=settings.gradio_share,
|
| 233 |
theme=gr.themes.Soft(),
|
| 234 |
css="footer {visibility: hidden}",
|
|
|
|
| 235 |
)
|
|
|
|
| 232 |
share=settings.gradio_share,
|
| 233 |
theme=gr.themes.Soft(),
|
| 234 |
css="footer {visibility: hidden}",
|
| 235 |
+
show_error=True, # Show full Python tracebacks in UI for debugging
|
| 236 |
)
|
|
@@ -169,14 +169,19 @@ class TestBuildHuggingFaceDataset:
|
|
| 169 |
|
| 170 |
@patch("datasets.load_dataset")
|
| 171 |
def test_loads_dataset_from_hub(self, mock_load_dataset: MagicMock) -> None:
|
| 172 |
-
"""Test that build_huggingface_dataset
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
|
|
|
|
|
|
| 176 |
|
| 177 |
result = build_huggingface_dataset("test/my-dataset")
|
| 178 |
|
| 179 |
-
|
|
|
|
| 180 |
assert isinstance(result, HuggingFaceDataset)
|
| 181 |
assert result.dataset_id == "test/my-dataset"
|
| 182 |
assert result._case_ids == ["sub-stroke0001"]
|
|
|
|
|
|
|
|
|
| 169 |
|
| 170 |
@patch("datasets.load_dataset")
|
| 171 |
def test_loads_dataset_from_hub(self, mock_load_dataset: MagicMock) -> None:
|
| 172 |
+
"""Test that build_huggingface_dataset uses streaming to enumerate case IDs."""
|
| 173 |
+
mock_streaming_ds = MagicMock()
|
| 174 |
+
mock_streaming_ds.__iter__ = MagicMock(
|
| 175 |
+
return_value=iter([{"subject_id": "sub-stroke0001"}])
|
| 176 |
+
)
|
| 177 |
+
mock_load_dataset.return_value = mock_streaming_ds
|
| 178 |
|
| 179 |
result = build_huggingface_dataset("test/my-dataset")
|
| 180 |
|
| 181 |
+
# Should use streaming mode for initial case ID enumeration
|
| 182 |
+
mock_load_dataset.assert_called_once_with("test/my-dataset", split="train", streaming=True)
|
| 183 |
assert isinstance(result, HuggingFaceDataset)
|
| 184 |
assert result.dataset_id == "test/my-dataset"
|
| 185 |
assert result._case_ids == ["sub-stroke0001"]
|
| 186 |
+
# Dataset should be None initially (lazy load)
|
| 187 |
+
assert result._hf_dataset is None
|