VibecoderMcSwaggins commited on
Commit
e244238
·
unverified ·
1 Parent(s): a2223b1

fix(data): use streaming mode to fix HF Spaces dataset hang (#15)

Browse files

* docs: add bug spec for HF Spaces dataset loading loop (P0)

* fix(data): use streaming mode for case ID enumeration to fix HF Spaces hang

Root cause: load_dataset() without streaming downloads and processes ALL
149 NIfTI files before returning, causing "Generating train split" to hang
at ~63 examples on HF Spaces due to resource limits.

Fix:
- Use streaming=True in build_huggingface_dataset() to quickly enumerate
case IDs without downloading binary data
- Lazy-load full dataset only when get_case() is actually called
- Add show_error=True to launch() for better error visibility

This fixes the P0 blocker where the dropdown never populated because
dataset processing exceeded HF Spaces timeout/resource limits.

* chore: address CodeRabbit review feedback

- Fix misleading "metadata-only" comment in adapter.py
- Add code fence language to bug spec markdown
- Convert bare URL to markdown link

docs/specs/08-bug-hf-spaces-dataset-loop.md ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Bug Spec: HuggingFace Spaces Dataset Loading Loop
2
+
3
+ **Status:** Open
4
+ **Priority:** P0 (Blocks deployment)
5
+ **Branch:** `debug/hf-spaces-dataset-error`
6
+ **Date:** 2025-12-08
7
+
8
+ ## Observed Behavior
9
+
10
+ Container enters infinite restart loop:
11
+ 1. Application starts successfully (`Running on local URL: http://0.0.0.0:7860`)
12
+ 2. Dataset download completes (`Downloading data: 100%|██████████| 149/149`)
13
+ 3. "Generating train split" begins processing
14
+ 4. **Container restarts** (new `Application Startup` timestamp)
15
+ 5. Cycle repeats indefinitely
16
+
17
+ The "Select Case" dropdown **never** populates. Users see "Preparing Space" spinner forever.
18
+
19
+ ## Environment
20
+
21
+ - **Space:** `VibecoderMcSwaggins/stroke-deepisles-demo`
22
+ - **Hardware:** T4-small GPU
23
+ - **Base Image:** `isleschallenge/deepisles:latest`
24
+ - **Dataset:** `hugging-science/isles24-stroke` (149 NIfTI files, ~2-5MB each)
25
+ - **Commit:** `a2223b1`
26
+
27
+ ## Timeline from Logs
28
+
29
+ ```text
30
+ 16:43:33 - Application Startup
31
+ 16:43:33 - Initializing dataset...
32
+ 16:43:33 - Downloading data: 0%
33
+ 16:48:10 - Downloading data: 100% (149/149) [~5 min]
34
+ 16:48:10 - Generating train split: starts
35
+ 16:56:53 - Application Startup (RESTART - lost all progress)
36
+ 16:56:53 - Downloading data: 0% (starts over)
37
+ ```
38
+
39
+ ## Hypotheses
40
+
41
+ ### H1: Memory OOM during train split generation
42
+ - Processing 149 NIfTI files into HF Dataset format
43
+ - Each file loaded into memory for processing
44
+ - T4-small may have limited RAM
45
+ - **Evidence:** Restart happens during "Generating train split" phase
46
+
47
+ ### H2: Disk space exhaustion
48
+ - HF Spaces ephemeral storage limit (~50GB based on org space error)
49
+ - DeepISLES base image is large
50
+ - Dataset download + cache + processing temps
51
+ - **Evidence:** Org space explicitly failed with "storage limit exceeded (50G)"
52
+
53
+ ### H3: Gradio demo.load() timeout
54
+ - `demo.load()` has internal timeout?
55
+ - 7+ minutes for dataset loading exceeds limit?
56
+ - **Evidence:** UI shows "Preparing Space" during load
57
+
58
+ ### H4: HF Spaces health check failure
59
+ - Even though port 7860 is bound, health check may require response
60
+ - Long-running `demo.load()` blocks event loop?
61
+ - **Evidence:** Container restarts after ~13 min total
62
+
63
+ ### H5: Exception swallowed during train split
64
+ - Our try/except returns `gr.Dropdown(info=f"Error: {e}")`
65
+ - But Gradio shows generic "Error" not our message
66
+ - Something crashes before our handler
67
+
68
+ ## Code Under Suspicion
69
+
70
+ ### `src/stroke_deepisles_demo/ui/app.py:34-56`
71
+ ```python
72
+ def initialize_case_selector() -> gr.Dropdown:
73
+ try:
74
+ logger.info("Initializing dataset for case selector...")
75
+ case_ids = list_case_ids() # <-- This triggers full dataset load
76
+
77
+ if not case_ids:
78
+ return gr.Dropdown(choices=[], info="No cases found in dataset.")
79
+
80
+ return gr.Dropdown(
81
+ choices=case_ids,
82
+ value=case_ids[0],
83
+ info="Choose a case from isles24-stroke dataset",
84
+ interactive=True,
85
+ )
86
+ except Exception as e:
87
+ logger.exception("Failed to initialize dataset")
88
+ return gr.Dropdown(choices=[], info=f"Error loading data: {e!s}")
89
+ ```
90
+
91
+ ### `src/stroke_deepisles_demo/data/loader.py`
92
+ - `list_case_ids()` calls `load_isles_dataset()`
93
+ - `load_isles_dataset()` calls HF `load_dataset()` (non-streaming)
94
+ - Full dataset downloaded and processed into memory
95
+
96
+ ## Potential Fixes
97
+
98
+ ### Fix 1: Streaming Mode (Recommended)
99
+ ```python
100
+ # Instead of:
101
+ ds = load_dataset("hugging-science/isles24-stroke")
102
+
103
+ # Use streaming:
104
+ ds = load_dataset("hugging-science/isles24-stroke", streaming=True)
105
+ case_ids = [ex["case_id"] for ex in ds] # Iterate without full load
106
+ ```
107
+ - **Pros:** Zero disk usage, immediate start
108
+ - **Cons:** Can't random access, must iterate
109
+
110
+ ### Fix 2: Lazy case ID loading
111
+ - Only load case IDs, not full dataset
112
+ - Use HF Hub API to list files without downloading
113
+
114
+ ### Fix 3: Pre-computed case ID list
115
+ - Hardcode or cache the 149 case IDs
116
+ - Skip dataset enumeration entirely for dropdown
117
+
118
+ ### Fix 4: Persistent Storage
119
+ - Enable HF Spaces Persistent Storage add-on
120
+ - Cache survives restarts
121
+ - **Cons:** Costs money, doesn't fix root cause
122
+
123
+ ### Fix 5: Background thread with timeout
124
+ - Run dataset load in background thread
125
+ - Show "Loading..." in dropdown immediately
126
+ - Update dropdown when ready (if ever)
127
+
128
+ ## Investigation Needed
129
+
130
+ 1. **Get actual error:** What exception/signal causes restart?
131
+ - Need HF Spaces runtime logs (not just container logs)
132
+ - Check for OOM killer, SIGKILL, etc.
133
+
134
+ 2. **Measure resource usage:**
135
+ - Disk usage during download/processing
136
+ - Memory usage during train split generation
137
+
138
+ 3. **Test streaming mode locally:**
139
+ - Does `streaming=True` work with our dataset?
140
+ - Can we still get case IDs?
141
+
142
+ 4. **Check Gradio demo.load() behavior:**
143
+ - Is there a timeout?
144
+ - Does long-running load block health checks?
145
+
146
+ ## Reproduction Steps
147
+
148
+ 1. Go to [the demo space](https://huggingface.co/spaces/VibecoderMcSwaggins/stroke-deepisles-demo)
149
+ 2. Open Logs tab
150
+ 3. Watch download complete (5 min)
151
+ 4. Watch "Generating train split" start
152
+ 5. Observe container restart (~7-13 min mark)
153
+ 6. See download start over from 0%
154
+
155
+ ## Related Issues
156
+
157
+ - Org space (`hugging-science/stroke-deepisles-demo`) failed with explicit "storage limit exceeded (50G)"
158
+ - This suggests disk space IS a factor
159
+ - Personal space may have same limit but hits it slower
160
+
161
+ ## Next Steps
162
+
163
+ 1. [ ] Get deep analysis from senior reviewer / external agent
164
+ 2. [ ] Test streaming mode locally
165
+ 3. [ ] Add resource monitoring/logging
166
+ 4. [ ] Consider pre-computed case ID approach as quick fix
src/stroke_deepisles_demo/data/adapter.py CHANGED
@@ -164,7 +164,7 @@ class HuggingFaceDataset:
164
  _cached_cases: dict[str, CaseFiles] = field(default_factory=dict, repr=False)
165
 
166
  def __len__(self) -> int:
167
- return len(self._hf_dataset)
168
 
169
  def __iter__(self) -> Iterator[str]:
170
  return iter(self._case_ids)
@@ -204,6 +204,15 @@ class HuggingFaceDataset:
204
  self._temp_dir = Path(tempfile.mkdtemp(prefix="isles24_hf_"))
205
  logger.debug("Created temp directory: %s", self._temp_dir)
206
 
 
 
 
 
 
 
 
 
 
207
  # Get the HuggingFace example
208
  example = self._hf_dataset[idx]
209
 
@@ -263,6 +272,9 @@ def build_huggingface_dataset(dataset_id: str) -> HuggingFaceDataset:
263
  """
264
  Load ISLES24 dataset from HuggingFace Hub.
265
 
 
 
 
266
  Args:
267
  dataset_id: HuggingFace dataset identifier (e.g., "hugging-science/isles24-stroke")
268
 
@@ -272,15 +284,23 @@ def build_huggingface_dataset(dataset_id: str) -> HuggingFaceDataset:
272
  from datasets import load_dataset
273
 
274
  logger.info("Loading HuggingFace dataset: %s", dataset_id)
275
- hf_dataset = load_dataset(dataset_id, split="train")
276
 
277
- # Extract case IDs
278
- case_ids = [example["subject_id"] for example in hf_dataset]
 
 
 
 
 
 
 
 
279
 
280
- logger.info("Loaded %d cases from HuggingFace: %s", len(case_ids), dataset_id)
281
 
 
282
  return HuggingFaceDataset(
283
  dataset_id=dataset_id,
284
- _hf_dataset=hf_dataset,
285
  _case_ids=case_ids,
286
  )
 
164
  _cached_cases: dict[str, CaseFiles] = field(default_factory=dict, repr=False)
165
 
166
  def __len__(self) -> int:
167
+ return len(self._case_ids)
168
 
169
  def __iter__(self) -> Iterator[str]:
170
  return iter(self._case_ids)
 
204
  self._temp_dir = Path(tempfile.mkdtemp(prefix="isles24_hf_"))
205
  logger.debug("Created temp directory: %s", self._temp_dir)
206
 
207
+ # Lazy load full dataset on first get_case() call
208
+ # This defers the expensive download until actually needed
209
+ if self._hf_dataset is None:
210
+ from datasets import load_dataset
211
+
212
+ logger.info("Loading full dataset for case access (lazy load)...")
213
+ self._hf_dataset = load_dataset(self.dataset_id, split="train")
214
+ logger.info("Full dataset loaded: %d examples", len(self._hf_dataset))
215
+
216
  # Get the HuggingFace example
217
  example = self._hf_dataset[idx]
218
 
 
272
  """
273
  Load ISLES24 dataset from HuggingFace Hub.
274
 
275
+ Uses streaming mode to quickly enumerate case IDs without downloading
276
+ the full dataset. Actual data is downloaded lazily when get_case() is called.
277
+
278
  Args:
279
  dataset_id: HuggingFace dataset identifier (e.g., "hugging-science/isles24-stroke")
280
 
 
284
  from datasets import load_dataset
285
 
286
  logger.info("Loading HuggingFace dataset: %s", dataset_id)
 
287
 
288
+ # Use streaming to quickly get case IDs without downloading full dataset
289
+ # This avoids the "Generating train split" phase that hangs on HF Spaces
290
+ logger.info("Streaming dataset to enumerate case IDs...")
291
+ streaming_ds = load_dataset(dataset_id, split="train", streaming=True)
292
+
293
+ # Extract case IDs from streaming dataset (accesses only subject_id field,
294
+ # deferring heavy binary NIfTI downloads to get_case())
295
+ case_ids = []
296
+ for example in streaming_ds:
297
+ case_ids.append(example["subject_id"])
298
 
299
+ logger.info("Found %d cases from HuggingFace: %s", len(case_ids), dataset_id)
300
 
301
+ # Return dataset with lazy loading - full data downloaded only when get_case() called
302
  return HuggingFaceDataset(
303
  dataset_id=dataset_id,
304
+ _hf_dataset=None, # Lazy load on first get_case()
305
  _case_ids=case_ids,
306
  )
src/stroke_deepisles_demo/ui/app.py CHANGED
@@ -232,4 +232,5 @@ if __name__ == "__main__":
232
  share=settings.gradio_share,
233
  theme=gr.themes.Soft(),
234
  css="footer {visibility: hidden}",
 
235
  )
 
232
  share=settings.gradio_share,
233
  theme=gr.themes.Soft(),
234
  css="footer {visibility: hidden}",
235
+ show_error=True, # Show full Python tracebacks in UI for debugging
236
  )
tests/data/test_hf_adapter.py CHANGED
@@ -169,14 +169,19 @@ class TestBuildHuggingFaceDataset:
169
 
170
  @patch("datasets.load_dataset")
171
  def test_loads_dataset_from_hub(self, mock_load_dataset: MagicMock) -> None:
172
- """Test that build_huggingface_dataset calls load_dataset correctly."""
173
- mock_ds = MagicMock()
174
- mock_ds.__iter__ = MagicMock(return_value=iter([{"subject_id": "sub-stroke0001"}]))
175
- mock_load_dataset.return_value = mock_ds
 
 
176
 
177
  result = build_huggingface_dataset("test/my-dataset")
178
 
179
- mock_load_dataset.assert_called_once_with("test/my-dataset", split="train")
 
180
  assert isinstance(result, HuggingFaceDataset)
181
  assert result.dataset_id == "test/my-dataset"
182
  assert result._case_ids == ["sub-stroke0001"]
 
 
 
169
 
170
  @patch("datasets.load_dataset")
171
  def test_loads_dataset_from_hub(self, mock_load_dataset: MagicMock) -> None:
172
+ """Test that build_huggingface_dataset uses streaming to enumerate case IDs."""
173
+ mock_streaming_ds = MagicMock()
174
+ mock_streaming_ds.__iter__ = MagicMock(
175
+ return_value=iter([{"subject_id": "sub-stroke0001"}])
176
+ )
177
+ mock_load_dataset.return_value = mock_streaming_ds
178
 
179
  result = build_huggingface_dataset("test/my-dataset")
180
 
181
+ # Should use streaming mode for initial case ID enumeration
182
+ mock_load_dataset.assert_called_once_with("test/my-dataset", split="train", streaming=True)
183
  assert isinstance(result, HuggingFaceDataset)
184
  assert result.dataset_id == "test/my-dataset"
185
  assert result._case_ids == ["sub-stroke0001"]
186
+ # Dataset should be None initially (lazy load)
187
+ assert result._hf_dataset is None