Spaces:

VibecoderMcSwaggins
/

stroke-viewer-frontend

Running

App Files Files Community

VibecoderMcSwaggins commited on 12 days ago

Commit

e244238

unverified ·

1 Parent(s): a2223b1

fix(data): use streaming mode to fix HF Spaces dataset hang (#15)

Browse files

* docs: add bug spec for HF Spaces dataset loading loop (P0)

* fix(data): use streaming mode for case ID enumeration to fix HF Spaces hang

Root cause: load_dataset() without streaming downloads and processes ALL
149 NIfTI files before returning, causing "Generating train split" to hang
at ~63 examples on HF Spaces due to resource limits.

Fix:
- Use streaming=True in build_huggingface_dataset() to quickly enumerate
case IDs without downloading binary data
- Lazy-load full dataset only when get_case() is actually called
- Add show_error=True to launch() for better error visibility

This fixes the P0 blocker where the dropdown never populated because
dataset processing exceeded HF Spaces timeout/resource limits.

* chore: address CodeRabbit review feedback

- Fix misleading "metadata-only" comment in adapter.py
- Add code fence language to bug spec markdown
- Convert bare URL to markdown link

Files changed (4) hide show

docs/specs/08-bug-hf-spaces-dataset-loop.md +166 -0
src/stroke_deepisles_demo/data/adapter.py +26 -6
src/stroke_deepisles_demo/ui/app.py +1 -0
tests/data/test_hf_adapter.py +10 -5

docs/specs/08-bug-hf-spaces-dataset-loop.md ADDED Viewed

	@@ -0,0 +1,166 @@

+# Bug Spec: HuggingFace Spaces Dataset Loading Loop
+**Status:** Open
+**Priority:** P0 (Blocks deployment)
+**Branch:** `debug/hf-spaces-dataset-error`
+**Date:** 2025-12-08
+## Observed Behavior
+Container enters infinite restart loop:
+1. Application starts successfully (`Running on local URL: http://0.0.0.0:7860`)
+2. Dataset download completes (`Downloading data: 100%|██████████| 149/149`)
+3. "Generating train split" begins processing
+4. **Container restarts** (new `Application Startup` timestamp)
+5. Cycle repeats indefinitely
+The "Select Case" dropdown **never** populates. Users see "Preparing Space" spinner forever.
+## Environment
+- **Space:** `VibecoderMcSwaggins/stroke-deepisles-demo`
+- **Hardware:** T4-small GPU
+- **Base Image:** `isleschallenge/deepisles:latest`
+- **Dataset:** `hugging-science/isles24-stroke` (149 NIfTI files, ~2-5MB each)
+- **Commit:** `a2223b1`
+## Timeline from Logs
+```text
+16:43:33 - Application Startup
+16:43:33 - Initializing dataset...
+16:43:33 - Downloading data: 0%
+16:48:10 - Downloading data: 100% (149/149) [~5 min]
+16:48:10 - Generating train split: starts
+16:56:53 - Application Startup (RESTART - lost all progress)
+16:56:53 - Downloading data: 0% (starts over)
+```
+## Hypotheses
+### H1: Memory OOM during train split generation
+- Processing 149 NIfTI files into HF Dataset format
+- Each file loaded into memory for processing
+- T4-small may have limited RAM
+- **Evidence:** Restart happens during "Generating train split" phase
+### H2: Disk space exhaustion
+- HF Spaces ephemeral storage limit (~50GB based on org space error)
+- DeepISLES base image is large
+- Dataset download + cache + processing temps
+- **Evidence:** Org space explicitly failed with "storage limit exceeded (50G)"
+### H3: Gradio demo.load() timeout
+- `demo.load()` has internal timeout?
+- 7+ minutes for dataset loading exceeds limit?
+- **Evidence:** UI shows "Preparing Space" during load
+### H4: HF Spaces health check failure
+- Even though port 7860 is bound, health check may require response
+- Long-running `demo.load()` blocks event loop?
+- **Evidence:** Container restarts after ~13 min total
+### H5: Exception swallowed during train split
+- Our try/except returns `gr.Dropdown(info=f"Error: {e}")`
+- But Gradio shows generic "Error" not our message
+- Something crashes before our handler
+## Code Under Suspicion
+### `src/stroke_deepisles_demo/ui/app.py:34-56`
+```python
+def initialize_case_selector() -> gr.Dropdown:
+    try:
+        logger.info("Initializing dataset for case selector...")
+        case_ids = list_case_ids()  # <-- This triggers full dataset load
+        if not case_ids:
+            return gr.Dropdown(choices=[], info="No cases found in dataset.")
+        return gr.Dropdown(
+            choices=case_ids,
+            value=case_ids[0],
+            info="Choose a case from isles24-stroke dataset",
+            interactive=True,
+        )
+    except Exception as e:
+        logger.exception("Failed to initialize dataset")
+        return gr.Dropdown(choices=[], info=f"Error loading data: {e!s}")
+```
+### `src/stroke_deepisles_demo/data/loader.py`
+- `list_case_ids()` calls `load_isles_dataset()`
+- `load_isles_dataset()` calls HF `load_dataset()` (non-streaming)
+- Full dataset downloaded and processed into memory
+## Potential Fixes
+### Fix 1: Streaming Mode (Recommended)
+```python
+# Instead of:
+ds = load_dataset("hugging-science/isles24-stroke")
+# Use streaming:
+ds = load_dataset("hugging-science/isles24-stroke", streaming=True)
+case_ids = [ex["case_id"] for ex in ds]  # Iterate without full load
+```
+- **Pros:** Zero disk usage, immediate start
+- **Cons:** Can't random access, must iterate
+### Fix 2: Lazy case ID loading
+- Only load case IDs, not full dataset
+- Use HF Hub API to list files without downloading
+### Fix 3: Pre-computed case ID list
+- Hardcode or cache the 149 case IDs
+- Skip dataset enumeration entirely for dropdown
+### Fix 4: Persistent Storage
+- Enable HF Spaces Persistent Storage add-on
+- Cache survives restarts
+- **Cons:** Costs money, doesn't fix root cause
+### Fix 5: Background thread with timeout
+- Run dataset load in background thread
+- Show "Loading..." in dropdown immediately
+- Update dropdown when ready (if ever)
+## Investigation Needed
+1. **Get actual error:** What exception/signal causes restart?
+   - Need HF Spaces runtime logs (not just container logs)
+   - Check for OOM killer, SIGKILL, etc.
+2. **Measure resource usage:**
+   - Disk usage during download/processing
+   - Memory usage during train split generation
+3. **Test streaming mode locally:**
+   - Does `streaming=True` work with our dataset?
+   - Can we still get case IDs?
+4. **Check Gradio demo.load() behavior:**
+   - Is there a timeout?
+   - Does long-running load block health checks?
+## Reproduction Steps
+1. Go to [the demo space](https://huggingface.co/spaces/VibecoderMcSwaggins/stroke-deepisles-demo)
+2. Open Logs tab
+3. Watch download complete (5 min)
+4. Watch "Generating train split" start
+5. Observe container restart (~7-13 min mark)
+6. See download start over from 0%
+## Related Issues
+- Org space (`hugging-science/stroke-deepisles-demo`) failed with explicit "storage limit exceeded (50G)"
+- This suggests disk space IS a factor
+- Personal space may have same limit but hits it slower
+## Next Steps
+1. [ ] Get deep analysis from senior reviewer / external agent
+2. [ ] Test streaming mode locally
+3. [ ] Add resource monitoring/logging
+4. [ ] Consider pre-computed case ID approach as quick fix

src/stroke_deepisles_demo/data/adapter.py CHANGED Viewed

@@ -164,7 +164,7 @@ class HuggingFaceDataset:
     _cached_cases: dict[str, CaseFiles] = field(default_factory=dict, repr=False)
     def __len__(self) -> int:
-        return len(self._hf_dataset)
     def __iter__(self) -> Iterator[str]:
         return iter(self._case_ids)
@@ -204,6 +204,15 @@ class HuggingFaceDataset:
             self._temp_dir = Path(tempfile.mkdtemp(prefix="isles24_hf_"))
             logger.debug("Created temp directory: %s", self._temp_dir)
         # Get the HuggingFace example
         example = self._hf_dataset[idx]
@@ -263,6 +272,9 @@ def build_huggingface_dataset(dataset_id: str) -> HuggingFaceDataset:
     """
     Load ISLES24 dataset from HuggingFace Hub.
     Args:
         dataset_id: HuggingFace dataset identifier (e.g., "hugging-science/isles24-stroke")
@@ -272,15 +284,23 @@ def build_huggingface_dataset(dataset_id: str) -> HuggingFaceDataset:
     from datasets import load_dataset
     logger.info("Loading HuggingFace dataset: %s", dataset_id)
-    hf_dataset = load_dataset(dataset_id, split="train")
-    # Extract case IDs
-    case_ids = [example["subject_id"] for example in hf_dataset]
-    logger.info("Loaded %d cases from HuggingFace: %s", len(case_ids), dataset_id)
     return HuggingFaceDataset(
         dataset_id=dataset_id,
-        _hf_dataset=hf_dataset,
         _case_ids=case_ids,
     )

     _cached_cases: dict[str, CaseFiles] = field(default_factory=dict, repr=False)
     def __len__(self) -> int:
+        return len(self._case_ids)
     def __iter__(self) -> Iterator[str]:
         return iter(self._case_ids)
             self._temp_dir = Path(tempfile.mkdtemp(prefix="isles24_hf_"))
             logger.debug("Created temp directory: %s", self._temp_dir)
+        # Lazy load full dataset on first get_case() call
+        # This defers the expensive download until actually needed
+        if self._hf_dataset is None:
+            from datasets import load_dataset
+            logger.info("Loading full dataset for case access (lazy load)...")
+            self._hf_dataset = load_dataset(self.dataset_id, split="train")
+            logger.info("Full dataset loaded: %d examples", len(self._hf_dataset))
         # Get the HuggingFace example
         example = self._hf_dataset[idx]
     """
     Load ISLES24 dataset from HuggingFace Hub.
+    Uses streaming mode to quickly enumerate case IDs without downloading
+    the full dataset. Actual data is downloaded lazily when get_case() is called.
     Args:
         dataset_id: HuggingFace dataset identifier (e.g., "hugging-science/isles24-stroke")
     from datasets import load_dataset
     logger.info("Loading HuggingFace dataset: %s", dataset_id)
+    # Use streaming to quickly get case IDs without downloading full dataset
+    # This avoids the "Generating train split" phase that hangs on HF Spaces
+    logger.info("Streaming dataset to enumerate case IDs...")
+    streaming_ds = load_dataset(dataset_id, split="train", streaming=True)
+    # Extract case IDs from streaming dataset (accesses only subject_id field,
+    # deferring heavy binary NIfTI downloads to get_case())
+    case_ids = []
+    for example in streaming_ds:
+        case_ids.append(example["subject_id"])
+    logger.info("Found %d cases from HuggingFace: %s", len(case_ids), dataset_id)
+    # Return dataset with lazy loading - full data downloaded only when get_case() called
     return HuggingFaceDataset(
         dataset_id=dataset_id,
+        _hf_dataset=None,  # Lazy load on first get_case()
         _case_ids=case_ids,
     )

src/stroke_deepisles_demo/ui/app.py CHANGED Viewed

@@ -232,4 +232,5 @@ if __name__ == "__main__":
         share=settings.gradio_share,
         theme=gr.themes.Soft(),
         css="footer {visibility: hidden}",
     )

         share=settings.gradio_share,
         theme=gr.themes.Soft(),
         css="footer {visibility: hidden}",
+        show_error=True,  # Show full Python tracebacks in UI for debugging
     )

tests/data/test_hf_adapter.py CHANGED Viewed

@@ -169,14 +169,19 @@ class TestBuildHuggingFaceDataset:
     @patch("datasets.load_dataset")
     def test_loads_dataset_from_hub(self, mock_load_dataset: MagicMock) -> None:
-        """Test that build_huggingface_dataset calls load_dataset correctly."""
-        mock_ds = MagicMock()
-        mock_ds.__iter__ = MagicMock(return_value=iter([{"subject_id": "sub-stroke0001"}]))
-        mock_load_dataset.return_value = mock_ds
         result = build_huggingface_dataset("test/my-dataset")
-        mock_load_dataset.assert_called_once_with("test/my-dataset", split="train")
         assert isinstance(result, HuggingFaceDataset)
         assert result.dataset_id == "test/my-dataset"
         assert result._case_ids == ["sub-stroke0001"]

     @patch("datasets.load_dataset")
     def test_loads_dataset_from_hub(self, mock_load_dataset: MagicMock) -> None:
+        """Test that build_huggingface_dataset uses streaming to enumerate case IDs."""
+        mock_streaming_ds = MagicMock()
+        mock_streaming_ds.__iter__ = MagicMock(
+            return_value=iter([{"subject_id": "sub-stroke0001"}])
+        )
+        mock_load_dataset.return_value = mock_streaming_ds
         result = build_huggingface_dataset("test/my-dataset")
+        # Should use streaming mode for initial case ID enumeration
+        mock_load_dataset.assert_called_once_with("test/my-dataset", split="train", streaming=True)
         assert isinstance(result, HuggingFaceDataset)
         assert result.dataset_id == "test/my-dataset"
         assert result._case_ids == ["sub-stroke0001"]
+        # Dataset should be None initially (lazy load)
+        assert result._hf_dataset is None