Spaces:
Runtime error
fix: address P1/P2 bugs from audit (BUG-005/006/008/010/011) (#40)
Browse files* fix: address P1/P2 bugs from audit (BUG-005/006/008/010/011)
Fixes validated from first principles by reading actual source code:
BUG-005 (P1): Exception handling converts 400→500
- Added `except HTTPException: raise` before `except Exception:` in
create_segment_job() to preserve intentional HTTP status codes
BUG-006 (P1): Unbounded concurrent inference
- Added MAX_CONCURRENT_JOBS=10 config constant
- Added get_active_job_count() to JobStore
- Added 503 response when limit reached (prevents OOM/GPU exhaustion)
BUG-008 (P2): Temp directory leak in convenience functions
- get_case() and list_case_ids() now use context manager
- Prevents unbounded /tmp growth from HuggingFace temp files
BUG-010 (P2): Case ID logged despite sensitivity comment
- Removed case_id from completion log message
- Comment at line 118 says "don't log case_id" - now consistent
BUG-011 (P2): Mock filenames don't match real backend
- Mock now uses {case_id}_dwi.nii.gz (matches pipeline.py:123)
- Mock now uses lesion_msk.nii.gz (matches direct.py:85)
Test results:
- ruff check: All passed
- mypy: No issues in 7 files
- tsc --noEmit: Passed
- vitest: 70/70 tests passed
* docs: update audit status, add NEXT-CONCERNS, apply CodeRabbit feedback
- Updated BUGS-AUDIT-2024-12-12.md:
- Removed BUG-007 (no auth) - intentional for public demo
- Added fix status table showing which bugs are fixed
- Updated status from AWAITING REVIEW to FIXED
- Created NEXT-CONCERNS.md for deferred items:
- BUG-009 (config drift) - needs investigation
- BUG-012 (loose pinning) - mitigated by lock files
- Applied CodeRabbit nitpicks:
- Clarified MAX_CONCURRENT_JOBS comment (queue depth vs serialization)
- Simplified handlers.ts comment (removed file:line references)
|
@@ -0,0 +1,208 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Bug Audit Report - 2024-12-12
|
| 2 |
+
|
| 3 |
+
Consolidated audit findings validated from first principles by reading source code.
|
| 4 |
+
**Status: FIXED** - P1 and P2 bugs addressed in PR #40.
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## P1 - High Priority (Functional Bugs)
|
| 9 |
+
|
| 10 |
+
### BUG-005: Exception Handling Converts 400→500
|
| 11 |
+
|
| 12 |
+
**File:** `src/stroke_deepisles_demo/api/routes.py:127-129`
|
| 13 |
+
|
| 14 |
+
**Issue:** The exception handler catches ALL exceptions including HTTPException,
|
| 15 |
+
converting intentional 400 Bad Request responses into 500 Internal Server Errors.
|
| 16 |
+
|
| 17 |
+
```python
|
| 18 |
+
# Lines 96-99: Raises 400 for invalid case_id
|
| 19 |
+
if body.case_id not in valid_cases:
|
| 20 |
+
raise HTTPException(
|
| 21 |
+
status_code=400,
|
| 22 |
+
detail=f"Invalid case ID: '{body.case_id}'. Use GET /api/cases for available cases.",
|
| 23 |
+
)
|
| 24 |
+
|
| 25 |
+
# Lines 127-129: BUG - catches the HTTPException above and converts to 500
|
| 26 |
+
except Exception:
|
| 27 |
+
logger.exception("Failed to create segmentation job")
|
| 28 |
+
raise HTTPException(status_code=500, detail="Failed to create segmentation job") from None
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
**Impact:** Frontend receives 500 errors for invalid input, making debugging difficult.
|
| 32 |
+
|
| 33 |
+
**Fix:** Add `except HTTPException: raise` before `except Exception:` to re-raise HTTP exceptions.
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
+
### BUG-006: Unbounded Concurrent Inference
|
| 38 |
+
|
| 39 |
+
**Files:** `src/stroke_deepisles_demo/api/routes.py`, `src/stroke_deepisles_demo/api/main.py`
|
| 40 |
+
|
| 41 |
+
**Issue:** No rate limiting, semaphores, or concurrency controls exist. Every request
|
| 42 |
+
spawns a GPU-bound background task without limits.
|
| 43 |
+
|
| 44 |
+
**Validation:** `grep -r "semaphore|rate.?limit|concurrent|max.*jobs|throttl"` returns 0 hits.
|
| 45 |
+
|
| 46 |
+
**Impact:**
|
| 47 |
+
- N concurrent requests = N GPU inference tasks = OOM or GPU memory exhaustion
|
| 48 |
+
- Denial of service risk (intentional or accidental)
|
| 49 |
+
- HF Spaces free tier could be overwhelmed
|
| 50 |
+
|
| 51 |
+
**Fix Options:**
|
| 52 |
+
1. Add `asyncio.Semaphore(1)` to limit concurrent inference
|
| 53 |
+
2. Add job queue depth limit (reject if >N pending jobs)
|
| 54 |
+
3. Add rate limiting middleware (slowapi or similar)
|
| 55 |
+
|
| 56 |
+
---
|
| 57 |
+
|
| 58 |
+
## P2 - Medium Priority (Resource Leaks, Misconfigurations)
|
| 59 |
+
|
| 60 |
+
### BUG-008: Temp Directory Leak in Convenience Functions
|
| 61 |
+
|
| 62 |
+
**File:** `src/stroke_deepisles_demo/data/__init__.py:20-34`
|
| 63 |
+
|
| 64 |
+
**Issue:** `get_case()` and `list_case_ids()` create dataset instances without using
|
| 65 |
+
the context manager, causing temp directory accumulation.
|
| 66 |
+
|
| 67 |
+
```python
|
| 68 |
+
def get_case(case_id: str | int) -> CaseFiles:
|
| 69 |
+
dataset = load_isles_dataset() # Creates temp dir internally
|
| 70 |
+
return dataset.get_case(case_id) # cleanup() never called!
|
| 71 |
+
|
| 72 |
+
def list_case_ids() -> list[str]:
|
| 73 |
+
dataset = load_isles_dataset() # Creates temp dir internally
|
| 74 |
+
return dataset.list_case_ids() # cleanup() never called!
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
**Validation:** The loader.py docstring explicitly says:
|
| 78 |
+
> "Use as context manager: `with load_isles_dataset() as ds: ...` for automatic cleanup"
|
| 79 |
+
|
| 80 |
+
**Impact:** Unbounded /tmp growth over time, eventual disk exhaustion.
|
| 81 |
+
|
| 82 |
+
**Fix:** Refactor to use context manager or call `dataset.cleanup()` explicitly.
|
| 83 |
+
|
| 84 |
+
---
|
| 85 |
+
|
| 86 |
+
### BUG-009: Results Directory Configuration Drift
|
| 87 |
+
|
| 88 |
+
**Files:**
|
| 89 |
+
- `src/stroke_deepisles_demo/core/config.py:93` → `results_dir: Path = Path("./results")`
|
| 90 |
+
- `src/stroke_deepisles_demo/api/config.py:10` → `RESULTS_DIR = Path("/tmp/stroke-results")`
|
| 91 |
+
|
| 92 |
+
**Issue:** Two different default values for results directory in two config modules.
|
| 93 |
+
API uses `/tmp/stroke-results`, core uses `./results`.
|
| 94 |
+
|
| 95 |
+
**Impact:**
|
| 96 |
+
- Confusion about canonical location
|
| 97 |
+
- Risk of code using wrong config and writing to wrong location
|
| 98 |
+
- core/config.py's `./results` won't work on HF Spaces (not writable)
|
| 99 |
+
|
| 100 |
+
**Fix:** Consolidate to single source of truth. Remove one or make them reference each other.
|
| 101 |
+
|
| 102 |
+
---
|
| 103 |
+
|
| 104 |
+
### BUG-010: Case ID Logged Despite Sensitivity Comment
|
| 105 |
+
|
| 106 |
+
**File:** `src/stroke_deepisles_demo/api/routes.py:264`
|
| 107 |
+
|
| 108 |
+
**Issue:** Code explicitly logs case_id despite comment on line 118 saying not to:
|
| 109 |
+
|
| 110 |
+
```python
|
| 111 |
+
# Line 118 (comment):
|
| 112 |
+
# Note: Don't log case_id as it may be sensitive (medical domain)
|
| 113 |
+
|
| 114 |
+
# Line 264 (violation):
|
| 115 |
+
logger.info(
|
| 116 |
+
"Job %s completed: case=%s, dice=%.3f, time=%.1fs",
|
| 117 |
+
job_id,
|
| 118 |
+
case_id, # <-- Logged here
|
| 119 |
+
result.dice_score or 0,
|
| 120 |
+
result.elapsed_seconds,
|
| 121 |
+
)
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
**Impact:** Potential PHI/PII exposure in logs if case IDs contain patient identifiers.
|
| 125 |
+
|
| 126 |
+
**Fix:** Remove `case=%s` from log format string and remove `case_id` argument.
|
| 127 |
+
|
| 128 |
+
---
|
| 129 |
+
|
| 130 |
+
### BUG-011: Frontend Mock Filenames Don't Match Real Backend
|
| 131 |
+
|
| 132 |
+
**Files:**
|
| 133 |
+
- `frontend/src/mocks/handlers.ts:167-168`
|
| 134 |
+
- `src/stroke_deepisles_demo/api/routes.py:246-256`
|
| 135 |
+
- `src/stroke_deepisles_demo/pipeline.py:123`
|
| 136 |
+
- `src/stroke_deepisles_demo/inference/direct.py:85`
|
| 137 |
+
|
| 138 |
+
**Issue:** Mock uses different filenames than actual backend output:
|
| 139 |
+
|
| 140 |
+
| File Type | Mock (handlers.ts) | Real Backend |
|
| 141 |
+
|-----------|-------------------|--------------|
|
| 142 |
+
| DWI | `dwi.nii.gz` | `{case_id}_dwi.nii.gz` |
|
| 143 |
+
| Prediction | `prediction.nii.gz` | `lesion_msk.nii.gz` |
|
| 144 |
+
|
| 145 |
+
**Validation:**
|
| 146 |
+
- Mock: `dwiUrl: \`\${API_BASE}/files/\${jobId}/\${updatedJob.caseId}/dwi.nii.gz\``
|
| 147 |
+
- Backend pipeline.py:123: `dwi_dest = results_dir / f"{resolved_case_id}_dwi.nii.gz"`
|
| 148 |
+
- Backend direct.py:85: `expected_path = output_dir / "lesion_msk.nii.gz"`
|
| 149 |
+
|
| 150 |
+
**Impact:** Frontend tests pass but fail in production. Integration gap.
|
| 151 |
+
|
| 152 |
+
**Fix:** Update mock to match actual backend output filenames.
|
| 153 |
+
|
| 154 |
+
---
|
| 155 |
+
|
| 156 |
+
## P3 - Low Priority (Observability, Best Practices)
|
| 157 |
+
|
| 158 |
+
### BUG-012: Loose Dependency Pinning
|
| 159 |
+
|
| 160 |
+
**Files:**
|
| 161 |
+
- `pyproject.toml` - Uses `>=` ranges (e.g., `fastapi>=0.115.0`)
|
| 162 |
+
- `frontend/package.json` - Uses `^` ranges (e.g., `"react": "^19.2.0"`)
|
| 163 |
+
|
| 164 |
+
**Issue:** Dependencies use loose version ranges instead of exact pins.
|
| 165 |
+
|
| 166 |
+
**Impact:** Non-reproducible builds. Breaking changes in minor updates could cause silent failures.
|
| 167 |
+
|
| 168 |
+
**Note:** Lock files (uv.lock, package-lock.json) may mitigate this if checked in.
|
| 169 |
+
|
| 170 |
+
---
|
| 171 |
+
|
| 172 |
+
## Validated as FALSE (Not Bugs)
|
| 173 |
+
|
| 174 |
+
### CLAIM: .env tracked despite .gitignore
|
| 175 |
+
**Status:** FALSE
|
| 176 |
+
**Validation:** `git ls-files .env` returns empty. File is not tracked.
|
| 177 |
+
|
| 178 |
+
### CLAIM: CORS missing .static.hf.space variant
|
| 179 |
+
**Status:** LIKELY FALSE
|
| 180 |
+
**Validation:** HuggingFace Static Spaces use `.hf.space` domain, not `.static.hf.space`.
|
| 181 |
+
The current CORS config includes `https://vibecodermcswaggins-stroke-viewer-frontend.hf.space`.
|
| 182 |
+
|
| 183 |
+
---
|
| 184 |
+
|
| 185 |
+
## Fix Status
|
| 186 |
+
|
| 187 |
+
| Bug | Status | PR |
|
| 188 |
+
|-----|--------|-----|
|
| 189 |
+
| BUG-005 | ✅ FIXED | #40 |
|
| 190 |
+
| BUG-006 | ✅ FIXED | #40 |
|
| 191 |
+
| BUG-008 | ✅ FIXED | #40 |
|
| 192 |
+
| BUG-009 | 🔄 DEFERRED | See NEXT-CONCERNS.md |
|
| 193 |
+
| BUG-010 | ✅ FIXED | #40 |
|
| 194 |
+
| BUG-011 | ✅ FIXED | #40 |
|
| 195 |
+
| BUG-012 | 🔄 DEFERRED | See NEXT-CONCERNS.md |
|
| 196 |
+
|
| 197 |
+
## Notes
|
| 198 |
+
|
| 199 |
+
1. **BUG-004 (StaticFiles CORS bypass)** was already fixed in the codebase via `files.py` router.
|
| 200 |
+
The fix comment at `api/main.py:131-132` documents this.
|
| 201 |
+
|
| 202 |
+
2. **BUG-007 (No Authentication)** was removed from audit - intentional design for public demo.
|
| 203 |
+
|
| 204 |
+
---
|
| 205 |
+
|
| 206 |
+
**Audited by:** Claude Code
|
| 207 |
+
**Date:** 2024-12-12
|
| 208 |
+
**Fixed:** 2024-12-12 (PR #40)
|
|
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Next Concerns - To Investigate
|
| 2 |
+
|
| 3 |
+
Deferred items from BUGS-AUDIT-2024-12-12.md that need deeper investigation.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## BUG-009: Results Directory Configuration Drift
|
| 8 |
+
|
| 9 |
+
**Priority:** P2 (Medium)
|
| 10 |
+
|
| 11 |
+
### Current State
|
| 12 |
+
|
| 13 |
+
Two files define where results go:
|
| 14 |
+
- `core/config.py:93` → `results_dir: Path = Path("./results")`
|
| 15 |
+
- `api/config.py:10` → `RESULTS_DIR = Path("/tmp/stroke-results")`
|
| 16 |
+
|
| 17 |
+
### Investigation Needed
|
| 18 |
+
|
| 19 |
+
1. **Does anything actually use `core/config.py`'s `results_dir`?**
|
| 20 |
+
- Grep codebase for imports from `core/config`
|
| 21 |
+
- If nothing uses it, delete the duplicate
|
| 22 |
+
- If something uses it, consolidate to single source of truth
|
| 23 |
+
|
| 24 |
+
2. **Gold standard pattern:**
|
| 25 |
+
```python
|
| 26 |
+
# api/config.py (SSOT)
|
| 27 |
+
RESULTS_DIR = Path("/tmp/stroke-results")
|
| 28 |
+
|
| 29 |
+
# core/config.py (if needed)
|
| 30 |
+
from stroke_deepisles_demo.api.config import RESULTS_DIR
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
### Action
|
| 34 |
+
|
| 35 |
+
Run investigation, then either:
|
| 36 |
+
- Delete unused `core/config.py` results_dir, OR
|
| 37 |
+
- Make it import from `api/config.py`
|
| 38 |
+
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
+
## BUG-012: Loose Dependency Pinning
|
| 42 |
+
|
| 43 |
+
**Priority:** P3 (Low)
|
| 44 |
+
|
| 45 |
+
### Current State
|
| 46 |
+
|
| 47 |
+
- `pyproject.toml` uses `>=` ranges (e.g., `fastapi>=0.115.0`)
|
| 48 |
+
- `package.json` uses `^` ranges (e.g., `"react": "^19.2.0"`)
|
| 49 |
+
|
| 50 |
+
### Reality Check
|
| 51 |
+
|
| 52 |
+
Lock files exist and are committed:
|
| 53 |
+
- `uv.lock` → Backend installs are deterministic
|
| 54 |
+
- `package-lock.json` → Frontend installs are deterministic
|
| 55 |
+
|
| 56 |
+
**Builds ARE reproducible** if you use the lock files.
|
| 57 |
+
|
| 58 |
+
### Investigation Needed
|
| 59 |
+
|
| 60 |
+
1. Verify `uv.lock` is committed: `git ls-files uv.lock`
|
| 61 |
+
2. Verify Docker uses `uv sync` (respects lock) not `uv pip install` (ignores lock)
|
| 62 |
+
3. Verify CI uses lock files
|
| 63 |
+
|
| 64 |
+
### Action
|
| 65 |
+
|
| 66 |
+
If lock files are committed and used correctly:
|
| 67 |
+
- Close as "mitigated by lock files"
|
| 68 |
+
- Optionally add README note about reproducible builds
|
| 69 |
+
|
| 70 |
+
If lock files are NOT being used:
|
| 71 |
+
- Fix CI/Docker to use them
|
| 72 |
+
- Document requirement
|
| 73 |
+
|
| 74 |
+
---
|
| 75 |
+
|
| 76 |
+
**Created:** 2024-12-12
|
| 77 |
+
**Status:** Pending investigation
|
|
@@ -159,13 +159,14 @@ export const handlers = [
|
|
| 159 |
|
| 160 |
// Include result if completed
|
| 161 |
if (updatedJob.status === "completed") {
|
|
|
|
| 162 |
response.result = {
|
| 163 |
caseId: updatedJob.caseId,
|
| 164 |
diceScore: 0.847,
|
| 165 |
volumeMl: 15.32,
|
| 166 |
elapsedSeconds: updatedJob.fastMode ? 12.5 : 45.0,
|
| 167 |
-
dwiUrl: `${API_BASE}/files/${jobId}/${updatedJob.caseId}
|
| 168 |
-
predictionUrl: `${API_BASE}/files/${jobId}/${updatedJob.caseId}/
|
| 169 |
};
|
| 170 |
}
|
| 171 |
|
|
|
|
| 159 |
|
| 160 |
// Include result if completed
|
| 161 |
if (updatedJob.status === "completed") {
|
| 162 |
+
// Filenames must match actual backend output format
|
| 163 |
response.result = {
|
| 164 |
caseId: updatedJob.caseId,
|
| 165 |
diceScore: 0.847,
|
| 166 |
volumeMl: 15.32,
|
| 167 |
elapsedSeconds: updatedJob.fastMode ? 12.5 : 45.0,
|
| 168 |
+
dwiUrl: `${API_BASE}/files/${jobId}/${updatedJob.caseId}/${updatedJob.caseId}_dwi.nii.gz`,
|
| 169 |
+
predictionUrl: `${API_BASE}/files/${jobId}/${updatedJob.caseId}/lesion_msk.nii.gz`,
|
| 170 |
};
|
| 171 |
}
|
| 172 |
|
|
@@ -8,3 +8,9 @@ from pathlib import Path
|
|
| 8 |
# Results directory for job outputs (must be in /tmp for HF Spaces)
|
| 9 |
# CRITICAL: This is the single source of truth. Import this instead of hardcoding.
|
| 10 |
RESULTS_DIR = Path("/tmp/stroke-results")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
# Results directory for job outputs (must be in /tmp for HF Spaces)
|
| 9 |
# CRITICAL: This is the single source of truth. Import this instead of hardcoding.
|
| 10 |
RESULTS_DIR = Path("/tmp/stroke-results")
|
| 11 |
+
|
| 12 |
+
# Maximum active jobs (pending + running) accepted by the API
|
| 13 |
+
# This limits how many jobs can be queued/running at once, NOT serialized GPU execution
|
| 14 |
+
# T4 GPU (16GB) can handle ~1-2 concurrent DeepISLES inferences safely
|
| 15 |
+
# Value of 10 allows reasonable queue depth while preventing unbounded accumulation
|
| 16 |
+
MAX_CONCURRENT_JOBS = 10
|
|
@@ -150,6 +150,18 @@ class JobStore:
|
|
| 150 |
"""
|
| 151 |
return bool(job_id) and _SAFE_JOB_ID_PATTERN.match(job_id) is not None
|
| 152 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 153 |
def create_job(self, job_id: str, case_id: str, fast_mode: bool) -> Job:
|
| 154 |
"""Create a new job in PENDING status.
|
| 155 |
|
|
|
|
| 150 |
"""
|
| 151 |
return bool(job_id) and _SAFE_JOB_ID_PATTERN.match(job_id) is not None
|
| 152 |
|
| 153 |
+
def get_active_job_count(self) -> int:
|
| 154 |
+
"""Return the number of active (pending or running) jobs.
|
| 155 |
+
|
| 156 |
+
Used for concurrency limiting to prevent GPU memory exhaustion.
|
| 157 |
+
"""
|
| 158 |
+
with self._lock:
|
| 159 |
+
return sum(
|
| 160 |
+
1
|
| 161 |
+
for job in self._jobs.values()
|
| 162 |
+
if job.status in (JobStatus.PENDING, JobStatus.RUNNING)
|
| 163 |
+
)
|
| 164 |
+
|
| 165 |
def create_job(self, job_id: str, case_id: str, fast_mode: bool) -> Job:
|
| 166 |
"""Create a new job in PENDING status.
|
| 167 |
|
|
@@ -15,7 +15,7 @@ import uuid
|
|
| 15 |
|
| 16 |
from fastapi import APIRouter, BackgroundTasks, HTTPException, Request
|
| 17 |
|
| 18 |
-
from stroke_deepisles_demo.api.config import RESULTS_DIR
|
| 19 |
from stroke_deepisles_demo.api.job_store import JobStatus, get_job_store
|
| 20 |
from stroke_deepisles_demo.api.schemas import (
|
| 21 |
CasesResponse,
|
|
@@ -90,7 +90,15 @@ def create_segment_job(
|
|
| 90 |
- Returning immediately avoids timeout errors
|
| 91 |
"""
|
| 92 |
try:
|
| 93 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
valid_cases = list_case_ids()
|
| 95 |
if body.case_id not in valid_cases:
|
| 96 |
raise HTTPException(
|
|
@@ -100,7 +108,6 @@ def create_segment_job(
|
|
| 100 |
|
| 101 |
# Use full UUID hex for uniqueness (no truncation)
|
| 102 |
job_id = uuid.uuid4().hex
|
| 103 |
-
store = get_job_store()
|
| 104 |
backend_url = get_backend_base_url(request)
|
| 105 |
|
| 106 |
# Create job record
|
|
@@ -124,6 +131,10 @@ def create_segment_job(
|
|
| 124 |
message=f"Segmentation job queued for {body.case_id}",
|
| 125 |
)
|
| 126 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 127 |
except Exception:
|
| 128 |
logger.exception("Failed to create segmentation job")
|
| 129 |
raise HTTPException(status_code=500, detail="Failed to create segmentation job") from None
|
|
@@ -260,10 +271,10 @@ def run_segmentation_job(
|
|
| 260 |
# Mark as completed
|
| 261 |
store.complete_job(job_id, result_data)
|
| 262 |
|
|
|
|
| 263 |
logger.info(
|
| 264 |
-
"Job %s completed:
|
| 265 |
job_id,
|
| 266 |
-
case_id,
|
| 267 |
result.dice_score or 0,
|
| 268 |
result.elapsed_seconds,
|
| 269 |
)
|
|
|
|
| 15 |
|
| 16 |
from fastapi import APIRouter, BackgroundTasks, HTTPException, Request
|
| 17 |
|
| 18 |
+
from stroke_deepisles_demo.api.config import MAX_CONCURRENT_JOBS, RESULTS_DIR
|
| 19 |
from stroke_deepisles_demo.api.job_store import JobStatus, get_job_store
|
| 20 |
from stroke_deepisles_demo.api.schemas import (
|
| 21 |
CasesResponse,
|
|
|
|
| 90 |
- Returning immediately avoids timeout errors
|
| 91 |
"""
|
| 92 |
try:
|
| 93 |
+
# Concurrency limit to prevent GPU memory exhaustion (BUG-006 fix)
|
| 94 |
+
store = get_job_store()
|
| 95 |
+
if store.get_active_job_count() >= MAX_CONCURRENT_JOBS:
|
| 96 |
+
raise HTTPException(
|
| 97 |
+
status_code=503,
|
| 98 |
+
detail="Server busy: too many active jobs. Please try again later.",
|
| 99 |
+
)
|
| 100 |
+
|
| 101 |
+
# Validate case_id exists before creating job
|
| 102 |
valid_cases = list_case_ids()
|
| 103 |
if body.case_id not in valid_cases:
|
| 104 |
raise HTTPException(
|
|
|
|
| 108 |
|
| 109 |
# Use full UUID hex for uniqueness (no truncation)
|
| 110 |
job_id = uuid.uuid4().hex
|
|
|
|
| 111 |
backend_url = get_backend_base_url(request)
|
| 112 |
|
| 113 |
# Create job record
|
|
|
|
| 131 |
message=f"Segmentation job queued for {body.case_id}",
|
| 132 |
)
|
| 133 |
|
| 134 |
+
except HTTPException:
|
| 135 |
+
# Re-raise HTTP exceptions (400, 404, 503, etc.) as-is
|
| 136 |
+
# Without this, they'd be caught by `except Exception` and converted to 500
|
| 137 |
+
raise
|
| 138 |
except Exception:
|
| 139 |
logger.exception("Failed to create segmentation job")
|
| 140 |
raise HTTPException(status_code=500, detail="Failed to create segmentation job") from None
|
|
|
|
| 271 |
# Mark as completed
|
| 272 |
store.complete_job(job_id, result_data)
|
| 273 |
|
| 274 |
+
# Note: Don't log case_id as it may be sensitive (medical domain)
|
| 275 |
logger.info(
|
| 276 |
+
"Job %s completed: dice=%.3f, time=%.1fs",
|
| 277 |
job_id,
|
|
|
|
| 278 |
result.dice_score or 0,
|
| 279 |
result.elapsed_seconds,
|
| 280 |
)
|
|
@@ -21,14 +21,21 @@ def get_case(case_id: str | int) -> CaseFiles:
|
|
| 21 |
"""
|
| 22 |
Load a single case by ID or index.
|
| 23 |
|
|
|
|
|
|
|
|
|
|
| 24 |
Returns:
|
| 25 |
CaseFiles dictionary
|
| 26 |
"""
|
| 27 |
-
|
| 28 |
-
|
| 29 |
|
| 30 |
|
| 31 |
def list_case_ids() -> list[str]:
|
| 32 |
-
"""List all available case IDs.
|
| 33 |
-
|
| 34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
"""
|
| 22 |
Load a single case by ID or index.
|
| 23 |
|
| 24 |
+
Uses context manager to ensure HuggingFace temp files are cleaned up.
|
| 25 |
+
This prevents unbounded disk usage from accumulating temp NIfTI files.
|
| 26 |
+
|
| 27 |
Returns:
|
| 28 |
CaseFiles dictionary
|
| 29 |
"""
|
| 30 |
+
with load_isles_dataset() as dataset:
|
| 31 |
+
return dataset.get_case(case_id)
|
| 32 |
|
| 33 |
|
| 34 |
def list_case_ids() -> list[str]:
|
| 35 |
+
"""List all available case IDs.
|
| 36 |
+
|
| 37 |
+
Uses context manager to ensure HuggingFace temp files are cleaned up.
|
| 38 |
+
This prevents unbounded disk usage from accumulating temp NIfTI files.
|
| 39 |
+
"""
|
| 40 |
+
with load_isles_dataset() as dataset:
|
| 41 |
+
return dataset.list_case_ids()
|