Bug 003: Gateway Timeout Risk for Long ML Inference
Status: FIXED Date Found: 2025-12-12 Date Fixed: 2025-12-12 Severity: Medium (was causing intermittent failures)
Summary
HuggingFace Spaces has an approximately 60-second proxy/gateway timeout. The DeepISLES ML inference typically takes 30-60 seconds in fast mode, which was causing intermittent 504 Gateway Timeout errors.
Solution: Implemented async job queue pattern with client-side polling.
Original Problem
HF Spaces Timeout Behavior
From HuggingFace community forums:
- "When requests take longer than a minute, users get a 504 timeout error"
- "After the POST request, the inference is run, but the API does not get the result since it's long timed out by then"
Symptoms
When this issue occurred:
- User clicks "Run Segmentation"
- UI shows "Processing..." for ~60 seconds
- Browser receives 504 Gateway Timeout
- Error displayed: "Segmentation failed: Gateway Timeout"
- Backend may still complete the inference (results exist but response lost)
Solution: Async Job Queue Pattern
Architecture
BEFORE (Synchronous - Timeout Risk):
Frontend Backend
|--POST /api/segment------->|
| (30-60s wait) |
|<--200 OK + results--------| # TIMEOUT!
AFTER (Async with Polling - No Timeout):
Frontend Backend
|--POST /api/segment------->|
|<--202 + jobId (<1s)-------|
|--GET /api/jobs/{id}------>|
|<--200 {progress: 30%}----|
|--GET /api/jobs/{id}------>|
|<--200 {progress: 70%}----|
|--GET /api/jobs/{id}------>|
|<--200 {complete, result}--|
Implementation
Backend Changes
Job Store (
src/stroke_deepisles_demo/api/job_store.py)- In-memory job storage with thread-safe operations
- Automatic cleanup of old jobs (1 hour TTL)
- Progress tracking with status updates
Routes (
src/stroke_deepisles_demo/api/routes.py)POST /api/segmentreturns 202 with job ID immediatelyGET /api/jobs/{job_id}returns current status/progress/results- Background task executes inference
Schemas (
src/stroke_deepisles_demo/api/schemas.py)CreateJobResponsefor job creationJobStatusResponsefor polling
Frontend Changes
Types (
frontend/src/types/index.ts)JobStatus,CreateJobResponse,JobStatusResponse
API Client (
frontend/src/api/client.ts)createSegmentJob()- creates jobgetJobStatus()- polls for status
Hook (
frontend/src/hooks/useSegmentation.ts)- Polls every 2 seconds
- Tracks progress, status, elapsed time
- Handles completion and errors
Components
ProgressIndicator- shows progress bar and statusApp- integrates progress display and cancel button
Spec Document
Full specification: docs/specs/async-job-queue.md
Performance Impact
| Metric | Before (Sync) | After (Async) |
|---|---|---|
| Initial response time | 30-60s | <1s |
| Total request count | 1 | ~15-30 (polling) |
| Timeout risk | HIGH | NONE |
| User feedback | None during wait | Real-time progress |
Files Changed
Backend
src/stroke_deepisles_demo/api/job_store.py(NEW)src/stroke_deepisles_demo/api/schemas.pysrc/stroke_deepisles_demo/api/routes.pysrc/stroke_deepisles_demo/api/main.py
Frontend
frontend/src/types/index.tsfrontend/src/api/client.tsfrontend/src/hooks/useSegmentation.tsfrontend/src/components/ProgressIndicator.tsx(NEW)frontend/src/App.tsxfrontend/src/mocks/handlers.ts
Tests
frontend/src/api/__tests__/client.test.tsfrontend/src/hooks/__tests__/useSegmentation.test.tsxfrontend/src/App.test.tsx
Verification
After fix:
- Deploy backend to HF Spaces
- Refresh frontend
- Run segmentation on any case
- Observe progress bar updating in real-time
- Results display after completion - NO timeout errors