| # Bug 003: Gateway Timeout Risk for Long ML Inference | |
| **Status**: FIXED | |
| **Date Found**: 2025-12-12 | |
| **Date Fixed**: 2025-12-12 | |
| **Severity**: Medium (was causing intermittent failures) | |
| --- | |
| ## Summary | |
| HuggingFace Spaces has an approximately 60-second proxy/gateway timeout. The DeepISLES | |
| ML inference typically takes 30-60 seconds in fast mode, which was causing intermittent | |
| 504 Gateway Timeout errors. | |
| **Solution**: Implemented async job queue pattern with client-side polling. | |
| ## Original Problem | |
| ### HF Spaces Timeout Behavior | |
| From HuggingFace community forums: | |
| - "When requests take longer than a minute, users get a 504 timeout error" | |
| - "After the POST request, the inference is run, but the API does not get the result | |
| since it's long timed out by then" | |
| ### Symptoms | |
| When this issue occurred: | |
| 1. User clicks "Run Segmentation" | |
| 2. UI shows "Processing..." for ~60 seconds | |
| 3. Browser receives 504 Gateway Timeout | |
| 4. Error displayed: "Segmentation failed: Gateway Timeout" | |
| 5. Backend may still complete the inference (results exist but response lost) | |
| ## Solution: Async Job Queue Pattern | |
| ### Architecture | |
| ```text | |
| BEFORE (Synchronous - Timeout Risk): | |
| Frontend Backend | |
| |--POST /api/segment------->| | |
| | (30-60s wait) | | |
| |<--200 OK + results--------| # TIMEOUT! | |
| AFTER (Async with Polling - No Timeout): | |
| Frontend Backend | |
| |--POST /api/segment------->| | |
| |<--202 + jobId (<1s)-------| | |
| |--GET /api/jobs/{id}------>| | |
| |<--200 {progress: 30%}----| | |
| |--GET /api/jobs/{id}------>| | |
| |<--200 {progress: 70%}----| | |
| |--GET /api/jobs/{id}------>| | |
| |<--200 {complete, result}--| | |
| ``` | |
| ### Implementation | |
| #### Backend Changes | |
| 1. **Job Store** (`src/stroke_deepisles_demo/api/job_store.py`) | |
| - In-memory job storage with thread-safe operations | |
| - Automatic cleanup of old jobs (1 hour TTL) | |
| - Progress tracking with status updates | |
| 2. **Routes** (`src/stroke_deepisles_demo/api/routes.py`) | |
| - `POST /api/segment` returns 202 with job ID immediately | |
| - `GET /api/jobs/{job_id}` returns current status/progress/results | |
| - Background task executes inference | |
| 3. **Schemas** (`src/stroke_deepisles_demo/api/schemas.py`) | |
| - `CreateJobResponse` for job creation | |
| - `JobStatusResponse` for polling | |
| #### Frontend Changes | |
| 1. **Types** (`frontend/src/types/index.ts`) | |
| - `JobStatus`, `CreateJobResponse`, `JobStatusResponse` | |
| 2. **API Client** (`frontend/src/api/client.ts`) | |
| - `createSegmentJob()` - creates job | |
| - `getJobStatus()` - polls for status | |
| 3. **Hook** (`frontend/src/hooks/useSegmentation.ts`) | |
| - Polls every 2 seconds | |
| - Tracks progress, status, elapsed time | |
| - Handles completion and errors | |
| 4. **Components** | |
| - `ProgressIndicator` - shows progress bar and status | |
| - `App` - integrates progress display and cancel button | |
| ### Spec Document | |
| Full specification: `docs/specs/async-job-queue.md` | |
| ## Performance Impact | |
| | Metric | Before (Sync) | After (Async) | | |
| |--------|--------------|---------------| | |
| | Initial response time | 30-60s | <1s | | |
| | Total request count | 1 | ~15-30 (polling) | | |
| | Timeout risk | HIGH | NONE | | |
| | User feedback | None during wait | Real-time progress | | |
| ## Files Changed | |
| ### Backend | |
| - `src/stroke_deepisles_demo/api/job_store.py` (NEW) | |
| - `src/stroke_deepisles_demo/api/schemas.py` | |
| - `src/stroke_deepisles_demo/api/routes.py` | |
| - `src/stroke_deepisles_demo/api/main.py` | |
| ### Frontend | |
| - `frontend/src/types/index.ts` | |
| - `frontend/src/api/client.ts` | |
| - `frontend/src/hooks/useSegmentation.ts` | |
| - `frontend/src/components/ProgressIndicator.tsx` (NEW) | |
| - `frontend/src/App.tsx` | |
| - `frontend/src/mocks/handlers.ts` | |
| ### Tests | |
| - `frontend/src/api/__tests__/client.test.ts` | |
| - `frontend/src/hooks/__tests__/useSegmentation.test.tsx` | |
| - `frontend/src/App.test.tsx` | |
| ## Verification | |
| After fix: | |
| 1. Deploy backend to HF Spaces | |
| 2. Refresh frontend | |
| 3. Run segmentation on any case | |
| 4. Observe progress bar updating in real-time | |
| 5. Results display after completion - NO timeout errors | |
| ## References | |
| - [FastAPI Background Tasks](https://fastapi.tiangolo.com/tutorial/background-tasks/) | |
| - [FastAPI Polling Strategy](https://openillumi.com/en/en-fastapi-long-task-progress-polling/) | |
| - [504 Gateway Timeout - HF Forums](https://discuss.huggingface.co/t/504-gateway-timeout-with-http-request/24018) | |
| - [Real Time Polling in React Query 2025](https://samwithcode.in/tutorial/react-js/real-time-polling-in-react-query-2025) | |