stroke-deepisles-demo / docs /bugs /003-gateway-timeout-long-inference.md
VibecoderMcSwaggins's picture
feat(api): async job queue with comprehensive test coverage (#36)
722753e unverified
# Bug 003: Gateway Timeout Risk for Long ML Inference
**Status**: FIXED
**Date Found**: 2025-12-12
**Date Fixed**: 2025-12-12
**Severity**: Medium (was causing intermittent failures)
---
## Summary
HuggingFace Spaces has an approximately 60-second proxy/gateway timeout. The DeepISLES
ML inference typically takes 30-60 seconds in fast mode, which was causing intermittent
504 Gateway Timeout errors.
**Solution**: Implemented async job queue pattern with client-side polling.
## Original Problem
### HF Spaces Timeout Behavior
From HuggingFace community forums:
- "When requests take longer than a minute, users get a 504 timeout error"
- "After the POST request, the inference is run, but the API does not get the result
since it's long timed out by then"
### Symptoms
When this issue occurred:
1. User clicks "Run Segmentation"
2. UI shows "Processing..." for ~60 seconds
3. Browser receives 504 Gateway Timeout
4. Error displayed: "Segmentation failed: Gateway Timeout"
5. Backend may still complete the inference (results exist but response lost)
## Solution: Async Job Queue Pattern
### Architecture
```text
BEFORE (Synchronous - Timeout Risk):
Frontend Backend
|--POST /api/segment------->|
| (30-60s wait) |
|<--200 OK + results--------| # TIMEOUT!
AFTER (Async with Polling - No Timeout):
Frontend Backend
|--POST /api/segment------->|
|<--202 + jobId (<1s)-------|
|--GET /api/jobs/{id}------>|
|<--200 {progress: 30%}----|
|--GET /api/jobs/{id}------>|
|<--200 {progress: 70%}----|
|--GET /api/jobs/{id}------>|
|<--200 {complete, result}--|
```
### Implementation
#### Backend Changes
1. **Job Store** (`src/stroke_deepisles_demo/api/job_store.py`)
- In-memory job storage with thread-safe operations
- Automatic cleanup of old jobs (1 hour TTL)
- Progress tracking with status updates
2. **Routes** (`src/stroke_deepisles_demo/api/routes.py`)
- `POST /api/segment` returns 202 with job ID immediately
- `GET /api/jobs/{job_id}` returns current status/progress/results
- Background task executes inference
3. **Schemas** (`src/stroke_deepisles_demo/api/schemas.py`)
- `CreateJobResponse` for job creation
- `JobStatusResponse` for polling
#### Frontend Changes
1. **Types** (`frontend/src/types/index.ts`)
- `JobStatus`, `CreateJobResponse`, `JobStatusResponse`
2. **API Client** (`frontend/src/api/client.ts`)
- `createSegmentJob()` - creates job
- `getJobStatus()` - polls for status
3. **Hook** (`frontend/src/hooks/useSegmentation.ts`)
- Polls every 2 seconds
- Tracks progress, status, elapsed time
- Handles completion and errors
4. **Components**
- `ProgressIndicator` - shows progress bar and status
- `App` - integrates progress display and cancel button
### Spec Document
Full specification: `docs/specs/async-job-queue.md`
## Performance Impact
| Metric | Before (Sync) | After (Async) |
|--------|--------------|---------------|
| Initial response time | 30-60s | <1s |
| Total request count | 1 | ~15-30 (polling) |
| Timeout risk | HIGH | NONE |
| User feedback | None during wait | Real-time progress |
## Files Changed
### Backend
- `src/stroke_deepisles_demo/api/job_store.py` (NEW)
- `src/stroke_deepisles_demo/api/schemas.py`
- `src/stroke_deepisles_demo/api/routes.py`
- `src/stroke_deepisles_demo/api/main.py`
### Frontend
- `frontend/src/types/index.ts`
- `frontend/src/api/client.ts`
- `frontend/src/hooks/useSegmentation.ts`
- `frontend/src/components/ProgressIndicator.tsx` (NEW)
- `frontend/src/App.tsx`
- `frontend/src/mocks/handlers.ts`
### Tests
- `frontend/src/api/__tests__/client.test.ts`
- `frontend/src/hooks/__tests__/useSegmentation.test.tsx`
- `frontend/src/App.test.tsx`
## Verification
After fix:
1. Deploy backend to HF Spaces
2. Refresh frontend
3. Run segmentation on any case
4. Observe progress bar updating in real-time
5. Results display after completion - NO timeout errors
## References
- [FastAPI Background Tasks](https://fastapi.tiangolo.com/tutorial/background-tasks/)
- [FastAPI Polling Strategy](https://openillumi.com/en/en-fastapi-long-task-progress-polling/)
- [504 Gateway Timeout - HF Forums](https://discuss.huggingface.co/t/504-gateway-timeout-with-http-request/24018)
- [Real Time Polling in React Query 2025](https://samwithcode.in/tutorial/react-js/real-time-polling-in-react-query-2025)