Spaces:

VibecoderMcSwaggins
/

stroke-deepisles-demo

Paused

File size: 4,461 Bytes

722753e

# Bug 003: Gateway Timeout Risk for Long ML Inference

**Status**: FIXED
**Date Found**: 2025-12-12
**Date Fixed**: 2025-12-12
**Severity**: Medium (was causing intermittent failures)

---

## Summary

HuggingFace Spaces has an approximately 60-second proxy/gateway timeout. The DeepISLES
ML inference typically takes 30-60 seconds in fast mode, which was causing intermittent
504 Gateway Timeout errors.

**Solution**: Implemented async job queue pattern with client-side polling.

## Original Problem

### HF Spaces Timeout Behavior

From HuggingFace community forums:
- "When requests take longer than a minute, users get a 504 timeout error"
- "After the POST request, the inference is run, but the API does not get the result
   since it's long timed out by then"

### Symptoms

When this issue occurred:
1. User clicks "Run Segmentation"
2. UI shows "Processing..." for ~60 seconds
3. Browser receives 504 Gateway Timeout
4. Error displayed: "Segmentation failed: Gateway Timeout"
5. Backend may still complete the inference (results exist but response lost)

## Solution: Async Job Queue Pattern

### Architecture

```text
BEFORE (Synchronous - Timeout Risk):
  Frontend                    Backend
     |--POST /api/segment------->|
     |       (30-60s wait)       |
     |<--200 OK + results--------|  # TIMEOUT!

AFTER (Async with Polling - No Timeout):
  Frontend                    Backend
     |--POST /api/segment------->|
     |<--202 + jobId (<1s)-------|
     |--GET /api/jobs/{id}------>|
     |<--200 {progress: 30%}----|
     |--GET /api/jobs/{id}------>|
     |<--200 {progress: 70%}----|
     |--GET /api/jobs/{id}------>|
     |<--200 {complete, result}--|
```

### Implementation

#### Backend Changes

1. **Job Store** (`src/stroke_deepisles_demo/api/job_store.py`)
   - In-memory job storage with thread-safe operations
   - Automatic cleanup of old jobs (1 hour TTL)
   - Progress tracking with status updates

2. **Routes** (`src/stroke_deepisles_demo/api/routes.py`)
   - `POST /api/segment` returns 202 with job ID immediately
   - `GET /api/jobs/{job_id}` returns current status/progress/results
   - Background task executes inference

3. **Schemas** (`src/stroke_deepisles_demo/api/schemas.py`)
   - `CreateJobResponse` for job creation
   - `JobStatusResponse` for polling

#### Frontend Changes

1. **Types** (`frontend/src/types/index.ts`)
   - `JobStatus`, `CreateJobResponse`, `JobStatusResponse`

2. **API Client** (`frontend/src/api/client.ts`)
   - `createSegmentJob()` - creates job
   - `getJobStatus()` - polls for status

3. **Hook** (`frontend/src/hooks/useSegmentation.ts`)
   - Polls every 2 seconds
   - Tracks progress, status, elapsed time
   - Handles completion and errors

4. **Components**
   - `ProgressIndicator` - shows progress bar and status
   - `App` - integrates progress display and cancel button

### Spec Document

Full specification: `docs/specs/async-job-queue.md`

## Performance Impact

| Metric | Before (Sync) | After (Async) |
|--------|--------------|---------------|
| Initial response time | 30-60s | <1s |
| Total request count | 1 | ~15-30 (polling) |
| Timeout risk | HIGH | NONE |
| User feedback | None during wait | Real-time progress |

## Files Changed

### Backend
- `src/stroke_deepisles_demo/api/job_store.py` (NEW)
- `src/stroke_deepisles_demo/api/schemas.py`
- `src/stroke_deepisles_demo/api/routes.py`
- `src/stroke_deepisles_demo/api/main.py`

### Frontend
- `frontend/src/types/index.ts`
- `frontend/src/api/client.ts`
- `frontend/src/hooks/useSegmentation.ts`
- `frontend/src/components/ProgressIndicator.tsx` (NEW)
- `frontend/src/App.tsx`
- `frontend/src/mocks/handlers.ts`

### Tests
- `frontend/src/api/__tests__/client.test.ts`
- `frontend/src/hooks/__tests__/useSegmentation.test.tsx`
- `frontend/src/App.test.tsx`

## Verification

After fix:
1. Deploy backend to HF Spaces
2. Refresh frontend
3. Run segmentation on any case
4. Observe progress bar updating in real-time
5. Results display after completion - NO timeout errors

## References

- [FastAPI Background Tasks](https://fastapi.tiangolo.com/tutorial/background-tasks/)
- [FastAPI Polling Strategy](https://openillumi.com/en/en-fastapi-long-task-progress-polling/)
- [504 Gateway Timeout - HF Forums](https://discuss.huggingface.co/t/504-gateway-timeout-with-http-request/24018)
- [Real Time Polling in React Query 2025](https://samwithcode.in/tutorial/react-js/real-time-polling-in-react-query-2025)