stroke-deepisles-demo / docs /bugs /003-gateway-timeout-long-inference.md
VibecoderMcSwaggins's picture
feat(api): async job queue with comprehensive test coverage (#36)
722753e unverified

Bug 003: Gateway Timeout Risk for Long ML Inference

Status: FIXED Date Found: 2025-12-12 Date Fixed: 2025-12-12 Severity: Medium (was causing intermittent failures)


Summary

HuggingFace Spaces has an approximately 60-second proxy/gateway timeout. The DeepISLES ML inference typically takes 30-60 seconds in fast mode, which was causing intermittent 504 Gateway Timeout errors.

Solution: Implemented async job queue pattern with client-side polling.

Original Problem

HF Spaces Timeout Behavior

From HuggingFace community forums:

  • "When requests take longer than a minute, users get a 504 timeout error"
  • "After the POST request, the inference is run, but the API does not get the result since it's long timed out by then"

Symptoms

When this issue occurred:

  1. User clicks "Run Segmentation"
  2. UI shows "Processing..." for ~60 seconds
  3. Browser receives 504 Gateway Timeout
  4. Error displayed: "Segmentation failed: Gateway Timeout"
  5. Backend may still complete the inference (results exist but response lost)

Solution: Async Job Queue Pattern

Architecture

BEFORE (Synchronous - Timeout Risk):
  Frontend                    Backend
     |--POST /api/segment------->|
     |       (30-60s wait)       |
     |<--200 OK + results--------|  # TIMEOUT!

AFTER (Async with Polling - No Timeout):
  Frontend                    Backend
     |--POST /api/segment------->|
     |<--202 + jobId (<1s)-------|
     |--GET /api/jobs/{id}------>|
     |<--200 {progress: 30%}----|
     |--GET /api/jobs/{id}------>|
     |<--200 {progress: 70%}----|
     |--GET /api/jobs/{id}------>|
     |<--200 {complete, result}--|

Implementation

Backend Changes

  1. Job Store (src/stroke_deepisles_demo/api/job_store.py)

    • In-memory job storage with thread-safe operations
    • Automatic cleanup of old jobs (1 hour TTL)
    • Progress tracking with status updates
  2. Routes (src/stroke_deepisles_demo/api/routes.py)

    • POST /api/segment returns 202 with job ID immediately
    • GET /api/jobs/{job_id} returns current status/progress/results
    • Background task executes inference
  3. Schemas (src/stroke_deepisles_demo/api/schemas.py)

    • CreateJobResponse for job creation
    • JobStatusResponse for polling

Frontend Changes

  1. Types (frontend/src/types/index.ts)

    • JobStatus, CreateJobResponse, JobStatusResponse
  2. API Client (frontend/src/api/client.ts)

    • createSegmentJob() - creates job
    • getJobStatus() - polls for status
  3. Hook (frontend/src/hooks/useSegmentation.ts)

    • Polls every 2 seconds
    • Tracks progress, status, elapsed time
    • Handles completion and errors
  4. Components

    • ProgressIndicator - shows progress bar and status
    • App - integrates progress display and cancel button

Spec Document

Full specification: docs/specs/async-job-queue.md

Performance Impact

Metric Before (Sync) After (Async)
Initial response time 30-60s <1s
Total request count 1 ~15-30 (polling)
Timeout risk HIGH NONE
User feedback None during wait Real-time progress

Files Changed

Backend

  • src/stroke_deepisles_demo/api/job_store.py (NEW)
  • src/stroke_deepisles_demo/api/schemas.py
  • src/stroke_deepisles_demo/api/routes.py
  • src/stroke_deepisles_demo/api/main.py

Frontend

  • frontend/src/types/index.ts
  • frontend/src/api/client.ts
  • frontend/src/hooks/useSegmentation.ts
  • frontend/src/components/ProgressIndicator.tsx (NEW)
  • frontend/src/App.tsx
  • frontend/src/mocks/handlers.ts

Tests

  • frontend/src/api/__tests__/client.test.ts
  • frontend/src/hooks/__tests__/useSegmentation.test.tsx
  • frontend/src/App.test.tsx

Verification

After fix:

  1. Deploy backend to HF Spaces
  2. Refresh frontend
  3. Run segmentation on any case
  4. Observe progress bar updating in real-time
  5. Results display after completion - NO timeout errors

References