Spaces:

VibecoderMcSwaggins
/

stroke-deepisles-demo

Paused

App Files Files Community

VibecoderMcSwaggins Claude commited on Dec 12, 2025

Commit

722753e

unverified ·

1 Parent(s): 1efb3e0

feat(api): async job queue with comprehensive test coverage (#36)

Browse files

* docs(bugs): add gateway timeout audit and deployment checklist

- Document Bug 003: HF Spaces ~60s proxy timeout risk for ML inference
- Add comprehensive deployment checklist with verification status
- Include E2E flow audit diagram for debugging reference
- Verify existing bug fixes (001, 002) are correct and complete

* feat(api): async job queue to eliminate gateway timeout

Implement async job queue pattern to handle HuggingFace Spaces' ~60s
gateway timeout for long-running ML inference (30-60s typical).

## Problem
- HF Spaces proxy has hard ~60s timeout
- DeepISLES inference takes 30-60s
- Intermittent 504 Gateway Timeout errors

## Solution
POST /api/segment now returns 202 Accepted immediately with job ID.
Frontend polls GET /api/jobs/{id} every 2s for status/progress/results.
No single request exceeds the timeout - completely eliminates the issue.

### Backend Changes
- Add job_store.py: Thread-safe in-memory job storage with TTL cleanup
- Update routes.py: Async job creation + background task execution
- Update schemas.py: CreateJobResponse, JobStatusResponse types
- Update main.py: Lifespan handler for job store initialization

### Frontend Changes
- Add ProgressIndicator component with animated progress bar
- Update useSegmentation hook with polling logic
- Update api/client.ts with createSegmentJob/getJobStatus methods
- Update App.tsx to show progress and cancel button
- Update types/index.ts with JobStatus types
- Update mock handlers for testing

### Documentation
- Add docs/specs/async-job-queue.md (full spec)
- Update docs/bugs/003-gateway-timeout-long-inference.md (FIXED)
- Update docs/bugs/README.md (checklist, E2E flow v2.0)

Performance: Initial response <1s (was 30-60s), zero timeout risk

* fix(test): comprehensive test fixes for async job queue

Frontend Test Fixes:
- Remove fake timers from App.test.tsx (incompatible with MSW polling)
- Use real timers with configurable mock job duration (500ms)
- Add setMockJobDuration() to handlers.ts for test configuration
- Fix ambiguous element queries (multiple elements with same text)
- Add proper timeout for multi-run test (15s for two job cycles)
- Use behavior-based assertions instead of implementation details

Backend Tests:
- Add comprehensive unit tests for job_store.py (20 tests)
- Job dataclass tests: elapsed time, to_dict(), status handling
- JobStore tests: create, get, start, update, complete, fail
- Cleanup tests: TTL expiration, file removal
- Global store tests: init/get patterns
- Update test_endpoints.py for async API (11 tests)
- POST /api/segment returns 202 with job ID
- GET /api/jobs/{id} returns status/progress/result
- Error handling tests

All tests now pass:
- Frontend: 63 tests
- Backend: 31 tests (API only)

* fix(lint): remove unused imports and fix type annotation

- Remove unused `time` and `field` imports from job_store.py
- Import AsyncIterator from collections.abc instead of typing
- Prefix unused `app` parameter with underscore in lifespan function

* style: format schemas.py to pass CI format check

* fix(e2e): update fixtures for async job queue API pattern

The e2e fixtures were mocking the old sync API (POST /api/segment returns
result directly). Updated to match the new async job queue pattern:

- POST /api/segment returns 202 with jobId
- GET /api/jobs/:jobId returns job status with progress
- Jobs progress over ~1 second from pending → running → completed
- Fixed processingText locator to use button role (avoid strict mode violation)

* fix(security): apply CodeRabbit security and quality fixes

Security fixes (CRITICAL):
- Add path traversal protection in cleanup_old_jobs()
- Validate job_id format to allow only alphanumeric, hyphens, underscores
- Use full UUID hex instead of truncated 8-char (prevents collisions)
- Sanitize error messages - don't expose raw exceptions to clients

Code quality improvements:
- Use 'is not None' checks in Job.to_dict() instead of truthiness
- Ensure started_at is set before computing elapsed time
- Prevent job_id overwrites with KeyError on duplicate
- Don't log case_id (potentially sensitive medical data)
- Fix misleading comment about lifespan vs import order

Documentation:
- Add 'text' language specifier to code fence blocks
- Note multi-worker would require shared store (Redis/DB)

---------

Co-authored-by: Claude <noreply@anthropic.com>

Files changed (21) hide show

docs/bugs/003-gateway-timeout-long-inference.md +143 -0
docs/bugs/README.md +84 -0
docs/specs/async-job-queue.md +600 -0
frontend/e2e/fixtures.ts +126 -27
frontend/e2e/pages/HomePage.ts +1 -1
frontend/src/App.test.tsx +102 -32
frontend/src/App.tsx +48 -5
frontend/src/api/__tests__/client.test.ts +35 -20
frontend/src/api/client.ts +52 -4
frontend/src/components/ProgressIndicator.tsx +84 -0
frontend/src/components/index.ts +1 -0
frontend/src/hooks/__tests__/useSegmentation.test.tsx +95 -29
frontend/src/hooks/useSegmentation.ts +191 -49
frontend/src/mocks/handlers.ts +169 -15
frontend/src/types/index.ts +25 -0
src/stroke_deepisles_demo/api/job_store.py +380 -0
src/stroke_deepisles_demo/api/main.py +76 -6
src/stroke_deepisles_demo/api/routes.py +199 -30
src/stroke_deepisles_demo/api/schemas.py +45 -2
tests/api/test_endpoints.py +120 -72
tests/api/test_job_store.py +314 -0

docs/bugs/003-gateway-timeout-long-inference.md ADDED Viewed

	@@ -0,0 +1,143 @@

+# Bug 003: Gateway Timeout Risk for Long ML Inference
+**Status**: FIXED
+**Date Found**: 2025-12-12
+**Date Fixed**: 2025-12-12
+**Severity**: Medium (was causing intermittent failures)
+---
+## Summary
+HuggingFace Spaces has an approximately 60-second proxy/gateway timeout. The DeepISLES
+ML inference typically takes 30-60 seconds in fast mode, which was causing intermittent
+504 Gateway Timeout errors.
+**Solution**: Implemented async job queue pattern with client-side polling.
+## Original Problem
+### HF Spaces Timeout Behavior
+From HuggingFace community forums:
+- "When requests take longer than a minute, users get a 504 timeout error"
+- "After the POST request, the inference is run, but the API does not get the result
+   since it's long timed out by then"
+### Symptoms
+When this issue occurred:
+1. User clicks "Run Segmentation"
+2. UI shows "Processing..." for ~60 seconds
+3. Browser receives 504 Gateway Timeout
+4. Error displayed: "Segmentation failed: Gateway Timeout"
+5. Backend may still complete the inference (results exist but response lost)
+## Solution: Async Job Queue Pattern
+### Architecture
+```text
+BEFORE (Synchronous - Timeout Risk):
+  Frontend                    Backend
+     |--POST /api/segment------->|
+     |       (30-60s wait)       |
+     |<--200 OK + results--------|  # TIMEOUT!
+AFTER (Async with Polling - No Timeout):
+  Frontend                    Backend
+     |--POST /api/segment------->|
+     |<--202 + jobId (<1s)-------|
+     |--GET /api/jobs/{id}------>|
+     |<--200 {progress: 30%}----|
+     |--GET /api/jobs/{id}------>|
+     |<--200 {progress: 70%}----|
+     |--GET /api/jobs/{id}------>|
+     |<--200 {complete, result}--|
+```
+### Implementation
+#### Backend Changes
+1. **Job Store** (`src/stroke_deepisles_demo/api/job_store.py`)
+   - In-memory job storage with thread-safe operations
+   - Automatic cleanup of old jobs (1 hour TTL)
+   - Progress tracking with status updates
+2. **Routes** (`src/stroke_deepisles_demo/api/routes.py`)
+   - `POST /api/segment` returns 202 with job ID immediately
+   - `GET /api/jobs/{job_id}` returns current status/progress/results
+   - Background task executes inference
+3. **Schemas** (`src/stroke_deepisles_demo/api/schemas.py`)
+   - `CreateJobResponse` for job creation
+   - `JobStatusResponse` for polling
+#### Frontend Changes
+1. **Types** (`frontend/src/types/index.ts`)
+   - `JobStatus`, `CreateJobResponse`, `JobStatusResponse`
+2. **API Client** (`frontend/src/api/client.ts`)
+   - `createSegmentJob()` - creates job
+   - `getJobStatus()` - polls for status
+3. **Hook** (`frontend/src/hooks/useSegmentation.ts`)
+   - Polls every 2 seconds
+   - Tracks progress, status, elapsed time
+   - Handles completion and errors
+4. **Components**
+   - `ProgressIndicator` - shows progress bar and status
+   - `App` - integrates progress display and cancel button
+### Spec Document
+Full specification: `docs/specs/async-job-queue.md`
+## Performance Impact
+| Metric | Before (Sync) | After (Async) |
+|--------|--------------|---------------|
+| Initial response time | 30-60s | <1s |
+| Total request count | 1 | ~15-30 (polling) |
+| Timeout risk | HIGH | NONE |
+| User feedback | None during wait | Real-time progress |
+## Files Changed
+### Backend
+- `src/stroke_deepisles_demo/api/job_store.py` (NEW)
+- `src/stroke_deepisles_demo/api/schemas.py`
+- `src/stroke_deepisles_demo/api/routes.py`
+- `src/stroke_deepisles_demo/api/main.py`
+### Frontend
+- `frontend/src/types/index.ts`
+- `frontend/src/api/client.ts`
+- `frontend/src/hooks/useSegmentation.ts`
+- `frontend/src/components/ProgressIndicator.tsx` (NEW)
+- `frontend/src/App.tsx`
+- `frontend/src/mocks/handlers.ts`
+### Tests
+- `frontend/src/api/__tests__/client.test.ts`
+- `frontend/src/hooks/__tests__/useSegmentation.test.tsx`
+- `frontend/src/App.test.tsx`
+## Verification
+After fix:
+1. Deploy backend to HF Spaces
+2. Refresh frontend
+3. Run segmentation on any case
+4. Observe progress bar updating in real-time
+5. Results display after completion - NO timeout errors
+## References
+- [FastAPI Background Tasks](https://fastapi.tiangolo.com/tutorial/background-tasks/)
+- [FastAPI Polling Strategy](https://openillumi.com/en/en-fastapi-long-task-progress-polling/)
+- [504 Gateway Timeout - HF Forums](https://discuss.huggingface.co/t/504-gateway-timeout-with-http-request/24018)
+- [Real Time Polling in React Query 2025](https://samwithcode.in/tutorial/react-js/real-time-polling-in-react-query-2025)

docs/bugs/README.md CHANGED Viewed

@@ -12,6 +12,24 @@ None currently.
 |----|-------|----------|--------|
 | [001](./001-cors-static-files-hf-spaces.md) | CORS regex blocking static file requests | Critical | FIXED |
 | [002](./002-http-vs-https-proxy-headers.md) | HTTP vs HTTPS URL mismatch behind proxy | High | FIXED |
 ## Common HuggingFace Spaces Pitfalls
@@ -43,9 +61,75 @@ Based on research and experience, here are common issues to watch for:
 - `SPACE_ID` contains the space identifier
 - Use these to detect production environment
 ## Sources
 - [Deploying FastAPI on HuggingFace Spaces](https://huggingface.co/blog/HemanthSai7/deploy-applications-on-huggingface-spaces)
 - [HF Spaces Restrictions](https://medium.com/@na.mazaheri/deploying-a-fastapi-app-on-hugging-face-spaces-and-handling-all-its-restrictions-d494d97a78fa)
 - [FastAPI HTTPS Discussion](https://github.com/fastapi/fastapi/discussions/6670)
 - [HF Docker Spaces Docs](https://huggingface.co/docs/hub/en/spaces-sdks-docker)

 |----|-------|----------|--------|
 | [001](./001-cors-static-files-hf-spaces.md) | CORS regex blocking static file requests | Critical | FIXED |
 | [002](./002-http-vs-https-proxy-headers.md) | HTTP vs HTTPS URL mismatch behind proxy | High | FIXED |
+| [003](./003-gateway-timeout-long-inference.md) | Gateway timeout for long ML inference | Medium | FIXED |
+## HF Spaces Deployment Checklist
+Last audit: 2025-12-12
+| Check | Status | Notes |
+|-------|--------|-------|
+| CORS regex matches both URL formats | PASS | `r"https://.*stroke-viewer-frontend.*\.hf\.space"` |
+| All URLs use HTTPS | PASS | `--proxy-headers` flag in Dockerfile |
+| File outputs to /tmp/ | PASS | Uses `/tmp/stroke-results/` |
+| Static files mounted after dir exists | PASS | `mkdir()` before `app.mount()` in main.py |
+| HF_SPACES env var set | PASS | Set in Dockerfile |
+| Using port 7860 | PASS | Configured in Dockerfile CMD |
+| Inference timeout handled | PASS | Async job queue pattern (no timeout risk) |
+| Error responses return JSON | PASS | HTTPException with detail |
+| CORS preflight (OPTIONS) handled | PASS | CORSMiddleware handles automatically |
+| Progress updates for long tasks | PASS | Polling with ProgressIndicator component |
 ## Common HuggingFace Spaces Pitfalls
 - `SPACE_ID` contains the space identifier
 - Use these to detect production environment
+### 6. Gateway Timeouts (SOLVED)
+- HF Spaces proxy has ~60 second timeout
+- Solution: Async job queue pattern with polling
+- POST returns immediately with job ID
+- Frontend polls GET /api/jobs/{id} for progress
+- See [Bug 003](./003-gateway-timeout-long-inference.md) and [Spec](../specs/async-job-queue.md)
+## E2E Flow (v2.0 - Async Job Pattern)
+The complete flow from frontend to backend and back:
+```text
+1. Frontend loads
+   ├── CaseSelector fetches GET /api/cases
+   ├── CORS: origin regex must match frontend URL
+   └── Response: JSON list of case IDs
+2. User runs segmentation
+   ├── App calls POST /api/segment {case_id, fast_mode}
+   ├── Backend creates job record
+   └── Response: 202 Accepted + {jobId, status: "pending"}
+3. Frontend polls for status
+   ├── GET /api/jobs/{jobId} every 2 seconds
+   ├── Response: {status, progress, progressMessage}
+   └── ProgressIndicator shows real-time updates
+4. Backend processes (in background thread)
+   ├── Job status: "running"
+   ├── Progress updates: 10% → 30% → 85% → 95%
+   ├── Runs DeepISLES inference
+   └── Writes results to /tmp/stroke-results/{jobId}/
+5. Job completes
+   ├── Status: "completed"
+   ├── Result includes file URLs
+   └── Frontend stops polling
+6. Frontend receives result
+   ├── Updates state with URLs
+   ├── Passes URLs to NiiVueViewer
+   └── Shows metrics in MetricsPanel
+7. NiiVue fetches static files
+   ├── Cross-origin fetch to backend /files/...
+   ├── CORS headers on static file response
+   └── Binary NIfTI files download
+8. Viewer displays
+   └── NIfTI volumes rendered in WebGL canvas
+```
+## API Endpoints (v2.0)
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| GET | /api/cases | List available cases |
+| POST | /api/segment | Create segmentation job (202 Accepted) |
+| GET | /api/jobs/{id} | Get job status/progress/results |
+| GET | /files/{jobId}/{caseId}/* | Static NIfTI files |
+| GET | / | Health check |
+| GET | /health | Detailed health with job count |
 ## Sources
 - [Deploying FastAPI on HuggingFace Spaces](https://huggingface.co/blog/HemanthSai7/deploy-applications-on-huggingface-spaces)
 - [HF Spaces Restrictions](https://medium.com/@na.mazaheri/deploying-a-fastapi-app-on-hugging-face-spaces-and-handling-all-its-restrictions-d494d97a78fa)
 - [FastAPI HTTPS Discussion](https://github.com/fastapi/fastapi/discussions/6670)
 - [HF Docker Spaces Docs](https://huggingface.co/docs/hub/en/spaces-sdks-docker)
+- [504 Gateway Timeout - HF Forums](https://discuss.huggingface.co/t/504-gateway-timeout-with-http-request/24018)
+- [FastAPI Background Tasks](https://fastapi.tiangolo.com/tutorial/background-tasks/)
+- [FastAPI Polling Strategy](https://openillumi.com/en/en-fastapi-long-task-progress-polling/)

docs/specs/async-job-queue.md ADDED Viewed

	@@ -0,0 +1,600 @@

+# Async Job Queue for Long-Running ML Inference
+**Status**: APPROVED
+**Created**: 2025-12-12
+**Author**: Claude Code Audit
+---
+## Executive Summary
+HuggingFace Spaces has a ~60-second gateway timeout that cannot be bypassed through
+configuration. DeepISLES ML inference typically takes 30-60 seconds, creating
+intermittent 504 Gateway Timeout errors. This spec defines a robust async job queue
+system that eliminates timeout issues by immediately returning a job ID and using
+client-side polling for status/results.
+## Problem Statement
+### Current Architecture (Synchronous)
+```
+Frontend                    Backend                     ML Inference
+   |                           |                            |
+   |--POST /api/segment------->|                            |
+   |                           |--run_pipeline_on_case()--->|
+   |                           |                            |
+   |      (30-60s wait)        |       (processing)         |
+   |                           |                            |
+   |                           |<---result------------------|
+   |<--200 OK + JSON-----------|                            |
+```
+**Problem**: HF Spaces proxy times out at ~60s, killing the connection before
+the ML inference completes. The response is lost even though processing succeeds.
+### Target Architecture (Async with Polling)
+```
+Frontend                    Backend                     ML Inference
+   |                           |                            |
+   |--POST /api/segment------->|                            |
+   |<--202 Accepted + job_id---|                            |
+   |                           |--BackgroundTask----------->|
+   |                           |                            |
+   |--GET /api/jobs/{id}------>|       (processing)         |
+   |<--200 {status: running}---|                            |
+   |                           |                            |
+   |--GET /api/jobs/{id}------>|                            |
+   |<--200 {status: running}---|                            |
+   |                           |<---result------------------|
+   |--GET /api/jobs/{id}------>|                            |
+   |<--200 {status: completed, |                            |
+   |        result: {...}}-----|                            |
+```
+**Solution**: Initial request returns in <1s. Polling requests are fast (<100ms).
+No single request exceeds the proxy timeout.
+## Technical Design
+### 1. Backend Job Store
+In-memory dictionary storing job state. This is appropriate because:
+- HF Spaces runs a single uvicorn worker (no multi-worker sync needed)
+- Jobs are ephemeral (results cached, cleanup after 1 hour)
+- No external dependencies (Redis, DB) required
+```python
+from dataclasses import dataclass, field
+from datetime import datetime
+from enum import Enum
+from typing import Any
+class JobStatus(str, Enum):
+    PENDING = "pending"      # Job created, not started
+    RUNNING = "running"      # Inference in progress
+    COMPLETED = "completed"  # Success, results available
+    FAILED = "failed"        # Error occurred
+@dataclass
+class Job:
+    id: str
+    status: JobStatus
+    case_id: str
+    fast_mode: bool
+    created_at: datetime
+    started_at: datetime | None = None
+    completed_at: datetime | None = None
+    progress: int = 0  # 0-100 percentage
+    progress_message: str = ""
+    result: dict[str, Any] | None = None
+    error: str | None = None
+# Thread-safe job store (single writer pattern)
+jobs: dict[str, Job] = {}
+```
+### 2. API Endpoints
+#### POST /api/segment (Modified)
+Returns immediately with job ID.
+**Request**: Same as before
+```json
+{
+  "case_id": "sub-strokecase0001",
+  "fast_mode": true
+}
+```
+**Response**: 202 Accepted
+```json
+{
+  "jobId": "a1b2c3d4",
+  "status": "pending",
+  "message": "Segmentation job queued"
+}
+```
+#### GET /api/jobs/{job_id}
+Poll for job status and results.
+**Response (Running)**:
+```json
+{
+  "jobId": "a1b2c3d4",
+  "status": "running",
+  "progress": 45,
+  "progressMessage": "Running DeepISLES inference...",
+  "elapsedSeconds": 23.5
+}
+```
+**Response (Completed)**:
+```json
+{
+  "jobId": "a1b2c3d4",
+  "status": "completed",
+  "progress": 100,
+  "progressMessage": "Segmentation complete",
+  "elapsedSeconds": 42.3,
+  "result": {
+    "caseId": "sub-strokecase0001",
+    "diceScore": 0.847,
+    "volumeMl": 12.34,
+    "dwiUrl": "https://...hf.space/files/a1b2c3d4/...",
+    "predictionUrl": "https://...hf.space/files/a1b2c3d4/..."
+  }
+}
+```
+**Response (Failed)**:
+```json
+{
+  "jobId": "a1b2c3d4",
+  "status": "failed",
+  "progress": 0,
+  "progressMessage": "Error occurred",
+  "elapsedSeconds": 5.2,
+  "error": "Case not found: sub-invalid"
+}
+```
+**Response (Not Found)**: 404
+```json
+{
+  "detail": "Job not found: xyz123"
+}
+```
+### 3. Background Task Execution
+```python
+from fastapi import BackgroundTasks
+@router.post("/segment", response_model=SegmentJobResponse, status_code=202)
+def create_segment_job(
+    request: Request,
+    body: SegmentRequest,
+    background_tasks: BackgroundTasks
+) -> SegmentJobResponse:
+    """Create a segmentation job and return immediately."""
+    job_id = str(uuid.uuid4())[:8]
+    # Create job record
+    job = Job(
+        id=job_id,
+        status=JobStatus.PENDING,
+        case_id=body.case_id,
+        fast_mode=body.fast_mode,
+        created_at=datetime.now(),
+    )
+    jobs[job_id] = job
+    # Queue background task
+    background_tasks.add_task(
+        run_segmentation_job,
+        job_id=job_id,
+        case_id=body.case_id,
+        fast_mode=body.fast_mode,
+        backend_url=get_backend_base_url(request),
+    )
+    return SegmentJobResponse(
+        jobId=job_id,
+        status=JobStatus.PENDING,
+        message="Segmentation job queued",
+    )
+```
+### 4. Job Execution with Progress Updates
+```python
+def run_segmentation_job(
+    job_id: str,
+    case_id: str,
+    fast_mode: bool,
+    backend_url: str,
+) -> None:
+    """Execute segmentation in background thread."""
+    job = jobs.get(job_id)
+    if not job:
+        return
+    try:
+        # Mark as running
+        job.status = JobStatus.RUNNING
+        job.started_at = datetime.now()
+        job.progress = 10
+        job.progress_message = "Loading case data..."
+        # Run inference with progress callbacks
+        output_dir = RESULTS_BASE / job_id
+        job.progress = 20
+        job.progress_message = "Staging files for DeepISLES..."
+        result = run_pipeline_on_case(
+            case_id,
+            output_dir=output_dir,
+            fast=fast_mode,
+            compute_dice=True,
+            cleanup_staging=True,
+            # Future: pass progress_callback for finer updates
+        )
+        job.progress = 90
+        job.progress_message = "Computing metrics..."
+        # Compute volume
+        volume_ml = None
+        with contextlib.suppress(Exception):
+            volume_ml = round(compute_volume_ml(result.prediction_mask, threshold=0.5), 2)
+        # Build result
+        job.progress = 100
+        job.progress_message = "Segmentation complete"
+        job.status = JobStatus.COMPLETED
+        job.completed_at = datetime.now()
+        job.result = {
+            "caseId": result.case_id,
+            "diceScore": result.dice_score,
+            "volumeMl": volume_ml,
+            "elapsedSeconds": round(result.elapsed_seconds, 2),
+            "dwiUrl": f"{backend_url}/files/{job_id}/{result.case_id}/{result.input_files['dwi'].name}",
+            "predictionUrl": f"{backend_url}/files/{job_id}/{result.case_id}/{result.prediction_mask.name}",
+        }
+    except Exception as e:
+        job.status = JobStatus.FAILED
+        job.completed_at = datetime.now()
+        job.error = str(e)
+        job.progress_message = "Error occurred"
+```
+### 5. Job Cleanup (Memory Management)
+```python
+import threading
+from datetime import timedelta
+JOB_TTL = timedelta(hours=1)  # Keep completed jobs for 1 hour
+def cleanup_old_jobs() -> None:
+    """Remove jobs older than TTL to prevent memory leaks."""
+    now = datetime.now()
+    expired = [
+        job_id for job_id, job in jobs.items()
+        if job.completed_at and (now - job.completed_at) > JOB_TTL
+    ]
+    for job_id in expired:
+        # Also cleanup result files
+        result_dir = RESULTS_BASE / job_id
+        if result_dir.exists():
+            shutil.rmtree(result_dir, ignore_errors=True)
+        del jobs[job_id]
+# Run cleanup every 10 minutes
+def start_cleanup_scheduler():
+    def run():
+        while True:
+            time.sleep(600)  # 10 minutes
+            cleanup_old_jobs()
+    thread = threading.Thread(target=run, daemon=True)
+    thread.start()
+```
+### 6. Frontend Polling Hook
+```typescript
+// hooks/useJobPolling.ts
+import { useState, useEffect, useCallback, useRef } from 'react'
+import { apiClient, JobStatus, JobStatusResponse } from '../api/client'
+interface UseJobPollingOptions {
+  pollingInterval?: number  // ms, default 2000
+  onComplete?: (result: SegmentationResult) => void
+  onError?: (error: string) => void
+}
+export function useJobPolling(options: UseJobPollingOptions = {}) {
+  const { pollingInterval = 2000, onComplete, onError } = options
+  const [jobId, setJobId] = useState<string | null>(null)
+  const [status, setStatus] = useState<JobStatus | null>(null)
+  const [progress, setProgress] = useState(0)
+  const [progressMessage, setProgressMessage] = useState('')
+  const [error, setError] = useState<string | null>(null)
+  const [isPolling, setIsPolling] = useState(false)
+  const intervalRef = useRef<number | null>(null)
+  const onCompleteRef = useRef(onComplete)
+  const onErrorRef = useRef(onError)
+  // Keep callbacks current
+  useEffect(() => {
+    onCompleteRef.current = onComplete
+    onErrorRef.current = onError
+  })
+  const stopPolling = useCallback(() => {
+    if (intervalRef.current) {
+      clearInterval(intervalRef.current)
+      intervalRef.current = null
+    }
+    setIsPolling(false)
+  }, [])
+  const pollJobStatus = useCallback(async (id: string) => {
+    try {
+      const response = await apiClient.getJobStatus(id)
+      setStatus(response.status)
+      setProgress(response.progress)
+      setProgressMessage(response.progressMessage)
+      if (response.status === 'completed' && response.result) {
+        stopPolling()
+        onCompleteRef.current?.(response.result)
+      } else if (response.status === 'failed') {
+        stopPolling()
+        setError(response.error || 'Job failed')
+        onErrorRef.current?.(response.error || 'Job failed')
+      }
+    } catch (err) {
+      // Don't stop polling on network errors - might be transient
+      console.warn('Polling error:', err)
+    }
+  }, [stopPolling])
+  const startJob = useCallback(async (caseId: string, fastMode = true) => {
+    // Reset state
+    setError(null)
+    setProgress(0)
+    setProgressMessage('Starting...')
+    setStatus('pending')
+    try {
+      // Create job
+      const response = await apiClient.createSegmentJob(caseId, fastMode)
+      setJobId(response.jobId)
+      setStatus(response.status)
+      // Start polling
+      setIsPolling(true)
+      intervalRef.current = window.setInterval(
+        () => pollJobStatus(response.jobId),
+        pollingInterval
+      )
+      // Initial poll
+      await pollJobStatus(response.jobId)
+    } catch (err) {
+      const message = err instanceof Error ? err.message : 'Failed to start job'
+      setError(message)
+      onErrorRef.current?.(message)
+    }
+  }, [pollingInterval, pollJobStatus])
+  // Cleanup on unmount
+  useEffect(() => {
+    return () => {
+      if (intervalRef.current) {
+        clearInterval(intervalRef.current)
+      }
+    }
+  }, [])
+  return {
+    jobId,
+    status,
+    progress,
+    progressMessage,
+    error,
+    isPolling,
+    startJob,
+    stopPolling,
+  }
+}
+```
+### 7. Frontend API Client Extensions
+```typescript
+// api/client.ts additions
+export type JobStatus = 'pending' | 'running' | 'completed' | 'failed'
+export interface CreateJobResponse {
+  jobId: string
+  status: JobStatus
+  message: string
+}
+export interface JobStatusResponse {
+  jobId: string
+  status: JobStatus
+  progress: number
+  progressMessage: string
+  elapsedSeconds?: number
+  result?: SegmentResponse
+  error?: string
+}
+class ApiClient {
+  // ... existing methods ...
+  async createSegmentJob(
+    caseId: string,
+    fastMode: boolean = true,
+    signal?: AbortSignal
+  ): Promise<CreateJobResponse> {
+    const response = await fetch(`${this.baseUrl}/api/segment`, {
+      method: 'POST',
+      headers: { 'Content-Type': 'application/json' },
+      body: JSON.stringify({ case_id: caseId, fast_mode: fastMode }),
+      signal,
+    })
+    if (!response.ok) {
+      const error = await response.json().catch(() => ({}))
+      throw new ApiError(
+        `Failed to create job: ${error.detail || response.statusText}`,
+        response.status,
+        error.detail
+      )
+    }
+    return response.json()
+  }
+  async getJobStatus(jobId: string, signal?: AbortSignal): Promise<JobStatusResponse> {
+    const response = await fetch(`${this.baseUrl}/api/jobs/${jobId}`, { signal })
+    if (response.status === 404) {
+      throw new ApiError('Job not found', 404)
+    }
+    if (!response.ok) {
+      const error = await response.json().catch(() => ({}))
+      throw new ApiError(
+        `Failed to get job status: ${error.detail || response.statusText}`,
+        response.status,
+        error.detail
+      )
+    }
+    return response.json()
+  }
+}
+```
+### 8. UI Progress Display
+```tsx
+// components/ProgressIndicator.tsx
+interface ProgressIndicatorProps {
+  progress: number
+  message: string
+  status: JobStatus
+}
+export function ProgressIndicator({ progress, message, status }: ProgressIndicatorProps) {
+  return (
+    <div className="bg-gray-800 rounded-lg p-4 space-y-3">
+      <div className="flex justify-between text-sm">
+        <span className="text-gray-400">{message}</span>
+        <span className="text-gray-300">{progress}%</span>
+      </div>
+      <div className="w-full bg-gray-700 rounded-full h-2">
+        <div
+          className={`h-2 rounded-full transition-all duration-300 ${
+            status === 'failed' ? 'bg-red-500' : 'bg-blue-500'
+          }`}
+          style={{ width: `${progress}%` }}
+        />
+      </div>
+    </div>
+  )
+}
+```
+## Implementation Checklist
+### Backend
+- [ ] Create `job_store.py` with Job dataclass and jobs dict
+- [ ] Create Pydantic schemas for job responses
+- [ ] Modify POST /api/segment to return 202 with job ID
+- [ ] Add GET /api/jobs/{job_id} endpoint
+- [ ] Implement background task execution with progress updates
+- [ ] Add job cleanup scheduler
+- [ ] Update CORS if needed for new endpoint
+### Frontend
+- [ ] Add job-related types to `types/index.ts`
+- [ ] Add API client methods for job creation and polling
+- [ ] Create `useJobPolling` hook
+- [ ] Create `ProgressIndicator` component
+- [ ] Update `useSegmentation` to use job polling
+- [ ] Update `App.tsx` to show progress during processing
+### Testing
+- [ ] Unit tests for job store
+- [ ] Unit tests for job endpoints
+- [ ] Unit tests for useJobPolling hook
+- [ ] E2E test for full job flow
+- [ ] Manual test on HF Spaces deployment
+### Documentation
+- [ ] Update API documentation
+- [ ] Update bug tracker with resolution
+- [ ] Add architecture diagram
+## Migration Strategy
+1. **Backend**: Add new endpoints alongside existing. Keep old `/api/segment`
+   temporarily for backwards compatibility (marked deprecated).
+2. **Frontend**: Update to use new job polling system. Old sync behavior removed.
+3. **Testing**: Verify on HF Spaces before removing deprecated endpoint.
+4. **Cleanup**: Remove deprecated sync endpoint after validation.
+## Performance Considerations
+| Metric | Before (Sync) | After (Async) |
+|--------|--------------|---------------|
+| Initial response time | 30-60s | <1s |
+| Total request count | 1 | ~15-30 (polling) |
+| Timeout risk | HIGH | NONE |
+| User feedback | None during wait | Progress updates |
+| Network efficiency | 1 large response | Many small responses |
+## Alternatives Considered
+### 1. SSE (Server-Sent Events)
+- **Pros**: Real-time updates, single connection
+- **Cons**: Connection stays open (could still timeout), HF proxy issues possible
+- **Decision**: Polling is more robust for HF Spaces constraints
+### 2. WebSockets
+- **Pros**: Bi-directional, real-time
+- **Cons**: Known 404 issues on HF Spaces, complex
+- **Decision**: Not viable on HF Spaces
+### 3. Redis/Celery
+- **Pros**: Production-grade, multi-worker support
+- **Cons**: Not available on HF Spaces Docker
+- **Decision**: In-memory sufficient for single-worker
+## References
+- [FastAPI Background Tasks](https://fastapi.tiangolo.com/tutorial/background-tasks/)
+- [FastAPI Polling Strategy for Long-Running Tasks](https://openillumi.com/en/en-fastapi-long-task-progress-polling/)
+- [Managing Background Tasks in FastAPI](https://leapcell.io/blog/managing-background-tasks-and-long-running-operations-in-fastapi)
+- [Real Time Polling in React Query 2025](https://samwithcode.in/tutorial/react-js/real-time-polling-in-react-query-2025)
+- [504 Gateway Timeout - HF Forums](https://discuss.huggingface.co/t/504-gateway-timeout-with-http-request/24018)

frontend/e2e/fixtures.ts CHANGED Viewed

@@ -1,48 +1,147 @@
-import { test as base, expect } from '@playwright/test'
-// API response mocks matching MSW handlers
 const MOCK_CASES = ['sub-stroke0001', 'sub-stroke0002', 'sub-stroke0003']
-const MOCK_SEGMENT_RESPONSE = {
-  caseId: 'sub-stroke0001',
   diceScore: 0.847,
   volumeMl: 15.32,
   elapsedSeconds: 12.5,
   // Use real public NIfTI for visual testing (NiiVue demo image)
   dwiUrl: 'https://niivue.github.io/niivue-demo-images/mni152.nii.gz',
   predictionUrl: 'https://niivue.github.io/niivue-demo-images/mni152.nii.gz',
-}
-// Extend base test to include API mocking
-export const test = base.extend({
-  // Auto-mock API routes for every test
-  page: async ({ page }, use) => {
-    // Mock GET /api/cases
-    await page.route('**/api/cases', (route) => {
-      route.fulfill({
-        status: 200,
-        contentType: 'application/json',
-        body: JSON.stringify({ cases: MOCK_CASES }),
-      })
     })
-    // Mock POST /api/segment - return different caseId based on request
-    await page.route('**/api/segment', async (route) => {
-      const request = route.request()
-      const body = JSON.parse(request.postData() || '{}') as { case_id?: string }
-      // Simulate network delay
-      await new Promise((r) => setTimeout(r, 200))
       route.fulfill({
-        status: 200,
         contentType: 'application/json',
-        body: JSON.stringify({
-          ...MOCK_SEGMENT_RESPONSE,
-          caseId: body.case_id || 'sub-stroke0001',
-        }),
       })
     })
     await use(page)
   },
 })

+import { test as base, expect, Page } from '@playwright/test'
+// API response mocks matching the async job queue pattern
 const MOCK_CASES = ['sub-stroke0001', 'sub-stroke0002', 'sub-stroke0003']
+// Track jobs for the async pattern
+interface MockJob {
+  id: string
+  caseId: string
+  status: 'pending' | 'running' | 'completed' | 'failed'
+  progress: number
+  progressMessage: string
+  createdAt: number
+}
+// Job store per test (reset for each test)
+const createJobStore = () => {
+  const jobs = new Map<string, MockJob>()
+  let jobCounter = 0
+  return {
+    createJob(caseId: string): MockJob {
+      const jobId = `e2e-job-${++jobCounter}`
+      const job: MockJob = {
+        id: jobId,
+        caseId,
+        status: 'pending',
+        progress: 0,
+        progressMessage: 'Job queued',
+        createdAt: Date.now(),
+      }
+      jobs.set(jobId, job)
+      return job
+    },
+    getJob(jobId: string): MockJob | undefined {
+      return jobs.get(jobId)
+    },
+    updateJobProgress(job: MockJob): MockJob {
+      // Simulate job progression over 1 second
+      const elapsed = Date.now() - job.createdAt
+      if (elapsed < 200) {
+        return { ...job, status: 'running', progress: 25, progressMessage: 'Loading case data...' }
+      } else if (elapsed < 500) {
+        return { ...job, status: 'running', progress: 50, progressMessage: 'Running inference...' }
+      } else if (elapsed < 800) {
+        return { ...job, status: 'running', progress: 75, progressMessage: 'Processing results...' }
+      } else {
+        return { ...job, status: 'completed', progress: 100, progressMessage: 'Segmentation complete' }
+      }
+    },
+  }
+}
+// Mock completed job result
+const createMockResult = (caseId: string) => ({
+  caseId,
   diceScore: 0.847,
   volumeMl: 15.32,
   elapsedSeconds: 12.5,
   // Use real public NIfTI for visual testing (NiiVue demo image)
   dwiUrl: 'https://niivue.github.io/niivue-demo-images/mni152.nii.gz',
   predictionUrl: 'https://niivue.github.io/niivue-demo-images/mni152.nii.gz',
+})
+// Setup API mocking for async job queue pattern
+async function setupApiMocks(page: Page) {
+  const jobStore = createJobStore()
+  // Mock GET /api/cases
+  await page.route('**/api/cases', (route) => {
+    route.fulfill({
+      status: 200,
+      contentType: 'application/json',
+      body: JSON.stringify({ cases: MOCK_CASES }),
     })
+  })
+  // Mock POST /api/segment - returns 202 with job ID (async pattern)
+  await page.route('**/api/segment', async (route) => {
+    const request = route.request()
+    const body = JSON.parse(request.postData() || '{}') as { case_id?: string }
+    const caseId = body.case_id || 'sub-stroke0001'
+    // Create a new job
+    const job = jobStore.createJob(caseId)
+    // Small delay to simulate network
+    await new Promise((r) => setTimeout(r, 50))
+    route.fulfill({
+      status: 202,
+      contentType: 'application/json',
+      body: JSON.stringify({
+        jobId: job.id,
+        status: 'pending',
+        message: `Segmentation job queued for ${caseId}`,
+      }),
+    })
+  })
+  // Mock GET /api/jobs/:jobId - returns job status (for polling)
+  await page.route('**/api/jobs/*', async (route) => {
+    const url = route.request().url()
+    const jobId = url.split('/api/jobs/')[1]
+    const job = jobStore.getJob(jobId)
+    if (!job) {
       route.fulfill({
+        status: 404,
         contentType: 'application/json',
+        body: JSON.stringify({ detail: `Job not found: ${jobId}` }),
       })
+      return
+    }
+    // Update job progress based on elapsed time
+    const updatedJob = jobStore.updateJobProgress(job)
+    const response: Record<string, unknown> = {
+      jobId: updatedJob.id,
+      status: updatedJob.status,
+      progress: updatedJob.progress,
+      progressMessage: updatedJob.progressMessage,
+      elapsedSeconds: (Date.now() - updatedJob.createdAt) / 1000,
+    }
+    // Include result when completed
+    if (updatedJob.status === 'completed') {
+      response.result = createMockResult(updatedJob.caseId)
+    }
+    route.fulfill({
+      status: 200,
+      contentType: 'application/json',
+      body: JSON.stringify(response),
     })
+  })
+}
+// Extend base test to include API mocking
+export const test = base.extend({
+  // Auto-mock API routes for every test
+  page: async ({ page }, use) => {
+    await setupApiMocks(page)
     await use(page)
   },
 })

frontend/e2e/pages/HomePage.ts CHANGED Viewed

@@ -19,7 +19,7 @@ export class HomePage {
     })
     this.caseSelector = page.getByRole('combobox')
     this.runButton = page.getByRole('button', { name: /run segmentation/i })
-    this.processingText = page.getByText(/processing/i)
     this.metricsPanel = page.getByRole('heading', { name: /results/i })
     this.diceScore = page.getByText(/0\.\d{3}/)
     this.viewer = page.locator('canvas')

     })
     this.caseSelector = page.getByRole('combobox')
     this.runButton = page.getByRole('button', { name: /run segmentation/i })
+    this.processingText = page.getByRole('button', { name: /processing/i })
     this.metricsPanel = page.getByRole('heading', { name: /results/i })
     this.diceScore = page.getByText(/0\.\d{3}/)
     this.viewer = page.locator('canvas')

frontend/src/App.test.tsx CHANGED Viewed

@@ -1,8 +1,8 @@
-import { describe, it, expect, vi } from 'vitest'
 import { render, screen, waitFor } from '@testing-library/react'
 import userEvent from '@testing-library/user-event'
 import { server } from './mocks/server'
-import { errorHandlers } from './mocks/handlers'
 import App from './App'
 // Mock NiiVue to avoid WebGL in tests
@@ -19,6 +19,17 @@ vi.mock('@niivue/niivue', () => ({
 }))
 describe('App Integration', () => {
   describe('Initial Render', () => {
     it('renders main heading', () => {
       render(<App />)
@@ -94,10 +105,11 @@ describe('App Integration', () => {
       await user.selectOptions(screen.getByRole('combobox'), 'sub-stroke0001')
       await user.click(screen.getByRole('button', { name: /run segmentation/i }))
-      expect(screen.getByText(/processing/i)).toBeInTheDocument()
     })
-    it('displays metrics after successful segmentation', async () => {
       const user = userEvent.setup()
       render(<App />)
@@ -108,12 +120,33 @@ describe('App Integration', () => {
       await user.selectOptions(screen.getByRole('combobox'), 'sub-stroke0001')
       await user.click(screen.getByRole('button', { name: /run segmentation/i }))
       await waitFor(() => {
-        expect(screen.getByText('0.847')).toBeInTheDocument()
       })
       expect(screen.getByText('15.32 mL')).toBeInTheDocument()
-      expect(screen.getByText(/12\.5s/)).toBeInTheDocument()
     })
     it('displays viewer after successful segmentation', async () => {
@@ -127,9 +160,13 @@ describe('App Integration', () => {
       await user.selectOptions(screen.getByRole('combobox'), 'sub-stroke0001')
       await user.click(screen.getByRole('button', { name: /run segmentation/i }))
-      await waitFor(() => {
-        expect(document.querySelector('canvas')).toBeInTheDocument()
-      })
     })
     it('hides placeholder after successful segmentation', async () => {
@@ -143,19 +180,37 @@ describe('App Integration', () => {
       await user.selectOptions(screen.getByRole('combobox'), 'sub-stroke0001')
       await user.click(screen.getByRole('button', { name: /run segmentation/i }))
-      await waitFor(() => {
-        expect(screen.getByText('0.847')).toBeInTheDocument()
-      })
       expect(
         screen.queryByText(/select a case and run segmentation/i)
       ).not.toBeInTheDocument()
     })
   })
   describe('Error Handling', () => {
-    it('shows error when segmentation fails', async () => {
-      server.use(errorHandlers.segmentServerError)
       const user = userEvent.setup()
       render(<App />)
@@ -171,11 +226,11 @@ describe('App Integration', () => {
         expect(screen.getByRole('alert')).toBeInTheDocument()
       })
-      expect(screen.getByText(/segmentation failed/i)).toBeInTheDocument()
     })
     it('allows retry after error', async () => {
-      server.use(errorHandlers.segmentServerError)
       const user = userEvent.setup()
       render(<App />)
@@ -197,16 +252,20 @@ describe('App Integration', () => {
       // Retry
       await user.click(screen.getByRole('button', { name: /run segmentation/i }))
-      await waitFor(() => {
-        expect(screen.getByText('0.847')).toBeInTheDocument()
-      })
       expect(screen.queryByRole('alert')).not.toBeInTheDocument()
     })
   })
   describe('Multiple Runs', () => {
-    it('allows running segmentation on different cases', async () => {
       const user = userEvent.setup()
       render(<App />)
@@ -218,23 +277,34 @@ describe('App Integration', () => {
       await user.selectOptions(screen.getByRole('combobox'), 'sub-stroke0001')
       await user.click(screen.getByRole('button', { name: /run segmentation/i }))
-      // Wait for first segmentation to complete
-      await waitFor(() => {
-        expect(screen.getByText('sub-stroke0001')).toBeInTheDocument()
-      })
-      // Wait for button to be ready again (not "Processing...")
-      await waitFor(() => {
-        expect(screen.getByRole('button', { name: /run segmentation/i })).toBeInTheDocument()
-      })
       // Second case
       await user.selectOptions(screen.getByRole('combobox'), 'sub-stroke0002')
       await user.click(screen.getByRole('button', { name: /run segmentation/i }))
-      await waitFor(() => {
-        expect(screen.getByText('sub-stroke0002')).toBeInTheDocument()
-      })
     })
   })
 })

+import { describe, it, expect, vi, beforeEach, afterEach } from 'vitest'
 import { render, screen, waitFor } from '@testing-library/react'
 import userEvent from '@testing-library/user-event'
 import { server } from './mocks/server'
+import { errorHandlers, setMockJobDuration } from './mocks/handlers'
 import App from './App'
 // Mock NiiVue to avoid WebGL in tests
 }))
 describe('App Integration', () => {
+  // Use real timers for integration tests - fake timers don't sync well
+  // with MSW's async handlers and polling intervals
+  beforeEach(() => {
+    // Reset mock job duration to fast for tests
+    setMockJobDuration(500) // Jobs complete in 500ms
+  })
+  afterEach(() => {
+    setMockJobDuration(500) // Reset to default
+  })
   describe('Initial Render', () => {
     it('renders main heading', () => {
       render(<App />)
       await user.selectOptions(screen.getByRole('combobox'), 'sub-stroke0001')
       await user.click(screen.getByRole('button', { name: /run segmentation/i }))
+      // Button should show "Processing..." while job is running
+      expect(screen.getByRole('button', { name: /processing/i })).toBeInTheDocument()
     })
+    it('shows progress indicator during job execution', async () => {
       const user = userEvent.setup()
       render(<App />)
       await user.selectOptions(screen.getByRole('combobox'), 'sub-stroke0001')
       await user.click(screen.getByRole('button', { name: /run segmentation/i }))
+      // Progress indicator should appear during processing
+      await waitFor(() => {
+        expect(screen.getByRole('progressbar')).toBeInTheDocument()
+      })
+    })
+    it('displays metrics after successful segmentation', async () => {
+      const user = userEvent.setup()
+      render(<App />)
       await waitFor(() => {
+        expect(screen.getByRole('combobox')).toBeInTheDocument()
       })
+      await user.selectOptions(screen.getByRole('combobox'), 'sub-stroke0001')
+      await user.click(screen.getByRole('button', { name: /run segmentation/i }))
+      // Wait for job to complete (mock duration is 500ms, polling is 2s)
+      // Use 5s timeout to account for polling interval
+      await waitFor(
+        () => {
+          expect(screen.getByText('0.847')).toBeInTheDocument()
+        },
+        { timeout: 5000 }
+      )
       expect(screen.getByText('15.32 mL')).toBeInTheDocument()
     })
     it('displays viewer after successful segmentation', async () => {
       await user.selectOptions(screen.getByRole('combobox'), 'sub-stroke0001')
       await user.click(screen.getByRole('button', { name: /run segmentation/i }))
+      // Wait for job to complete and canvas to render
+      await waitFor(
+        () => {
+          expect(document.querySelector('canvas')).toBeInTheDocument()
+        },
+        { timeout: 5000 }
+      )
     })
     it('hides placeholder after successful segmentation', async () => {
       await user.selectOptions(screen.getByRole('combobox'), 'sub-stroke0001')
       await user.click(screen.getByRole('button', { name: /run segmentation/i }))
+      // Wait for job to complete
+      await waitFor(
+        () => {
+          expect(screen.getByText('0.847')).toBeInTheDocument()
+        },
+        { timeout: 5000 }
+      )
       expect(
         screen.queryByText(/select a case and run segmentation/i)
       ).not.toBeInTheDocument()
     })
+    it('shows cancel button during processing', async () => {
+      const user = userEvent.setup()
+      render(<App />)
+      await waitFor(() => {
+        expect(screen.getByRole('combobox')).toBeInTheDocument()
+      })
+      await user.selectOptions(screen.getByRole('combobox'), 'sub-stroke0001')
+      await user.click(screen.getByRole('button', { name: /run segmentation/i }))
+      expect(screen.getByRole('button', { name: /cancel/i })).toBeInTheDocument()
+    })
   })
   describe('Error Handling', () => {
+    it('shows error when job creation fails', async () => {
+      server.use(errorHandlers.segmentCreateError)
       const user = userEvent.setup()
       render(<App />)
         expect(screen.getByRole('alert')).toBeInTheDocument()
       })
+      expect(screen.getByText(/failed to create job/i)).toBeInTheDocument()
     })
     it('allows retry after error', async () => {
+      server.use(errorHandlers.segmentCreateError)
       const user = userEvent.setup()
       render(<App />)
       // Retry
       await user.click(screen.getByRole('button', { name: /run segmentation/i }))
+      // Wait for job to complete (real timer now)
+      await waitFor(
+        () => {
+          expect(screen.getByText('0.847')).toBeInTheDocument()
+        },
+        { timeout: 5000 }
+      )
       expect(screen.queryByRole('alert')).not.toBeInTheDocument()
     })
   })
   describe('Multiple Runs', () => {
+    it('allows running segmentation on different cases', { timeout: 15000 }, async () => {
       const user = userEvent.setup()
       render(<App />)
       await user.selectOptions(screen.getByRole('combobox'), 'sub-stroke0001')
       await user.click(screen.getByRole('button', { name: /run segmentation/i }))
+      // Wait for first segmentation to complete - check metrics (Dice Score proves completion)
+      await waitFor(
+        () => {
+          expect(screen.getByText('0.847')).toBeInTheDocument()
+          // Button should no longer say "Processing..." after completion
+          expect(screen.queryByRole('button', { name: /processing/i })).not.toBeInTheDocument()
+        },
+        { timeout: 5000 }
+      )
       // Second case
       await user.selectOptions(screen.getByRole('combobox'), 'sub-stroke0002')
       await user.click(screen.getByRole('button', { name: /run segmentation/i }))
+      // Wait for second job to complete - check that case ID changed in metrics
+      // Note: We look within the metrics container for the case ID to avoid matching dropdown
+      await waitFor(
+        () => {
+          // The metrics panel shows case ID in a span with class "ml-2 font-mono"
+          // after the "Case:" label
+          const caseLabels = screen.getAllByText(/Case:/i)
+          expect(caseLabels.length).toBeGreaterThan(0)
+          // The second run should show sub-stroke0002 in the metrics
+          const metricsContainer = screen.getByText('Results').closest('div')
+          expect(metricsContainer).toHaveTextContent('sub-stroke0002')
+        },
+        { timeout: 5000 }
+      )
     })
   })
 })

frontend/src/App.tsx CHANGED Viewed

@@ -3,11 +3,22 @@ import { Layout } from './components/Layout'
 import { CaseSelector } from './components/CaseSelector'
 import { NiiVueViewer } from './components/NiiVueViewer'
 import { MetricsPanel } from './components/MetricsPanel'
 import { useSegmentation } from './hooks/useSegmentation'
 export default function App() {
   const [selectedCase, setSelectedCase] = useState<string | null>(null)
-  const { result, isLoading, error, runSegmentation } = useSegmentation()
   const handleRunSegmentation = async () => {
     if (selectedCase) {
@@ -15,6 +26,9 @@ export default function App() {
     }
   }
   return (
     <Layout>
       <div className="grid grid-cols-1 lg:grid-cols-3 gap-6">
@@ -35,12 +49,39 @@ export default function App() {
             {isLoading ? 'Processing...' : 'Run Segmentation'}
           </button>
-          {error && (
-            <div role="alert" className="bg-red-900/50 text-red-300 p-3 rounded-lg">
-              {error}
             </div>
           )}
           {result && <MetricsPanel metrics={result.metrics} />}
         </div>
@@ -54,7 +95,9 @@ export default function App() {
           ) : (
             <div className="bg-gray-900 rounded-lg h-[500px] flex items-center justify-center">
               <p className="text-gray-400">
-                Select a case and run segmentation to view results
               </p>
             </div>
           )}

 import { CaseSelector } from './components/CaseSelector'
 import { NiiVueViewer } from './components/NiiVueViewer'
 import { MetricsPanel } from './components/MetricsPanel'
+import { ProgressIndicator } from './components/ProgressIndicator'
 import { useSegmentation } from './hooks/useSegmentation'
 export default function App() {
   const [selectedCase, setSelectedCase] = useState<string | null>(null)
+  const {
+    result,
+    isLoading,
+    error,
+    jobStatus,
+    progress,
+    progressMessage,
+    elapsedSeconds,
+    runSegmentation,
+    cancelJob,
+  } = useSegmentation()
   const handleRunSegmentation = async () => {
     if (selectedCase) {
     }
   }
+  // Show progress indicator when job is active
+  const showProgress = isLoading && jobStatus && jobStatus !== 'completed'
   return (
     <Layout>
       <div className="grid grid-cols-1 lg:grid-cols-3 gap-6">
             {isLoading ? 'Processing...' : 'Run Segmentation'}
           </button>
+          {/* Cancel button when processing */}
+          {isLoading && (
+            <button
+              onClick={cancelJob}
+              className="w-full bg-gray-700 hover:bg-gray-600 text-gray-300
+                         font-medium py-2 px-4 rounded-lg transition-colors text-sm"
+            >
+              Cancel
+            </button>
+          )}
+          {/* Progress indicator */}
+          {showProgress && (
+            <ProgressIndicator
+              progress={progress}
+              message={progressMessage}
+              status={jobStatus}
+              elapsedSeconds={elapsedSeconds}
+            />
+          )}
+          {/* Error display */}
+          {error && !isLoading && (
+            <div
+              role="alert"
+              className="bg-red-900/50 text-red-300 p-3 rounded-lg text-sm"
+            >
+              <p className="font-medium">Error</p>
+              <p className="mt-1">{error}</p>
             </div>
           )}
+          {/* Results metrics */}
           {result && <MetricsPanel metrics={result.metrics} />}
         </div>
           ) : (
             <div className="bg-gray-900 rounded-lg h-[500px] flex items-center justify-center">
               <p className="text-gray-400">
+                {isLoading
+                  ? 'Processing segmentation...'
+                  : 'Select a case and run segmentation to view results'}
               </p>
             </div>
           )}

frontend/src/api/__tests__/client.test.ts CHANGED Viewed

@@ -25,37 +25,52 @@ describe('apiClient', () => {
     })
   })
-  describe('runSegmentation', () => {
-    it('returns segmentation result', async () => {
-      const result = await apiClient.runSegmentation('sub-stroke0001')
-      expect(result.caseId).toBe('sub-stroke0001')
-      expect(result.diceScore).toBe(0.847)
-      expect(result.volumeMl).toBe(15.32)
-      expect(result.dwiUrl).toContain('dwi.nii.gz')
-      expect(result.predictionUrl).toContain('prediction.nii.gz')
     })
-    it('sends fast_mode=false parameter (slower processing)', async () => {
-      const result = await apiClient.runSegmentation('sub-stroke0001', false)
-      // Mock returns 45.0s when fast_mode=false
-      expect(result.elapsedSeconds).toBe(45.0)
     })
-    it('defaults fast_mode to true (faster processing)', async () => {
-      const result = await apiClient.runSegmentation('sub-stroke0001')
-      // Mock returns 12.5s when fast_mode=true (the default)
-      expect(result.elapsedSeconds).toBe(12.5)
     })
-    it('throws ApiError on server error', async () => {
-      server.use(errorHandlers.segmentServerError)
       await expect(
-        apiClient.runSegmentation('sub-stroke0001')
-      ).rejects.toThrow(/segmentation failed/i)
     })
   })
 })

     })
   })
+  describe('createSegmentJob', () => {
+    it('returns job ID and pending status', async () => {
+      const result = await apiClient.createSegmentJob('sub-stroke0001')
+      expect(result.jobId).toBeDefined()
+      expect(result.status).toBe('pending')
+      expect(result.message).toContain('sub-stroke0001')
     })
+    it('sends fast_mode parameter', async () => {
+      const result = await apiClient.createSegmentJob('sub-stroke0001', false)
+      expect(result.jobId).toBeDefined()
+      expect(result.status).toBe('pending')
     })
+    it('throws ApiError on server error', async () => {
+      server.use(errorHandlers.segmentCreateError)
+      await expect(
+        apiClient.createSegmentJob('sub-stroke0001')
+      ).rejects.toThrow(/failed to create job/i)
     })
+  })
+  describe('getJobStatus', () => {
+    it('returns job status with progress', async () => {
+      // First create a job
+      const createResult = await apiClient.createSegmentJob('sub-stroke0001')
+      // Then get its status
+      const status = await apiClient.getJobStatus(createResult.jobId)
+      expect(status.jobId).toBe(createResult.jobId)
+      expect(['pending', 'running', 'completed']).toContain(status.status)
+      expect(status.progress).toBeGreaterThanOrEqual(0)
+      expect(status.progress).toBeLessThanOrEqual(100)
+      expect(status.progressMessage).toBeDefined()
+    })
+    it('throws ApiError when job not found', async () => {
+      server.use(errorHandlers.jobNotFound)
       await expect(
+        apiClient.getJobStatus('nonexistent-job')
+      ).rejects.toThrow(/not found/i)
     })
   })
 })

frontend/src/api/client.ts CHANGED Viewed

@@ -1,4 +1,8 @@
-import type { CasesResponse, SegmentResponse } from '../types'
 function getApiBase(): string {
   const url = import.meta.env.VITE_API_URL
@@ -36,6 +40,9 @@ class ApiClient {
     this.baseUrl = baseUrl
   }
   async getCases(signal?: AbortSignal): Promise<CasesResponse> {
     const response = await fetch(`${this.baseUrl}/api/cases`, { signal })
@@ -51,11 +58,17 @@ class ApiClient {
     return response.json()
   }
-  async runSegmentation(
     caseId: string,
     fastMode: boolean = true,
     signal?: AbortSignal
-  ): Promise<SegmentResponse> {
     const response = await fetch(`${this.baseUrl}/api/segment`, {
       method: 'POST',
       headers: {
@@ -71,7 +84,42 @@ class ApiClient {
     if (!response.ok) {
       const error = await response.json().catch(() => ({}))
       throw new ApiError(
-        `Segmentation failed: ${error.detail || response.statusText}`,
         response.status,
         error.detail
       )

+import type {
+  CasesResponse,
+  CreateJobResponse,
+  JobStatusResponse,
+} from '../types'
 function getApiBase(): string {
   const url = import.meta.env.VITE_API_URL
     this.baseUrl = baseUrl
   }
+  /**
+   * Get list of available cases
+   */
   async getCases(signal?: AbortSignal): Promise<CasesResponse> {
     const response = await fetch(`${this.baseUrl}/api/cases`, { signal })
     return response.json()
   }
+  /**
+   * Create a segmentation job (async - returns immediately with job ID)
+   *
+   * The actual ML inference runs in the background. Poll getJobStatus()
+   * to track progress and retrieve results when complete.
+   */
+  async createSegmentJob(
     caseId: string,
     fastMode: boolean = true,
     signal?: AbortSignal
+  ): Promise<CreateJobResponse> {
     const response = await fetch(`${this.baseUrl}/api/segment`, {
       method: 'POST',
       headers: {
     if (!response.ok) {
       const error = await response.json().catch(() => ({}))
       throw new ApiError(
+        `Failed to create job: ${error.detail || response.statusText}`,
+        response.status,
+        error.detail
+      )
+    }
+    return response.json()
+  }
+  /**
+   * Get the status of a segmentation job
+   *
+   * Poll this endpoint to track progress and retrieve results.
+   * When status is 'completed', the result field contains segmentation data.
+   * When status is 'failed', the error field contains the error message.
+   */
+  async getJobStatus(
+    jobId: string,
+    signal?: AbortSignal
+  ): Promise<JobStatusResponse> {
+    const response = await fetch(`${this.baseUrl}/api/jobs/${jobId}`, {
+      signal,
+    })
+    if (response.status === 404) {
+      throw new ApiError(
+        'Job not found or expired',
+        404,
+        'Jobs expire after 1 hour'
+      )
+    }
+    if (!response.ok) {
+      const error = await response.json().catch(() => ({}))
+      throw new ApiError(
+        `Failed to get job status: ${error.detail || response.statusText}`,
         response.status,
         error.detail
       )

frontend/src/components/ProgressIndicator.tsx ADDED Viewed

	@@ -0,0 +1,84 @@

+import type { JobStatus } from '../types'
+interface ProgressIndicatorProps {
+  progress: number
+  message: string
+  status: JobStatus
+  elapsedSeconds?: number
+}
+/**
+ * Visual progress indicator for long-running ML inference jobs.
+ *
+ * Shows:
+ * - Progress bar with percentage
+ * - Current operation message
+ * - Elapsed time
+ * - Status-appropriate coloring (blue for running, red for failed)
+ */
+export function ProgressIndicator({
+  progress,
+  message,
+  status,
+  elapsedSeconds,
+}: ProgressIndicatorProps) {
+  const isError = status === 'failed'
+  const isComplete = status === 'completed'
+  // Determine bar color based on status
+  const barColorClass = isError
+    ? 'bg-red-500'
+    : isComplete
+      ? 'bg-green-500'
+      : 'bg-blue-500'
+  // Animate the bar while running
+  const animationClass =
+    status === 'running' || status === 'pending' ? 'animate-pulse' : ''
+  return (
+    <div className="bg-gray-800 rounded-lg p-4 space-y-3">
+      {/* Header with message and percentage */}
+      <div className="flex justify-between items-center text-sm">
+        <span className="text-gray-300 font-medium">{message}</span>
+        <span className="text-gray-400 tabular-nums">{progress}%</span>
+      </div>
+      {/* Progress bar */}
+      <div className="w-full bg-gray-700 rounded-full h-2.5 overflow-hidden">
+        <div
+          className={`h-full rounded-full transition-all duration-500 ease-out ${barColorClass} ${animationClass}`}
+          style={{ width: `${progress}%` }}
+          role="progressbar"
+          aria-valuenow={progress}
+          aria-valuemin={0}
+          aria-valuemax={100}
+          aria-label={message}
+        />
+      </div>
+      {/* Footer with elapsed time and status */}
+      <div className="flex justify-between items-center text-xs text-gray-500">
+        {elapsedSeconds !== undefined ? (
+          <span className="tabular-nums">
+            Elapsed: {elapsedSeconds.toFixed(1)}s
+          </span>
+        ) : (
+          <span>Starting...</span>
+        )}
+        <span
+          className={`capitalize ${
+            isError
+              ? 'text-red-400'
+              : isComplete
+                ? 'text-green-400'
+                : 'text-blue-400'
+          }`}
+        >
+          {status}
+        </span>
+      </div>
+    </div>
+  )
+}

frontend/src/components/index.ts CHANGED Viewed

@@ -2,3 +2,4 @@ export { Layout } from './Layout'
 export { MetricsPanel } from './MetricsPanel'
 export { CaseSelector } from './CaseSelector'
 export { NiiVueViewer } from './NiiVueViewer'

 export { MetricsPanel } from './MetricsPanel'
 export { CaseSelector } from './CaseSelector'
 export { NiiVueViewer } from './NiiVueViewer'
+export { ProgressIndicator } from './ProgressIndicator'

frontend/src/hooks/__tests__/useSegmentation.test.tsx CHANGED Viewed

@@ -1,19 +1,28 @@
-import { describe, it, expect } from 'vitest'
 import { renderHook, waitFor, act } from '@testing-library/react'
 import { server } from '../../mocks/server'
 import { errorHandlers } from '../../mocks/handlers'
 import { useSegmentation } from '../useSegmentation'
 describe('useSegmentation', () => {
   it('starts with null result and not loading', () => {
     const { result } = renderHook(() => useSegmentation())
     expect(result.current.result).toBeNull()
     expect(result.current.isLoading).toBe(false)
     expect(result.current.error).toBeNull()
   })
-  it('sets loading state during segmentation', async () => {
     const { result } = renderHook(() => useSegmentation())
     act(() => {
@@ -22,75 +31,132 @@ describe('useSegmentation', () => {
     expect(result.current.isLoading).toBe(true)
     await waitFor(() => {
-      expect(result.current.isLoading).toBe(false)
     })
   })
-  it('returns result on success', async () => {
     const { result } = renderHook(() => useSegmentation())
     await act(async () => {
-      await result.current.runSegmentation('sub-stroke0001')
     })
-    expect(result.current.result).not.toBeNull()
     expect(result.current.result?.metrics.caseId).toBe('sub-stroke0001')
     expect(result.current.result?.metrics.diceScore).toBe(0.847)
     expect(result.current.result?.dwiUrl).toContain('dwi.nii.gz')
   })
-  it('sets error on failure', async () => {
-    server.use(errorHandlers.segmentServerError)
     const { result } = renderHook(() => useSegmentation())
-    await act(async () => {
-      await result.current.runSegmentation('sub-stroke0001')
     })
-    expect(result.current.error).toMatch(/segmentation failed/i)
     expect(result.current.result).toBeNull()
   })
   it('clears previous error on new request', async () => {
-    server.use(errorHandlers.segmentServerError)
     const { result } = renderHook(() => useSegmentation())
     // First request fails
-    await act(async () => {
-      await result.current.runSegmentation('sub-stroke0001')
     })
-    expect(result.current.error).not.toBeNull()
     // Reset to success handler
     server.resetHandlers()
-    // Second request succeeds
-    await act(async () => {
-      await result.current.runSegmentation('sub-stroke0001')
     })
     expect(result.current.error).toBeNull()
-    expect(result.current.result).not.toBeNull()
   })
-  it('clears previous result on new request', async () => {
     const { result } = renderHook(() => useSegmentation())
-    // First request
-    await act(async () => {
-      await result.current.runSegmentation('sub-stroke0001')
     })
-    expect(result.current.result).not.toBeNull()
-    // Start second request - result should clear while loading
     act(() => {
-      result.current.runSegmentation('sub-stroke0002')
     })
-    // While loading, previous result is still available
-    // (or you could clear it - depends on UX preference)
-    expect(result.current.isLoading).toBe(true)
   })
 })

+import { describe, it, expect, vi, beforeEach, afterEach } from 'vitest'
 import { renderHook, waitFor, act } from '@testing-library/react'
 import { server } from '../../mocks/server'
 import { errorHandlers } from '../../mocks/handlers'
 import { useSegmentation } from '../useSegmentation'
 describe('useSegmentation', () => {
+  beforeEach(() => {
+    vi.useFakeTimers({ shouldAdvanceTime: true })
+  })
+  afterEach(() => {
+    vi.useRealTimers()
+  })
   it('starts with null result and not loading', () => {
     const { result } = renderHook(() => useSegmentation())
     expect(result.current.result).toBeNull()
     expect(result.current.isLoading).toBe(false)
     expect(result.current.error).toBeNull()
+    expect(result.current.jobStatus).toBeNull()
   })
+  it('sets loading state and job status during segmentation', async () => {
     const { result } = renderHook(() => useSegmentation())
     act(() => {
     expect(result.current.isLoading).toBe(true)
+    // Wait for job to be created
     await waitFor(() => {
+      expect(result.current.jobId).toBeDefined()
     })
+    expect(result.current.jobStatus).toBeDefined()
   })
+  it('returns result on job completion', async () => {
     const { result } = renderHook(() => useSegmentation())
+    act(() => {
+      result.current.runSegmentation('sub-stroke0001')
+    })
+    // Wait for job creation
+    await waitFor(() => {
+      expect(result.current.jobId).toBeDefined()
+    })
+    // Advance time to allow job to complete (mock jobs complete in ~3s)
     await act(async () => {
+      await vi.advanceTimersByTimeAsync(5000)
+    })
+    await waitFor(() => {
+      expect(result.current.isLoading).toBe(false)
+      expect(result.current.result).not.toBeNull()
     })
     expect(result.current.result?.metrics.caseId).toBe('sub-stroke0001')
     expect(result.current.result?.metrics.diceScore).toBe(0.847)
     expect(result.current.result?.dwiUrl).toContain('dwi.nii.gz')
   })
+  it('shows progress updates during job execution', async () => {
+    const { result } = renderHook(() => useSegmentation())
+    act(() => {
+      result.current.runSegmentation('sub-stroke0001')
+    })
+    // Wait for job to start
+    await waitFor(() => {
+      expect(result.current.jobId).toBeDefined()
+    })
+    // Progress should be tracked
+    expect(result.current.progress).toBeGreaterThanOrEqual(0)
+    expect(result.current.progressMessage).toBeDefined()
+  })
+  it('sets error on job creation failure', async () => {
+    server.use(errorHandlers.segmentCreateError)
     const { result } = renderHook(() => useSegmentation())
+    act(() => {
+      result.current.runSegmentation('sub-stroke0001')
     })
+    await waitFor(() => {
+      expect(result.current.isLoading).toBe(false)
+    })
+    expect(result.current.error).toMatch(/failed to create job/i)
     expect(result.current.result).toBeNull()
   })
   it('clears previous error on new request', async () => {
+    server.use(errorHandlers.segmentCreateError)
     const { result } = renderHook(() => useSegmentation())
     // First request fails
+    act(() => {
+      result.current.runSegmentation('sub-stroke0001')
+    })
+    await waitFor(() => {
+      expect(result.current.error).not.toBeNull()
     })
     // Reset to success handler
     server.resetHandlers()
+    // Second request should clear error
+    act(() => {
+      result.current.runSegmentation('sub-stroke0001')
     })
     expect(result.current.error).toBeNull()
+    expect(result.current.isLoading).toBe(true)
   })
+  it('can cancel a running job', async () => {
     const { result } = renderHook(() => useSegmentation())
+    act(() => {
+      result.current.runSegmentation('sub-stroke0001')
     })
+    await waitFor(() => {
+      expect(result.current.isLoading).toBe(true)
+    })
+    // Cancel the job
     act(() => {
+      result.current.cancelJob()
     })
+    expect(result.current.isLoading).toBe(false)
+    expect(result.current.jobStatus).toBeNull()
+  })
+  it('cleans up polling on unmount', async () => {
+    const { result, unmount } = renderHook(() => useSegmentation())
+    act(() => {
+      result.current.runSegmentation('sub-stroke0001')
+    })
+    await waitFor(() => {
+      expect(result.current.isLoading).toBe(true)
+    })
+    // Unmount should not throw
+    unmount()
   })
 })

frontend/src/hooks/useSegmentation.ts CHANGED Viewed

@@ -1,63 +1,205 @@
-import { useState, useCallback, useRef } from 'react'
 import { apiClient } from '../api/client'
-import type { SegmentationResult } from '../types'
 export function useSegmentation() {
   const [result, setResult] = useState<SegmentationResult | null>(null)
-  const [isLoading, setIsLoading] = useState(false)
   const [error, setError] = useState<string | null>(null)
-  // Track the current request to prevent race conditions
-  // Each new request gets a unique token; only the latest request's results are applied
-  const currentRequestRef = useRef<number>(0)
   const abortControllerRef = useRef<AbortController | null>(null)
-  const runSegmentation = useCallback(async (caseId: string, fastMode = true) => {
-    // Cancel any in-flight request
-    abortControllerRef.current?.abort()
-    const abortController = new AbortController()
-    abortControllerRef.current = abortController
-    // Increment request token to track this request
-    const requestToken = ++currentRequestRef.current
-    setIsLoading(true)
-    setError(null)
-    try {
-      const data = await apiClient.runSegmentation(caseId, fastMode, abortController.signal)
-      // Only apply results if this is still the current request
-      // Prevents stale responses from overwriting newer results
-      if (requestToken !== currentRequestRef.current) return
-      setResult({
-        dwiUrl: data.dwiUrl,
-        predictionUrl: data.predictionUrl,
-        metrics: {
-          caseId: data.caseId,
-          diceScore: data.diceScore,
-          volumeMl: data.volumeMl,
-          elapsedSeconds: data.elapsedSeconds,
-        },
-      })
-    } catch (err) {
-      // Ignore abort errors - user intentionally cancelled
-      if (err instanceof Error && err.name === 'AbortError') return
-      // Only apply error if this is still the current request
-      if (requestToken !== currentRequestRef.current) return
-      const message = err instanceof Error ? err.message : 'Unknown error'
-      setError(message)
       setResult(null)
-    } finally {
-      // Only clear loading if this is still the current request
-      if (requestToken === currentRequestRef.current) {
         setIsLoading(false)
       }
     }
-  }, [])
-  return { result, isLoading, error, runSegmentation }
 }

+import { useState, useCallback, useRef, useEffect } from 'react'
 import { apiClient } from '../api/client'
+import type { SegmentationResult, JobStatus } from '../types'
+// Polling interval in milliseconds
+const POLLING_INTERVAL = 2000
+/**
+ * Hook for running segmentation with async job polling.
+ *
+ * Instead of waiting for the full inference to complete (which can timeout
+ * on HuggingFace Spaces), this hook:
+ * 1. Creates a job that returns immediately with a job ID
+ * 2. Polls for job status/progress every 2 seconds
+ * 3. Returns results when the job completes
+ *
+ * This avoids the ~60s gateway timeout on HF Spaces while providing
+ * real-time progress updates to the user.
+ */
 export function useSegmentation() {
+  // Result state
   const [result, setResult] = useState<SegmentationResult | null>(null)
   const [error, setError] = useState<string | null>(null)
+  // Job tracking state
+  const [jobId, setJobId] = useState<string | null>(null)
+  const [jobStatus, setJobStatus] = useState<JobStatus | null>(null)
+  const [progress, setProgress] = useState(0)
+  const [progressMessage, setProgressMessage] = useState('')
+  const [elapsedSeconds, setElapsedSeconds] = useState<number | undefined>(
+    undefined
+  )
+  // Loading state - true from job creation until completion/failure
+  const [isLoading, setIsLoading] = useState(false)
+  // Refs for managing async operations
+  const currentJobRef = useRef<string | null>(null)
+  const pollingIntervalRef = useRef<number | null>(null)
   const abortControllerRef = useRef<AbortController | null>(null)
+  /**
+   * Stop polling for job status
+   */
+  const stopPolling = useCallback(() => {
+    if (pollingIntervalRef.current) {
+      clearInterval(pollingIntervalRef.current)
+      pollingIntervalRef.current = null
+    }
+  }, [])
+  /**
+   * Poll for job status and update state
+   */
+  const pollJobStatus = useCallback(
+    async (id: string, signal: AbortSignal) => {
+      // Don't poll if this isn't the current job
+      if (id !== currentJobRef.current) {
+        stopPolling()
+        return
+      }
+      try {
+        const response = await apiClient.getJobStatus(id, signal)
+        // Ignore results if job changed
+        if (id !== currentJobRef.current) return
+        // Update progress state
+        setJobStatus(response.status)
+        setProgress(response.progress)
+        setProgressMessage(response.progressMessage)
+        setElapsedSeconds(response.elapsedSeconds)
+        // Handle completion
+        if (response.status === 'completed' && response.result) {
+          stopPolling()
+          setIsLoading(false)
+          setResult({
+            dwiUrl: response.result.dwiUrl,
+            predictionUrl: response.result.predictionUrl,
+            metrics: {
+              caseId: response.result.caseId,
+              diceScore: response.result.diceScore,
+              volumeMl: response.result.volumeMl,
+              elapsedSeconds: response.result.elapsedSeconds,
+            },
+          })
+        }
+        // Handle failure
+        if (response.status === 'failed') {
+          stopPolling()
+          setIsLoading(false)
+          setError(response.error || 'Job failed')
+          setResult(null)
+        }
+      } catch (err) {
+        // Ignore abort errors
+        if (err instanceof Error && err.name === 'AbortError') return
+        // Don't stop polling on transient network errors - retry next interval
+        console.warn('Polling error (will retry):', err)
+      }
+    },
+    [stopPolling]
+  )
+  /**
+   * Start segmentation job and begin polling
+   */
+  const runSegmentation = useCallback(
+    async (caseId: string, fastMode = true) => {
+      // Cancel any existing job/polling
+      stopPolling()
+      abortControllerRef.current?.abort()
+      const abortController = new AbortController()
+      abortControllerRef.current = abortController
+      // Reset state
+      setError(null)
       setResult(null)
+      setProgress(0)
+      setProgressMessage('Creating job...')
+      setJobStatus('pending')
+      setElapsedSeconds(undefined)
+      setIsLoading(true)
+      try {
+        // Create the job
+        const response = await apiClient.createSegmentJob(
+          caseId,
+          fastMode,
+          abortController.signal
+        )
+        // Store job reference
+        const newJobId = response.jobId
+        setJobId(newJobId)
+        currentJobRef.current = newJobId
+        setJobStatus(response.status)
+        setProgressMessage(response.message)
+        // Start polling
+        pollingIntervalRef.current = window.setInterval(() => {
+          pollJobStatus(newJobId, abortController.signal)
+        }, POLLING_INTERVAL)
+        // Do an initial poll immediately
+        await pollJobStatus(newJobId, abortController.signal)
+      } catch (err) {
+        // Ignore abort errors
+        if (err instanceof Error && err.name === 'AbortError') return
+        const message = err instanceof Error ? err.message : 'Failed to start job'
+        setError(message)
         setIsLoading(false)
+        setJobStatus('failed')
       }
+    },
+    [pollJobStatus, stopPolling]
+  )
+  /**
+   * Cancel the current job (stops polling, clears loading state)
+   */
+  const cancelJob = useCallback(() => {
+    stopPolling()
+    abortControllerRef.current?.abort()
+    currentJobRef.current = null
+    setIsLoading(false)
+    setJobStatus(null)
+    setProgress(0)
+    setProgressMessage('')
+  }, [stopPolling])
+  // Cleanup on unmount
+  useEffect(() => {
+    return () => {
+      stopPolling()
+      abortControllerRef.current?.abort()
     }
+  }, [stopPolling])
+  return {
+    // Result data
+    result,
+    error,
+    // Job status
+    jobId,
+    jobStatus,
+    progress,
+    progressMessage,
+    elapsedSeconds,
+    // Loading state
+    isLoading,
+    // Actions
+    runSegmentation,
+    cancelJob,
+  }
 }

frontend/src/mocks/handlers.ts CHANGED Viewed

@@ -1,8 +1,66 @@
 import { http, HttpResponse, delay } from 'msw'
 const API_BASE = import.meta.env.VITE_API_URL || 'http://localhost:7860'
 export const handlers = [
   http.get(`${API_BASE}/api/cases`, async () => {
     await delay(100)
     return HttpResponse.json({
@@ -10,18 +68,75 @@ export const handlers = [
     })
   }),
   http.post(`${API_BASE}/api/segment`, async ({ request }) => {
     const body = (await request.json()) as { case_id: string; fast_mode?: boolean }
-    await delay(200)
-    return HttpResponse.json({
       caseId: body.case_id,
-      diceScore: 0.847,
-      volumeMl: 15.32,
-      // Reflect fast_mode in response - slower when fast_mode=false
-      elapsedSeconds: body.fast_mode === false ? 45.0 : 12.5,
-      dwiUrl: `${API_BASE}/files/dwi.nii.gz`,
-      predictionUrl: `${API_BASE}/files/prediction.nii.gz`,
-    })
   }),
 ]
@@ -38,15 +153,54 @@ export const errorHandlers = {
     return HttpResponse.error()
   }),
-  segmentServerError: http.post(`${API_BASE}/api/segment`, () => {
     return HttpResponse.json(
-      { detail: 'Segmentation failed: out of memory' },
-      { status: 500 }
     )
   }),
-  segmentTimeout: http.post(`${API_BASE}/api/segment`, async () => {
-    await delay(30000)
-    return HttpResponse.json({ detail: 'Timeout' }, { status: 504 })
   }),
 }

 import { http, HttpResponse, delay } from 'msw'
+import type { JobStatus } from '../types'
 const API_BASE = import.meta.env.VITE_API_URL || 'http://localhost:7860'
+// In-memory job store for mocking
+interface MockJob {
+  id: string
+  caseId: string
+  status: JobStatus
+  progress: number
+  progressMessage: string
+  elapsedSeconds: number
+  fastMode: boolean
+  createdAt: number
+}
+const mockJobs = new Map<string, MockJob>()
+let jobCounter = 0
+// Configurable job duration for tests (ms)
+// Default: 500ms for fast tests
+let mockJobDurationMs = 500
+/**
+ * Set the mock job duration for tests.
+ * Jobs will complete after this many milliseconds.
+ */
+export function setMockJobDuration(durationMs: number): void {
+  mockJobDurationMs = durationMs
+}
+// Simulate job progression over time
+function getJobProgress(job: MockJob): MockJob {
+  const elapsed = (Date.now() - job.createdAt) / 1000
+  const duration = mockJobDurationMs / 1000 // Convert to seconds
+  if (job.status === 'completed' || job.status === 'failed') {
+    return job
+  }
+  // Progress through stages based on elapsed time relative to configured duration
+  // Stages: 20% loading, 40% inference, 30% processing, 10% finalizing
+  const progress20 = duration * 0.2
+  const progress60 = duration * 0.6
+  const progress90 = duration * 0.9
+  if (elapsed < progress20) {
+    return { ...job, status: 'running', progress: 10, progressMessage: 'Loading case data...', elapsedSeconds: elapsed }
+  } else if (elapsed < progress60) {
+    return { ...job, status: 'running', progress: 30, progressMessage: 'Running DeepISLES inference...', elapsedSeconds: elapsed }
+  } else if (elapsed < progress90) {
+    return { ...job, status: 'running', progress: 70, progressMessage: 'Processing results...', elapsedSeconds: elapsed }
+  } else if (elapsed < duration) {
+    return { ...job, status: 'running', progress: 90, progressMessage: 'Computing metrics...', elapsedSeconds: elapsed }
+  } else {
+    // Job complete
+    return { ...job, status: 'completed', progress: 100, progressMessage: 'Segmentation complete', elapsedSeconds: elapsed }
+  }
+}
 export const handlers = [
+  // GET /api/cases - List available cases
   http.get(`${API_BASE}/api/cases`, async () => {
     await delay(100)
     return HttpResponse.json({
     })
   }),
+  // POST /api/segment - Create segmentation job (returns immediately)
   http.post(`${API_BASE}/api/segment`, async ({ request }) => {
     const body = (await request.json()) as { case_id: string; fast_mode?: boolean }
+    await delay(50) // Small delay to simulate network
+    // Create a new job
+    const jobId = `mock-${++jobCounter}`
+    const job: MockJob = {
+      id: jobId,
       caseId: body.case_id,
+      status: 'pending',
+      progress: 0,
+      progressMessage: 'Job queued',
+      elapsedSeconds: 0,
+      fastMode: body.fast_mode !== false,
+      createdAt: Date.now(),
+    }
+    mockJobs.set(jobId, job)
+    // Return 202 Accepted with job ID
+    return HttpResponse.json(
+      {
+        jobId: jobId,
+        status: 'pending',
+        message: `Segmentation job queued for ${body.case_id}`,
+      },
+      { status: 202 }
+    )
+  }),
+  // GET /api/jobs/:jobId - Get job status
+  http.get(`${API_BASE}/api/jobs/:jobId`, async ({ params }) => {
+    const jobId = params.jobId as string
+    await delay(50) // Small delay to simulate network
+    const job = mockJobs.get(jobId)
+    if (!job) {
+      return HttpResponse.json(
+        { detail: `Job not found: ${jobId}. Jobs expire after 1 hour.` },
+        { status: 404 }
+      )
+    }
+    // Update job progress based on elapsed time
+    const updatedJob = getJobProgress(job)
+    mockJobs.set(jobId, updatedJob)
+    // Build response
+    const response: Record<string, unknown> = {
+      jobId: updatedJob.id,
+      status: updatedJob.status,
+      progress: updatedJob.progress,
+      progressMessage: updatedJob.progressMessage,
+      elapsedSeconds: Math.round(updatedJob.elapsedSeconds * 100) / 100,
+    }
+    // Include result if completed
+    if (updatedJob.status === 'completed') {
+      response.result = {
+        caseId: updatedJob.caseId,
+        diceScore: 0.847,
+        volumeMl: 15.32,
+        elapsedSeconds: updatedJob.fastMode ? 12.5 : 45.0,
+        dwiUrl: `${API_BASE}/files/${jobId}/${updatedJob.caseId}/dwi.nii.gz`,
+        predictionUrl: `${API_BASE}/files/${jobId}/${updatedJob.caseId}/prediction.nii.gz`,
+      }
+    }
+    return HttpResponse.json(response)
   }),
 ]
     return HttpResponse.error()
   }),
+  segmentCreateError: http.post(`${API_BASE}/api/segment`, () => {
     return HttpResponse.json(
+      { detail: 'Failed to create job: case not found' },
+      { status: 400 }
     )
   }),
+  jobNotFound: http.get(`${API_BASE}/api/jobs/:jobId`, () => {
+    return HttpResponse.json(
+      { detail: 'Job not found or expired' },
+      { status: 404 }
+    )
   }),
+  // Simulate a job that fails during processing
+  jobFailed: [
+    http.post(`${API_BASE}/api/segment`, async ({ request }) => {
+      const body = (await request.json()) as { case_id: string }
+      const jobId = `fail-${++jobCounter}`
+      mockJobs.set(jobId, {
+        id: jobId,
+        caseId: body.case_id,
+        status: 'failed',
+        progress: 30,
+        progressMessage: 'Error occurred',
+        elapsedSeconds: 5.2,
+        fastMode: true,
+        createdAt: Date.now(),
+      })
+      return HttpResponse.json(
+        { jobId, status: 'pending', message: 'Job queued' },
+        { status: 202 }
+      )
+    }),
+    http.get(`${API_BASE}/api/jobs/:jobId`, ({ params }) => {
+      const jobId = params.jobId as string
+      const job = mockJobs.get(jobId)
+      if (!job) {
+        return HttpResponse.json({ detail: 'Not found' }, { status: 404 })
+      }
+      return HttpResponse.json({
+        jobId: job.id,
+        status: 'failed',
+        progress: 30,
+        progressMessage: 'Error occurred',
+        elapsedSeconds: 5.2,
+        error: 'Segmentation failed: out of memory',
+      })
+    }),
+  ],
 }

frontend/src/types/index.ts CHANGED Viewed

@@ -1,3 +1,4 @@
 export interface Metrics {
   caseId: string
   diceScore: number | null
@@ -5,16 +6,19 @@ export interface Metrics {
   elapsedSeconds: number
 }
 export interface SegmentationResult {
   dwiUrl: string
   predictionUrl: string
   metrics: Metrics
 }
 export interface CasesResponse {
   cases: string[]
 }
 export interface SegmentResponse {
   caseId: string
   diceScore: number | null
@@ -23,3 +27,24 @@ export interface SegmentResponse {
   dwiUrl: string
   predictionUrl: string
 }

+// Segmentation metrics
 export interface Metrics {
   caseId: string
   diceScore: number | null
   elapsedSeconds: number
 }
+// Final segmentation result with URLs and metrics
 export interface SegmentationResult {
   dwiUrl: string
   predictionUrl: string
   metrics: Metrics
 }
+// API Response Types
 export interface CasesResponse {
   cases: string[]
 }
+// Segmentation result data (embedded in job response)
 export interface SegmentResponse {
   caseId: string
   diceScore: number | null
   dwiUrl: string
   predictionUrl: string
 }
+// Job Status Types
+export type JobStatus = 'pending' | 'running' | 'completed' | 'failed'
+// Response from POST /api/segment (job creation)
+export interface CreateJobResponse {
+  jobId: string
+  status: JobStatus
+  message: string
+}
+// Response from GET /api/jobs/{jobId} (status polling)
+export interface JobStatusResponse {
+  jobId: string
+  status: JobStatus
+  progress: number
+  progressMessage: string
+  elapsedSeconds?: number
+  result?: SegmentResponse
+  error?: string
+}

src/stroke_deepisles_demo/api/job_store.py ADDED Viewed

	@@ -0,0 +1,380 @@

+"""In-memory job store for async ML inference tasks.
+This module provides a thread-safe job store for tracking long-running ML inference
+jobs. Jobs are stored in-memory, which is appropriate for HuggingFace Spaces since:
+1. HF Spaces runs a single uvicorn worker (no multi-worker sync needed)
+2. Jobs are ephemeral (results cached, cleanup after TTL)
+3. No external dependencies (Redis, DB) required
+Note: Multi-worker deployments would require a shared store (Redis/DB).
+Architecture:
+- Jobs are created with PENDING status
+- Background tasks update status to RUNNING, then COMPLETED/FAILED
+- Frontend polls GET /api/jobs/{id} for status updates
+- Cleanup thread removes old jobs to prevent memory leaks
+"""
+from __future__ import annotations
+import re
+import shutil
+import threading
+from dataclasses import dataclass
+from datetime import datetime, timedelta
+from enum import Enum
+from pathlib import Path
+from typing import Any
+from stroke_deepisles_demo.core.logging import get_logger
+logger = get_logger(__name__)
+# Regex for safe job IDs (alphanumeric, hyphens, underscores only)
+_SAFE_JOB_ID_PATTERN = re.compile(r"^[a-zA-Z0-9_-]+$")
+class JobStatus(str, Enum):
+    """Status of an async job."""
+    PENDING = "pending"  # Job created, not yet started
+    RUNNING = "running"  # Inference in progress
+    COMPLETED = "completed"  # Success, results available
+    FAILED = "failed"  # Error occurred
+@dataclass
+class Job:
+    """Represents an async segmentation job.
+    Attributes:
+        id: Unique job identifier (full UUID hex)
+        status: Current job status
+        case_id: The case being processed
+        fast_mode: Whether to use fast inference mode
+        created_at: When the job was created
+        started_at: When processing began (None if pending)
+        completed_at: When processing finished (None if not done)
+        progress: Progress percentage (0-100)
+        progress_message: Human-readable progress status
+        result: Segmentation results (None until completed)
+        error: Error message (None unless failed)
+    """
+    id: str
+    status: JobStatus
+    case_id: str
+    fast_mode: bool
+    created_at: datetime
+    started_at: datetime | None = None
+    completed_at: datetime | None = None
+    progress: int = 0
+    progress_message: str = "Queued"
+    result: dict[str, Any] | None = None
+    error: str | None = None
+    @property
+    def elapsed_seconds(self) -> float:
+        """Calculate elapsed time since job started."""
+        if self.started_at is None:
+            return 0.0
+        end_time = self.completed_at or datetime.now()
+        return (end_time - self.started_at).total_seconds()
+    def to_dict(self) -> dict[str, Any]:
+        """Convert job to dictionary for API response."""
+        data: dict[str, Any] = {
+            "jobId": self.id,
+            "status": self.status.value,
+            "progress": self.progress,
+            "progressMessage": self.progress_message,
+        }
+        if self.started_at is not None:
+            data["elapsedSeconds"] = round(self.elapsed_seconds, 2)
+        if self.status == JobStatus.COMPLETED and self.result is not None:
+            data["result"] = self.result
+        if self.status == JobStatus.FAILED and self.error is not None:
+            data["error"] = self.error
+        return data
+class JobStore:
+    """Thread-safe in-memory job store.
+    Provides CRUD operations for jobs with automatic cleanup of old entries.
+    Uses a simple dict with a lock for thread safety.
+    Usage:
+        store = JobStore()
+        job = store.create_job("case123", fast_mode=True)
+        store.update_progress(job.id, 50, "Processing...")
+        store.complete_job(job.id, {"result": "data"})
+    """
+    # Default time-to-live for completed jobs
+    DEFAULT_TTL = timedelta(hours=1)
+    # Cleanup interval (how often to check for expired jobs)
+    CLEANUP_INTERVAL_SECONDS = 600  # 10 minutes
+    def __init__(
+        self,
+        ttl: timedelta = DEFAULT_TTL,
+        results_dir: Path | None = None,
+    ) -> None:
+        """Initialize the job store.
+        Args:
+            ttl: How long to keep completed jobs before cleanup
+            results_dir: Directory where job results are stored (for cleanup)
+        """
+        self._jobs: dict[str, Job] = {}
+        self._lock = threading.RLock()
+        self._ttl = ttl
+        self._results_dir = results_dir or Path("/tmp/stroke-results")
+        self._cleanup_thread: threading.Thread | None = None
+        self._shutdown = threading.Event()
+    @staticmethod
+    def _is_safe_job_id(job_id: str) -> bool:
+        """Validate job ID to prevent path traversal attacks.
+        Only allows alphanumeric characters, hyphens, and underscores.
+        """
+        return bool(job_id) and _SAFE_JOB_ID_PATTERN.match(job_id) is not None
+    def create_job(self, job_id: str, case_id: str, fast_mode: bool) -> Job:
+        """Create a new job in PENDING status.
+        Args:
+            job_id: Unique identifier for the job
+            case_id: Case to process
+            fast_mode: Whether to use fast inference
+        Returns:
+            The created Job object
+        Raises:
+            ValueError: If job_id is invalid (contains unsafe characters)
+            KeyError: If job_id already exists
+        """
+        if not self._is_safe_job_id(job_id):
+            raise ValueError(f"Invalid job_id: {job_id!r}")
+        job = Job(
+            id=job_id,
+            status=JobStatus.PENDING,
+            case_id=case_id,
+            fast_mode=fast_mode,
+            created_at=datetime.now(),
+        )
+        with self._lock:
+            if job_id in self._jobs:
+                raise KeyError(f"Job already exists: {job_id}")
+            self._jobs[job_id] = job
+        # Note: Don't log case_id as it may be sensitive (medical domain)
+        logger.info("Created job %s", job_id)
+        return job
+    def get_job(self, job_id: str) -> Job | None:
+        """Get a job by ID.
+        Args:
+            job_id: The job identifier
+        Returns:
+            The Job object, or None if not found
+        """
+        with self._lock:
+            return self._jobs.get(job_id)
+    def start_job(self, job_id: str) -> None:
+        """Mark a job as started (RUNNING status).
+        Args:
+            job_id: The job identifier
+        """
+        with self._lock:
+            job = self._jobs.get(job_id)
+            if job:
+                job.status = JobStatus.RUNNING
+                job.started_at = datetime.now()
+                job.progress = 5
+                job.progress_message = "Starting inference..."
+                logger.info("Started job %s", job_id)
+    def update_progress(
+        self,
+        job_id: str,
+        progress: int,
+        message: str,
+    ) -> None:
+        """Update job progress.
+        Args:
+            job_id: The job identifier
+            progress: Progress percentage (0-100)
+            message: Human-readable progress message
+        """
+        with self._lock:
+            job = self._jobs.get(job_id)
+            if job and job.status == JobStatus.RUNNING:
+                job.progress = min(max(progress, 0), 100)
+                job.progress_message = message
+    def complete_job(self, job_id: str, result: dict[str, Any]) -> None:
+        """Mark a job as successfully completed.
+        Args:
+            job_id: The job identifier
+            result: The segmentation results
+        """
+        with self._lock:
+            job = self._jobs.get(job_id)
+            if job:
+                # Ensure started_at is set for elapsed time calculation
+                if job.started_at is None:
+                    job.started_at = datetime.now()
+                job.status = JobStatus.COMPLETED
+                job.completed_at = datetime.now()
+                job.progress = 100
+                job.progress_message = "Segmentation complete"
+                job.result = result
+                logger.info(
+                    "Completed job %s in %.2fs",
+                    job_id,
+                    job.elapsed_seconds,
+                )
+    def fail_job(self, job_id: str, error: str) -> None:
+        """Mark a job as failed.
+        Args:
+            job_id: The job identifier
+            error: Error message describing the failure
+        """
+        with self._lock:
+            job = self._jobs.get(job_id)
+            if job:
+                # Ensure started_at is set for elapsed time calculation
+                if job.started_at is None:
+                    job.started_at = datetime.now()
+                job.status = JobStatus.FAILED
+                job.completed_at = datetime.now()
+                job.progress_message = "Error occurred"
+                job.error = error
+                logger.error("Failed job %s: %s", job_id, error)
+    def cleanup_old_jobs(self) -> int:
+        """Remove jobs older than TTL to prevent memory leaks.
+        Also cleans up associated result files on disk.
+        Returns:
+            Number of jobs cleaned up
+        """
+        now = datetime.now()
+        expired_ids: list[str] = []
+        with self._lock:
+            for job_id, job in self._jobs.items():
+                if job.completed_at and (now - job.completed_at) > self._ttl:
+                    expired_ids.append(job_id)
+            for job_id in expired_ids:
+                del self._jobs[job_id]
+        # Clean up result files outside the lock
+        # Use path validation to prevent path traversal attacks
+        base_dir = self._results_dir.resolve()
+        for job_id in expired_ids:
+            # Skip cleanup for unsafe job IDs (shouldn't happen, but defense in depth)
+            if not self._is_safe_job_id(job_id):
+                logger.warning("Skipping cleanup for unsafe job id %r", job_id)
+                continue
+            result_dir = (self._results_dir / job_id).resolve()
+            # Verify path is within results directory (prevent traversal)
+            if not result_dir.is_relative_to(base_dir):
+                logger.warning("Path traversal attempt blocked for job %s", job_id)
+                continue
+            if result_dir.exists():
+                try:
+                    shutil.rmtree(result_dir)
+                    logger.debug("Cleaned up result files for job %s", job_id)
+                except OSError as e:
+                    logger.warning("Failed to cleanup %s: %s", result_dir, e)
+        if expired_ids:
+            logger.info("Cleaned up %d expired jobs", len(expired_ids))
+        return len(expired_ids)
+    def start_cleanup_scheduler(self) -> None:
+        """Start background thread for periodic job cleanup."""
+        if self._cleanup_thread is not None:
+            return  # Already running
+        def cleanup_loop() -> None:
+            while not self._shutdown.wait(self.CLEANUP_INTERVAL_SECONDS):
+                try:
+                    self.cleanup_old_jobs()
+                except Exception:
+                    logger.exception("Error during job cleanup")
+        self._cleanup_thread = threading.Thread(
+            target=cleanup_loop,
+            daemon=True,
+            name="job-cleanup",
+        )
+        self._cleanup_thread.start()
+        logger.info("Started job cleanup scheduler (interval=%ds)", self.CLEANUP_INTERVAL_SECONDS)
+    def stop_cleanup_scheduler(self) -> None:
+        """Stop the cleanup scheduler thread."""
+        self._shutdown.set()
+        if self._cleanup_thread:
+            self._cleanup_thread.join(timeout=5)
+            self._cleanup_thread = None
+        logger.info("Stopped job cleanup scheduler")
+    def __len__(self) -> int:
+        """Return number of jobs in store."""
+        with self._lock:
+            return len(self._jobs)
+# Global job store instance
+# Initialized in main.py on app startup
+job_store: JobStore | None = None
+def get_job_store() -> JobStore:
+    """Get the global job store instance.
+    Raises:
+        RuntimeError: If job store not initialized
+    """
+    if job_store is None:
+        raise RuntimeError("Job store not initialized. Call init_job_store() first.")
+    return job_store
+def init_job_store(results_dir: Path | None = None) -> JobStore:
+    """Initialize the global job store.
+    Args:
+        results_dir: Directory for job results
+    Returns:
+        The initialized JobStore
+    """
+    global job_store
+    job_store = JobStore(results_dir=results_dir)
+    job_store.start_cleanup_scheduler()
+    return job_store

src/stroke_deepisles_demo/api/main.py CHANGED Viewed

@@ -1,6 +1,21 @@
-"""FastAPI application for stroke segmentation API."""
 import os
 from pathlib import Path
 from typing import Any
@@ -8,12 +23,49 @@ from fastapi import FastAPI
 from fastapi.middleware.cors import CORSMiddleware
 from fastapi.staticfiles import StaticFiles
 from stroke_deepisles_demo.api.routes import router
 app = FastAPI(
     title="Stroke Segmentation API",
-    description="DeepISLES stroke lesion segmentation",
-    version="1.0.0",
 )
 # CORS configuration
@@ -41,8 +93,7 @@ app.add_middleware(
 app.include_router(router, prefix="/api")
 # Static files for NIfTI results
-# Create directory if it doesn't exist (ensures mount works on first run)
-RESULTS_DIR = Path("/tmp/stroke-results")
 RESULTS_DIR.mkdir(parents=True, exist_ok=True)
 app.mount("/files", StaticFiles(directory=str(RESULTS_DIR)), name="files")
@@ -50,4 +101,23 @@ app.mount("/files", StaticFiles(directory=str(RESULTS_DIR)), name="files")
 @app.get("/")
 async def root() -> dict[str, Any]:
     """Health check endpoint."""
-    return {"status": "healthy", "service": "stroke-segmentation-api"}

+"""FastAPI application for stroke segmentation API.
+This API provides async ML inference for stroke lesion segmentation using DeepISLES.
+It implements a job queue pattern to handle long-running inference without timeouts:
+1. POST /api/segment - Creates job, returns immediately (202)
+2. GET /api/jobs/{id} - Poll for status/progress/results
+3. GET /files/{job_id}/... - Download result NIfTI files
+Architecture designed to work within HuggingFace Spaces constraints:
+- ~60s gateway timeout (avoided via async job pattern)
+- Single worker (in-memory job store is sufficient)
+- /tmp writable only (results stored there)
+"""
 import os
+from collections.abc import AsyncIterator
+from contextlib import asynccontextmanager
 from pathlib import Path
 from typing import Any
 from fastapi.middleware.cors import CORSMiddleware
 from fastapi.staticfiles import StaticFiles
+from stroke_deepisles_demo.api.job_store import init_job_store
 from stroke_deepisles_demo.api.routes import router
+from stroke_deepisles_demo.core.logging import get_logger
+logger = get_logger(__name__)
+# Results directory (must be in /tmp for HF Spaces)
+RESULTS_DIR = Path("/tmp/stroke-results")
+@asynccontextmanager
+async def lifespan(_app: FastAPI) -> AsyncIterator[None]:
+    """Application lifespan handler for startup/shutdown tasks.
+    Startup:
+    - Initialize job store with cleanup scheduler
+    - Create results directory
+    Shutdown:
+    - Stop cleanup scheduler
+    """
+    # Startup
+    logger.info("Starting stroke segmentation API...")
+    # Create results directory
+    RESULTS_DIR.mkdir(parents=True, exist_ok=True)
+    # Initialize job store with cleanup scheduler
+    job_store = init_job_store(results_dir=RESULTS_DIR)
+    logger.info("Job store initialized with %d jobs", len(job_store))
+    yield
+    # Shutdown
+    logger.info("Shutting down stroke segmentation API...")
+    job_store.stop_cleanup_scheduler()
 app = FastAPI(
     title="Stroke Segmentation API",
+    description="DeepISLES stroke lesion segmentation with async job queue",
+    version="2.0.0",
+    lifespan=lifespan,
 )
 # CORS configuration
 app.include_router(router, prefix="/api")
 # Static files for NIfTI results
+# Note: Mount happens at import time; ensure directory exists here as well.
 RESULTS_DIR.mkdir(parents=True, exist_ok=True)
 app.mount("/files", StaticFiles(directory=str(RESULTS_DIR)), name="files")
 @app.get("/")
 async def root() -> dict[str, Any]:
     """Health check endpoint."""
+    return {
+        "status": "healthy",
+        "service": "stroke-segmentation-api",
+        "version": "2.0.0",
+        "features": ["async-jobs", "progress-tracking"],
+    }
+@app.get("/health")
+async def health() -> dict[str, Any]:
+    """Detailed health check endpoint."""
+    from stroke_deepisles_demo.api.job_store import get_job_store
+    store = get_job_store()
+    return {
+        "status": "healthy",
+        "jobs_in_memory": len(store),
+        "results_dir": str(RESULTS_DIR),
+        "results_dir_exists": RESULTS_DIR.exists(),
+    }

src/stroke_deepisles_demo/api/routes.py CHANGED Viewed

@@ -1,17 +1,37 @@
-"""API route handlers."""
 import contextlib
 import os
 import uuid
 from pathlib import Path
-from fastapi import APIRouter, HTTPException, Request
-from stroke_deepisles_demo.api.schemas import CasesResponse, SegmentRequest, SegmentResponse
 from stroke_deepisles_demo.data import list_case_ids
 from stroke_deepisles_demo.metrics import compute_volume_ml
 from stroke_deepisles_demo.pipeline import run_pipeline_on_case
 router = APIRouter()
 # Base directory for results
@@ -43,52 +63,201 @@ def get_cases() -> CasesResponse:
         return CasesResponse(cases=cases)
     except HTTPException:
         raise
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=str(e)) from None
-@router.post("/segment", response_model=SegmentResponse)
-def run_segmentation(request: Request, body: SegmentRequest) -> SegmentResponse:
-    """Run DeepISLES segmentation on a case.
-    Note: This is a sync def (not async) because run_pipeline_on_case() is synchronous
-    and CPU/GPU-bound. FastAPI automatically runs sync endpoints in a threadpool,
-    which prevents blocking the event loop during inference.
     """
     try:
-        # Generate unique run ID to avoid conflicts
-        run_id = str(uuid.uuid4())[:8]
-        output_dir = RESULTS_BASE / run_id
         result = run_pipeline_on_case(
-            body.case_id,
             output_dir=output_dir,
-            fast=body.fast_mode,
             compute_dice=True,
             cleanup_staging=True,
         )
         # Compute volume (may fail for edge cases)
         volume_ml = None
         with contextlib.suppress(Exception):
             volume_ml = round(compute_volume_ml(result.prediction_mask, threshold=0.5), 2)
-        # Build absolute file URLs
-        backend_url = get_backend_base_url(request)
         dwi_filename = result.input_files["dwi"].name
         pred_filename = result.prediction_mask.name
-        file_path_prefix = f"/files/{run_id}/{result.case_id}"
-        return SegmentResponse(
-            caseId=result.case_id,
-            diceScore=result.dice_score,
-            volumeMl=volume_ml,
-            elapsedSeconds=round(result.elapsed_seconds, 2),
-            dwiUrl=f"{backend_url}{file_path_prefix}/{dwi_filename}",
-            predictionUrl=f"{backend_url}{file_path_prefix}/{pred_filename}",
         )
-    except HTTPException:
-        raise
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=str(e)) from None

+"""API route handlers for stroke segmentation.
+This module implements an async job queue pattern to handle long-running ML inference:
+1. POST /api/segment creates a job and returns immediately (202 Accepted)
+2. Background task runs the inference
+3. Frontend polls GET /api/jobs/{job_id} for status/results
+This pattern avoids HuggingFace Spaces' ~60s gateway timeout.
+"""
+from __future__ import annotations
 import contextlib
 import os
 import uuid
 from pathlib import Path
+from fastapi import APIRouter, BackgroundTasks, HTTPException, Request
+from stroke_deepisles_demo.api.job_store import JobStatus, get_job_store
+from stroke_deepisles_demo.api.schemas import (
+    CasesResponse,
+    CreateJobResponse,
+    JobStatusResponse,
+    SegmentRequest,
+    SegmentResponse,
+)
+from stroke_deepisles_demo.core.logging import get_logger
 from stroke_deepisles_demo.data import list_case_ids
 from stroke_deepisles_demo.metrics import compute_volume_ml
 from stroke_deepisles_demo.pipeline import run_pipeline_on_case
+logger = get_logger(__name__)
 router = APIRouter()
 # Base directory for results
         return CasesResponse(cases=cases)
     except HTTPException:
         raise
+    except Exception:
+        logger.exception("Failed to list cases")
+        raise HTTPException(status_code=500, detail="Failed to retrieve cases") from None
+@router.post(
+    "/segment",
+    response_model=CreateJobResponse,
+    status_code=202,
+    responses={
+        202: {"description": "Job created successfully"},
+        400: {"description": "Invalid request"},
+        500: {"description": "Internal server error"},
+    },
+)
+def create_segment_job(
+    request: Request,
+    body: SegmentRequest,
+    background_tasks: BackgroundTasks,
+) -> CreateJobResponse:
+    """Create an async segmentation job.
+    Returns immediately with a job ID. The actual ML inference runs in the background.
+    Poll GET /api/jobs/{jobId} for status updates and results.
+    This async pattern is required because:
+    - DeepISLES inference takes 30-60 seconds
+    - HuggingFace Spaces has a ~60s gateway timeout
+    - Returning immediately avoids timeout errors
+    """
+    try:
+        # Use full UUID hex for uniqueness (no truncation)
+        job_id = uuid.uuid4().hex
+        store = get_job_store()
+        backend_url = get_backend_base_url(request)
+        # Create job record
+        store.create_job(job_id, body.case_id, body.fast_mode)
+        # Queue background task
+        background_tasks.add_task(
+            run_segmentation_job,
+            job_id=job_id,
+            case_id=body.case_id,
+            fast_mode=body.fast_mode,
+            backend_url=backend_url,
+        )
+        # Note: Don't log case_id as it may be sensitive (medical domain)
+        logger.info("Created segmentation job %s", job_id)
+        return CreateJobResponse(
+            jobId=job_id,
+            status="pending",
+            message=f"Segmentation job queued for {body.case_id}",
+        )
+    except Exception:
+        logger.exception("Failed to create segmentation job")
+        raise HTTPException(status_code=500, detail="Failed to create segmentation job") from None
+@router.get(
+    "/jobs/{job_id}",
+    response_model=JobStatusResponse,
+    responses={
+        200: {"description": "Job status retrieved"},
+        404: {"description": "Job not found"},
+    },
+)
+def get_job_status(job_id: str) -> JobStatusResponse:
+    """Get the status of a segmentation job.
+    Poll this endpoint to track job progress and retrieve results.
+    Returns:
+        Job status including progress percentage and results when completed.
+    Raises:
+        404: Job not found (may have expired or never existed)
+    """
+    store = get_job_store()
+    job = store.get_job(job_id)
+    if job is None:
+        raise HTTPException(
+            status_code=404,
+            detail=f"Job not found: {job_id}. Jobs expire after 1 hour.",
+        )
+    # Build response from job data
+    response = JobStatusResponse(
+        jobId=job.id,
+        status=job.status.value,
+        progress=job.progress,
+        progressMessage=job.progress_message,
+        elapsedSeconds=round(job.elapsed_seconds, 2) if job.started_at else None,
+        result=None,
+        error=None,
+    )
+    # Include result if completed
+    if job.status == JobStatus.COMPLETED and job.result:
+        response.result = SegmentResponse(**job.result)
+    # Include error if failed
+    if job.status == JobStatus.FAILED and job.error:
+        response.error = job.error
+    return response
+def run_segmentation_job(
+    job_id: str,
+    case_id: str,
+    fast_mode: bool,
+    backend_url: str,
+) -> None:
+    """Execute segmentation in background thread.
+    This function runs in a threadpool (not the main event loop) because
+    the ML inference is CPU/GPU-bound and blocking.
+    Updates job status and progress throughout execution, allowing the
+    frontend to show meaningful progress updates.
+    Args:
+        job_id: Unique job identifier
+        case_id: Case to process
+        fast_mode: Whether to use fast inference mode
+        backend_url: Base URL for constructing result file URLs
     """
+    store = get_job_store()
+    job = store.get_job(job_id)
+    if job is None:
+        logger.error("Job %s not found when starting execution", job_id)
+        return
     try:
+        # Mark as running
+        store.start_job(job_id)
+        store.update_progress(job_id, 10, "Loading case data...")
+        # Set up output directory
+        output_dir = RESULTS_BASE / job_id
+        store.update_progress(job_id, 20, "Staging files for DeepISLES...")
+        # Run the pipeline
+        store.update_progress(job_id, 30, "Running DeepISLES inference...")
         result = run_pipeline_on_case(
+            case_id,
             output_dir=output_dir,
+            fast=fast_mode,
             compute_dice=True,
             cleanup_staging=True,
         )
+        store.update_progress(job_id, 85, "Computing metrics...")
         # Compute volume (may fail for edge cases)
         volume_ml = None
         with contextlib.suppress(Exception):
             volume_ml = round(compute_volume_ml(result.prediction_mask, threshold=0.5), 2)
+        store.update_progress(job_id, 95, "Preparing results...")
+        # Build result data
         dwi_filename = result.input_files["dwi"].name
         pred_filename = result.prediction_mask.name
+        file_path_prefix = f"/files/{job_id}/{result.case_id}"
+        result_data = {
+            "caseId": result.case_id,
+            "diceScore": result.dice_score,
+            "volumeMl": volume_ml,
+            "elapsedSeconds": round(result.elapsed_seconds, 2),
+            "dwiUrl": f"{backend_url}{file_path_prefix}/{dwi_filename}",
+            "predictionUrl": f"{backend_url}{file_path_prefix}/{pred_filename}",
+        }
+        # Mark as completed
+        store.complete_job(job_id, result_data)
+        logger.info(
+            "Job %s completed: case=%s, dice=%.3f, time=%.1fs",
+            job_id,
+            case_id,
+            result.dice_score or 0,
+            result.elapsed_seconds,
         )
+    except Exception:
+        logger.exception("Job %s failed", job_id)
+        # Sanitize error message - don't expose internal details to clients
+        store.fail_job(job_id, "Segmentation failed")

src/stroke_deepisles_demo/api/schemas.py CHANGED Viewed

@@ -1,6 +1,8 @@
 """Pydantic schemas for API requests and responses."""
-from pydantic import BaseModel
 class CasesResponse(BaseModel):
@@ -17,7 +19,7 @@ class SegmentRequest(BaseModel):
 class SegmentResponse(BaseModel):
-    """Response for POST /api/segment."""
     caseId: str
     diceScore: float | None
@@ -25,3 +27,44 @@ class SegmentResponse(BaseModel):
     elapsedSeconds: float
     dwiUrl: str
     predictionUrl: str

 """Pydantic schemas for API requests and responses."""
+from typing import Literal
+from pydantic import BaseModel, Field
 class CasesResponse(BaseModel):
 class SegmentResponse(BaseModel):
+    """Segmentation result data (embedded in job response when completed)."""
     caseId: str
     diceScore: float | None
     elapsedSeconds: float
     dwiUrl: str
     predictionUrl: str
+# Job status type for strong typing
+JobStatusType = Literal["pending", "running", "completed", "failed"]
+class CreateJobResponse(BaseModel):
+    """Response for POST /api/segment (async job creation).
+    Returns immediately with job ID. Client should poll GET /api/jobs/{jobId}
+    for status updates and results.
+    """
+    jobId: str = Field(..., description="Unique job identifier for polling")
+    status: JobStatusType = Field(..., description="Initial job status (always 'pending')")
+    message: str = Field(..., description="Human-readable status message")
+class JobStatusResponse(BaseModel):
+    """Response for GET /api/jobs/{job_id}.
+    Provides current job status, progress, and results when completed.
+    """
+    jobId: str = Field(..., description="Unique job identifier")
+    status: JobStatusType = Field(..., description="Current job status")
+    progress: int = Field(..., ge=0, le=100, description="Progress percentage (0-100)")
+    progressMessage: str = Field(..., description="Human-readable progress status")
+    elapsedSeconds: float | None = Field(
+        None, description="Time elapsed since job started (seconds)"
+    )
+    result: SegmentResponse | None = Field(
+        None, description="Segmentation results (only present when status='completed')"
+    )
+    error: str | None = Field(None, description="Error message (only present when status='failed')")
+class ErrorResponse(BaseModel):
+    """Standard error response body."""
+    detail: str = Field(..., description="Error description")

tests/api/test_endpoints.py CHANGED Viewed

@@ -1,20 +1,32 @@
 """TDD tests for API endpoints.
-RED-GREEN-REFACTOR: Tests written FIRST, before implementation.
 """
-from unittest.mock import MagicMock, patch
 import pytest
 from fastapi.testclient import TestClient
 from stroke_deepisles_demo.api import app
 @pytest.fixture
-def client() -> TestClient:
-    """Create test client for FastAPI app."""
-    return TestClient(app)
 class TestHealthCheck:
@@ -63,67 +75,40 @@ class TestGetCases:
             response = client.get("/api/cases")
             assert response.status_code == 500
-            assert "Dataset not found" in response.json()["detail"]
 class TestPostSegment:
-    """Tests for POST /api/segment endpoint."""
-    def test_runs_segmentation_and_returns_result(self, client: TestClient) -> None:
-        """POST /api/segment runs pipeline and returns metrics + URLs."""
-        mock_result = MagicMock()
-        mock_result.case_id = "sub-stroke0001"
-        mock_result.dice_score = 0.847
-        mock_result.elapsed_seconds = 12.5
-        mock_result.prediction_mask.name = "prediction.nii.gz"
-        mock_result.input_files = {"dwi": MagicMock(name="dwi.nii.gz")}
-        mock_result.input_files["dwi"].name = "dwi.nii.gz"
-        with (
-            patch("stroke_deepisles_demo.api.routes.run_pipeline_on_case") as mock_pipeline,
-            patch("stroke_deepisles_demo.api.routes.compute_volume_ml") as mock_volume,
-        ):
-            mock_pipeline.return_value = mock_result
-            mock_volume.return_value = 15.32
-            response = client.post(
-                "/api/segment",
-                json={"case_id": "sub-stroke0001", "fast_mode": True},
-            )
-            assert response.status_code == 200
-            data = response.json()
-            assert data["caseId"] == "sub-stroke0001"
-            assert data["diceScore"] == 0.847
-            assert data["volumeMl"] == 15.32
-            assert data["elapsedSeconds"] == 12.5
-            assert "dwi.nii.gz" in data["dwiUrl"]
-            assert "prediction.nii.gz" in data["predictionUrl"]
-    def test_passes_fast_mode_to_pipeline(self, client: TestClient) -> None:
-        """POST /api/segment passes fast_mode parameter to pipeline."""
-        mock_result = MagicMock()
-        mock_result.case_id = "sub-stroke0001"
-        mock_result.dice_score = None
-        mock_result.elapsed_seconds = 45.0
-        mock_result.prediction_mask.name = "pred.nii.gz"
-        mock_result.input_files = {"dwi": MagicMock()}
-        mock_result.input_files["dwi"].name = "dwi.nii.gz"
-        with (
-            patch("stroke_deepisles_demo.api.routes.run_pipeline_on_case") as mock_pipeline,
-            patch("stroke_deepisles_demo.api.routes.compute_volume_ml"),
-        ):
-            mock_pipeline.return_value = mock_result
-            client.post(
-                "/api/segment",
-                json={"case_id": "sub-stroke0001", "fast_mode": False},
-            )
-            mock_pipeline.assert_called_once()
-            call_kwargs = mock_pipeline.call_args[1]
-            assert call_kwargs["fast"] is False
     def test_returns_422_on_missing_case_id(self, client: TestClient) -> None:
         """POST /api/segment returns 422 when case_id is missing."""
@@ -131,15 +116,78 @@ class TestPostSegment:
         assert response.status_code == 422
-    def test_returns_500_on_pipeline_error(self, client: TestClient) -> None:
-        """POST /api/segment returns 500 when pipeline raises exception."""
-        with patch("stroke_deepisles_demo.api.routes.run_pipeline_on_case") as mock_pipeline:
-            mock_pipeline.side_effect = RuntimeError("GPU out of memory")
-            response = client.post(
-                "/api/segment",
-                json={"case_id": "sub-stroke0001"},
-            )
-            assert response.status_code == 500
-            assert "GPU out of memory" in response.json()["detail"]

 """TDD tests for API endpoints.
+Tests the FastAPI REST API with async job queue pattern.
+POST /api/segment returns 202 Accepted with job ID.
+GET /api/jobs/{id} returns job status/progress/results.
 """
+from collections.abc import Generator
+from pathlib import Path
+from tempfile import TemporaryDirectory
+from unittest.mock import patch
 import pytest
 from fastapi.testclient import TestClient
 from stroke_deepisles_demo.api import app
+from stroke_deepisles_demo.api.job_store import init_job_store
 @pytest.fixture
+def client() -> Generator[TestClient, None, None]:
+    """Create test client for FastAPI app with fresh job store."""
+    with TemporaryDirectory() as tmpdir:
+        # Initialize a fresh job store for each test
+        store = init_job_store(results_dir=Path(tmpdir))
+        try:
+            yield TestClient(app)
+        finally:
+            store.stop_cleanup_scheduler()
 class TestHealthCheck:
             response = client.get("/api/cases")
             assert response.status_code == 500
+            # Note: Error message is sanitized (doesn't expose internal details)
+            assert "Failed to retrieve cases" in response.json()["detail"]
 class TestPostSegment:
+    """Tests for POST /api/segment endpoint (async job creation)."""
+    def test_creates_job_and_returns_202(self, client: TestClient) -> None:
+        """POST /api/segment creates a job and returns 202 Accepted."""
+        response = client.post(
+            "/api/segment",
+            json={"case_id": "sub-stroke0001", "fast_mode": True},
+        )
+        assert response.status_code == 202
+        data = response.json()
+        assert "jobId" in data
+        assert data["status"] == "pending"
+        assert "message" in data
+    def test_returns_job_id_for_polling(self, client: TestClient) -> None:
+        """POST /api/segment returns a job ID that can be used for polling."""
+        response = client.post(
+            "/api/segment",
+            json={"case_id": "sub-stroke0001", "fast_mode": True},
+        )
+        job_id = response.json()["jobId"]
+        assert job_id is not None
+        assert len(job_id) > 0
+        # Job should be retrievable via GET /api/jobs/{id}
+        status_response = client.get(f"/api/jobs/{job_id}")
+        assert status_response.status_code == 200
     def test_returns_422_on_missing_case_id(self, client: TestClient) -> None:
         """POST /api/segment returns 422 when case_id is missing."""
         assert response.status_code == 422
+class TestGetJobStatus:
+    """Tests for GET /api/jobs/{job_id} endpoint."""
+    def test_returns_pending_job_status(self, client: TestClient) -> None:
+        """GET /api/jobs/{id} returns status for a job in the store."""
+        from stroke_deepisles_demo.api.job_store import get_job_store
+        # Create a job directly in the store (without running inference)
+        store = get_job_store()
+        store.create_job("pending-job", "sub-stroke0001", fast_mode=True)
+        # Get status
+        response = client.get("/api/jobs/pending-job")
+        assert response.status_code == 200
+        data = response.json()
+        assert data["jobId"] == "pending-job"
+        assert data["status"] == "pending"
+        assert "progress" in data
+        assert "progressMessage" in data
+    def test_returns_404_for_unknown_job(self, client: TestClient) -> None:
+        """GET /api/jobs/{id} returns 404 for unknown job ID."""
+        response = client.get("/api/jobs/nonexistent-job-id")
+        assert response.status_code == 404
+        assert "not found" in response.json()["detail"].lower()
+    def test_completed_job_includes_result(self, client: TestClient) -> None:
+        """GET /api/jobs/{id} includes result data when job is completed."""
+        from stroke_deepisles_demo.api.job_store import get_job_store
+        # Create and manually complete a job (to avoid waiting for real inference)
+        store = get_job_store()
+        store.create_job("test-job", "sub-stroke0001", fast_mode=True)
+        store.start_job("test-job")
+        store.complete_job(
+            "test-job",
+            {
+                "caseId": "sub-stroke0001",
+                "diceScore": 0.847,
+                "volumeMl": 15.32,
+                "elapsedSeconds": 12.5,
+                "dwiUrl": "http://localhost/files/test-job/sub-stroke0001/dwi.nii.gz",
+                "predictionUrl": "http://localhost/files/test-job/sub-stroke0001/pred.nii.gz",
+            },
+        )
+        response = client.get("/api/jobs/test-job")
+        assert response.status_code == 200
+        data = response.json()
+        assert data["status"] == "completed"
+        assert data["progress"] == 100
+        assert data["result"] is not None
+        assert data["result"]["caseId"] == "sub-stroke0001"
+        assert data["result"]["diceScore"] == 0.847
+    def test_failed_job_includes_error(self, client: TestClient) -> None:
+        """GET /api/jobs/{id} includes error message when job failed."""
+        from stroke_deepisles_demo.api.job_store import get_job_store
+        # Create and manually fail a job
+        store = get_job_store()
+        store.create_job("test-job", "sub-stroke0001", fast_mode=True)
+        store.start_job("test-job")
+        store.fail_job("test-job", "GPU out of memory")
+        response = client.get("/api/jobs/test-job")
+        assert response.status_code == 200
+        data = response.json()
+        assert data["status"] == "failed"
+        assert data["error"] == "GPU out of memory"

tests/api/test_job_store.py ADDED Viewed

	@@ -0,0 +1,314 @@

+"""Unit tests for the async job store.
+Tests the JobStore class that manages background ML inference jobs.
+Follows Uncle Bob's testing principles:
+- Test behavior, not implementation
+- Each test verifies one thing
+- Tests are independent and repeatable
+"""
+from collections.abc import Generator
+from datetime import datetime, timedelta
+from pathlib import Path
+from tempfile import TemporaryDirectory
+from unittest.mock import patch
+import pytest
+from stroke_deepisles_demo.api.job_store import (
+    Job,
+    JobStatus,
+    JobStore,
+    get_job_store,
+    init_job_store,
+)
+class TestJob:
+    """Tests for the Job dataclass."""
+    def test_new_job_has_zero_elapsed_seconds(self) -> None:
+        """A job that hasn't started should report 0 elapsed seconds."""
+        job = Job(
+            id="abc123",
+            status=JobStatus.PENDING,
+            case_id="sub-stroke0001",
+            fast_mode=True,
+            created_at=datetime.now(),
+        )
+        assert job.elapsed_seconds == 0.0
+    def test_running_job_tracks_elapsed_time(self) -> None:
+        """A running job should report elapsed time since start."""
+        start = datetime.now() - timedelta(seconds=10)
+        job = Job(
+            id="abc123",
+            status=JobStatus.RUNNING,
+            case_id="sub-stroke0001",
+            fast_mode=True,
+            created_at=start - timedelta(seconds=1),
+            started_at=start,
+        )
+        # Should be approximately 10 seconds (with some tolerance)
+        assert 9.5 <= job.elapsed_seconds <= 11.0
+    def test_completed_job_has_fixed_elapsed_time(self) -> None:
+        """A completed job should report time from start to completion."""
+        start = datetime.now() - timedelta(seconds=30)
+        end = start + timedelta(seconds=15)
+        job = Job(
+            id="abc123",
+            status=JobStatus.COMPLETED,
+            case_id="sub-stroke0001",
+            fast_mode=True,
+            created_at=start - timedelta(seconds=1),
+            started_at=start,
+            completed_at=end,
+        )
+        # Should be exactly 15 seconds (completed job doesn't change)
+        assert job.elapsed_seconds == 15.0
+    def test_to_dict_includes_required_fields(self) -> None:
+        """Job.to_dict() should include all fields needed by the API."""
+        job = Job(
+            id="abc123",
+            status=JobStatus.RUNNING,
+            case_id="sub-stroke0001",
+            fast_mode=True,
+            created_at=datetime.now(),
+            started_at=datetime.now(),
+            progress=50,
+            progress_message="Processing...",
+        )
+        data = job.to_dict()
+        assert data["jobId"] == "abc123"
+        assert data["status"] == "running"
+        assert data["progress"] == 50
+        assert data["progressMessage"] == "Processing..."
+        assert "elapsedSeconds" in data
+    def test_to_dict_includes_result_when_completed(self) -> None:
+        """Completed jobs should include result data in to_dict()."""
+        job = Job(
+            id="abc123",
+            status=JobStatus.COMPLETED,
+            case_id="sub-stroke0001",
+            fast_mode=True,
+            created_at=datetime.now(),
+            started_at=datetime.now(),
+            completed_at=datetime.now(),
+            result={"caseId": "sub-stroke0001", "diceScore": 0.847},
+        )
+        data = job.to_dict()
+        assert "result" in data
+        assert data["result"]["diceScore"] == 0.847
+    def test_to_dict_includes_error_when_failed(self) -> None:
+        """Failed jobs should include error message in to_dict()."""
+        job = Job(
+            id="abc123",
+            status=JobStatus.FAILED,
+            case_id="sub-stroke0001",
+            fast_mode=True,
+            created_at=datetime.now(),
+            error="GPU out of memory",
+        )
+        data = job.to_dict()
+        assert "error" in data
+        assert data["error"] == "GPU out of memory"
+class TestJobStore:
+    """Tests for the JobStore class."""
+    @pytest.fixture
+    def store(self) -> Generator[JobStore, None, None]:
+        """Create a fresh JobStore for each test."""
+        with TemporaryDirectory() as tmpdir:
+            yield JobStore(results_dir=Path(tmpdir))
+    def test_create_job_returns_pending_job(self, store: JobStore) -> None:
+        """Creating a job should return a job in PENDING status."""
+        job = store.create_job("job-1", "sub-stroke0001", fast_mode=True)
+        assert job.id == "job-1"
+        assert job.status == JobStatus.PENDING
+        assert job.case_id == "sub-stroke0001"
+        assert job.fast_mode is True
+    def test_get_job_returns_created_job(self, store: JobStore) -> None:
+        """get_job() should return a previously created job."""
+        store.create_job("job-1", "sub-stroke0001", fast_mode=True)
+        job = store.get_job("job-1")
+        assert job is not None
+        assert job.id == "job-1"
+    def test_get_job_returns_none_for_unknown_id(self, store: JobStore) -> None:
+        """get_job() should return None for unknown job IDs."""
+        job = store.get_job("nonexistent")
+        assert job is None
+    def test_start_job_changes_status_to_running(self, store: JobStore) -> None:
+        """start_job() should update job status to RUNNING."""
+        store.create_job("job-1", "sub-stroke0001", fast_mode=True)
+        store.start_job("job-1")
+        job = store.get_job("job-1")
+        assert job is not None
+        assert job.status == JobStatus.RUNNING
+        assert job.started_at is not None
+    def test_update_progress_changes_progress_fields(self, store: JobStore) -> None:
+        """update_progress() should update progress and message."""
+        store.create_job("job-1", "sub-stroke0001", fast_mode=True)
+        store.start_job("job-1")
+        store.update_progress("job-1", 75, "Computing metrics...")
+        job = store.get_job("job-1")
+        assert job is not None
+        assert job.progress == 75
+        assert job.progress_message == "Computing metrics..."
+    def test_update_progress_clamps_to_valid_range(self, store: JobStore) -> None:
+        """update_progress() should clamp progress to 0-100."""
+        store.create_job("job-1", "sub-stroke0001", fast_mode=True)
+        store.start_job("job-1")
+        store.update_progress("job-1", 150, "Over 100")
+        job = store.get_job("job-1")
+        assert job is not None
+        assert job.progress == 100
+        store.update_progress("job-1", -10, "Negative")
+        job = store.get_job("job-1")
+        assert job is not None
+        assert job.progress == 0
+    def test_complete_job_sets_status_and_result(self, store: JobStore) -> None:
+        """complete_job() should mark job as completed with result."""
+        store.create_job("job-1", "sub-stroke0001", fast_mode=True)
+        store.start_job("job-1")
+        result = {"caseId": "sub-stroke0001", "diceScore": 0.847}
+        store.complete_job("job-1", result)
+        job = store.get_job("job-1")
+        assert job is not None
+        assert job.status == JobStatus.COMPLETED
+        assert job.progress == 100
+        assert job.result == result
+        assert job.completed_at is not None
+    def test_fail_job_sets_status_and_error(self, store: JobStore) -> None:
+        """fail_job() should mark job as failed with error message."""
+        store.create_job("job-1", "sub-stroke0001", fast_mode=True)
+        store.start_job("job-1")
+        store.fail_job("job-1", "GPU out of memory")
+        job = store.get_job("job-1")
+        assert job is not None
+        assert job.status == JobStatus.FAILED
+        assert job.error == "GPU out of memory"
+        assert job.completed_at is not None
+    def test_len_returns_number_of_jobs(self, store: JobStore) -> None:
+        """len(store) should return the number of jobs."""
+        assert len(store) == 0
+        store.create_job("job-1", "case1", fast_mode=True)
+        assert len(store) == 1
+        store.create_job("job-2", "case2", fast_mode=True)
+        assert len(store) == 2
+class TestJobStoreCleanup:
+    """Tests for job cleanup functionality."""
+    def test_cleanup_removes_old_completed_jobs(self) -> None:
+        """cleanup_old_jobs() should remove jobs older than TTL."""
+        with TemporaryDirectory() as tmpdir:
+            # Use a very short TTL for testing
+            store = JobStore(ttl=timedelta(seconds=0), results_dir=Path(tmpdir))
+            store.create_job("job-1", "case1", fast_mode=True)
+            store.start_job("job-1")
+            store.complete_job("job-1", {"result": "data"})
+            # Job is "old" immediately (TTL=0)
+            cleaned = store.cleanup_old_jobs()
+            assert cleaned == 1
+            assert store.get_job("job-1") is None
+    def test_cleanup_keeps_running_jobs(self) -> None:
+        """cleanup_old_jobs() should not remove running jobs."""
+        with TemporaryDirectory() as tmpdir:
+            store = JobStore(ttl=timedelta(seconds=0), results_dir=Path(tmpdir))
+            store.create_job("job-1", "case1", fast_mode=True)
+            store.start_job("job-1")
+            # Job is running, not completed
+            cleaned = store.cleanup_old_jobs()
+            assert cleaned == 0
+            assert store.get_job("job-1") is not None
+    def test_cleanup_removes_result_files(self) -> None:
+        """cleanup_old_jobs() should also remove result files on disk."""
+        with TemporaryDirectory() as tmpdir:
+            results_dir = Path(tmpdir)
+            store = JobStore(ttl=timedelta(seconds=0), results_dir=results_dir)
+            # Create job and its result directory
+            store.create_job("job-1", "case1", fast_mode=True)
+            store.start_job("job-1")
+            job_results = results_dir / "job-1"
+            job_results.mkdir()
+            (job_results / "prediction.nii.gz").touch()
+            store.complete_job("job-1", {"result": "data"})
+            # Cleanup should remove both job record and files
+            store.cleanup_old_jobs()
+            assert not job_results.exists()
+class TestGlobalJobStore:
+    """Tests for the global job store singleton."""
+    def test_get_job_store_raises_before_init(self) -> None:
+        """get_job_store() should raise if not initialized."""
+        # Patch the global to simulate uninitialized state
+        with (
+            patch("stroke_deepisles_demo.api.job_store.job_store", None),
+            pytest.raises(RuntimeError, match="not initialized"),
+        ):
+            get_job_store()
+    def test_init_job_store_creates_global_instance(self) -> None:
+        """init_job_store() should create and return a JobStore."""
+        with TemporaryDirectory() as tmpdir:
+            store = init_job_store(results_dir=Path(tmpdir))
+            assert store is not None
+            assert isinstance(store, JobStore)
+            # Clean up the scheduler
+            store.stop_cleanup_scheduler()