Spaces:

VibecoderMcSwaggins
/

stroke-viewer-frontend

Running

App Files Files Community

stroke-viewer-frontend / docs /specs /async-job-queue.md

VibecoderMcSwaggins

feat(api): async job queue with comprehensive test coverage (#36)

722753e unverified 6 days ago

preview code

raw

history blame

17.2 kB

	# Async Job Queue for Long-Running ML Inference

	Status: APPROVED
	Created: 2025-12-12
	Author: Claude Code Audit

	---

	## Executive Summary

	HuggingFace Spaces has a ~60-second gateway timeout that cannot be bypassed through
	configuration. DeepISLES ML inference typically takes 30-60 seconds, creating
	intermittent 504 Gateway Timeout errors. This spec defines a robust async job queue
	system that eliminates timeout issues by immediately returning a job ID and using
	client-side polling for status/results.

	## Problem Statement

	### Current Architecture (Synchronous)

	```
	Frontend Backend ML Inference
	\| \| \|
	\|--POST /api/segment------->\| \|
	\| \|--run_pipeline_on_case()--->\|
	\| \| \|
	\| (30-60s wait) \| (processing) \|
	\| \| \|
	\| \|<---result------------------\|
	\|<--200 OK + JSON-----------\| \|
	```

	Problem: HF Spaces proxy times out at ~60s, killing the connection before
	the ML inference completes. The response is lost even though processing succeeds.

	### Target Architecture (Async with Polling)

	```
	Frontend Backend ML Inference
	\| \| \|
	\|--POST /api/segment------->\| \|
	\|<--202 Accepted + job_id---\| \|
	\| \|--BackgroundTask----------->\|
	\| \| \|
	\|--GET /api/jobs/{id}------>\| (processing) \|
	\|<--200 {status: running}---\| \|
	\| \| \|
	\|--GET /api/jobs/{id}------>\| \|
	\|<--200 {status: running}---\| \|
	\| \|<---result------------------\|
	\|--GET /api/jobs/{id}------>\| \|
	\|<--200 {status: completed, \| \|
	\| result: {...}}-----\| \|
	```

	Solution: Initial request returns in <1s. Polling requests are fast (<100ms).
	No single request exceeds the proxy timeout.

	## Technical Design

	### 1. Backend Job Store

	In-memory dictionary storing job state. This is appropriate because:
	- HF Spaces runs a single uvicorn worker (no multi-worker sync needed)
	- Jobs are ephemeral (results cached, cleanup after 1 hour)
	- No external dependencies (Redis, DB) required

	```python
	from dataclasses import dataclass, field
	from datetime import datetime
	from enum import Enum
	from typing import Any

	class JobStatus(str, Enum):
	PENDING = "pending" # Job created, not started
	RUNNING = "running" # Inference in progress
	COMPLETED = "completed" # Success, results available
	FAILED = "failed" # Error occurred

	@dataclass
	class Job:
	id: str
	status: JobStatus
	case_id: str
	fast_mode: bool
	created_at: datetime
	started_at: datetime \| None = None
	completed_at: datetime \| None = None
	progress: int = 0 # 0-100 percentage
	progress_message: str = ""
	result: dict[str, Any] \| None = None
	error: str \| None = None

	# Thread-safe job store (single writer pattern)
	jobs: dict[str, Job] = {}
	```

	### 2. API Endpoints

	#### POST /api/segment (Modified)
	Returns immediately with job ID.

	Request: Same as before
	```json
	{
	"case_id": "sub-strokecase0001",
	"fast_mode": true
	}
	```

	Response: 202 Accepted
	```json
	{
	"jobId": "a1b2c3d4",
	"status": "pending",
	"message": "Segmentation job queued"
	}
	```

	#### GET /api/jobs/{job_id}
	Poll for job status and results.

	Response (Running):
	```json
	{
	"jobId": "a1b2c3d4",
	"status": "running",
	"progress": 45,
	"progressMessage": "Running DeepISLES inference...",
	"elapsedSeconds": 23.5
	}
	```

	Response (Completed):
	```json
	{
	"jobId": "a1b2c3d4",
	"status": "completed",
	"progress": 100,
	"progressMessage": "Segmentation complete",
	"elapsedSeconds": 42.3,
	"result": {
	"caseId": "sub-strokecase0001",
	"diceScore": 0.847,
	"volumeMl": 12.34,
	"dwiUrl": "https://...hf.space/files/a1b2c3d4/...",
	"predictionUrl": "https://...hf.space/files/a1b2c3d4/..."
	}
	}
	```

	Response (Failed):
	```json
	{
	"jobId": "a1b2c3d4",
	"status": "failed",
	"progress": 0,
	"progressMessage": "Error occurred",
	"elapsedSeconds": 5.2,
	"error": "Case not found: sub-invalid"
	}
	```

	Response (Not Found): 404
	```json
	{
	"detail": "Job not found: xyz123"
	}
	```

	### 3. Background Task Execution

	```python
	from fastapi import BackgroundTasks

	@router.post("/segment", response_model=SegmentJobResponse, status_code=202)
	def create_segment_job(
	request: Request,
	body: SegmentRequest,
	background_tasks: BackgroundTasks
	) -> SegmentJobResponse:
	"""Create a segmentation job and return immediately."""
	job_id = str(uuid.uuid4())[:8]

	# Create job record
	job = Job(
	id=job_id,
	status=JobStatus.PENDING,
	case_id=body.case_id,
	fast_mode=body.fast_mode,
	created_at=datetime.now(),
	)
	jobs[job_id] = job

	# Queue background task
	background_tasks.add_task(
	run_segmentation_job,
	job_id=job_id,
	case_id=body.case_id,
	fast_mode=body.fast_mode,
	backend_url=get_backend_base_url(request),
	)

	return SegmentJobResponse(
	jobId=job_id,
	status=JobStatus.PENDING,
	message="Segmentation job queued",
	)
	```

	### 4. Job Execution with Progress Updates

	```python
	def run_segmentation_job(
	job_id: str,
	case_id: str,
	fast_mode: bool,
	backend_url: str,
	) -> None:
	"""Execute segmentation in background thread."""
	job = jobs.get(job_id)
	if not job:
	return

	try:
	# Mark as running
	job.status = JobStatus.RUNNING
	job.started_at = datetime.now()
	job.progress = 10
	job.progress_message = "Loading case data..."

	# Run inference with progress callbacks
	output_dir = RESULTS_BASE / job_id

	job.progress = 20
	job.progress_message = "Staging files for DeepISLES..."

	result = run_pipeline_on_case(
	case_id,
	output_dir=output_dir,
	fast=fast_mode,
	compute_dice=True,
	cleanup_staging=True,
	# Future: pass progress_callback for finer updates
	)

	job.progress = 90
	job.progress_message = "Computing metrics..."

	# Compute volume
	volume_ml = None
	with contextlib.suppress(Exception):
	volume_ml = round(compute_volume_ml(result.prediction_mask, threshold=0.5), 2)

	# Build result
	job.progress = 100
	job.progress_message = "Segmentation complete"
	job.status = JobStatus.COMPLETED
	job.completed_at = datetime.now()
	job.result = {
	"caseId": result.case_id,
	"diceScore": result.dice_score,
	"volumeMl": volume_ml,
	"elapsedSeconds": round(result.elapsed_seconds, 2),
	"dwiUrl": f"{backend_url}/files/{job_id}/{result.case_id}/{result.input_files['dwi'].name}",
	"predictionUrl": f"{backend_url}/files/{job_id}/{result.case_id}/{result.prediction_mask.name}",
	}

	except Exception as e:
	job.status = JobStatus.FAILED
	job.completed_at = datetime.now()
	job.error = str(e)
	job.progress_message = "Error occurred"
	```

	### 5. Job Cleanup (Memory Management)

	```python
	import threading
	from datetime import timedelta

	JOB_TTL = timedelta(hours=1) # Keep completed jobs for 1 hour

	def cleanup_old_jobs() -> None:
	"""Remove jobs older than TTL to prevent memory leaks."""
	now = datetime.now()
	expired = [
	job_id for job_id, job in jobs.items()
	if job.completed_at and (now - job.completed_at) > JOB_TTL
	]
	for job_id in expired:
	# Also cleanup result files
	result_dir = RESULTS_BASE / job_id
	if result_dir.exists():
	shutil.rmtree(result_dir, ignore_errors=True)
	del jobs[job_id]

	# Run cleanup every 10 minutes
	def start_cleanup_scheduler():
	def run():
	while True:
	time.sleep(600) # 10 minutes
	cleanup_old_jobs()

	thread = threading.Thread(target=run, daemon=True)
	thread.start()
	```

	### 6. Frontend Polling Hook

	```typescript
	// hooks/useJobPolling.ts
	import { useState, useEffect, useCallback, useRef } from 'react'
	import { apiClient, JobStatus, JobStatusResponse } from '../api/client'

	interface UseJobPollingOptions {
	pollingInterval?: number // ms, default 2000
	onComplete?: (result: SegmentationResult) => void
	onError?: (error: string) => void
	}

	export function useJobPolling(options: UseJobPollingOptions = {}) {
	const { pollingInterval = 2000, onComplete, onError } = options

	const [jobId, setJobId] = useState<string \| null>(null)
	const [status, setStatus] = useState<JobStatus \| null>(null)
	const [progress, setProgress] = useState(0)
	const [progressMessage, setProgressMessage] = useState('')
	const [error, setError] = useState<string \| null>(null)
	const [isPolling, setIsPolling] = useState(false)

	const intervalRef = useRef<number \| null>(null)
	const onCompleteRef = useRef(onComplete)
	const onErrorRef = useRef(onError)

	// Keep callbacks current
	useEffect(() => {
	onCompleteRef.current = onComplete
	onErrorRef.current = onError
	})

	const stopPolling = useCallback(() => {
	if (intervalRef.current) {
	clearInterval(intervalRef.current)
	intervalRef.current = null
	}
	setIsPolling(false)
	}, [])

	const pollJobStatus = useCallback(async (id: string) => {
	try {
	const response = await apiClient.getJobStatus(id)

	setStatus(response.status)
	setProgress(response.progress)
	setProgressMessage(response.progressMessage)

	if (response.status === 'completed' && response.result) {
	stopPolling()
	onCompleteRef.current?.(response.result)
	} else if (response.status === 'failed') {
	stopPolling()
	setError(response.error \|\| 'Job failed')
	onErrorRef.current?.(response.error \|\| 'Job failed')
	}
	} catch (err) {
	// Don't stop polling on network errors - might be transient
	console.warn('Polling error:', err)
	}
	}, [stopPolling])

	const startJob = useCallback(async (caseId: string, fastMode = true) => {
	// Reset state
	setError(null)
	setProgress(0)
	setProgressMessage('Starting...')
	setStatus('pending')

	try {
	// Create job
	const response = await apiClient.createSegmentJob(caseId, fastMode)
	setJobId(response.jobId)
	setStatus(response.status)

	// Start polling
	setIsPolling(true)
	intervalRef.current = window.setInterval(
	() => pollJobStatus(response.jobId),
	pollingInterval
	)

	// Initial poll
	await pollJobStatus(response.jobId)

	} catch (err) {
	const message = err instanceof Error ? err.message : 'Failed to start job'
	setError(message)
	onErrorRef.current?.(message)
	}
	}, [pollingInterval, pollJobStatus])

	// Cleanup on unmount
	useEffect(() => {
	return () => {
	if (intervalRef.current) {
	clearInterval(intervalRef.current)
	}
	}
	}, [])

	return {
	jobId,
	status,
	progress,
	progressMessage,
	error,
	isPolling,
	startJob,
	stopPolling,
	}
	}
	```

	### 7. Frontend API Client Extensions

	```typescript
	// api/client.ts additions

	export type JobStatus = 'pending' \| 'running' \| 'completed' \| 'failed'

	export interface CreateJobResponse {
	jobId: string
	status: JobStatus
	message: string
	}

	export interface JobStatusResponse {
	jobId: string
	status: JobStatus
	progress: number
	progressMessage: string
	elapsedSeconds?: number
	result?: SegmentResponse
	error?: string
	}

	class ApiClient {
	// ... existing methods ...

	async createSegmentJob(
	caseId: string,
	fastMode: boolean = true,
	signal?: AbortSignal
	): Promise<CreateJobResponse> {
	const response = await fetch(`${this.baseUrl}/api/segment`, {
	method: 'POST',
	headers: { 'Content-Type': 'application/json' },
	body: JSON.stringify({ case_id: caseId, fast_mode: fastMode }),
	signal,
	})

	if (!response.ok) {
	const error = await response.json().catch(() => ({}))
	throw new ApiError(
	`Failed to create job: ${error.detail \|\| response.statusText}`,
	response.status,
	error.detail
	)
	}

	return response.json()
	}

	async getJobStatus(jobId: string, signal?: AbortSignal): Promise<JobStatusResponse> {
	const response = await fetch(`${this.baseUrl}/api/jobs/${jobId}`, { signal })

	if (response.status === 404) {
	throw new ApiError('Job not found', 404)
	}

	if (!response.ok) {
	const error = await response.json().catch(() => ({}))
	throw new ApiError(
	`Failed to get job status: ${error.detail \|\| response.statusText}`,
	response.status,
	error.detail
	)
	}

	return response.json()
	}
	}
	```

	### 8. UI Progress Display

	```tsx
	// components/ProgressIndicator.tsx
	interface ProgressIndicatorProps {
	progress: number
	message: string
	status: JobStatus
	}

	export function ProgressIndicator({ progress, message, status }: ProgressIndicatorProps) {
	return (
	<div className="bg-gray-800 rounded-lg p-4 space-y-3">
	<div className="flex justify-between text-sm">
	<span className="text-gray-400">{message}</span>
	<span className="text-gray-300">{progress}%</span>
	</div>
	<div className="w-full bg-gray-700 rounded-full h-2">
	<div
	className={`h-2 rounded-full transition-all duration-300 ${
	status === 'failed' ? 'bg-red-500' : 'bg-blue-500'
	}`}
	style={{ width: `${progress}%` }}
	/>
	</div>
	</div>
	)
	}
	```

	## Implementation Checklist

	### Backend
	- [ ] Create `job_store.py` with Job dataclass and jobs dict
	- [ ] Create Pydantic schemas for job responses
	- [ ] Modify POST /api/segment to return 202 with job ID
	- [ ] Add GET /api/jobs/{job_id} endpoint
	- [ ] Implement background task execution with progress updates
	- [ ] Add job cleanup scheduler
	- [ ] Update CORS if needed for new endpoint

	### Frontend
	- [ ] Add job-related types to `types/index.ts`
	- [ ] Add API client methods for job creation and polling
	- [ ] Create `useJobPolling` hook
	- [ ] Create `ProgressIndicator` component
	- [ ] Update `useSegmentation` to use job polling
	- [ ] Update `App.tsx` to show progress during processing

	### Testing
	- [ ] Unit tests for job store
	- [ ] Unit tests for job endpoints
	- [ ] Unit tests for useJobPolling hook
	- [ ] E2E test for full job flow
	- [ ] Manual test on HF Spaces deployment

	### Documentation
	- [ ] Update API documentation
	- [ ] Update bug tracker with resolution
	- [ ] Add architecture diagram

	## Migration Strategy

	1. Backend: Add new endpoints alongside existing. Keep old `/api/segment`
	temporarily for backwards compatibility (marked deprecated).

	2. Frontend: Update to use new job polling system. Old sync behavior removed.

	3. Testing: Verify on HF Spaces before removing deprecated endpoint.

	4. Cleanup: Remove deprecated sync endpoint after validation.

	## Performance Considerations

	\| Metric \| Before (Sync) \| After (Async) \|
	\|--------\|--------------\|---------------\|
	\| Initial response time \| 30-60s \| <1s \|
	\| Total request count \| 1 \| ~15-30 (polling) \|
	\| Timeout risk \| HIGH \| NONE \|
	\| User feedback \| None during wait \| Progress updates \|
	\| Network efficiency \| 1 large response \| Many small responses \|

	## Alternatives Considered

	### 1. SSE (Server-Sent Events)
	- Pros: Real-time updates, single connection
	- Cons: Connection stays open (could still timeout), HF proxy issues possible
	- Decision: Polling is more robust for HF Spaces constraints

	### 2. WebSockets
	- Pros: Bi-directional, real-time
	- Cons: Known 404 issues on HF Spaces, complex
	- Decision: Not viable on HF Spaces

	### 3. Redis/Celery
	- Pros: Production-grade, multi-worker support
	- Cons: Not available on HF Spaces Docker
	- Decision: In-memory sufficient for single-worker

	## References

	- [FastAPI Background Tasks](https://fastapi.tiangolo.com/tutorial/background-tasks/)
	- [FastAPI Polling Strategy for Long-Running Tasks](https://openillumi.com/en/en-fastapi-long-task-progress-polling/)
	- [Managing Background Tasks in FastAPI](https://leapcell.io/blog/managing-background-tasks-and-long-running-operations-in-fastapi)
	- [Real Time Polling in React Query 2025](https://samwithcode.in/tutorial/react-js/real-time-polling-in-react-query-2025)
	- [504 Gateway Timeout - HF Forums](https://discuss.huggingface.co/t/504-gateway-timeout-with-http-request/24018)