File size: 17,208 Bytes
722753e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
# Async Job Queue for Long-Running ML Inference

**Status**: APPROVED
**Created**: 2025-12-12
**Author**: Claude Code Audit

---

## Executive Summary

HuggingFace Spaces has a ~60-second gateway timeout that cannot be bypassed through
configuration. DeepISLES ML inference typically takes 30-60 seconds, creating
intermittent 504 Gateway Timeout errors. This spec defines a robust async job queue
system that eliminates timeout issues by immediately returning a job ID and using
client-side polling for status/results.

## Problem Statement

### Current Architecture (Synchronous)

```
Frontend                    Backend                     ML Inference
   |                           |                            |
   |--POST /api/segment------->|                            |
   |                           |--run_pipeline_on_case()--->|
   |                           |                            |
   |      (30-60s wait)        |       (processing)         |
   |                           |                            |
   |                           |<---result------------------|
   |<--200 OK + JSON-----------|                            |
```

**Problem**: HF Spaces proxy times out at ~60s, killing the connection before
the ML inference completes. The response is lost even though processing succeeds.

### Target Architecture (Async with Polling)

```
Frontend                    Backend                     ML Inference
   |                           |                            |
   |--POST /api/segment------->|                            |
   |<--202 Accepted + job_id---|                            |
   |                           |--BackgroundTask----------->|
   |                           |                            |
   |--GET /api/jobs/{id}------>|       (processing)         |
   |<--200 {status: running}---|                            |
   |                           |                            |
   |--GET /api/jobs/{id}------>|                            |
   |<--200 {status: running}---|                            |
   |                           |<---result------------------|
   |--GET /api/jobs/{id}------>|                            |
   |<--200 {status: completed, |                            |
   |        result: {...}}-----|                            |
```

**Solution**: Initial request returns in <1s. Polling requests are fast (<100ms).
No single request exceeds the proxy timeout.

## Technical Design

### 1. Backend Job Store

In-memory dictionary storing job state. This is appropriate because:
- HF Spaces runs a single uvicorn worker (no multi-worker sync needed)
- Jobs are ephemeral (results cached, cleanup after 1 hour)
- No external dependencies (Redis, DB) required

```python
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
from typing import Any

class JobStatus(str, Enum):
    PENDING = "pending"      # Job created, not started
    RUNNING = "running"      # Inference in progress
    COMPLETED = "completed"  # Success, results available
    FAILED = "failed"        # Error occurred

@dataclass
class Job:
    id: str
    status: JobStatus
    case_id: str
    fast_mode: bool
    created_at: datetime
    started_at: datetime | None = None
    completed_at: datetime | None = None
    progress: int = 0  # 0-100 percentage
    progress_message: str = ""
    result: dict[str, Any] | None = None
    error: str | None = None

# Thread-safe job store (single writer pattern)
jobs: dict[str, Job] = {}
```

### 2. API Endpoints

#### POST /api/segment (Modified)
Returns immediately with job ID.

**Request**: Same as before
```json
{
  "case_id": "sub-strokecase0001",
  "fast_mode": true
}
```

**Response**: 202 Accepted
```json
{
  "jobId": "a1b2c3d4",
  "status": "pending",
  "message": "Segmentation job queued"
}
```

#### GET /api/jobs/{job_id}
Poll for job status and results.

**Response (Running)**:
```json
{
  "jobId": "a1b2c3d4",
  "status": "running",
  "progress": 45,
  "progressMessage": "Running DeepISLES inference...",
  "elapsedSeconds": 23.5
}
```

**Response (Completed)**:
```json
{
  "jobId": "a1b2c3d4",
  "status": "completed",
  "progress": 100,
  "progressMessage": "Segmentation complete",
  "elapsedSeconds": 42.3,
  "result": {
    "caseId": "sub-strokecase0001",
    "diceScore": 0.847,
    "volumeMl": 12.34,
    "dwiUrl": "https://...hf.space/files/a1b2c3d4/...",
    "predictionUrl": "https://...hf.space/files/a1b2c3d4/..."
  }
}
```

**Response (Failed)**:
```json
{
  "jobId": "a1b2c3d4",
  "status": "failed",
  "progress": 0,
  "progressMessage": "Error occurred",
  "elapsedSeconds": 5.2,
  "error": "Case not found: sub-invalid"
}
```

**Response (Not Found)**: 404
```json
{
  "detail": "Job not found: xyz123"
}
```

### 3. Background Task Execution

```python
from fastapi import BackgroundTasks

@router.post("/segment", response_model=SegmentJobResponse, status_code=202)
def create_segment_job(
    request: Request,
    body: SegmentRequest,
    background_tasks: BackgroundTasks
) -> SegmentJobResponse:
    """Create a segmentation job and return immediately."""
    job_id = str(uuid.uuid4())[:8]

    # Create job record
    job = Job(
        id=job_id,
        status=JobStatus.PENDING,
        case_id=body.case_id,
        fast_mode=body.fast_mode,
        created_at=datetime.now(),
    )
    jobs[job_id] = job

    # Queue background task
    background_tasks.add_task(
        run_segmentation_job,
        job_id=job_id,
        case_id=body.case_id,
        fast_mode=body.fast_mode,
        backend_url=get_backend_base_url(request),
    )

    return SegmentJobResponse(
        jobId=job_id,
        status=JobStatus.PENDING,
        message="Segmentation job queued",
    )
```

### 4. Job Execution with Progress Updates

```python
def run_segmentation_job(
    job_id: str,
    case_id: str,
    fast_mode: bool,
    backend_url: str,
) -> None:
    """Execute segmentation in background thread."""
    job = jobs.get(job_id)
    if not job:
        return

    try:
        # Mark as running
        job.status = JobStatus.RUNNING
        job.started_at = datetime.now()
        job.progress = 10
        job.progress_message = "Loading case data..."

        # Run inference with progress callbacks
        output_dir = RESULTS_BASE / job_id

        job.progress = 20
        job.progress_message = "Staging files for DeepISLES..."

        result = run_pipeline_on_case(
            case_id,
            output_dir=output_dir,
            fast=fast_mode,
            compute_dice=True,
            cleanup_staging=True,
            # Future: pass progress_callback for finer updates
        )

        job.progress = 90
        job.progress_message = "Computing metrics..."

        # Compute volume
        volume_ml = None
        with contextlib.suppress(Exception):
            volume_ml = round(compute_volume_ml(result.prediction_mask, threshold=0.5), 2)

        # Build result
        job.progress = 100
        job.progress_message = "Segmentation complete"
        job.status = JobStatus.COMPLETED
        job.completed_at = datetime.now()
        job.result = {
            "caseId": result.case_id,
            "diceScore": result.dice_score,
            "volumeMl": volume_ml,
            "elapsedSeconds": round(result.elapsed_seconds, 2),
            "dwiUrl": f"{backend_url}/files/{job_id}/{result.case_id}/{result.input_files['dwi'].name}",
            "predictionUrl": f"{backend_url}/files/{job_id}/{result.case_id}/{result.prediction_mask.name}",
        }

    except Exception as e:
        job.status = JobStatus.FAILED
        job.completed_at = datetime.now()
        job.error = str(e)
        job.progress_message = "Error occurred"
```

### 5. Job Cleanup (Memory Management)

```python
import threading
from datetime import timedelta

JOB_TTL = timedelta(hours=1)  # Keep completed jobs for 1 hour

def cleanup_old_jobs() -> None:
    """Remove jobs older than TTL to prevent memory leaks."""
    now = datetime.now()
    expired = [
        job_id for job_id, job in jobs.items()
        if job.completed_at and (now - job.completed_at) > JOB_TTL
    ]
    for job_id in expired:
        # Also cleanup result files
        result_dir = RESULTS_BASE / job_id
        if result_dir.exists():
            shutil.rmtree(result_dir, ignore_errors=True)
        del jobs[job_id]

# Run cleanup every 10 minutes
def start_cleanup_scheduler():
    def run():
        while True:
            time.sleep(600)  # 10 minutes
            cleanup_old_jobs()

    thread = threading.Thread(target=run, daemon=True)
    thread.start()
```

### 6. Frontend Polling Hook

```typescript
// hooks/useJobPolling.ts
import { useState, useEffect, useCallback, useRef } from 'react'
import { apiClient, JobStatus, JobStatusResponse } from '../api/client'

interface UseJobPollingOptions {
  pollingInterval?: number  // ms, default 2000
  onComplete?: (result: SegmentationResult) => void
  onError?: (error: string) => void
}

export function useJobPolling(options: UseJobPollingOptions = {}) {
  const { pollingInterval = 2000, onComplete, onError } = options

  const [jobId, setJobId] = useState<string | null>(null)
  const [status, setStatus] = useState<JobStatus | null>(null)
  const [progress, setProgress] = useState(0)
  const [progressMessage, setProgressMessage] = useState('')
  const [error, setError] = useState<string | null>(null)
  const [isPolling, setIsPolling] = useState(false)

  const intervalRef = useRef<number | null>(null)
  const onCompleteRef = useRef(onComplete)
  const onErrorRef = useRef(onError)

  // Keep callbacks current
  useEffect(() => {
    onCompleteRef.current = onComplete
    onErrorRef.current = onError
  })

  const stopPolling = useCallback(() => {
    if (intervalRef.current) {
      clearInterval(intervalRef.current)
      intervalRef.current = null
    }
    setIsPolling(false)
  }, [])

  const pollJobStatus = useCallback(async (id: string) => {
    try {
      const response = await apiClient.getJobStatus(id)

      setStatus(response.status)
      setProgress(response.progress)
      setProgressMessage(response.progressMessage)

      if (response.status === 'completed' && response.result) {
        stopPolling()
        onCompleteRef.current?.(response.result)
      } else if (response.status === 'failed') {
        stopPolling()
        setError(response.error || 'Job failed')
        onErrorRef.current?.(response.error || 'Job failed')
      }
    } catch (err) {
      // Don't stop polling on network errors - might be transient
      console.warn('Polling error:', err)
    }
  }, [stopPolling])

  const startJob = useCallback(async (caseId: string, fastMode = true) => {
    // Reset state
    setError(null)
    setProgress(0)
    setProgressMessage('Starting...')
    setStatus('pending')

    try {
      // Create job
      const response = await apiClient.createSegmentJob(caseId, fastMode)
      setJobId(response.jobId)
      setStatus(response.status)

      // Start polling
      setIsPolling(true)
      intervalRef.current = window.setInterval(
        () => pollJobStatus(response.jobId),
        pollingInterval
      )

      // Initial poll
      await pollJobStatus(response.jobId)

    } catch (err) {
      const message = err instanceof Error ? err.message : 'Failed to start job'
      setError(message)
      onErrorRef.current?.(message)
    }
  }, [pollingInterval, pollJobStatus])

  // Cleanup on unmount
  useEffect(() => {
    return () => {
      if (intervalRef.current) {
        clearInterval(intervalRef.current)
      }
    }
  }, [])

  return {
    jobId,
    status,
    progress,
    progressMessage,
    error,
    isPolling,
    startJob,
    stopPolling,
  }
}
```

### 7. Frontend API Client Extensions

```typescript
// api/client.ts additions

export type JobStatus = 'pending' | 'running' | 'completed' | 'failed'

export interface CreateJobResponse {
  jobId: string
  status: JobStatus
  message: string
}

export interface JobStatusResponse {
  jobId: string
  status: JobStatus
  progress: number
  progressMessage: string
  elapsedSeconds?: number
  result?: SegmentResponse
  error?: string
}

class ApiClient {
  // ... existing methods ...

  async createSegmentJob(
    caseId: string,
    fastMode: boolean = true,
    signal?: AbortSignal
  ): Promise<CreateJobResponse> {
    const response = await fetch(`${this.baseUrl}/api/segment`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ case_id: caseId, fast_mode: fastMode }),
      signal,
    })

    if (!response.ok) {
      const error = await response.json().catch(() => ({}))
      throw new ApiError(
        `Failed to create job: ${error.detail || response.statusText}`,
        response.status,
        error.detail
      )
    }

    return response.json()
  }

  async getJobStatus(jobId: string, signal?: AbortSignal): Promise<JobStatusResponse> {
    const response = await fetch(`${this.baseUrl}/api/jobs/${jobId}`, { signal })

    if (response.status === 404) {
      throw new ApiError('Job not found', 404)
    }

    if (!response.ok) {
      const error = await response.json().catch(() => ({}))
      throw new ApiError(
        `Failed to get job status: ${error.detail || response.statusText}`,
        response.status,
        error.detail
      )
    }

    return response.json()
  }
}
```

### 8. UI Progress Display

```tsx
// components/ProgressIndicator.tsx
interface ProgressIndicatorProps {
  progress: number
  message: string
  status: JobStatus
}

export function ProgressIndicator({ progress, message, status }: ProgressIndicatorProps) {
  return (
    <div className="bg-gray-800 rounded-lg p-4 space-y-3">
      <div className="flex justify-between text-sm">
        <span className="text-gray-400">{message}</span>
        <span className="text-gray-300">{progress}%</span>
      </div>
      <div className="w-full bg-gray-700 rounded-full h-2">
        <div
          className={`h-2 rounded-full transition-all duration-300 ${
            status === 'failed' ? 'bg-red-500' : 'bg-blue-500'
          }`}
          style={{ width: `${progress}%` }}
        />
      </div>
    </div>
  )
}
```

## Implementation Checklist

### Backend
- [ ] Create `job_store.py` with Job dataclass and jobs dict
- [ ] Create Pydantic schemas for job responses
- [ ] Modify POST /api/segment to return 202 with job ID
- [ ] Add GET /api/jobs/{job_id} endpoint
- [ ] Implement background task execution with progress updates
- [ ] Add job cleanup scheduler
- [ ] Update CORS if needed for new endpoint

### Frontend
- [ ] Add job-related types to `types/index.ts`
- [ ] Add API client methods for job creation and polling
- [ ] Create `useJobPolling` hook
- [ ] Create `ProgressIndicator` component
- [ ] Update `useSegmentation` to use job polling
- [ ] Update `App.tsx` to show progress during processing

### Testing
- [ ] Unit tests for job store
- [ ] Unit tests for job endpoints
- [ ] Unit tests for useJobPolling hook
- [ ] E2E test for full job flow
- [ ] Manual test on HF Spaces deployment

### Documentation
- [ ] Update API documentation
- [ ] Update bug tracker with resolution
- [ ] Add architecture diagram

## Migration Strategy

1. **Backend**: Add new endpoints alongside existing. Keep old `/api/segment`
   temporarily for backwards compatibility (marked deprecated).

2. **Frontend**: Update to use new job polling system. Old sync behavior removed.

3. **Testing**: Verify on HF Spaces before removing deprecated endpoint.

4. **Cleanup**: Remove deprecated sync endpoint after validation.

## Performance Considerations

| Metric | Before (Sync) | After (Async) |
|--------|--------------|---------------|
| Initial response time | 30-60s | <1s |
| Total request count | 1 | ~15-30 (polling) |
| Timeout risk | HIGH | NONE |
| User feedback | None during wait | Progress updates |
| Network efficiency | 1 large response | Many small responses |

## Alternatives Considered

### 1. SSE (Server-Sent Events)
- **Pros**: Real-time updates, single connection
- **Cons**: Connection stays open (could still timeout), HF proxy issues possible
- **Decision**: Polling is more robust for HF Spaces constraints

### 2. WebSockets
- **Pros**: Bi-directional, real-time
- **Cons**: Known 404 issues on HF Spaces, complex
- **Decision**: Not viable on HF Spaces

### 3. Redis/Celery
- **Pros**: Production-grade, multi-worker support
- **Cons**: Not available on HF Spaces Docker
- **Decision**: In-memory sufficient for single-worker

## References

- [FastAPI Background Tasks](https://fastapi.tiangolo.com/tutorial/background-tasks/)
- [FastAPI Polling Strategy for Long-Running Tasks](https://openillumi.com/en/en-fastapi-long-task-progress-polling/)
- [Managing Background Tasks in FastAPI](https://leapcell.io/blog/managing-background-tasks-and-long-running-operations-in-fastapi)
- [Real Time Polling in React Query 2025](https://samwithcode.in/tutorial/react-js/real-time-polling-in-react-query-2025)
- [504 Gateway Timeout - HF Forums](https://discuss.huggingface.co/t/504-gateway-timeout-with-http-request/24018)