ISR / CLAUDE.md
Zhen Ye
refactor: migrate to uv, prompt-tune BAML mission planner and assessor
18a11bc

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Multi-GPU video analysis platform with three fully functional modes:

  • Object Detection: Bounding boxes via YOLO11, DETR, or Grounding DINO
  • Segmentation: Mask overlays via Grounded SAM2 (GSAM2) or YOLO+SAM2 (YSAM2)
  • Drone Detection: Aerial object detection via YOLOv8 fine-tuned on VisDrone

Deployed as a HuggingFace Space (Docker SDK) at https://biaslab2025-isr.hf.space.

Development Commands

# Setup
uv sync

# Run dev server
uvicorn app:app --host 0.0.0.0 --port 7860 --reload

# Verify imports (quick smoke test β€” no tests exist yet)
python -c "from app import app"

# Docker
docker build -t isr . && docker run -p 7860:7860 isr

# Test async detection
curl -X POST http://localhost:7860/detect/async \
  -F "video=@sample.mp4" \
  -F "mode=object_detection" \
  -F "queries=person,car" \
  -F "detector=yolo11"

Core Architecture

Async Detection Flow (primary path)

Frontend (index.html) β†’ POST /detect/async β†’ background task β†’ MJPEG stream + polling
  1. Frontend uploads video + mode + queries to /detect/async
  2. Backend creates a JobInfo, spawns process_video_async() as an asyncio.Task
  3. inference.py runs multi-GPU parallel inference, publishing frames to an MJPEG stream
  4. Frontend consumes /detect/stream/{job_id} for live video, polls /detect/status/{job_id}
  5. On completion, frontend fetches final video from /detect/video/{job_id}

API Endpoints (app.py)

Method Path Purpose
POST /detect/async Start async job (returns job_id + stream/status URLs)
GET /detect/status/{job_id} Poll job status
GET /detect/stream/{job_id} MJPEG live stream (event-driven, 640px wide)
GET /detect/video/{job_id} Download processed MP4
GET /detect/depth-video/{job_id} Download depth video
GET /detect/tracks/{job_id}/summary Per-frame detection counts (timeline heatmap)
GET /detect/tracks/{job_id}/{frame_idx} Per-frame track data
DELETE /detect/job/{job_id} Cancel running job
POST /detect Synchronous detection (returns MP4 directly)
POST /benchmark GSAM2 latency breakdown
POST /benchmark/profile Per-frame timing breakdown
POST /benchmark/analysis Full roofline analysis

/detect/async params: video, mode (object_detection/segmentation/drone_detection), queries, detector (default: yolo11), segmenter (default: GSAM2-L), enable_depth (default: false), step (default: 7, segmentation keyframe interval).

Multi-GPU Inference Pipeline (inference.py)

run_inference() β€” Detection and drone modes:

  • AsyncVideoReader prefetches frames into a queue (up to 32 frames)
  • Models loaded in parallel via ThreadPoolExecutor (one detector per GPU)
  • Queue-based producer/consumer: main thread feeds queue_in, N GPU workers drain it
  • Workers batch frames (up to max_batch_size=32 for YOLO) under per-model RLock
  • Writer thread reorders frames, runs ByteTracker + SpeedEstimator, writes via StreamingVideoWriter, publishes to MJPEG stream
  • Cancellation: workers poll _check_cancellation(job_id) each cycle

run_grounded_sam2_tracking() β€” Segmentation mode:

  • Extracts all frames to JPEG files on disk
  • Runs detection on keyframes (every step frames) to seed SAM2
  • SAM2 video predictor propagates masks between keyframes
  • ID reconciliation via IoU matching in MaskDictionary
  • Renders colored semi-transparent mask overlays with contours

Jobs System (jobs/)

  • models.py β€” JobInfo dataclass + JobStatus enum (PROCESSING/COMPLETED/FAILED/CANCELLED)
  • storage.py β€” In-memory JobStorage (singleton, RLock-protected) + disk at /tmp/detection_jobs/{job_id}/. Per-frame track data stored here. Auto-cleanup every 10 min (1hr expiry).
  • background.py β€” process_video_async() coroutine dispatches to the right inference function
  • streaming.py β€” MJPEG frame queue + asyncio.Event publisher; publish_frame() resizes to 640px

Frontend (demo/)

Single-page command center UI served at / (mounted at /demo). No build step. Uses window.ISR global namespace.

Key scripts:

  • init.js β†’ bootstraps window.ISR, wires UI, initializes state machine
  • state-machine.js β†’ explicit FSM for UI flow (idle β†’ detecting β†’ playing β†’ inspect)
  • api.js β†’ all backend API calls (startDetection, fetchTracks, fetchPointCloud, etc.)
  • real-backend.js β†’ streaming + polling + prefetch logic for live detection jobs
  • inspect.js β†’ 4-quadrant inspection panel (seg, edge, depth, 3D) with Tripo3D support
  • render.js β†’ canvas overlays for bounding boxes and tracks
  • ui.js β†’ panel layout, drawer tabs, command bar
  • analysis.js β†’ track analysis and timeline rendering
  • helpers.js β†’ viridis colormap, Sobel filter, RLE decode, utility functions

The frontend infers mode from the detector select element's data-kind attribute.

Models

Detectors (models/detectors/)

Key Class Type Batch Notes
yolo11 Yolo11Detector COCO closed-set Yes (32) Default. Tiling for large frames.
detr_resnet50 DetrDetector COCO closed-set No HF transformers pipeline
grounding_dino GroundingDinoDetector Open-vocabulary No Text-query grounded detection
yolov8_visdrone YoloV8VisDroneDetector VisDrone aerial Yes (32) ensure_weights() for safe parallel init

All implement ObjectDetector.predict(frame, queries) β†’ DetectionResult(boxes, scores, labels, label_names).

Registered in models/model_loader.py. Cached via @lru_cache for single-GPU; load_detector_on_device(name, device) for multi-GPU (uncached). Call prefetch_weights(name) before parallel GPU init to avoid download race conditions.

Segmenters (models/segmenters/)

Key Detector SAM2 Size
GSAM2-S/B/L Grounding DINO small/base/large
YSAM2-S/B/L YOLO11 small/base/large

Default: GSAM2-L. Registered in models/segmenters/model_loader.py.

Depth Estimators (models/depth_estimators/)

Single entry: key depth β†’ DepthAnythingV2Estimator. Optional, enabled via enable_depth=True.

Adding New Detectors

  1. Create class in models/detectors/ implementing ObjectDetector.predict() β†’ DetectionResult
  2. If weights need downloading, add ensure_weights() classmethod for thread-safe prefetch
  3. Register in models/model_loader.py _REGISTRY
  4. Add <option> to demo/index.html #detectorSelect with appropriate data-kind

Key Patterns

  • Weight downloads: Use ensure_weights() classmethod + prefetch_weights() in inference.py before ThreadPoolExecutor to avoid race conditions (see yolov8_visdrone.py)
  • Per-model locking: Each detector/depth instance gets a threading.RLock for thread-safe predict() calls in multi-GPU workers
  • Frame reordering: Writer thread uses a reorder buffer (128 frames) since GPU workers finish out-of-order
  • MJPEG streaming: publish_frame() drops frames if queue full (backpressure), consumer is event-driven at ~30fps
  • Job file layout: /tmp/detection_jobs/{job_id}/ β†’ input.mp4, output.mp4, depth.mp4

Parallel Execution with Team Mode

When implementing features that touch independent subsystems, use team mode (parallel agents with worktree isolation) for maximum efficiency.

When to Parallelize

  • Backend (Python) + Frontend (JS) changes β€” always parallelizable
  • Independent API endpoints or UI components
  • Any 2+ tasks that don't modify the same files

How to Parallelize

  1. Identify independent task domains (e.g., backend vs frontend)
  2. Dispatch one agent per domain using isolation: "worktree"
  3. Each agent works in its own git worktree β€” no conflicts
  4. Merge results back: git checkout <worktree-branch> -- <files>

Default to parallel when tasks are independent. Sequential only when one task's output is the other's input.

Planning & Design Documents

  • Plan and design docs (docs/plans/) are temporary working artifacts only
  • Do NOT commit them to git
  • Delete them after implementation is complete
  • Use them during planning/brainstorming, then discard