Spaces:
Paused
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
Reusable video analysis base combining object detection, segmentation, depth estimation, and multi-object tracking. Deployed as a Hugging Face Space (Docker SDK). Designed for multi-GPU inference with async job processing and live MJPEG streaming.
Development Commands
# Setup
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Run dev server
uvicorn app:app --host 0.0.0.0 --port 7860 --reload
# Docker (production / HF Spaces)
docker build -t detection_base . && docker run -p 7860:7860 detection_base
# Test async detection
curl -X POST http://localhost:7860/detect/async \
-F "video=@sample.mp4" \
-F "mode=object_detection" \
-F "queries=person,car" \
-F "detector=yolo11"
No test suite exists. Verify changes by running the server and testing through the UI at http://localhost:7860.
Architecture
Request Flow
index.html β POST /detect/async β app.py
ββ process_first_frame() # Fast preview (~1-2s)
ββ Return job_id + URLs immediately
ββ Background: process_video_async()
ββ run_inference() # Detection mode
ββ run_grounded_sam2_tracking() # Segmentation mode
The async pipeline returns instantly with a job_id. The frontend polls /detect/status/{job_id} and streams live frames via /detect/stream/{job_id} (MJPEG).
API Endpoints (app.py)
Core: POST /detect (sync), POST /detect/async (async with streaming)
Job management: GET /detect/status/{job_id}, DELETE /detect/job/{job_id}, GET /detect/video/{job_id}, GET /detect/stream/{job_id}
Per-frame data: GET /detect/tracks/{job_id}/{frame_idx}, GET /detect/first-frame/{job_id}, GET /detect/first-frame-depth/{job_id}, GET /detect/depth-video/{job_id}
Benchmarking: POST /benchmark, POST /benchmark/profile, POST /benchmark/analysis, GET /gpu-monitor, GET /benchmark/hardware
Model Registries
All models use a registry + factory pattern with @lru_cache for singleton loading. Use load_*_on_device(name, device) for multi-GPU (no cache).
Detectors (models/model_loader.py):
| Key | Model | Vocabulary |
|---|---|---|
yolo11 (default) |
YOLO11m | COCO classes only |
detr_resnet50 |
DETR | COCO classes only |
grounding_dino |
Grounding DINO | Open-vocabulary (arbitrary text) |
drone_yolo |
Drone YOLO | Specialized UAV detection |
All implement ObjectDetector.predict(frame, queries) β DetectionResult(boxes, scores, labels, label_names) from models/detectors/base.py.
Segmenters (models/segmenters/model_loader.py):
GSAM2-S/B/Lβ Grounded SAM2 (small/base/large) backed by grounding_dinoYSAM2-S/B/Lβ YOLO-SAM2 (small/base/large) backed by yolo11
Depth (models/depth_estimators/model_loader.py):
depthβ DepthAnythingV2
Inference Pipeline (inference.py)
Three public entry points:
process_first_frame()β Extract + detect on frame 0 only. Returns processed frame + detections.run_inference()β Full detection pipeline. Multi-GPU data parallelism with worker threads per GPU, reorder buffer for out-of-order completion, ByteTracker for object tracking, optional depth.run_grounded_sam2_tracking()β SAM2 segmentation with temporal coherence. UsesSharedFrameStore(in-memory decoded frames, 12 GiB budget) or falls back to JPEG extraction.stepparameter controls keyframe interval.
Async Job System (jobs/)
jobs/models.pyβJobInfodataclass,JobStatusenum (PROCESSING/COMPLETED/FAILED/CANCELLED)jobs/storage.pyβ Thread-safe in-memory storage at/tmp/detection_jobs/{job_id}/. Auto-cleanup every 10 minutes.jobs/background.pyβprocess_video_async()dispatches to the correct inference function, updates job status.jobs/streaming.pyβ Event-driven MJPEG frame publishing. Non-blocking (drops if consumer is slow). Frames pre-resized to 640px width.
Concurrency Model
- Per-model
RLockfor GPU serialization (inference.py:_get_model_lock) - Multi-GPU workers use separate model instances per device
AsyncVideoReaderprefetches frames in a background thread to prevent GPU starvation
Frontend (index.html)
Single HTML page with vanilla JS. Upload video, pick mode/model, view first frame, live MJPEG stream, download processed video, inspect detection JSON.
Adding a New Detector
- Create class in
models/detectors/implementingObjectDetectorfrombase.py - Register in
models/model_loader.py_REGISTRY - Add option to detector dropdown in
index.html
Dual Remotes
hfβ Hugging Face Space (deployment)githubβ GitHub (version control)