ISR

Paused

Zhen Ye Claude Opus 4.6 commited on 27 days ago

Commit

bfc6bae

1 Parent(s): 562781d

fix: weight download race condition + rewrite CLAUDE.md

Add ensure_weights() classmethod and prefetch_weights() to download
model weights once before parallel multi-GPU init, fixing FileNotFoundError
when 4 GPUs race to download visDrone.pt simultaneously.

Rewrite CLAUDE.md to reflect current architecture: async job pipeline,
multi-GPU inference, GSAM2 segmentation, frontend SPA modules.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (4) hide show

CLAUDE.md +106 -214
inference.py +4 -1
models/detectors/yolov8_visdrone.py +13 -10
models/model_loader.py +7 -0

CLAUDE.md CHANGED Viewed

@@ -4,221 +4,155 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 ## Project Overview
-Simple video object detection system with three modes:
-- **Object Detection**: Detect custom objects using text queries (fully functional)
-- **Segmentation**: Mask overlays using SAM3
-- **Drone Detection**: (Coming Soon) Specialized UAV detection
-## Core Architecture
-### Simple Detection Flow
-```
-User → demo.html → POST /detect → inference.py → detector → processed video
-```
-1. User selects mode and uploads video via web interface
-2. Frontend sends video + mode + queries to `/detect` endpoint
-3. Backend runs detection inference with selected model
-4. Returns processed video with bounding boxes
-### Available Detectors
-The system includes 4 pre-trained object detection models:
-| Detector | Key | Type | Best For |
-|----------|-----|------|----------|
-| **OWLv2** | `owlv2_base` | Open-vocabulary | Custom text queries (default) |
-| **YOLOv8** | `hf_yolov8` | COCO classes | Fast real-time detection |
-| **DETR** | `detr_resnet50` | COCO classes | Transformer-based detection |
-| **Grounding DINO** | `grounding_dino` | Open-vocabulary | Text-grounded detection |
-All detectors implement the `ObjectDetector` interface in `models/detectors/base.py` with a single `predict()` method.
 ## Development Commands
-### Setup
 ```bash
-python -m venv .venv
-source .venv/bin/activate  # or `.venv/bin/activate` on macOS/Linux
 pip install -r requirements.txt
-```
-### Running the Server
-```bash
-# Development
 uvicorn app:app --host 0.0.0.0 --port 7860 --reload
-# Production (Docker)
-docker build -t object_detectors .
-docker run -p 7860:7860 object_detectors
-```
-### Testing the API
-```bash
-# Test object detection
-curl -X POST http://localhost:7860/detect \
-  -F "video=@sample.mp4" \
-  -F "mode=object_detection" \
-  -F "queries=person,car,dog" \
-  -F "detector=owlv2_base" \
-  --output processed.mp4
-# Test placeholder modes (returns JSON)
-curl -X POST http://localhost:7860/detect \
   -F "video=@sample.mp4" \
-  -F "mode=segmentation"
 ```
-## Key Implementation Details
-### API Endpoint: `/detect`
-**Parameters:**
-- `video` (file): Video file to process
-- `mode` (string): Detection mode - `object_detection`, `segmentation`, or `drone_detection`
-- `queries` (string): Comma-separated object classes (for object_detection mode)
-- `detector` (string): Model key (default: `owlv2_base`)
-**Returns:**
-- For `object_detection`: MP4 video with bounding boxes
-- For `segmentation`: MP4 video with mask overlays
-- For `drone_detection`: JSON with `{"status": "coming_soon", "message": "..."}`
-### Inference Pipeline
-The `run_inference()` function in `inference.py` follows these steps:
-1. **Extract Frames**: Decode video using OpenCV
-2. **Parse Queries**: Split comma-separated text into list (defaults to common objects if empty)
-3. **Select Detector**: Load detector by key (cached via `@lru_cache`)
-4. **Process Frames**: Run detection on each frame
-   - Call `detector.predict(frame, queries)`
-   - Draw green bounding boxes on detections
-5. **Write Video**: Encode processed frames back to MP4
-Default queries (if none provided): `["person", "car", "truck", "motorcycle", "bicycle", "bus", "train", "airplane"]`
-### Detector Loading
-Detectors are registered in `models/model_loader.py`:
-```python
-_REGISTRY: Dict[str, Callable[[], ObjectDetector]] = {
-    "owlv2_base": Owlv2Detector,
-    "hf_yolov8": HuggingFaceYoloV8Detector,
-    "detr_resnet50": DetrDetector,
-    "grounding_dino": GroundingDinoDetector,
-}
-```
-Loaded via `load_detector(name)` which caches instances for performance.
-### Detection Result Format
-All detectors return a `DetectionResult` namedtuple:
-```python
-DetectionResult(
-    boxes: np.ndarray,        # Nx4 array [x1, y1, x2, y2]
-    scores: Sequence[float],  # Confidence scores
-    labels: Sequence[int],    # Class indices
-    label_names: Optional[Sequence[str]]  # Human-readable names
-)
-```
-## File Structure
-```
-.
-├── app.py                    # FastAPI server with /detect endpoint
-├── inference.py              # Video processing and detection pipeline
-├── demo.html                 # Web interface with mode selector
-├── requirements.txt          # Python dependencies
-├── models/
-│   ├── model_loader.py      # Detector registry and loading
-│   └── detectors/
-│       ├── base.py          # ObjectDetector interface
-│       ├── owlv2.py         # OWLv2 implementation
-│       ├── yolov8.py        # YOLOv8 implementation
-│       ├── detr.py          # DETR implementation
-│       └── grounding_dino.py # Grounding DINO implementation
-├── utils/
-│   └── video.py             # Video encoding/decoding utilities
-└── coco_classes.py          # COCO dataset class definitions
-```
-## Adding New Detectors
-To add a new detector:
-1. **Create detector class** in `models/detectors/`:
-   ```python
-   from .base import ObjectDetector, DetectionResult
-   class MyDetector(ObjectDetector):
-       name = "my_detector"
-       def predict(self, frame, queries):
-           # Your detection logic
-           return DetectionResult(boxes, scores, labels, label_names)
-   ```
-2. **Register in model_loader.py**:
-   ```python
-   _REGISTRY = {
-       ...
-       "my_detector": MyDetector,
-   }
-   ```
-3. **Update frontend** `demo.html` detector dropdown:
-   ```html
-   <option value="my_detector">My Detector</option>
-   ```
-## Adding New Detection Modes
-To implement additional modes such as drone detection:
-1. **Create specialized detector** (if needed):
-   - For segmentation: Extend `SegmentationResult` to include masks
-   - For drone detection: Create `DroneDetector` with specialized filtering
-2. **Update `/detect` endpoint** in `app.py`:
-   ```python
-   if mode == "segmentation":
-       # Run segmentation inference
-       # Return video with masks rendered
-   ```
-3. **Update frontend** to remove "disabled" class from mode card
-4. **Update inference.py** if needed to handle new output types
-## Common Patterns
-### Query Processing
-Queries are parsed from comma-separated strings:
-```python
-queries = [q.strip() for q in "person, car, dog".split(",") if q.strip()]
-# Result: ["person", "car", "dog"]
-```
-### Frame Processing Loop
-Standard pattern for processing video frames:
-```python
-processed_frames = []
-for idx, frame in enumerate(frames):
-    processed_frame, detections = infer_frame(frame, queries, detector_name)
-    processed_frames.append(processed_frame)
-```
-### Temporary File Management
-FastAPI's `BackgroundTasks` cleans up temp files after response:
-```python
-_schedule_cleanup(background_tasks, input_path)
-_schedule_cleanup(background_tasks, output_path)
-```
 ## Parallel Execution with Team Mode
@@ -227,7 +161,6 @@ When implementing features that touch independent subsystems, **use team mode (p
 ### When to Parallelize
 - Backend (Python) + Frontend (JS) changes — always parallelizable
 - Independent API endpoints or UI components
-- Test writing + implementation when in different files
 - Any 2+ tasks that don't modify the same files
 ### How to Parallelize
@@ -235,46 +168,5 @@ When implementing features that touch independent subsystems, **use team mode (p
 2. Dispatch one agent per domain using `isolation: "worktree"`
 3. Each agent works in its own git worktree — no conflicts
 4. Merge results back: `git checkout <worktree-branch> -- <files>`
-5. Clean up worktrees after merge
-### Example
-```
-Agent 1 (worktree): Backend — app.py, jobs/storage.py
-Agent 2 (worktree): Frontend — timeline.js, client.js
-→ Both run simultaneously, merge when done
-```
 **Default to parallel** when tasks are independent. Sequential only when one task's output is the other's input.
-## Performance Notes
-- **Detector Caching**: Models are loaded once and cached via `@lru_cache`
-- **Default Resolution**: Videos processed at original resolution
-- **Frame Limit**: Use `max_frames` parameter in `run_inference()` for testing
-- **Memory Usage**: Entire video is loaded into memory (frames list)
-## Troubleshooting
-### "No module named 'fastapi'"
-Install dependencies: `pip install -r requirements.txt`
-### "Video decoding failed"
-Check video codec compatibility. System expects MP4/H.264.
-### "Detector not found"
-Verify detector key exists in `model_loader._REGISTRY`
-### Slow processing
-- Try faster detector: YOLOv8 (`hf_yolov8`)
-- Reduce video resolution before uploading
-- Use `max_frames` parameter for testing
-## Dependencies
-Core packages:
-- `fastapi` + `uvicorn`: Web server
-- `torch` + `transformers`: Deep learning models
-- `opencv-python-headless`: Video processing
-- `ultralytics`: YOLOv8 implementation
-- `huggingface-hub`: Model downloading
-- `pillow`, `scipy`, `accelerate`, `timm`: Supporting libraries

 ## Project Overview
+Multi-GPU video analysis platform with three fully functional modes:
+- **Object Detection**: Bounding boxes via YOLO11, DETR, or Grounding DINO
+- **Segmentation**: Mask overlays via Grounded SAM2 (GSAM2) or YOLO+SAM2 (YSAM2)
+- **Drone Detection**: Aerial object detection via YOLOv8 fine-tuned on VisDrone
+Deployed as a HuggingFace Space (Docker SDK) at `https://biaslab2025-isr.hf.space`.
 ## Development Commands
 ```bash
+# Setup
+python -m venv .venv && source .venv/bin/activate
 pip install -r requirements.txt
+# Run dev server
 uvicorn app:app --host 0.0.0.0 --port 7860 --reload
+# Verify imports (quick smoke test — no tests exist yet)
+python -c "from app import app"
+# Docker
+docker build -t isr . && docker run -p 7860:7860 isr
+# Test async detection
+curl -X POST http://localhost:7860/detect/async \
   -F "video=@sample.mp4" \
+  -F "mode=object_detection" \
+  -F "queries=person,car" \
+  -F "detector=yolo11"
 ```
+## Core Architecture
+### Async Detection Flow (primary path)
+```
+Frontend (index.html) → POST /detect/async → background task → MJPEG stream + polling
+```
+1. Frontend uploads video + mode + queries to `/detect/async`
+2. Backend creates a `JobInfo`, spawns `process_video_async()` as an `asyncio.Task`
+3. `inference.py` runs multi-GPU parallel inference, publishing frames to an MJPEG stream
+4. Frontend consumes `/detect/stream/{job_id}` for live video, polls `/detect/status/{job_id}`
+5. On completion, frontend fetches final video from `/detect/video/{job_id}`
+### API Endpoints (app.py)
+| Method | Path | Purpose |
+|--------|------|---------|
+| POST | `/detect/async` | Start async job (returns `job_id` + stream/status URLs) |
+| GET | `/detect/status/{job_id}` | Poll job status |
+| GET | `/detect/stream/{job_id}` | MJPEG live stream (event-driven, 640px wide) |
+| GET | `/detect/video/{job_id}` | Download processed MP4 |
+| GET | `/detect/depth-video/{job_id}` | Download depth video |
+| GET | `/detect/tracks/{job_id}/summary` | Per-frame detection counts (timeline heatmap) |
+| GET | `/detect/tracks/{job_id}/{frame_idx}` | Per-frame track data |
+| DELETE | `/detect/job/{job_id}` | Cancel running job |
+| POST | `/detect` | Synchronous detection (returns MP4 directly) |
+| POST | `/benchmark` | GSAM2 latency breakdown |
+| POST | `/benchmark/profile` | Per-frame timing breakdown |
+| POST | `/benchmark/analysis` | Full roofline analysis |
+**`/detect/async` params:** `video`, `mode` (object_detection/segmentation/drone_detection), `queries`, `detector` (default: yolo11), `segmenter` (default: GSAM2-L), `enable_depth` (default: false), `step` (default: 7, segmentation keyframe interval).
+### Multi-GPU Inference Pipeline (inference.py)
+**`run_inference()`** — Detection and drone modes:
+- `AsyncVideoReader` prefetches frames into a queue (up to 32 frames)
+- Models loaded in parallel via `ThreadPoolExecutor` (one detector per GPU)
+- Queue-based producer/consumer: main thread feeds `queue_in`, N GPU workers drain it
+- Workers batch frames (up to `max_batch_size=32` for YOLO) under per-model `RLock`
+- Writer thread reorders frames, runs `ByteTracker` + `SpeedEstimator`, writes via `StreamingVideoWriter`, publishes to MJPEG stream
+- Cancellation: workers poll `_check_cancellation(job_id)` each cycle
+**`run_grounded_sam2_tracking()`** — Segmentation mode:
+- Extracts all frames to JPEG files on disk
+- Runs detection on keyframes (every `step` frames) to seed SAM2
+- SAM2 video predictor propagates masks between keyframes
+- ID reconciliation via IoU matching in `MaskDictionary`
+- Renders colored semi-transparent mask overlays with contours
+### Jobs System (jobs/)
+- **`models.py`** — `JobInfo` dataclass + `JobStatus` enum (PROCESSING/COMPLETED/FAILED/CANCELLED)
+- **`storage.py`** — In-memory `JobStorage` (singleton, `RLock`-protected) + disk at `/tmp/detection_jobs/{job_id}/`. Per-frame track data stored here. Auto-cleanup every 10 min (1hr expiry).
+- **`background.py`** — `process_video_async()` coroutine dispatches to the right inference function
+- **`streaming.py`** — MJPEG frame queue + `asyncio.Event` publisher; `publish_frame()` resizes to 640px
+### Frontend (frontend/)
+Single-page app served at `/app`. No build step. Uses `window.APP` global namespace.
+**Script modules (load order matters):**
+- `init.js` → bootstraps `window.APP` namespace
+- `core/config.js` → backend URL, tracking constants
+- `core/state.js` → all client state (video, job, tracks, UI)
+- `core/video.js` → video load/unload, blob lifecycle, depth toggle
+- `core/tracker.js` → client-side IoU + velocity tracker
+- `core/timeline.js` → canvas heatmap timeline bar
+- `api/client.js` → `hfDetectAsync()`, `pollAsyncJob()`, `cancelBackendJob()`
+- `ui/overlays.js` → canvas bounding box rendering
+- `ui/cards.js` → live track card panel
+- `ui/logging.js` → system log, status indicators
+- `main.js` → event wiring, app entry point
+The frontend infers `mode` from `data-kind` attribute on the `<select id="detectorSelect">` options.
+## Models
+### Detectors (models/detectors/)
+| Key | Class | Type | Batch | Notes |
+|-----|-------|------|-------|-------|
+| `yolo11` | `Yolo11Detector` | COCO closed-set | Yes (32) | Default. Tiling for large frames. |
+| `detr_resnet50` | `DetrDetector` | COCO closed-set | No | HF transformers pipeline |
+| `grounding_dino` | `GroundingDinoDetector` | Open-vocabulary | No | Text-query grounded detection |
+| `yolov8_visdrone` | `YoloV8VisDroneDetector` | VisDrone aerial | Yes (32) | `ensure_weights()` for safe parallel init |
+All implement `ObjectDetector.predict(frame, queries) → DetectionResult(boxes, scores, labels, label_names)`.
+Registered in `models/model_loader.py`. Cached via `@lru_cache` for single-GPU; `load_detector_on_device(name, device)` for multi-GPU (uncached). Call `prefetch_weights(name)` before parallel GPU init to avoid download race conditions.
+### Segmenters (models/segmenters/)
+| Key | Detector | SAM2 Size |
+|-----|----------|-----------|
+| `GSAM2-S/B/L` | Grounding DINO | small/base/large |
+| `YSAM2-S/B/L` | YOLO11 | small/base/large |
+Default: `GSAM2-L`. Registered in `models/segmenters/model_loader.py`.
+### Depth Estimators (models/depth_estimators/)
+Single entry: key `depth` → `DepthAnythingV2Estimator`. Optional, enabled via `enable_depth=True`.
+## Adding New Detectors
+1. Create class in `models/detectors/` implementing `ObjectDetector.predict()` → `DetectionResult`
+2. If weights need downloading, add `ensure_weights()` classmethod for thread-safe prefetch
+3. Register in `models/model_loader.py` `_REGISTRY`
+4. Add `<option>` to `frontend/index.html` `#detectorSelect` with appropriate `data-kind`
+## Key Patterns
+- **Weight downloads**: Use `ensure_weights()` classmethod + `prefetch_weights()` in inference.py before `ThreadPoolExecutor` to avoid race conditions (see `yolov8_visdrone.py`)
+- **Per-model locking**: Each detector/depth instance gets a `threading.RLock` for thread-safe `predict()` calls in multi-GPU workers
+- **Frame reordering**: Writer thread uses a reorder buffer (128 frames) since GPU workers finish out-of-order
+- **MJPEG streaming**: `publish_frame()` drops frames if queue full (backpressure), consumer is event-driven at ~30fps
+- **Job file layout**: `/tmp/detection_jobs/{job_id}/` → `input.mp4`, `output.mp4`, `depth.mp4`
 ## Parallel Execution with Team Mode
 ### When to Parallelize
 - Backend (Python) + Frontend (JS) changes — always parallelizable
 - Independent API endpoints or UI components
 - Any 2+ tasks that don't modify the same files
 ### How to Parallelize
 2. Dispatch one agent per domain using `isolation: "worktree"`
 3. Each agent works in its own git worktree — no conflicts
 4. Merge results back: `git checkout <worktree-branch> -- <files>`
 **Default to parallel** when tasks are independent. Sequential only when one task's output is the other's input.

inference.py CHANGED Viewed

@@ -646,7 +646,10 @@ def run_inference(
     if num_gpus > 0:
         logging.info("Detected %d GPUs. Loading models in parallel...", num_gpus)
         def load_models_on_gpu(gpu_id: int):
             device_str = f"cuda:{gpu_id}"
             try:

     if num_gpus > 0:
         logging.info("Detected %d GPUs. Loading models in parallel...", num_gpus)
+        # Pre-download weights before parallel GPU init to avoid race conditions
+        from models.model_loader import prefetch_weights
+        prefetch_weights(active_detector)
         def load_models_on_gpu(gpu_id: int):
             device_str = f"cuda:{gpu_id}"
             try:

models/detectors/yolov8_visdrone.py CHANGED Viewed

@@ -23,6 +23,17 @@ class YoloV8VisDroneDetector(ObjectDetector):
     supports_batch = True
     max_batch_size = 32
     def __init__(self, score_threshold: float = 0.3, device: str = None) -> None:
         self.name = "yolov8_visdrone"
         self.score_threshold = score_threshold
@@ -31,17 +42,9 @@ class YoloV8VisDroneDetector(ObjectDetector):
         else:
             self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
         logging.info(
-            "Loading YOLOv8-VisDrone from HuggingFace Hub: %s onto %s",
-            self.REPO_ID,
-            self.device,
         )
-        if not _VISDRONE_PATH.exists():
-            logging.info("Downloading visDrone.pt to %s ...", _VISDRONE_PATH)
-            hf_hub_download(
-                repo_id=self.REPO_ID,
-                filename="visDrone.pt",
-                local_dir=str(_WEIGHTS_CACHE),
-            )
         self.model = YOLO(str(_VISDRONE_PATH))
         self.model.to(self.device)
         self.class_names = self.model.names

     supports_batch = True
     max_batch_size = 32
+    @classmethod
+    def ensure_weights(cls):
+        """Download weights once (call before parallel GPU init)."""
+        if not _VISDRONE_PATH.exists():
+            logging.info("Downloading visDrone.pt to %s ...", _VISDRONE_PATH)
+            hf_hub_download(
+                repo_id=cls.REPO_ID,
+                filename="visDrone.pt",
+                local_dir=str(_WEIGHTS_CACHE),
+            )
     def __init__(self, score_threshold: float = 0.3, device: str = None) -> None:
         self.name = "yolov8_visdrone"
         self.score_threshold = score_threshold
         else:
             self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
         logging.info(
+            "Loading YOLOv8-VisDrone onto %s", self.device,
         )
+        self.ensure_weights()
         self.model = YOLO(str(_VISDRONE_PATH))
         self.model.to(self.device)
         self.class_names = self.model.names

models/model_loader.py CHANGED Viewed

@@ -39,6 +39,13 @@ def load_detector(name: Optional[str] = None) -> ObjectDetector:
     return _get_cached_detector(detector_name)
 def load_detector_on_device(name: str, device: str) -> ObjectDetector:
     """Create a new detector instance on the specified device (no caching)."""
     return _create_detector(name, device=device)

     return _get_cached_detector(detector_name)
+def prefetch_weights(name: str) -> None:
+    """Pre-download model weights (call before parallel GPU init)."""
+    factory = _REGISTRY.get(name)
+    if factory and hasattr(factory, "ensure_weights"):
+        factory.ensure_weights()
 def load_detector_on_device(name: str, device: str) -> ObjectDetector:
     """Create a new detector instance on the specified device (no caching)."""
     return _create_detector(name, device=device)