# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview Simple video object detection system with three modes: - **Object Detection**: Detect custom objects using text queries (fully functional) - **Segmentation**: Mask overlays using SAM3 - **Drone Detection**: (Coming Soon) Specialized UAV detection ## Core Architecture ### Simple Detection Flow ``` User → demo.html → POST /detect → inference.py → detector → processed video ``` 1. User selects mode and uploads video via web interface 2. Frontend sends video + mode + queries to `/detect` endpoint 3. Backend runs detection inference with selected model 4. Returns processed video with bounding boxes ### Available Detectors The system includes 4 pre-trained object detection models: | Detector | Key | Type | Best For | |----------|-----|------|----------| | **OWLv2** | `owlv2_base` | Open-vocabulary | Custom text queries (default) | | **YOLOv8** | `hf_yolov8` | COCO classes | Fast real-time detection | | **DETR** | `detr_resnet50` | COCO classes | Transformer-based detection | | **Grounding DINO** | `grounding_dino` | Open-vocabulary | Text-grounded detection | All detectors implement the `ObjectDetector` interface in `models/detectors/base.py` with a single `predict()` method. ## Development Commands ### Setup ```bash python -m venv .venv source .venv/bin/activate # or `.venv/bin/activate` on macOS/Linux pip install -r requirements.txt ``` ### Running the Server ```bash # Development uvicorn app:app --host 0.0.0.0 --port 7860 --reload # Production (Docker) docker build -t object_detectors . docker run -p 7860:7860 object_detectors ``` ### Testing the API ```bash # Test object detection curl -X POST http://localhost:7860/detect \ -F "video=@sample.mp4" \ -F "mode=object_detection" \ -F "queries=person,car,dog" \ -F "detector=owlv2_base" \ --output processed.mp4 # Test placeholder modes (returns JSON) curl -X POST http://localhost:7860/detect \ -F "video=@sample.mp4" \ -F "mode=segmentation" ``` ## Key Implementation Details ### API Endpoint: `/detect` **Parameters:** - `video` (file): Video file to process - `mode` (string): Detection mode - `object_detection`, `segmentation`, or `drone_detection` - `queries` (string): Comma-separated object classes (for object_detection mode) - `detector` (string): Model key (default: `owlv2_base`) **Returns:** - For `object_detection`: MP4 video with bounding boxes - For `segmentation`: MP4 video with mask overlays - For `drone_detection`: JSON with `{"status": "coming_soon", "message": "..."}` ### Inference Pipeline The `run_inference()` function in `inference.py` follows these steps: 1. **Extract Frames**: Decode video using OpenCV 2. **Parse Queries**: Split comma-separated text into list (defaults to common objects if empty) 3. **Select Detector**: Load detector by key (cached via `@lru_cache`) 4. **Process Frames**: Run detection on each frame - Call `detector.predict(frame, queries)` - Draw green bounding boxes on detections 5. **Write Video**: Encode processed frames back to MP4 Default queries (if none provided): `["person", "car", "truck", "motorcycle", "bicycle", "bus", "train", "airplane"]` ### Detector Loading Detectors are registered in `models/model_loader.py`: ```python _REGISTRY: Dict[str, Callable[[], ObjectDetector]] = { "owlv2_base": Owlv2Detector, "hf_yolov8": HuggingFaceYoloV8Detector, "detr_resnet50": DetrDetector, "grounding_dino": GroundingDinoDetector, } ``` Loaded via `load_detector(name)` which caches instances for performance. ### Detection Result Format All detectors return a `DetectionResult` namedtuple: ```python DetectionResult( boxes: np.ndarray, # Nx4 array [x1, y1, x2, y2] scores: Sequence[float], # Confidence scores labels: Sequence[int], # Class indices label_names: Optional[Sequence[str]] # Human-readable names ) ``` ## File Structure ``` . ├── app.py # FastAPI server with /detect endpoint ├── inference.py # Video processing and detection pipeline ├── demo.html # Web interface with mode selector ├── requirements.txt # Python dependencies ├── models/ │ ├── model_loader.py # Detector registry and loading │ └── detectors/ │ ├── base.py # ObjectDetector interface │ ├── owlv2.py # OWLv2 implementation │ ├── yolov8.py # YOLOv8 implementation │ ├── detr.py # DETR implementation │ └── grounding_dino.py # Grounding DINO implementation ├── utils/ │ └── video.py # Video encoding/decoding utilities └── coco_classes.py # COCO dataset class definitions ``` ## Adding New Detectors To add a new detector: 1. **Create detector class** in `models/detectors/`: ```python from .base import ObjectDetector, DetectionResult class MyDetector(ObjectDetector): name = "my_detector" def predict(self, frame, queries): # Your detection logic return DetectionResult(boxes, scores, labels, label_names) ``` 2. **Register in model_loader.py**: ```python _REGISTRY = { ... "my_detector": MyDetector, } ``` 3. **Update frontend** `demo.html` detector dropdown: ```html ``` ## Adding New Detection Modes To implement additional modes such as drone detection: 1. **Create specialized detector** (if needed): - For segmentation: Extend `SegmentationResult` to include masks - For drone detection: Create `DroneDetector` with specialized filtering 2. **Update `/detect` endpoint** in `app.py`: ```python if mode == "segmentation": # Run segmentation inference # Return video with masks rendered ``` 3. **Update frontend** to remove "disabled" class from mode card 4. **Update inference.py** if needed to handle new output types ## Common Patterns ### Query Processing Queries are parsed from comma-separated strings: ```python queries = [q.strip() for q in "person, car, dog".split(",") if q.strip()] # Result: ["person", "car", "dog"] ``` ### Frame Processing Loop Standard pattern for processing video frames: ```python processed_frames = [] for idx, frame in enumerate(frames): processed_frame, detections = infer_frame(frame, queries, detector_name) processed_frames.append(processed_frame) ``` ### Temporary File Management FastAPI's `BackgroundTasks` cleans up temp files after response: ```python _schedule_cleanup(background_tasks, input_path) _schedule_cleanup(background_tasks, output_path) ``` ## Performance Notes - **Detector Caching**: Models are loaded once and cached via `@lru_cache` - **Default Resolution**: Videos processed at original resolution - **Frame Limit**: Use `max_frames` parameter in `run_inference()` for testing - **Memory Usage**: Entire video is loaded into memory (frames list) ## Troubleshooting ### "No module named 'fastapi'" Install dependencies: `pip install -r requirements.txt` ### "Video decoding failed" Check video codec compatibility. System expects MP4/H.264. ### "Detector not found" Verify detector key exists in `model_loader._REGISTRY` ### Slow processing - Try faster detector: YOLOv8 (`hf_yolov8`) - Reduce video resolution before uploading - Use `max_frames` parameter for testing ## Dependencies Core packages: - `fastapi` + `uvicorn`: Web server - `torch` + `transformers`: Deep learning models - `opencv-python-headless`: Video processing - `ultralytics`: YOLOv8 implementation - `huggingface-hub`: Model downloading - `pillow`, `scipy`, `accelerate`, `timm`: Supporting libraries