Spaces:
Sleeping
Sleeping
| # CLAUDE.md | |
| This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. | |
| ## Project Overview | |
| Simple video object detection system with three modes: | |
| - **Object Detection**: Detect custom objects using text queries (fully functional) | |
| - **Segmentation**: Mask overlays using SAM3 | |
| - **Drone Detection**: (Coming Soon) Specialized UAV detection | |
| ## Core Architecture | |
| ### Simple Detection Flow | |
| ``` | |
| User β demo.html β POST /detect β inference.py β detector β processed video | |
| ``` | |
| 1. User selects mode and uploads video via web interface | |
| 2. Frontend sends video + mode + queries to `/detect` endpoint | |
| 3. Backend runs detection inference with selected model | |
| 4. Returns processed video with bounding boxes | |
| ### Available Detectors | |
| The system includes 4 pre-trained object detection models: | |
| | Detector | Key | Type | Best For | | |
| |----------|-----|------|----------| | |
| | **OWLv2** | `owlv2_base` | Open-vocabulary | Custom text queries (default) | | |
| | **YOLOv8** | `hf_yolov8` | COCO classes | Fast real-time detection | | |
| | **DETR** | `detr_resnet50` | COCO classes | Transformer-based detection | | |
| | **Grounding DINO** | `grounding_dino` | Open-vocabulary | Text-grounded detection | | |
| All detectors implement the `ObjectDetector` interface in `models/detectors/base.py` with a single `predict()` method. | |
| ## Development Commands | |
| ### Setup | |
| ```bash | |
| python -m venv .venv | |
| source .venv/bin/activate # or `.venv/bin/activate` on macOS/Linux | |
| pip install -r requirements.txt | |
| ``` | |
| ### Running the Server | |
| ```bash | |
| # Development | |
| uvicorn app:app --host 0.0.0.0 --port 7860 --reload | |
| # Production (Docker) | |
| docker build -t object_detectors . | |
| docker run -p 7860:7860 object_detectors | |
| ``` | |
| ### Testing the API | |
| ```bash | |
| # Test object detection | |
| curl -X POST http://localhost:7860/detect \ | |
| -F "video=@sample.mp4" \ | |
| -F "mode=object_detection" \ | |
| -F "queries=person,car,dog" \ | |
| -F "detector=owlv2_base" \ | |
| --output processed.mp4 | |
| # Test placeholder modes (returns JSON) | |
| curl -X POST http://localhost:7860/detect \ | |
| -F "video=@sample.mp4" \ | |
| -F "mode=segmentation" | |
| ``` | |
| ## Key Implementation Details | |
| ### API Endpoint: `/detect` | |
| **Parameters:** | |
| - `video` (file): Video file to process | |
| - `mode` (string): Detection mode - `object_detection`, `segmentation`, or `drone_detection` | |
| - `queries` (string): Comma-separated object classes (for object_detection mode) | |
| - `detector` (string): Model key (default: `owlv2_base`) | |
| **Returns:** | |
| - For `object_detection`: MP4 video with bounding boxes | |
| - For `segmentation`: MP4 video with mask overlays | |
| - For `drone_detection`: JSON with `{"status": "coming_soon", "message": "..."}` | |
| ### Inference Pipeline | |
| The `run_inference()` function in `inference.py` follows these steps: | |
| 1. **Extract Frames**: Decode video using OpenCV | |
| 2. **Parse Queries**: Split comma-separated text into list (defaults to common objects if empty) | |
| 3. **Select Detector**: Load detector by key (cached via `@lru_cache`) | |
| 4. **Process Frames**: Run detection on each frame | |
| - Call `detector.predict(frame, queries)` | |
| - Draw green bounding boxes on detections | |
| 5. **Write Video**: Encode processed frames back to MP4 | |
| Default queries (if none provided): `["person", "car", "truck", "motorcycle", "bicycle", "bus", "train", "airplane"]` | |
| ### Detector Loading | |
| Detectors are registered in `models/model_loader.py`: | |
| ```python | |
| _REGISTRY: Dict[str, Callable[[], ObjectDetector]] = { | |
| "owlv2_base": Owlv2Detector, | |
| "hf_yolov8": HuggingFaceYoloV8Detector, | |
| "detr_resnet50": DetrDetector, | |
| "grounding_dino": GroundingDinoDetector, | |
| } | |
| ``` | |
| Loaded via `load_detector(name)` which caches instances for performance. | |
| ### Detection Result Format | |
| All detectors return a `DetectionResult` namedtuple: | |
| ```python | |
| DetectionResult( | |
| boxes: np.ndarray, # Nx4 array [x1, y1, x2, y2] | |
| scores: Sequence[float], # Confidence scores | |
| labels: Sequence[int], # Class indices | |
| label_names: Optional[Sequence[str]] # Human-readable names | |
| ) | |
| ``` | |
| ## File Structure | |
| ``` | |
| . | |
| βββ app.py # FastAPI server with /detect endpoint | |
| βββ inference.py # Video processing and detection pipeline | |
| βββ demo.html # Web interface with mode selector | |
| βββ requirements.txt # Python dependencies | |
| βββ models/ | |
| β βββ model_loader.py # Detector registry and loading | |
| β βββ detectors/ | |
| β βββ base.py # ObjectDetector interface | |
| β βββ owlv2.py # OWLv2 implementation | |
| β βββ yolov8.py # YOLOv8 implementation | |
| β βββ detr.py # DETR implementation | |
| β βββ grounding_dino.py # Grounding DINO implementation | |
| βββ utils/ | |
| β βββ video.py # Video encoding/decoding utilities | |
| βββ coco_classes.py # COCO dataset class definitions | |
| ``` | |
| ## Adding New Detectors | |
| To add a new detector: | |
| 1. **Create detector class** in `models/detectors/`: | |
| ```python | |
| from .base import ObjectDetector, DetectionResult | |
| class MyDetector(ObjectDetector): | |
| name = "my_detector" | |
| def predict(self, frame, queries): | |
| # Your detection logic | |
| return DetectionResult(boxes, scores, labels, label_names) | |
| ``` | |
| 2. **Register in model_loader.py**: | |
| ```python | |
| _REGISTRY = { | |
| ... | |
| "my_detector": MyDetector, | |
| } | |
| ``` | |
| 3. **Update frontend** `demo.html` detector dropdown: | |
| ```html | |
| <option value="my_detector">My Detector</option> | |
| ``` | |
| ## Adding New Detection Modes | |
| To implement additional modes such as drone detection: | |
| 1. **Create specialized detector** (if needed): | |
| - For segmentation: Extend `SegmentationResult` to include masks | |
| - For drone detection: Create `DroneDetector` with specialized filtering | |
| 2. **Update `/detect` endpoint** in `app.py`: | |
| ```python | |
| if mode == "segmentation": | |
| # Run segmentation inference | |
| # Return video with masks rendered | |
| ``` | |
| 3. **Update frontend** to remove "disabled" class from mode card | |
| 4. **Update inference.py** if needed to handle new output types | |
| ## Common Patterns | |
| ### Query Processing | |
| Queries are parsed from comma-separated strings: | |
| ```python | |
| queries = [q.strip() for q in "person, car, dog".split(",") if q.strip()] | |
| # Result: ["person", "car", "dog"] | |
| ``` | |
| ### Frame Processing Loop | |
| Standard pattern for processing video frames: | |
| ```python | |
| processed_frames = [] | |
| for idx, frame in enumerate(frames): | |
| processed_frame, detections = infer_frame(frame, queries, detector_name) | |
| processed_frames.append(processed_frame) | |
| ``` | |
| ### Temporary File Management | |
| FastAPI's `BackgroundTasks` cleans up temp files after response: | |
| ```python | |
| _schedule_cleanup(background_tasks, input_path) | |
| _schedule_cleanup(background_tasks, output_path) | |
| ``` | |
| ## Performance Notes | |
| - **Detector Caching**: Models are loaded once and cached via `@lru_cache` | |
| - **Default Resolution**: Videos processed at original resolution | |
| - **Frame Limit**: Use `max_frames` parameter in `run_inference()` for testing | |
| - **Memory Usage**: Entire video is loaded into memory (frames list) | |
| ## Troubleshooting | |
| ### "No module named 'fastapi'" | |
| Install dependencies: `pip install -r requirements.txt` | |
| ### "Video decoding failed" | |
| Check video codec compatibility. System expects MP4/H.264. | |
| ### "Detector not found" | |
| Verify detector key exists in `model_loader._REGISTRY` | |
| ### Slow processing | |
| - Try faster detector: YOLOv8 (`hf_yolov8`) | |
| - Reduce video resolution before uploading | |
| - Use `max_frames` parameter for testing | |
| ## Dependencies | |
| Core packages: | |
| - `fastapi` + `uvicorn`: Web server | |
| - `torch` + `transformers`: Deep learning models | |
| - `opencv-python-headless`: Video processing | |
| - `ultralytics`: YOLOv8 implementation | |
| - `huggingface-hub`: Model downloading | |
| - `pillow`, `scipy`, `accelerate`, `timm`: Supporting libraries | |