Spaces:
Sleeping
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
Simple video object detection system with three modes:
- Object Detection: Detect custom objects using text queries (fully functional)
- Segmentation: Mask overlays using SAM3
- Drone Detection: (Coming Soon) Specialized UAV detection
Core Architecture
Simple Detection Flow
User β demo.html β POST /detect β inference.py β detector β processed video
- User selects mode and uploads video via web interface
- Frontend sends video + mode + queries to
/detectendpoint - Backend runs detection inference with selected model
- Returns processed video with bounding boxes
Available Detectors
The system includes 4 pre-trained object detection models:
| Detector | Key | Type | Best For |
|---|---|---|---|
| OWLv2 | owlv2_base |
Open-vocabulary | Custom text queries (default) |
| YOLOv8 | hf_yolov8 |
COCO classes | Fast real-time detection |
| DETR | detr_resnet50 |
COCO classes | Transformer-based detection |
| Grounding DINO | grounding_dino |
Open-vocabulary | Text-grounded detection |
All detectors implement the ObjectDetector interface in models/detectors/base.py with a single predict() method.
Development Commands
Setup
python -m venv .venv
source .venv/bin/activate # or `.venv/bin/activate` on macOS/Linux
pip install -r requirements.txt
Running the Server
# Development
uvicorn app:app --host 0.0.0.0 --port 7860 --reload
# Production (Docker)
docker build -t object_detectors .
docker run -p 7860:7860 object_detectors
Testing the API
# Test object detection
curl -X POST http://localhost:7860/detect \
-F "video=@sample.mp4" \
-F "mode=object_detection" \
-F "queries=person,car,dog" \
-F "detector=owlv2_base" \
--output processed.mp4
# Test placeholder modes (returns JSON)
curl -X POST http://localhost:7860/detect \
-F "video=@sample.mp4" \
-F "mode=segmentation"
Key Implementation Details
API Endpoint: /detect
Parameters:
video(file): Video file to processmode(string): Detection mode -object_detection,segmentation, ordrone_detectionqueries(string): Comma-separated object classes (for object_detection mode)detector(string): Model key (default:owlv2_base)
Returns:
- For
object_detection: MP4 video with bounding boxes - For
segmentation: MP4 video with mask overlays - For
drone_detection: JSON with{"status": "coming_soon", "message": "..."}
Inference Pipeline
The run_inference() function in inference.py follows these steps:
- Extract Frames: Decode video using OpenCV
- Parse Queries: Split comma-separated text into list (defaults to common objects if empty)
- Select Detector: Load detector by key (cached via
@lru_cache) - Process Frames: Run detection on each frame
- Call
detector.predict(frame, queries) - Draw green bounding boxes on detections
- Call
- Write Video: Encode processed frames back to MP4
Default queries (if none provided): ["person", "car", "truck", "motorcycle", "bicycle", "bus", "train", "airplane"]
Detector Loading
Detectors are registered in models/model_loader.py:
_REGISTRY: Dict[str, Callable[[], ObjectDetector]] = {
"owlv2_base": Owlv2Detector,
"hf_yolov8": HuggingFaceYoloV8Detector,
"detr_resnet50": DetrDetector,
"grounding_dino": GroundingDinoDetector,
}
Loaded via load_detector(name) which caches instances for performance.
Detection Result Format
All detectors return a DetectionResult namedtuple:
DetectionResult(
boxes: np.ndarray, # Nx4 array [x1, y1, x2, y2]
scores: Sequence[float], # Confidence scores
labels: Sequence[int], # Class indices
label_names: Optional[Sequence[str]] # Human-readable names
)
File Structure
.
βββ app.py # FastAPI server with /detect endpoint
βββ inference.py # Video processing and detection pipeline
βββ demo.html # Web interface with mode selector
βββ requirements.txt # Python dependencies
βββ models/
β βββ model_loader.py # Detector registry and loading
β βββ detectors/
β βββ base.py # ObjectDetector interface
β βββ owlv2.py # OWLv2 implementation
β βββ yolov8.py # YOLOv8 implementation
β βββ detr.py # DETR implementation
β βββ grounding_dino.py # Grounding DINO implementation
βββ utils/
β βββ video.py # Video encoding/decoding utilities
βββ coco_classes.py # COCO dataset class definitions
Adding New Detectors
To add a new detector:
Create detector class in
models/detectors/:from .base import ObjectDetector, DetectionResult class MyDetector(ObjectDetector): name = "my_detector" def predict(self, frame, queries): # Your detection logic return DetectionResult(boxes, scores, labels, label_names)Register in model_loader.py:
_REGISTRY = { ... "my_detector": MyDetector, }Update frontend
demo.htmldetector dropdown:<option value="my_detector">My Detector</option>
Adding New Detection Modes
To implement additional modes such as drone detection:
Create specialized detector (if needed):
- For segmentation: Extend
SegmentationResultto include masks - For drone detection: Create
DroneDetectorwith specialized filtering
- For segmentation: Extend
Update
/detectendpoint inapp.py:if mode == "segmentation": # Run segmentation inference # Return video with masks renderedUpdate frontend to remove "disabled" class from mode card
Update inference.py if needed to handle new output types
Common Patterns
Query Processing
Queries are parsed from comma-separated strings:
queries = [q.strip() for q in "person, car, dog".split(",") if q.strip()]
# Result: ["person", "car", "dog"]
Frame Processing Loop
Standard pattern for processing video frames:
processed_frames = []
for idx, frame in enumerate(frames):
processed_frame, detections = infer_frame(frame, queries, detector_name)
processed_frames.append(processed_frame)
Temporary File Management
FastAPI's BackgroundTasks cleans up temp files after response:
_schedule_cleanup(background_tasks, input_path)
_schedule_cleanup(background_tasks, output_path)
Performance Notes
- Detector Caching: Models are loaded once and cached via
@lru_cache - Default Resolution: Videos processed at original resolution
- Frame Limit: Use
max_framesparameter inrun_inference()for testing - Memory Usage: Entire video is loaded into memory (frames list)
Troubleshooting
"No module named 'fastapi'"
Install dependencies: pip install -r requirements.txt
"Video decoding failed"
Check video codec compatibility. System expects MP4/H.264.
"Detector not found"
Verify detector key exists in model_loader._REGISTRY
Slow processing
- Try faster detector: YOLOv8 (
hf_yolov8) - Reduce video resolution before uploading
- Use
max_framesparameter for testing
Dependencies
Core packages:
fastapi+uvicorn: Web servertorch+transformers: Deep learning modelsopencv-python-headless: Video processingultralytics: YOLOv8 implementationhuggingface-hub: Model downloadingpillow,scipy,accelerate,timm: Supporting libraries