perception / CLAUDE.md
Zhen Ye
using apple depth pro hf
94c85d4
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
Simple video object detection system with three modes:
- **Object Detection**: Detect custom objects using text queries (fully functional)
- **Segmentation**: Mask overlays using SAM3
- **Drone Detection**: (Coming Soon) Specialized UAV detection
## Core Architecture
### Simple Detection Flow
```
User β†’ demo.html β†’ POST /detect β†’ inference.py β†’ detector β†’ processed video
```
1. User selects mode and uploads video via web interface
2. Frontend sends video + mode + queries to `/detect` endpoint
3. Backend runs detection inference with selected model
4. Returns processed video with bounding boxes
### Available Detectors
The system includes 4 pre-trained object detection models:
| Detector | Key | Type | Best For |
|----------|-----|------|----------|
| **OWLv2** | `owlv2_base` | Open-vocabulary | Custom text queries (default) |
| **YOLOv8** | `hf_yolov8` | COCO classes | Fast real-time detection |
| **DETR** | `detr_resnet50` | COCO classes | Transformer-based detection |
| **Grounding DINO** | `grounding_dino` | Open-vocabulary | Text-grounded detection |
All detectors implement the `ObjectDetector` interface in `models/detectors/base.py` with a single `predict()` method.
## Development Commands
### Setup
```bash
python -m venv .venv
source .venv/bin/activate # or `.venv/bin/activate` on macOS/Linux
pip install -r requirements.txt
```
### Running the Server
```bash
# Development
uvicorn app:app --host 0.0.0.0 --port 7860 --reload
# Production (Docker)
docker build -t object_detectors .
docker run -p 7860:7860 object_detectors
```
### Testing the API
```bash
# Test object detection
curl -X POST http://localhost:7860/detect \
-F "video=@sample.mp4" \
-F "mode=object_detection" \
-F "queries=person,car,dog" \
-F "detector=owlv2_base" \
--output processed.mp4
# Test placeholder modes (returns JSON)
curl -X POST http://localhost:7860/detect \
-F "video=@sample.mp4" \
-F "mode=segmentation"
```
## Key Implementation Details
### API Endpoint: `/detect`
**Parameters:**
- `video` (file): Video file to process
- `mode` (string): Detection mode - `object_detection`, `segmentation`, or `drone_detection`
- `queries` (string): Comma-separated object classes (for object_detection mode)
- `detector` (string): Model key (default: `owlv2_base`)
**Returns:**
- For `object_detection`: MP4 video with bounding boxes
- For `segmentation`: MP4 video with mask overlays
- For `drone_detection`: JSON with `{"status": "coming_soon", "message": "..."}`
### Inference Pipeline
The `run_inference()` function in `inference.py` follows these steps:
1. **Extract Frames**: Decode video using OpenCV
2. **Parse Queries**: Split comma-separated text into list (defaults to common objects if empty)
3. **Select Detector**: Load detector by key (cached via `@lru_cache`)
4. **Process Frames**: Run detection on each frame
- Call `detector.predict(frame, queries)`
- Draw green bounding boxes on detections
5. **Write Video**: Encode processed frames back to MP4
Default queries (if none provided): `["person", "car", "truck", "motorcycle", "bicycle", "bus", "train", "airplane"]`
### Detector Loading
Detectors are registered in `models/model_loader.py`:
```python
_REGISTRY: Dict[str, Callable[[], ObjectDetector]] = {
"owlv2_base": Owlv2Detector,
"hf_yolov8": HuggingFaceYoloV8Detector,
"detr_resnet50": DetrDetector,
"grounding_dino": GroundingDinoDetector,
}
```
Loaded via `load_detector(name)` which caches instances for performance.
### Detection Result Format
All detectors return a `DetectionResult` namedtuple:
```python
DetectionResult(
boxes: np.ndarray, # Nx4 array [x1, y1, x2, y2]
scores: Sequence[float], # Confidence scores
labels: Sequence[int], # Class indices
label_names: Optional[Sequence[str]] # Human-readable names
)
```
## File Structure
```
.
β”œβ”€β”€ app.py # FastAPI server with /detect endpoint
β”œβ”€β”€ inference.py # Video processing and detection pipeline
β”œβ”€β”€ demo.html # Web interface with mode selector
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ models/
β”‚ β”œβ”€β”€ model_loader.py # Detector registry and loading
β”‚ └── detectors/
β”‚ β”œβ”€β”€ base.py # ObjectDetector interface
β”‚ β”œβ”€β”€ owlv2.py # OWLv2 implementation
β”‚ β”œβ”€β”€ yolov8.py # YOLOv8 implementation
β”‚ β”œβ”€β”€ detr.py # DETR implementation
β”‚ └── grounding_dino.py # Grounding DINO implementation
β”œβ”€β”€ utils/
β”‚ └── video.py # Video encoding/decoding utilities
└── coco_classes.py # COCO dataset class definitions
```
## Adding New Detectors
To add a new detector:
1. **Create detector class** in `models/detectors/`:
```python
from .base import ObjectDetector, DetectionResult
class MyDetector(ObjectDetector):
name = "my_detector"
def predict(self, frame, queries):
# Your detection logic
return DetectionResult(boxes, scores, labels, label_names)
```
2. **Register in model_loader.py**:
```python
_REGISTRY = {
...
"my_detector": MyDetector,
}
```
3. **Update frontend** `demo.html` detector dropdown:
```html
<option value="my_detector">My Detector</option>
```
## Adding New Detection Modes
To implement additional modes such as drone detection:
1. **Create specialized detector** (if needed):
- For segmentation: Extend `SegmentationResult` to include masks
- For drone detection: Create `DroneDetector` with specialized filtering
2. **Update `/detect` endpoint** in `app.py`:
```python
if mode == "segmentation":
# Run segmentation inference
# Return video with masks rendered
```
3. **Update frontend** to remove "disabled" class from mode card
4. **Update inference.py** if needed to handle new output types
## Common Patterns
### Query Processing
Queries are parsed from comma-separated strings:
```python
queries = [q.strip() for q in "person, car, dog".split(",") if q.strip()]
# Result: ["person", "car", "dog"]
```
### Frame Processing Loop
Standard pattern for processing video frames:
```python
processed_frames = []
for idx, frame in enumerate(frames):
processed_frame, detections = infer_frame(frame, queries, detector_name)
processed_frames.append(processed_frame)
```
### Temporary File Management
FastAPI's `BackgroundTasks` cleans up temp files after response:
```python
_schedule_cleanup(background_tasks, input_path)
_schedule_cleanup(background_tasks, output_path)
```
## Performance Notes
- **Detector Caching**: Models are loaded once and cached via `@lru_cache`
- **Default Resolution**: Videos processed at original resolution
- **Frame Limit**: Use `max_frames` parameter in `run_inference()` for testing
- **Memory Usage**: Entire video is loaded into memory (frames list)
## Troubleshooting
### "No module named 'fastapi'"
Install dependencies: `pip install -r requirements.txt`
### "Video decoding failed"
Check video codec compatibility. System expects MP4/H.264.
### "Detector not found"
Verify detector key exists in `model_loader._REGISTRY`
### Slow processing
- Try faster detector: YOLOv8 (`hf_yolov8`)
- Reduce video resolution before uploading
- Use `max_frames` parameter for testing
## Dependencies
Core packages:
- `fastapi` + `uvicorn`: Web server
- `torch` + `transformers`: Deep learning models
- `opencv-python-headless`: Video processing
- `ultralytics`: YOLOv8 implementation
- `huggingface-hub`: Model downloading
- `pillow`, `scipy`, `accelerate`, `timm`: Supporting libraries