Spaces:

BiasLab2025
/

perception

Sleeping

File size: 7,848 Bytes

94c85d4

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

Simple video object detection system with three modes:
- **Object Detection**: Detect custom objects using text queries (fully functional)
- **Segmentation**: Mask overlays using SAM3
- **Drone Detection**: (Coming Soon) Specialized UAV detection

## Core Architecture

### Simple Detection Flow

```
User → demo.html → POST /detect → inference.py → detector → processed video
```

1. User selects mode and uploads video via web interface
2. Frontend sends video + mode + queries to `/detect` endpoint
3. Backend runs detection inference with selected model
4. Returns processed video with bounding boxes

### Available Detectors

The system includes 4 pre-trained object detection models:

| Detector | Key | Type | Best For |
|----------|-----|------|----------|
| **OWLv2** | `owlv2_base` | Open-vocabulary | Custom text queries (default) |
| **YOLOv8** | `hf_yolov8` | COCO classes | Fast real-time detection |
| **DETR** | `detr_resnet50` | COCO classes | Transformer-based detection |
| **Grounding DINO** | `grounding_dino` | Open-vocabulary | Text-grounded detection |

All detectors implement the `ObjectDetector` interface in `models/detectors/base.py` with a single `predict()` method.

## Development Commands

### Setup
```bash
python -m venv .venv
source .venv/bin/activate  # or `.venv/bin/activate` on macOS/Linux
pip install -r requirements.txt
```

### Running the Server
```bash
# Development
uvicorn app:app --host 0.0.0.0 --port 7860 --reload

# Production (Docker)
docker build -t object_detectors .
docker run -p 7860:7860 object_detectors
```

### Testing the API
```bash
# Test object detection
curl -X POST http://localhost:7860/detect \
  -F "video=@sample.mp4" \
  -F "mode=object_detection" \
  -F "queries=person,car,dog" \
  -F "detector=owlv2_base" \
  --output processed.mp4

# Test placeholder modes (returns JSON)
curl -X POST http://localhost:7860/detect \
  -F "video=@sample.mp4" \
  -F "mode=segmentation"
```

## Key Implementation Details

### API Endpoint: `/detect`

**Parameters:**
- `video` (file): Video file to process
- `mode` (string): Detection mode - `object_detection`, `segmentation`, or `drone_detection`
- `queries` (string): Comma-separated object classes (for object_detection mode)
- `detector` (string): Model key (default: `owlv2_base`)

**Returns:**
- For `object_detection`: MP4 video with bounding boxes
- For `segmentation`: MP4 video with mask overlays
- For `drone_detection`: JSON with `{"status": "coming_soon", "message": "..."}` 

### Inference Pipeline

The `run_inference()` function in `inference.py` follows these steps:

1. **Extract Frames**: Decode video using OpenCV
2. **Parse Queries**: Split comma-separated text into list (defaults to common objects if empty)
3. **Select Detector**: Load detector by key (cached via `@lru_cache`)
4. **Process Frames**: Run detection on each frame
   - Call `detector.predict(frame, queries)`
   - Draw green bounding boxes on detections
5. **Write Video**: Encode processed frames back to MP4

Default queries (if none provided): `["person", "car", "truck", "motorcycle", "bicycle", "bus", "train", "airplane"]`

### Detector Loading

Detectors are registered in `models/model_loader.py`:

```python
_REGISTRY: Dict[str, Callable[[], ObjectDetector]] = {
    "owlv2_base": Owlv2Detector,
    "hf_yolov8": HuggingFaceYoloV8Detector,
    "detr_resnet50": DetrDetector,
    "grounding_dino": GroundingDinoDetector,
}
```

Loaded via `load_detector(name)` which caches instances for performance.

### Detection Result Format

All detectors return a `DetectionResult` namedtuple:
```python
DetectionResult(
    boxes: np.ndarray,        # Nx4 array [x1, y1, x2, y2]
    scores: Sequence[float],  # Confidence scores
    labels: Sequence[int],    # Class indices
    label_names: Optional[Sequence[str]]  # Human-readable names
)
```

## File Structure

```
.
├── app.py                    # FastAPI server with /detect endpoint
├── inference.py              # Video processing and detection pipeline
├── demo.html                 # Web interface with mode selector
├── requirements.txt          # Python dependencies
├── models/
│   ├── model_loader.py      # Detector registry and loading
│   └── detectors/
│       ├── base.py          # ObjectDetector interface
│       ├── owlv2.py         # OWLv2 implementation
│       ├── yolov8.py        # YOLOv8 implementation
│       ├── detr.py          # DETR implementation
│       └── grounding_dino.py # Grounding DINO implementation
├── utils/
│   └── video.py             # Video encoding/decoding utilities
└── coco_classes.py          # COCO dataset class definitions
```

## Adding New Detectors

To add a new detector:

1. **Create detector class** in `models/detectors/`:
   ```python
   from .base import ObjectDetector, DetectionResult

   class MyDetector(ObjectDetector):
       name = "my_detector"

       def predict(self, frame, queries):
           # Your detection logic
           return DetectionResult(boxes, scores, labels, label_names)
   ```

2. **Register in model_loader.py**:
   ```python
   _REGISTRY = {
       ...
       "my_detector": MyDetector,
   }
   ```

3. **Update frontend** `demo.html` detector dropdown:
   ```html
   <option value="my_detector">My Detector</option>
   ```

## Adding New Detection Modes

To implement additional modes such as drone detection:

1. **Create specialized detector** (if needed):
   - For segmentation: Extend `SegmentationResult` to include masks
   - For drone detection: Create `DroneDetector` with specialized filtering

2. **Update `/detect` endpoint** in `app.py`:
   ```python
   if mode == "segmentation":
       # Run segmentation inference
       # Return video with masks rendered
   ```

3. **Update frontend** to remove "disabled" class from mode card

4. **Update inference.py** if needed to handle new output types

## Common Patterns

### Query Processing
Queries are parsed from comma-separated strings:
```python
queries = [q.strip() for q in "person, car, dog".split(",") if q.strip()]
# Result: ["person", "car", "dog"]
```

### Frame Processing Loop
Standard pattern for processing video frames:
```python
processed_frames = []
for idx, frame in enumerate(frames):
    processed_frame, detections = infer_frame(frame, queries, detector_name)
    processed_frames.append(processed_frame)
```

### Temporary File Management
FastAPI's `BackgroundTasks` cleans up temp files after response:
```python
_schedule_cleanup(background_tasks, input_path)
_schedule_cleanup(background_tasks, output_path)
```

## Performance Notes

- **Detector Caching**: Models are loaded once and cached via `@lru_cache`
- **Default Resolution**: Videos processed at original resolution
- **Frame Limit**: Use `max_frames` parameter in `run_inference()` for testing
- **Memory Usage**: Entire video is loaded into memory (frames list)

## Troubleshooting

### "No module named 'fastapi'"
Install dependencies: `pip install -r requirements.txt`

### "Video decoding failed"
Check video codec compatibility. System expects MP4/H.264.

### "Detector not found"
Verify detector key exists in `model_loader._REGISTRY`

### Slow processing
- Try faster detector: YOLOv8 (`hf_yolov8`)
- Reduce video resolution before uploading
- Use `max_frames` parameter for testing

## Dependencies

Core packages:
- `fastapi` + `uvicorn`: Web server
- `torch` + `transformers`: Deep learning models
- `opencv-python-headless`: Video processing
- `ultralytics`: YOLOv8 implementation
- `huggingface-hub`: Model downloading
- `pillow`, `scipy`, `accelerate`, `timm`: Supporting libraries