File size: 7,848 Bytes
94c85d4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

Simple video object detection system with three modes:
- **Object Detection**: Detect custom objects using text queries (fully functional)
- **Segmentation**: Mask overlays using SAM3
- **Drone Detection**: (Coming Soon) Specialized UAV detection

## Core Architecture

### Simple Detection Flow

```
User β†’ demo.html β†’ POST /detect β†’ inference.py β†’ detector β†’ processed video
```

1. User selects mode and uploads video via web interface
2. Frontend sends video + mode + queries to `/detect` endpoint
3. Backend runs detection inference with selected model
4. Returns processed video with bounding boxes

### Available Detectors

The system includes 4 pre-trained object detection models:

| Detector | Key | Type | Best For |
|----------|-----|------|----------|
| **OWLv2** | `owlv2_base` | Open-vocabulary | Custom text queries (default) |
| **YOLOv8** | `hf_yolov8` | COCO classes | Fast real-time detection |
| **DETR** | `detr_resnet50` | COCO classes | Transformer-based detection |
| **Grounding DINO** | `grounding_dino` | Open-vocabulary | Text-grounded detection |

All detectors implement the `ObjectDetector` interface in `models/detectors/base.py` with a single `predict()` method.

## Development Commands

### Setup
```bash
python -m venv .venv
source .venv/bin/activate  # or `.venv/bin/activate` on macOS/Linux
pip install -r requirements.txt
```

### Running the Server
```bash
# Development
uvicorn app:app --host 0.0.0.0 --port 7860 --reload

# Production (Docker)
docker build -t object_detectors .
docker run -p 7860:7860 object_detectors
```

### Testing the API
```bash
# Test object detection
curl -X POST http://localhost:7860/detect \
  -F "video=@sample.mp4" \
  -F "mode=object_detection" \
  -F "queries=person,car,dog" \
  -F "detector=owlv2_base" \
  --output processed.mp4

# Test placeholder modes (returns JSON)
curl -X POST http://localhost:7860/detect \
  -F "video=@sample.mp4" \
  -F "mode=segmentation"
```

## Key Implementation Details

### API Endpoint: `/detect`

**Parameters:**
- `video` (file): Video file to process
- `mode` (string): Detection mode - `object_detection`, `segmentation`, or `drone_detection`
- `queries` (string): Comma-separated object classes (for object_detection mode)
- `detector` (string): Model key (default: `owlv2_base`)

**Returns:**
- For `object_detection`: MP4 video with bounding boxes
- For `segmentation`: MP4 video with mask overlays
- For `drone_detection`: JSON with `{"status": "coming_soon", "message": "..."}` 

### Inference Pipeline

The `run_inference()` function in `inference.py` follows these steps:

1. **Extract Frames**: Decode video using OpenCV
2. **Parse Queries**: Split comma-separated text into list (defaults to common objects if empty)
3. **Select Detector**: Load detector by key (cached via `@lru_cache`)
4. **Process Frames**: Run detection on each frame
   - Call `detector.predict(frame, queries)`
   - Draw green bounding boxes on detections
5. **Write Video**: Encode processed frames back to MP4

Default queries (if none provided): `["person", "car", "truck", "motorcycle", "bicycle", "bus", "train", "airplane"]`

### Detector Loading

Detectors are registered in `models/model_loader.py`:

```python
_REGISTRY: Dict[str, Callable[[], ObjectDetector]] = {
    "owlv2_base": Owlv2Detector,
    "hf_yolov8": HuggingFaceYoloV8Detector,
    "detr_resnet50": DetrDetector,
    "grounding_dino": GroundingDinoDetector,
}
```

Loaded via `load_detector(name)` which caches instances for performance.

### Detection Result Format

All detectors return a `DetectionResult` namedtuple:
```python
DetectionResult(
    boxes: np.ndarray,        # Nx4 array [x1, y1, x2, y2]
    scores: Sequence[float],  # Confidence scores
    labels: Sequence[int],    # Class indices
    label_names: Optional[Sequence[str]]  # Human-readable names
)
```

## File Structure

```
.
β”œβ”€β”€ app.py                    # FastAPI server with /detect endpoint
β”œβ”€β”€ inference.py              # Video processing and detection pipeline
β”œβ”€β”€ demo.html                 # Web interface with mode selector
β”œβ”€β”€ requirements.txt          # Python dependencies
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ model_loader.py      # Detector registry and loading
β”‚   └── detectors/
β”‚       β”œβ”€β”€ base.py          # ObjectDetector interface
β”‚       β”œβ”€β”€ owlv2.py         # OWLv2 implementation
β”‚       β”œβ”€β”€ yolov8.py        # YOLOv8 implementation
β”‚       β”œβ”€β”€ detr.py          # DETR implementation
β”‚       └── grounding_dino.py # Grounding DINO implementation
β”œβ”€β”€ utils/
β”‚   └── video.py             # Video encoding/decoding utilities
└── coco_classes.py          # COCO dataset class definitions
```

## Adding New Detectors

To add a new detector:

1. **Create detector class** in `models/detectors/`:
   ```python
   from .base import ObjectDetector, DetectionResult

   class MyDetector(ObjectDetector):
       name = "my_detector"

       def predict(self, frame, queries):
           # Your detection logic
           return DetectionResult(boxes, scores, labels, label_names)
   ```

2. **Register in model_loader.py**:
   ```python
   _REGISTRY = {
       ...
       "my_detector": MyDetector,
   }
   ```

3. **Update frontend** `demo.html` detector dropdown:
   ```html
   <option value="my_detector">My Detector</option>
   ```

## Adding New Detection Modes

To implement additional modes such as drone detection:

1. **Create specialized detector** (if needed):
   - For segmentation: Extend `SegmentationResult` to include masks
   - For drone detection: Create `DroneDetector` with specialized filtering

2. **Update `/detect` endpoint** in `app.py`:
   ```python
   if mode == "segmentation":
       # Run segmentation inference
       # Return video with masks rendered
   ```

3. **Update frontend** to remove "disabled" class from mode card

4. **Update inference.py** if needed to handle new output types

## Common Patterns

### Query Processing
Queries are parsed from comma-separated strings:
```python
queries = [q.strip() for q in "person, car, dog".split(",") if q.strip()]
# Result: ["person", "car", "dog"]
```

### Frame Processing Loop
Standard pattern for processing video frames:
```python
processed_frames = []
for idx, frame in enumerate(frames):
    processed_frame, detections = infer_frame(frame, queries, detector_name)
    processed_frames.append(processed_frame)
```

### Temporary File Management
FastAPI's `BackgroundTasks` cleans up temp files after response:
```python
_schedule_cleanup(background_tasks, input_path)
_schedule_cleanup(background_tasks, output_path)
```

## Performance Notes

- **Detector Caching**: Models are loaded once and cached via `@lru_cache`
- **Default Resolution**: Videos processed at original resolution
- **Frame Limit**: Use `max_frames` parameter in `run_inference()` for testing
- **Memory Usage**: Entire video is loaded into memory (frames list)

## Troubleshooting

### "No module named 'fastapi'"
Install dependencies: `pip install -r requirements.txt`

### "Video decoding failed"
Check video codec compatibility. System expects MP4/H.264.

### "Detector not found"
Verify detector key exists in `model_loader._REGISTRY`

### Slow processing
- Try faster detector: YOLOv8 (`hf_yolov8`)
- Reduce video resolution before uploading
- Use `max_frames` parameter for testing

## Dependencies

Core packages:
- `fastapi` + `uvicorn`: Web server
- `torch` + `transformers`: Deep learning models
- `opencv-python-headless`: Video processing
- `ultralytics`: YOLOv8 implementation
- `huggingface-hub`: Model downloading
- `pillow`, `scipy`, `accelerate`, `timm`: Supporting libraries