Spaces:

BiasLab2025
/

perception

Sleeping

App Files Files Community

perception / CLAUDE.md

Zhen Ye

using apple depth pro hf

94c85d4 about 1 month ago

preview code

raw

history blame contribute delete

7.85 kB

	# CLAUDE.md

	This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

	## Project Overview

	Simple video object detection system with three modes:
	- Object Detection: Detect custom objects using text queries (fully functional)
	- Segmentation: Mask overlays using SAM3
	- Drone Detection: (Coming Soon) Specialized UAV detection

	## Core Architecture

	### Simple Detection Flow

	```
	User → demo.html → POST /detect → inference.py → detector → processed video
	```

	1. User selects mode and uploads video via web interface
	2. Frontend sends video + mode + queries to `/detect` endpoint
	3. Backend runs detection inference with selected model
	4. Returns processed video with bounding boxes

	### Available Detectors

	The system includes 4 pre-trained object detection models:

	\| Detector \| Key \| Type \| Best For \|
	\|----------\|-----\|------\|----------\|
	\| OWLv2 \| `owlv2_base` \| Open-vocabulary \| Custom text queries (default) \|
	\| YOLOv8 \| `hf_yolov8` \| COCO classes \| Fast real-time detection \|
	\| DETR \| `detr_resnet50` \| COCO classes \| Transformer-based detection \|
	\| Grounding DINO \| `grounding_dino` \| Open-vocabulary \| Text-grounded detection \|

	All detectors implement the `ObjectDetector` interface in `models/detectors/base.py` with a single `predict()` method.

	## Development Commands

	### Setup
	```bash
	python -m venv .venv
	source .venv/bin/activate # or `.venv/bin/activate` on macOS/Linux
	pip install -r requirements.txt
	```

	### Running the Server
	```bash
	# Development
	uvicorn app:app --host 0.0.0.0 --port 7860 --reload

	# Production (Docker)
	docker build -t object_detectors .
	docker run -p 7860:7860 object_detectors
	```

	### Testing the API
	```bash
	# Test object detection
	curl -X POST http://localhost:7860/detect \
	-F "video=@sample.mp4" \
	-F "mode=object_detection" \
	-F "queries=person,car,dog" \
	-F "detector=owlv2_base" \
	--output processed.mp4

	# Test placeholder modes (returns JSON)
	curl -X POST http://localhost:7860/detect \
	-F "video=@sample.mp4" \
	-F "mode=segmentation"
	```

	## Key Implementation Details

	### API Endpoint: `/detect`

	Parameters:
	- `video` (file): Video file to process
	- `mode` (string): Detection mode - `object_detection`, `segmentation`, or `drone_detection`
	- `queries` (string): Comma-separated object classes (for object_detection mode)
	- `detector` (string): Model key (default: `owlv2_base`)

	Returns:
	- For `object_detection`: MP4 video with bounding boxes
	- For `segmentation`: MP4 video with mask overlays
	- For `drone_detection`: JSON with `{"status": "coming_soon", "message": "..."}`

	### Inference Pipeline

	The `run_inference()` function in `inference.py` follows these steps:

	1. Extract Frames: Decode video using OpenCV
	2. Parse Queries: Split comma-separated text into list (defaults to common objects if empty)
	3. Select Detector: Load detector by key (cached via `@lru_cache`)
	4. Process Frames: Run detection on each frame
	- Call `detector.predict(frame, queries)`
	- Draw green bounding boxes on detections
	5. Write Video: Encode processed frames back to MP4

	Default queries (if none provided): `["person", "car", "truck", "motorcycle", "bicycle", "bus", "train", "airplane"]`

	### Detector Loading

	Detectors are registered in `models/model_loader.py`:

	```python
	_REGISTRY: Dict[str, Callable[[], ObjectDetector]] = {
	"owlv2_base": Owlv2Detector,
	"hf_yolov8": HuggingFaceYoloV8Detector,
	"detr_resnet50": DetrDetector,
	"grounding_dino": GroundingDinoDetector,
	}
	```

	Loaded via `load_detector(name)` which caches instances for performance.

	### Detection Result Format

	All detectors return a `DetectionResult` namedtuple:
	```python
	DetectionResult(
	boxes: np.ndarray, # Nx4 array [x1, y1, x2, y2]
	scores: Sequence[float], # Confidence scores
	labels: Sequence[int], # Class indices
	label_names: Optional[Sequence[str]] # Human-readable names
	)
	```

	## File Structure

	```
	.
	├── app.py # FastAPI server with /detect endpoint
	├── inference.py # Video processing and detection pipeline
	├── demo.html # Web interface with mode selector
	├── requirements.txt # Python dependencies
	├── models/
	│ ├── model_loader.py # Detector registry and loading
	│ └── detectors/
	│ ├── base.py # ObjectDetector interface
	│ ├── owlv2.py # OWLv2 implementation
	│ ├── yolov8.py # YOLOv8 implementation
	│ ├── detr.py # DETR implementation
	│ └── grounding_dino.py # Grounding DINO implementation
	├── utils/
	│ └── video.py # Video encoding/decoding utilities
	└── coco_classes.py # COCO dataset class definitions
	```

	## Adding New Detectors

	To add a new detector:

	1. Create detector class in `models/detectors/`:
	```python
	from .base import ObjectDetector, DetectionResult

	class MyDetector(ObjectDetector):
	name = "my_detector"

	def predict(self, frame, queries):
	# Your detection logic
	return DetectionResult(boxes, scores, labels, label_names)
	```

	2. Register in model_loader.py:
	```python
	_REGISTRY = {
	...
	"my_detector": MyDetector,
	}
	```

	3. Update frontend `demo.html` detector dropdown:
	```html
	<option value="my_detector">My Detector</option>
	```

	## Adding New Detection Modes

	To implement additional modes such as drone detection:

	1. Create specialized detector (if needed):
	- For segmentation: Extend `SegmentationResult` to include masks
	- For drone detection: Create `DroneDetector` with specialized filtering

	2. Update `/detect` endpoint in `app.py`:
	```python
	if mode == "segmentation":
	# Run segmentation inference
	# Return video with masks rendered
	```

	3. Update frontend to remove "disabled" class from mode card

	4. Update inference.py if needed to handle new output types

	## Common Patterns

	### Query Processing
	Queries are parsed from comma-separated strings:
	```python
	queries = [q.strip() for q in "person, car, dog".split(",") if q.strip()]
	# Result: ["person", "car", "dog"]
	```

	### Frame Processing Loop
	Standard pattern for processing video frames:
	```python
	processed_frames = []
	for idx, frame in enumerate(frames):
	processed_frame, detections = infer_frame(frame, queries, detector_name)
	processed_frames.append(processed_frame)
	```

	### Temporary File Management
	FastAPI's `BackgroundTasks` cleans up temp files after response:
	```python
	_schedule_cleanup(background_tasks, input_path)
	_schedule_cleanup(background_tasks, output_path)
	```

	## Performance Notes

	- Detector Caching: Models are loaded once and cached via `@lru_cache`
	- Default Resolution: Videos processed at original resolution
	- Frame Limit: Use `max_frames` parameter in `run_inference()` for testing
	- Memory Usage: Entire video is loaded into memory (frames list)

	## Troubleshooting

	### "No module named 'fastapi'"
	Install dependencies: `pip install -r requirements.txt`

	### "Video decoding failed"
	Check video codec compatibility. System expects MP4/H.264.

	### "Detector not found"
	Verify detector key exists in `model_loader._REGISTRY`

	### Slow processing
	- Try faster detector: YOLOv8 (`hf_yolov8`)
	- Reduce video resolution before uploading
	- Use `max_frames` parameter for testing

	## Dependencies

	Core packages:
	- `fastapi` + `uvicorn`: Web server
	- `torch` + `transformers`: Deep learning models
	- `opencv-python-headless`: Video processing
	- `ultralytics`: YOLOv8 implementation
	- `huggingface-hub`: Model downloading
	- `pillow`, `scipy`, `accelerate`, `timm`: Supporting libraries