perception / CLAUDE.md
Zhen Ye
using apple depth pro hf
94c85d4

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Simple video object detection system with three modes:

  • Object Detection: Detect custom objects using text queries (fully functional)
  • Segmentation: Mask overlays using SAM3
  • Drone Detection: (Coming Soon) Specialized UAV detection

Core Architecture

Simple Detection Flow

User β†’ demo.html β†’ POST /detect β†’ inference.py β†’ detector β†’ processed video
  1. User selects mode and uploads video via web interface
  2. Frontend sends video + mode + queries to /detect endpoint
  3. Backend runs detection inference with selected model
  4. Returns processed video with bounding boxes

Available Detectors

The system includes 4 pre-trained object detection models:

Detector Key Type Best For
OWLv2 owlv2_base Open-vocabulary Custom text queries (default)
YOLOv8 hf_yolov8 COCO classes Fast real-time detection
DETR detr_resnet50 COCO classes Transformer-based detection
Grounding DINO grounding_dino Open-vocabulary Text-grounded detection

All detectors implement the ObjectDetector interface in models/detectors/base.py with a single predict() method.

Development Commands

Setup

python -m venv .venv
source .venv/bin/activate  # or `.venv/bin/activate` on macOS/Linux
pip install -r requirements.txt

Running the Server

# Development
uvicorn app:app --host 0.0.0.0 --port 7860 --reload

# Production (Docker)
docker build -t object_detectors .
docker run -p 7860:7860 object_detectors

Testing the API

# Test object detection
curl -X POST http://localhost:7860/detect \
  -F "video=@sample.mp4" \
  -F "mode=object_detection" \
  -F "queries=person,car,dog" \
  -F "detector=owlv2_base" \
  --output processed.mp4

# Test placeholder modes (returns JSON)
curl -X POST http://localhost:7860/detect \
  -F "video=@sample.mp4" \
  -F "mode=segmentation"

Key Implementation Details

API Endpoint: /detect

Parameters:

  • video (file): Video file to process
  • mode (string): Detection mode - object_detection, segmentation, or drone_detection
  • queries (string): Comma-separated object classes (for object_detection mode)
  • detector (string): Model key (default: owlv2_base)

Returns:

  • For object_detection: MP4 video with bounding boxes
  • For segmentation: MP4 video with mask overlays
  • For drone_detection: JSON with {"status": "coming_soon", "message": "..."}

Inference Pipeline

The run_inference() function in inference.py follows these steps:

  1. Extract Frames: Decode video using OpenCV
  2. Parse Queries: Split comma-separated text into list (defaults to common objects if empty)
  3. Select Detector: Load detector by key (cached via @lru_cache)
  4. Process Frames: Run detection on each frame
    • Call detector.predict(frame, queries)
    • Draw green bounding boxes on detections
  5. Write Video: Encode processed frames back to MP4

Default queries (if none provided): ["person", "car", "truck", "motorcycle", "bicycle", "bus", "train", "airplane"]

Detector Loading

Detectors are registered in models/model_loader.py:

_REGISTRY: Dict[str, Callable[[], ObjectDetector]] = {
    "owlv2_base": Owlv2Detector,
    "hf_yolov8": HuggingFaceYoloV8Detector,
    "detr_resnet50": DetrDetector,
    "grounding_dino": GroundingDinoDetector,
}

Loaded via load_detector(name) which caches instances for performance.

Detection Result Format

All detectors return a DetectionResult namedtuple:

DetectionResult(
    boxes: np.ndarray,        # Nx4 array [x1, y1, x2, y2]
    scores: Sequence[float],  # Confidence scores
    labels: Sequence[int],    # Class indices
    label_names: Optional[Sequence[str]]  # Human-readable names
)

File Structure

.
β”œβ”€β”€ app.py                    # FastAPI server with /detect endpoint
β”œβ”€β”€ inference.py              # Video processing and detection pipeline
β”œβ”€β”€ demo.html                 # Web interface with mode selector
β”œβ”€β”€ requirements.txt          # Python dependencies
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ model_loader.py      # Detector registry and loading
β”‚   └── detectors/
β”‚       β”œβ”€β”€ base.py          # ObjectDetector interface
β”‚       β”œβ”€β”€ owlv2.py         # OWLv2 implementation
β”‚       β”œβ”€β”€ yolov8.py        # YOLOv8 implementation
β”‚       β”œβ”€β”€ detr.py          # DETR implementation
β”‚       └── grounding_dino.py # Grounding DINO implementation
β”œβ”€β”€ utils/
β”‚   └── video.py             # Video encoding/decoding utilities
└── coco_classes.py          # COCO dataset class definitions

Adding New Detectors

To add a new detector:

  1. Create detector class in models/detectors/:

    from .base import ObjectDetector, DetectionResult
    
    class MyDetector(ObjectDetector):
        name = "my_detector"
    
        def predict(self, frame, queries):
            # Your detection logic
            return DetectionResult(boxes, scores, labels, label_names)
    
  2. Register in model_loader.py:

    _REGISTRY = {
        ...
        "my_detector": MyDetector,
    }
    
  3. Update frontend demo.html detector dropdown:

    <option value="my_detector">My Detector</option>
    

Adding New Detection Modes

To implement additional modes such as drone detection:

  1. Create specialized detector (if needed):

    • For segmentation: Extend SegmentationResult to include masks
    • For drone detection: Create DroneDetector with specialized filtering
  2. Update /detect endpoint in app.py:

    if mode == "segmentation":
        # Run segmentation inference
        # Return video with masks rendered
    
  3. Update frontend to remove "disabled" class from mode card

  4. Update inference.py if needed to handle new output types

Common Patterns

Query Processing

Queries are parsed from comma-separated strings:

queries = [q.strip() for q in "person, car, dog".split(",") if q.strip()]
# Result: ["person", "car", "dog"]

Frame Processing Loop

Standard pattern for processing video frames:

processed_frames = []
for idx, frame in enumerate(frames):
    processed_frame, detections = infer_frame(frame, queries, detector_name)
    processed_frames.append(processed_frame)

Temporary File Management

FastAPI's BackgroundTasks cleans up temp files after response:

_schedule_cleanup(background_tasks, input_path)
_schedule_cleanup(background_tasks, output_path)

Performance Notes

  • Detector Caching: Models are loaded once and cached via @lru_cache
  • Default Resolution: Videos processed at original resolution
  • Frame Limit: Use max_frames parameter in run_inference() for testing
  • Memory Usage: Entire video is loaded into memory (frames list)

Troubleshooting

"No module named 'fastapi'"

Install dependencies: pip install -r requirements.txt

"Video decoding failed"

Check video codec compatibility. System expects MP4/H.264.

"Detector not found"

Verify detector key exists in model_loader._REGISTRY

Slow processing

  • Try faster detector: YOLOv8 (hf_yolov8)
  • Reduce video resolution before uploading
  • Use max_frames parameter for testing

Dependencies

Core packages:

  • fastapi + uvicorn: Web server
  • torch + transformers: Deep learning models
  • opencv-python-headless: Video processing
  • ultralytics: YOLOv8 implementation
  • huggingface-hub: Model downloading
  • pillow, scipy, accelerate, timm: Supporting libraries