Spaces:

BiasLab2025
/

perception

Sleeping

App Files Files Community

perception / CLAUDE.md

Zhen Ye

using apple depth pro hf

94c85d4 about 1 month ago

preview code

raw

history blame contribute delete

7.85 kB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Simple video object detection system with three modes:

Object Detection: Detect custom objects using text queries (fully functional)
Segmentation: Mask overlays using SAM3
Drone Detection: (Coming Soon) Specialized UAV detection

Core Architecture

Simple Detection Flow

User → demo.html → POST /detect → inference.py → detector → processed video

User selects mode and uploads video via web interface
Frontend sends video + mode + queries to /detect endpoint
Backend runs detection inference with selected model
Returns processed video with bounding boxes

Available Detectors

The system includes 4 pre-trained object detection models:

Detector	Key	Type	Best For
OWLv2	`owlv2_base`	Open-vocabulary	Custom text queries (default)
YOLOv8	`hf_yolov8`	COCO classes	Fast real-time detection
DETR	`detr_resnet50`	COCO classes	Transformer-based detection
Grounding DINO	`grounding_dino`	Open-vocabulary	Text-grounded detection

All detectors implement the ObjectDetector interface in models/detectors/base.py with a single predict() method.

Development Commands

Setup

python -m venv .venv
source .venv/bin/activate  # or `.venv/bin/activate` on macOS/Linux
pip install -r requirements.txt

Running the Server

# Development
uvicorn app:app --host 0.0.0.0 --port 7860 --reload

# Production (Docker)
docker build -t object_detectors .
docker run -p 7860:7860 object_detectors

Testing the API

# Test object detection
curl -X POST http://localhost:7860/detect \
  -F "video=@sample.mp4" \
  -F "mode=object_detection" \
  -F "queries=person,car,dog" \
  -F "detector=owlv2_base" \
  --output processed.mp4

# Test placeholder modes (returns JSON)
curl -X POST http://localhost:7860/detect \
  -F "video=@sample.mp4" \
  -F "mode=segmentation"

Key Implementation Details

API Endpoint: `/detect`

Parameters:

video (file): Video file to process
mode (string): Detection mode - object_detection, segmentation, or drone_detection
queries (string): Comma-separated object classes (for object_detection mode)
detector (string): Model key (default: owlv2_base)

Returns:

For object_detection: MP4 video with bounding boxes
For segmentation: MP4 video with mask overlays
For drone_detection: JSON with {"status": "coming_soon", "message": "..."}

Inference Pipeline

The run_inference() function in inference.py follows these steps:

Extract Frames: Decode video using OpenCV
Parse Queries: Split comma-separated text into list (defaults to common objects if empty)
Select Detector: Load detector by key (cached via @lru_cache)
Process Frames: Run detection on each frame
- Call detector.predict(frame, queries)
- Draw green bounding boxes on detections
Write Video: Encode processed frames back to MP4

Default queries (if none provided): ["person", "car", "truck", "motorcycle", "bicycle", "bus", "train", "airplane"]

Detector Loading

Detectors are registered in models/model_loader.py:

_REGISTRY: Dict[str, Callable[[], ObjectDetector]] = {
    "owlv2_base": Owlv2Detector,
    "hf_yolov8": HuggingFaceYoloV8Detector,
    "detr_resnet50": DetrDetector,
    "grounding_dino": GroundingDinoDetector,
}

Loaded via load_detector(name) which caches instances for performance.

Detection Result Format

All detectors return a DetectionResult namedtuple:

DetectionResult(
    boxes: np.ndarray,        # Nx4 array [x1, y1, x2, y2]
    scores: Sequence[float],  # Confidence scores
    labels: Sequence[int],    # Class indices
    label_names: Optional[Sequence[str]]  # Human-readable names
)

File Structure

.
├── app.py                    # FastAPI server with /detect endpoint
├── inference.py              # Video processing and detection pipeline
├── demo.html                 # Web interface with mode selector
├── requirements.txt          # Python dependencies
├── models/
│   ├── model_loader.py      # Detector registry and loading
│   └── detectors/
│       ├── base.py          # ObjectDetector interface
│       ├── owlv2.py         # OWLv2 implementation
│       ├── yolov8.py        # YOLOv8 implementation
│       ├── detr.py          # DETR implementation
│       └── grounding_dino.py # Grounding DINO implementation
├── utils/
│   └── video.py             # Video encoding/decoding utilities
└── coco_classes.py          # COCO dataset class definitions

Adding New Detectors

To add a new detector:

Create detector class in models/detectors/:

from .base import ObjectDetector, DetectionResult

class MyDetector(ObjectDetector):
    name = "my_detector"

    def predict(self, frame, queries):
        # Your detection logic
        return DetectionResult(boxes, scores, labels, label_names)

Register in model_loader.py:

_REGISTRY = {
    ...
    "my_detector": MyDetector,
}

Update frontend demo.html detector dropdown:

<option value="my_detector">My Detector</option>

Adding New Detection Modes

To implement additional modes such as drone detection:

Create specialized detector (if needed):
- For segmentation: Extend SegmentationResult to include masks
- For drone detection: Create DroneDetector with specialized filtering

Update /detect endpoint in app.py:

if mode == "segmentation":
    # Run segmentation inference
    # Return video with masks rendered

Update frontend to remove "disabled" class from mode card
Update inference.py if needed to handle new output types

Common Patterns

Query Processing

Queries are parsed from comma-separated strings:

queries = [q.strip() for q in "person, car, dog".split(",") if q.strip()]
# Result: ["person", "car", "dog"]

Frame Processing Loop

Standard pattern for processing video frames:

processed_frames = []
for idx, frame in enumerate(frames):
    processed_frame, detections = infer_frame(frame, queries, detector_name)
    processed_frames.append(processed_frame)

Temporary File Management

FastAPI's BackgroundTasks cleans up temp files after response:

_schedule_cleanup(background_tasks, input_path)
_schedule_cleanup(background_tasks, output_path)

Performance Notes

Detector Caching: Models are loaded once and cached via @lru_cache
Default Resolution: Videos processed at original resolution
Frame Limit: Use max_frames parameter in run_inference() for testing
Memory Usage: Entire video is loaded into memory (frames list)

Troubleshooting

"No module named 'fastapi'"

Install dependencies: pip install -r requirements.txt

"Video decoding failed"

Check video codec compatibility. System expects MP4/H.264.

"Detector not found"

Verify detector key exists in model_loader._REGISTRY

Slow processing

Try faster detector: YOLOv8 (hf_yolov8)
Reduce video resolution before uploading
Use max_frames parameter for testing

Dependencies

Core packages:

fastapi + uvicorn: Web server
torch + transformers: Deep learning models
opencv-python-headless: Video processing
ultralytics: YOLOv8 implementation
huggingface-hub: Model downloading
pillow, scipy, accelerate, timm: Supporting libraries