Spaces:

BiasLab2025
/

perception

Paused

App Files Files Community

Zhen Ye commited on Jan 10

Commit

94c85d4

1 Parent(s): 469102e

using apple depth pro hf

Browse files

Files changed (4) hide show

CLAUDE.md +254 -0
demo.html +114 -10
models/depth_estimators/depth_pro.py +34 -25
requirements.txt +0 -1

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,254 @@

+# CLAUDE.md
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+## Project Overview
+Simple video object detection system with three modes:
+- **Object Detection**: Detect custom objects using text queries (fully functional)
+- **Segmentation**: Mask overlays using SAM3
+- **Drone Detection**: (Coming Soon) Specialized UAV detection
+## Core Architecture
+### Simple Detection Flow
+```
+User → demo.html → POST /detect → inference.py → detector → processed video
+```
+1. User selects mode and uploads video via web interface
+2. Frontend sends video + mode + queries to `/detect` endpoint
+3. Backend runs detection inference with selected model
+4. Returns processed video with bounding boxes
+### Available Detectors
+The system includes 4 pre-trained object detection models:
+| Detector | Key | Type | Best For |
+|----------|-----|------|----------|
+| **OWLv2** | `owlv2_base` | Open-vocabulary | Custom text queries (default) |
+| **YOLOv8** | `hf_yolov8` | COCO classes | Fast real-time detection |
+| **DETR** | `detr_resnet50` | COCO classes | Transformer-based detection |
+| **Grounding DINO** | `grounding_dino` | Open-vocabulary | Text-grounded detection |
+All detectors implement the `ObjectDetector` interface in `models/detectors/base.py` with a single `predict()` method.
+## Development Commands
+### Setup
+```bash
+python -m venv .venv
+source .venv/bin/activate  # or `.venv/bin/activate` on macOS/Linux
+pip install -r requirements.txt
+```
+### Running the Server
+```bash
+# Development
+uvicorn app:app --host 0.0.0.0 --port 7860 --reload
+# Production (Docker)
+docker build -t object_detectors .
+docker run -p 7860:7860 object_detectors
+```
+### Testing the API
+```bash
+# Test object detection
+curl -X POST http://localhost:7860/detect \
+  -F "video=@sample.mp4" \
+  -F "mode=object_detection" \
+  -F "queries=person,car,dog" \
+  -F "detector=owlv2_base" \
+  --output processed.mp4
+# Test placeholder modes (returns JSON)
+curl -X POST http://localhost:7860/detect \
+  -F "video=@sample.mp4" \
+  -F "mode=segmentation"
+```
+## Key Implementation Details
+### API Endpoint: `/detect`
+**Parameters:**
+- `video` (file): Video file to process
+- `mode` (string): Detection mode - `object_detection`, `segmentation`, or `drone_detection`
+- `queries` (string): Comma-separated object classes (for object_detection mode)
+- `detector` (string): Model key (default: `owlv2_base`)
+**Returns:**
+- For `object_detection`: MP4 video with bounding boxes
+- For `segmentation`: MP4 video with mask overlays
+- For `drone_detection`: JSON with `{"status": "coming_soon", "message": "..."}`
+### Inference Pipeline
+The `run_inference()` function in `inference.py` follows these steps:
+1. **Extract Frames**: Decode video using OpenCV
+2. **Parse Queries**: Split comma-separated text into list (defaults to common objects if empty)
+3. **Select Detector**: Load detector by key (cached via `@lru_cache`)
+4. **Process Frames**: Run detection on each frame
+   - Call `detector.predict(frame, queries)`
+   - Draw green bounding boxes on detections
+5. **Write Video**: Encode processed frames back to MP4
+Default queries (if none provided): `["person", "car", "truck", "motorcycle", "bicycle", "bus", "train", "airplane"]`
+### Detector Loading
+Detectors are registered in `models/model_loader.py`:
+```python
+_REGISTRY: Dict[str, Callable[[], ObjectDetector]] = {
+    "owlv2_base": Owlv2Detector,
+    "hf_yolov8": HuggingFaceYoloV8Detector,
+    "detr_resnet50": DetrDetector,
+    "grounding_dino": GroundingDinoDetector,
+}
+```
+Loaded via `load_detector(name)` which caches instances for performance.
+### Detection Result Format
+All detectors return a `DetectionResult` namedtuple:
+```python
+DetectionResult(
+    boxes: np.ndarray,        # Nx4 array [x1, y1, x2, y2]
+    scores: Sequence[float],  # Confidence scores
+    labels: Sequence[int],    # Class indices
+    label_names: Optional[Sequence[str]]  # Human-readable names
+)
+```
+## File Structure
+```
+.
+├── app.py                    # FastAPI server with /detect endpoint
+├── inference.py              # Video processing and detection pipeline
+├── demo.html                 # Web interface with mode selector
+├── requirements.txt          # Python dependencies
+├── models/
+│   ├── model_loader.py      # Detector registry and loading
+│   └── detectors/
+│       ├── base.py          # ObjectDetector interface
+│       ├── owlv2.py         # OWLv2 implementation
+│       ├── yolov8.py        # YOLOv8 implementation
+│       ├── detr.py          # DETR implementation
+│       └── grounding_dino.py # Grounding DINO implementation
+├── utils/
+│   └── video.py             # Video encoding/decoding utilities
+└── coco_classes.py          # COCO dataset class definitions
+```
+## Adding New Detectors
+To add a new detector:
+1. **Create detector class** in `models/detectors/`:
+   ```python
+   from .base import ObjectDetector, DetectionResult
+   class MyDetector(ObjectDetector):
+       name = "my_detector"
+       def predict(self, frame, queries):
+           # Your detection logic
+           return DetectionResult(boxes, scores, labels, label_names)
+   ```
+2. **Register in model_loader.py**:
+   ```python
+   _REGISTRY = {
+       ...
+       "my_detector": MyDetector,
+   }
+   ```
+3. **Update frontend** `demo.html` detector dropdown:
+   ```html
+   <option value="my_detector">My Detector</option>
+   ```
+## Adding New Detection Modes
+To implement additional modes such as drone detection:
+1. **Create specialized detector** (if needed):
+   - For segmentation: Extend `SegmentationResult` to include masks
+   - For drone detection: Create `DroneDetector` with specialized filtering
+2. **Update `/detect` endpoint** in `app.py`:
+   ```python
+   if mode == "segmentation":
+       # Run segmentation inference
+       # Return video with masks rendered
+   ```
+3. **Update frontend** to remove "disabled" class from mode card
+4. **Update inference.py** if needed to handle new output types
+## Common Patterns
+### Query Processing
+Queries are parsed from comma-separated strings:
+```python
+queries = [q.strip() for q in "person, car, dog".split(",") if q.strip()]
+# Result: ["person", "car", "dog"]
+```
+### Frame Processing Loop
+Standard pattern for processing video frames:
+```python
+processed_frames = []
+for idx, frame in enumerate(frames):
+    processed_frame, detections = infer_frame(frame, queries, detector_name)
+    processed_frames.append(processed_frame)
+```
+### Temporary File Management
+FastAPI's `BackgroundTasks` cleans up temp files after response:
+```python
+_schedule_cleanup(background_tasks, input_path)
+_schedule_cleanup(background_tasks, output_path)
+```
+## Performance Notes
+- **Detector Caching**: Models are loaded once and cached via `@lru_cache`
+- **Default Resolution**: Videos processed at original resolution
+- **Frame Limit**: Use `max_frames` parameter in `run_inference()` for testing
+- **Memory Usage**: Entire video is loaded into memory (frames list)
+## Troubleshooting
+### "No module named 'fastapi'"
+Install dependencies: `pip install -r requirements.txt`
+### "Video decoding failed"
+Check video codec compatibility. System expects MP4/H.264.
+### "Detector not found"
+Verify detector key exists in `model_loader._REGISTRY`
+### Slow processing
+- Try faster detector: YOLOv8 (`hf_yolov8`)
+- Reduce video resolution before uploading
+- Use `max_frames` parameter for testing
+## Dependencies
+Core packages:
+- `fastapi` + `uvicorn`: Web server
+- `torch` + `transformers`: Deep learning models
+- `opencv-python-headless`: Video processing
+- `ultralytics`: YOLOv8 implementation
+- `huggingface-hub`: Model downloading
+- `pillow`, `scipy`, `accelerate`, `timm`: Supporting libraries

demo.html CHANGED Viewed

@@ -306,6 +306,31 @@
             100% { transform: rotate(360deg); }
         }
         .hidden {
             display: none;
         }
@@ -415,6 +440,13 @@
             <!-- Results -->
             <div class="section hidden" id="resultsSection">
                 <div class="section-title">Results</div>
                 <div class="results-grid">
                     <div class="video-card">
                         <div class="video-card-header">First Frame</div>
@@ -466,6 +498,11 @@
         // State
         let selectedMode = 'object_detection';
         let videoFile = null;
         // Elements
         const modeCards = document.querySelectorAll('.mode-card');
@@ -490,8 +527,56 @@
         const depthVideo = document.getElementById('depthVideo');
         const depthDownloadBtn = document.getElementById('depthDownloadBtn');
         const depthVideoStatus = document.getElementById('depthVideoStatus');
         let statusPoller = null;
         const statusLine = document.getElementById('statusLine');
         // Mode selection handler
         modeCards.forEach(card => {
             card.addEventListener('click', (e) => {
@@ -571,6 +656,12 @@
             depthDownloadBtn.removeAttribute('href');
             depthDownloadBtn.classList.add('hidden');
             depthVideoStatus.textContent = '';
             statusLine.classList.add('hidden');
             statusLine.textContent = '';
@@ -615,16 +706,22 @@
                             clearInterval(statusPoller);
                             statusPoller = null;
                             statusLine.textContent = 'Status: completed';
                             const videoResponse = await fetch(data.video_url);
                             if (!videoResponse.ok) {
                                 alert('Failed to fetch processed video.');
                                 return;
                             }
                             const blob = await videoResponse.blob();
-                            const videoUrl = URL.createObjectURL(blob);
-                            processedVideo.src = videoUrl;
-                            downloadBtn.href = videoUrl;
                             await loadDepthAssets(data);
                         } else if (statusData.status === 'failed') {
                             clearInterval(statusPoller);
@@ -662,8 +759,8 @@
                 const frameResponse = await fetch(jobData.first_frame_depth_url);
                 if (frameResponse.ok) {
                     const frameBlob = await frameResponse.blob();
-                    const frameUrl = URL.createObjectURL(frameBlob);
-                    depthFrameImage.src = frameUrl;
                     depthFrameImage.classList.remove('hidden');
                     depthFramePlaceholder.classList.add('hidden');
                 } else {
@@ -678,11 +775,18 @@
                 const depthResponse = await fetch(jobData.depth_video_url);
                 if (depthResponse.ok) {
                     const depthBlob = await depthResponse.blob();
-                    const depthUrl = URL.createObjectURL(depthBlob);
-                    depthVideo.src = depthUrl;
-                    depthVideo.classList.remove('hidden');
-                    depthDownloadBtn.href = depthUrl;
-                    depthDownloadBtn.classList.remove('hidden');
                 } else {
                     const error = await depthResponse.json();
                     depthVideoStatus.textContent = error.detail || 'Depth video unavailable.';

             100% { transform: rotate(360deg); }
         }
+        /* View toggle buttons */
+        .view-toggle-btn {
+            padding: 12px 28px;
+            margin: 0 10px;
+            background: #e5e7eb;
+            color: #374151;
+            border: 2px solid #d1d5db;
+            border-radius: 8px;
+            cursor: pointer;
+            font-weight: 600;
+            font-size: 14px;
+            transition: all 0.3s;
+        }
+        .view-toggle-btn.active {
+            background: #1f2933;
+            color: #f9fafb;
+            border-color: #1f2933;
+        }
+        .view-toggle-btn:hover:not(.active) {
+            background: #d1d5db;
+            transform: translateY(-1px);
+        }
         .hidden {
             display: none;
         }
             <!-- Results -->
             <div class="section hidden" id="resultsSection">
                 <div class="section-title">Results</div>
+                <!-- View Toggle Buttons -->
+                <div id="viewToggleContainer" class="hidden" style="text-align: center; margin-bottom: 20px;">
+                    <button class="view-toggle-btn active" id="detectionViewBtn">Detection View</button>
+                    <button class="view-toggle-btn" id="depthViewBtn">Depth View</button>
+                </div>
                 <div class="results-grid">
                     <div class="video-card">
                         <div class="video-card-header">First Frame</div>
         // State
         let selectedMode = 'object_detection';
         let videoFile = null;
+        let currentView = 'detection';  // 'detection' or 'depth'
+        let detectionVideoUrl = null;
+        let depthVideoUrl = null;
+        let detectionFirstFrameUrl = null;
+        let depthFirstFrameUrl = null;
         // Elements
         const modeCards = document.querySelectorAll('.mode-card');
         const depthVideo = document.getElementById('depthVideo');
         const depthDownloadBtn = document.getElementById('depthDownloadBtn');
         const depthVideoStatus = document.getElementById('depthVideoStatus');
+        const viewToggleContainer = document.getElementById('viewToggleContainer');
+        const detectionViewBtn = document.getElementById('detectionViewBtn');
+        const depthViewBtn = document.getElementById('depthViewBtn');
         let statusPoller = null;
         const statusLine = document.getElementById('statusLine');
+        // View switching function
+        function switchToView(view) {
+            currentView = view;
+            if (view === 'detection') {
+                detectionViewBtn.classList.add('active');
+                depthViewBtn.classList.remove('active');
+                if (detectionFirstFrameUrl) {
+                    firstFrameImage.src = detectionFirstFrameUrl;
+                    depthFrameImage.classList.add('hidden');
+                    depthFramePlaceholder.classList.remove('hidden');
+                }
+                if (detectionVideoUrl) {
+                    processedVideo.src = detectionVideoUrl;
+                    downloadBtn.href = detectionVideoUrl;
+                    downloadBtn.download = 'processed_detection.mp4';
+                    processedVideo.load();
+                }
+            } else {
+                depthViewBtn.classList.add('active');
+                detectionViewBtn.classList.remove('active');
+                if (depthFirstFrameUrl) {
+                    firstFrameImage.src = depthFirstFrameUrl;
+                    depthFrameImage.classList.add('hidden');
+                    depthFramePlaceholder.classList.add('hidden');
+                }
+                if (depthVideoUrl) {
+                    processedVideo.src = depthVideoUrl;
+                    downloadBtn.href = depthVideoUrl;
+                    downloadBtn.download = 'depth_map.mp4';
+                    processedVideo.load();
+                }
+            }
+        }
+        // Toggle button event listeners
+        if (detectionViewBtn) {
+            detectionViewBtn.addEventListener('click', () => switchToView('detection'));
+        }
+        if (depthViewBtn) {
+            depthViewBtn.addEventListener('click', () => switchToView('depth'));
+        }
         // Mode selection handler
         modeCards.forEach(card => {
             card.addEventListener('click', (e) => {
             depthDownloadBtn.removeAttribute('href');
             depthDownloadBtn.classList.add('hidden');
             depthVideoStatus.textContent = '';
+            viewToggleContainer.classList.add('hidden');
+            currentView = 'detection';
+            detectionVideoUrl = null;
+            depthVideoUrl = null;
+            detectionFirstFrameUrl = null;
+            depthFirstFrameUrl = null;
             statusLine.classList.add('hidden');
             statusLine.textContent = '';
                             clearInterval(statusPoller);
                             statusPoller = null;
                             statusLine.textContent = 'Status: completed';
+                            // Fetch detection video
                             const videoResponse = await fetch(data.video_url);
                             if (!videoResponse.ok) {
                                 alert('Failed to fetch processed video.');
                                 return;
                             }
                             const blob = await videoResponse.blob();
+                            detectionVideoUrl = URL.createObjectURL(blob);
+                            detectionFirstFrameUrl = `${data.first_frame_url}?t=${Date.now()}`;
+                            // Set initial detection view
+                            processedVideo.src = detectionVideoUrl;
+                            downloadBtn.href = detectionVideoUrl;
+                            // Load depth assets
                             await loadDepthAssets(data);
                         } else if (statusData.status === 'failed') {
                             clearInterval(statusPoller);
                 const frameResponse = await fetch(jobData.first_frame_depth_url);
                 if (frameResponse.ok) {
                     const frameBlob = await frameResponse.blob();
+                    depthFirstFrameUrl = URL.createObjectURL(frameBlob);
+                    depthFrameImage.src = depthFirstFrameUrl;
                     depthFrameImage.classList.remove('hidden');
                     depthFramePlaceholder.classList.add('hidden');
                 } else {
                 const depthResponse = await fetch(jobData.depth_video_url);
                 if (depthResponse.ok) {
                     const depthBlob = await depthResponse.blob();
+                    depthVideoUrl = URL.createObjectURL(depthBlob);
+                    // Keep depth video card hidden - using toggle instead
+                    depthVideo.src = depthVideoUrl;
+                    depthVideo.classList.add('hidden');
+                    depthDownloadBtn.classList.add('hidden');
+                    // Show toggle buttons now that we have both videos
+                    viewToggleContainer.classList.remove('hidden');
+                    // Start with detection view
+                    switchToView('detection');
                 } else {
                     const error = await depthResponse.json();
                     depthVideoStatus.textContent = error.detail || 'Depth video unavailable.';

models/depth_estimators/depth_pro.py CHANGED Viewed

@@ -8,28 +8,32 @@ from .base import DepthEstimator, DepthResult
 class DepthProEstimator(DepthEstimator):
-    """Apple Depth Pro depth estimator."""
     name = "depth_pro"
     def __init__(self):
-        """Initialize Depth Pro model."""
         try:
-            import depth_pro
         except ImportError as exc:
             raise ImportError(
-                "depth_pro package not installed. "
-                "Install with: pip install git+https://github.com/apple/ml-depth-pro.git"
             ) from exc
-        logging.info("Loading Depth Pro model...")
-        self.model, self.transform = depth_pro.create_model_and_transforms()
-        self.model.eval()
-        # Move model to GPU if available
         self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
         if torch.cuda.is_available():
-            self.model = self.model.cuda()
             logging.info("Depth Pro model loaded on GPU")
         else:
             logging.warning("Depth Pro model loaded on CPU (no CUDA available)")
@@ -47,29 +51,34 @@ class DepthProEstimator(DepthEstimator):
         # Convert BGR to RGB
         rgb_frame = frame[:, :, ::-1]  # BGR → RGB
-        # Convert to PIL Image for transform
         pil_image = Image.fromarray(rgb_frame)
-        # Apply transform and move to device
-        image_tensor = self.transform(pil_image)
-        image_tensor = image_tensor.to(self.device)
         # Run inference (no gradient needed)
         with torch.no_grad():
-            prediction = self.model.infer(image_tensor, f_px=None)
-        # Extract depth map and move to CPU/numpy
-        # prediction is a dict: {"depth": tensor, "focallength_px": tensor}
-        depth_tensor = prediction["depth"]
-        focal_length_tensor = prediction.get("focallength_px")
-        # Convert to numpy, remove batch dimension if present
-        depth_map = depth_tensor.cpu().numpy().squeeze()
-        # Extract focal length
-        if focal_length_tensor is not None:
-            focal_length = float(focal_length_tensor.cpu().item())
         else:
-            focal_length = 1.0
         return DepthResult(depth_map=depth_map, focal_length=focal_length)

 class DepthProEstimator(DepthEstimator):
+    """Apple Depth Pro depth estimator using Hugging Face transformers."""
     name = "depth_pro"
     def __init__(self):
+        """Initialize Depth Pro model from Hugging Face."""
         try:
+            from transformers import DepthProImageProcessorFast, DepthProForDepthEstimation
         except ImportError as exc:
             raise ImportError(
+                "transformers package not installed or doesn't include DepthPro. "
+                "Update with: pip install transformers --upgrade"
             ) from exc
+        logging.info("Loading Depth Pro model from Hugging Face...")
+        # Set device
         self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        # Load model and processor
+        model_id = "apple/DepthPro-hf"
+        self.image_processor = DepthProImageProcessorFast.from_pretrained(model_id)
+        self.model = DepthProForDepthEstimation.from_pretrained(model_id).to(self.device)
+        self.model.eval()
         if torch.cuda.is_available():
             logging.info("Depth Pro model loaded on GPU")
         else:
             logging.warning("Depth Pro model loaded on CPU (no CUDA available)")
         # Convert BGR to RGB
         rgb_frame = frame[:, :, ::-1]  # BGR → RGB
+        # Convert to PIL Image
         pil_image = Image.fromarray(rgb_frame)
+        height, width = pil_image.height, pil_image.width
+        # Preprocess image
+        inputs = self.image_processor(images=pil_image, return_tensors="pt").to(self.device)
         # Run inference (no gradient needed)
         with torch.no_grad():
+            outputs = self.model(**inputs)
+        # Post-process to get depth and focal length
+        post_processed = self.image_processor.post_process_depth_estimation(
+            outputs,
+            target_sizes=[(height, width)],
+        )
+        # Extract depth map and focal length
+        depth_tensor = post_processed[0]["predicted_depth"]  # Already at target size
+        focal_length_value = post_processed[0].get("focal_length", 1.0)
+        # Convert to numpy
+        depth_map = depth_tensor.cpu().numpy()
+        # focal_length might be a tensor, convert to float
+        if isinstance(focal_length_value, torch.Tensor):
+            focal_length = float(focal_length_value.item())
         else:
+            focal_length = float(focal_length_value)
         return DepthResult(depth_map=depth_map, focal_length=focal_length)

requirements.txt CHANGED Viewed

@@ -11,4 +11,3 @@ huggingface-hub
 ultralytics
 timm
 ffmpeg-python
-depth-pro @ git+https://github.com/apple/ml-depth-pro.git

 ultralytics
 timm
 ffmpeg-python