File size: 3,318 Bytes
88c28fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
library_name: diffusers
license: apache-2.0
tags:
- modular-diffusers
- diffusers
- depth-estimation
---
# Depth Pro Estimator Block

A custom [Modular Diffusers](https://huggingface.co/docs/diffusers/modular_diffusers/overview) block for monocular depth estimation using Apple's [Depth Pro](https://huggingface.co/apple/DepthPro-hf) model. Supports both images and videos.

## Features

- **Metric depth estimation** in real-world meters using Depth Pro
- **Image and video** input support
- **Grayscale or turbo colormap** visualization
- Inverse depth normalization (following Apple's reference implementation) for robust handling of outdoor/sky scenes

## Installation

```bash
# Using uv
uv sync

# Using pip
pip install -r requirements.txt
```

## Quick Start

### Load the block

```python
from diffusers import ModularPipelineBlocks
import torch

blocks = ModularPipelineBlocks.from_pretrained(
    "your-username/depth-pro-estimator",  # or local path "."
    trust_remote_code=True,
)
pipeline = blocks.init_pipeline()
pipeline.load_components(torch_dtype=torch.float16)
pipeline.to("cuda")
```

### Single image - grayscale depth

```python
from PIL import Image

image = Image.open("photo.jpg")
output = pipeline(image=image)

# Save depth map
output.depth_image.save("photo_depth.png")

# Access raw metric depth tensor (in meters)
print(output.predicted_depth.shape)  # (H, W)
print(output.field_of_view)          # estimated FOV
print(output.focal_length)           # estimated focal length
```

### Single image - turbo colormap

```python
output = pipeline(image=image, colormap="turbo")
output.depth_image.save("photo_depth_turbo.png")
```

### Video - grayscale depth

```python
from block import save_video

output = pipeline(video_path="input.mp4", colormap="grayscale")
save_video(output.depth_frames, output.fps, "output_depth.mp4")
```

### Video - turbo colormap

```python
output = pipeline(video_path="input.mp4", colormap="turbo")
save_video(output.depth_frames, output.fps, "output_depth_turbo.mp4")
```

## Inputs

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `image` | `PIL.Image` | - | Image to estimate depth for |
| `video_path` | `str` | - | Path to input video. When provided, `image` is ignored |
| `colormap` | `str` | `"grayscale"` | `"grayscale"` or `"turbo"` (colormapped) |

## Outputs

### Image mode

| Output | Type | Description |
|--------|------|-------------|
| `depth_image` | `PIL.Image` | Normalized depth visualization |
| `predicted_depth` | `torch.Tensor` | Raw metric depth in meters (H x W) |
| `field_of_view` | `float` | Estimated horizontal FOV |
| `focal_length` | `float` | Estimated focal length |

### Video mode

| Output | Type | Description |
|--------|------|-------------|
| `depth_frames` | `List[PIL.Image]` | Per-frame depth visualizations |
| `fps` | `float` | Source video frame rate |

## Depth Normalization

Depth visualization uses inverse depth clipped to [0.1m, 250m], following [Apple's reference implementation](https://github.com/apple/ml-depth-pro). This prevents sky/infinity values (clamped at 10,000m by the model) from crushing near-field detail into a binary mask.

- **Bright = close**, **dark = far** (grayscale)
- **Warm (red/yellow) = close**, **cool (blue) = far** (turbo)