# DepthPro Wrapper — Image to Point Cloud

[![Paper](https://img.shields.io/badge/Paper-arXiv%3A2410.02073-blue)](https://arxiv.org/abs/2410.02073)
[![Model](https://img.shields.io/badge/🤗%20Model-apple%2FDepthPro--hf-yellow)](https://huggingface.co/apple/DepthPro-hf)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)

A **clean, drop-in Python wrapper** around Apple's **DepthPro** (arXiv:2410.02073) that turns a single RGB image into a **metric depth map** and, if you want, a **3D point cloud** — with **zero calibration**.

DepthPro is a 952 M-parameter ViT-L model that predicts **absolute metric depth** (meters, not relative) and estimates the **camera focal length** and **field of view** automatically.  No camera intrinsics, no per-scene training, no LiDAR required.

---

## 🚀 Quick start (10 lines)

```python
from depthpro_wrapper import DepthProEstimator, rgbd_to_point_cloud, save_point_cloud

# 1. Load model (~2 GB download on first run)
estimator = DepthProEstimator(device="cuda:0")

# 2. Drop an image in
result = estimator.estimate("photo.jpg")
print(f"Focal length: {result.focal_length:.1f} px")
print(f"Depth range: {result.depth.min():.2f} – {result.depth.max():.2f} m")

# 3. Get a coloured point cloud out
points, colors = rgbd_to_point_cloud(
    result.depth, result.image, result.focal_length
)

# 4. Save as PLY
save_point_cloud("scene.ply", points, colors=colors)
```

Or use the CLI:

```bash
python scripts/image_to_pointcloud.py photo.jpg scene.ply --colored --normals
```

---

## 📦 Installation

```bash
# 1. Core dependencies
pip install torch torchvision transformers pillow numpy

# 2. Install this wrapper
pip install -e .
```

DepthPro is a large ViT-L model (~2 GB).  The weights download automatically from HuggingFace the first time you instantiate `DepthProEstimator`.

> **GPU strongly recommended.**  The model runs in ~0.3 s on a modern GPU; CPU inference is possible but extremely slow.

---

## 🔬 How it works (the full pipeline)

Here is exactly what happens under the hood when you call `estimate()`.

### Step 1 — Preprocessing (handled automatically)
Your input image is resized to **1536 × 1536** (the fixed operating resolution DepthPro was trained on), rescaled by `1/255`, and normalised to `[-1, 1]` with `mean=0.5, std=0.5`.  This is done by `DepthProImageProcessorFast`.

### Step 2 — Feature extraction (DINOv2 ViT-L + multi-scale patches)
DepthPro's backbone is a **DINOv2 ViT-L/16** (24 layers, 1024 hidden dim, 16×16 patch size).  It processes the image at three scales simultaneously:

| Scale | Resolution | Purpose |
|-------|-----------|---------|
| 0.25× | 384×384 | Global context, far-away geometry |
| 0.5×  | 768×768  | Mid-range structure |
| 1.0×  | 1536×1536 | Fine detail, edges, thin structures |

The three-scale features are fused with a **DPT-style decoder** (hidden size 256) into a dense feature map.

### Step 3 — Depth prediction (canonical inverse depth)
The decoder outputs **canonical inverse depth** `C`.  This is not the final metric depth yet — it is a scale-invariant representation that the network learns to predict robustly across scenes.  The actual metric depth is recovered in the post-processing step.

### Step 4 — FOV / focal-length estimation (no calibration needed)
DepthPro has a **dedicated FOV head**.  It ingests frozen features from the depth network plus task-specific features from a separate ViT encoder to predict the **horizontal field of view (FOV)** in degrees.

From the FOV, the focal length in pixels is derived:

```
focal_length = (image_width / 2) / tan(FOV / 2)
```

This is the critical piece that makes the depth **metric** (in meters) rather than just up-to-scale.

### Step 5 — Post-processing (metric depth + back-projection)
The processor converts canonical inverse depth `C` to metric depth `D_m` using the estimated focal length:

```
D_m = (focal_length × image_width) / C
```

Then, using the **pinhole camera model**, every pixel `(u, v)` is back-projected to a 3D point:

```
X = (u - cx) * Z / focal_length
Y = (v - cy) * Z / focal_length
Z = D_m[v, u]
```

where `(cx, cy)` is the principal point (image centre by default).  DepthPro assumes **square pixels** (`fx == fy`), which is standard for most modern cameras.

### Result
You get:
* `depth` — (H, W) metric depth map in **meters**
* `focal_length` — estimated focal length in **pixels**
* `field_of_view` — estimated horizontal FOV in **degrees**
* `points` — (N, 3) 3D point cloud in camera coordinates (`+Z` forward)

---

## 🧰 API Reference

### `DepthProEstimator`

```python
class DepthProEstimator(
    model_name="apple/DepthPro-hf",
    device="cuda:0",
    dtype=torch.float16,
)
```

* `model_name` — HuggingFace model ID or local path.
* `device` — PyTorch device.  CUDA strongly recommended.
* `dtype` — `torch.float16` (default, fast) or `torch.float32` (slightly higher precision).

#### `.estimate(image)`

```python
result = estimator.estimate(
    image,                    # str, Path, PIL.Image, or np.ndarray
    return_confidence=False,
)
```

Returns a `DepthResult` dataclass:

| Attribute | Shape | Description |
|-----------|-------|-------------|
| `depth` | (H, W) | Metric depth in meters (float32) |
| `focal_length` | scalar | Estimated focal length in pixels |
| `field_of_view` | scalar | Estimated horizontal FOV in degrees |
| `image` | (H, W, 3) | Original RGB image (uint8) |
| `confidence` | (H, W) or None | Per-pixel confidence (if requested) |
| `height`, `width` | scalars | Convenience properties |

#### `.estimate_batch(images)`

Process multiple images in a single forward pass for efficiency:

```python
results = estimator.estimate_batch(["a.jpg", "b.jpg", "c.jpg"])
for r in results:
    print(r.depth.shape)
```

### `depth_to_point_cloud(depth, focal_length, ...)`

```python
from depthpro_wrapper import depth_to_point_cloud

points = depth_to_point_cloud(
    depth=result.depth,           # (H, W) metric depth
    focal_length=result.focal_length,
    principal_point=None,         # default = image centre
    mask=None,                    # optional boolean mask
    sample_step=1,                # 2 = 1/4 points, 4 = 1/16
)
```

Returns `(N, 3)` float32 array of 3D points in camera coordinates.

### `rgbd_to_point_cloud(depth, rgb, focal_length, ...)`

Same as above but also returns per-point RGB colours:

```python
points, colors = rgbd_to_point_cloud(
    result.depth, result.image, result.focal_length,
    sample_step=2,
)
```

Returns `(N, 3)` points and `(N, 3)` uint8 colours.

### `normals_from_depth(depth, focal_length)`

Compute surface normals directly from the depth map (useful for feeding into surface-reconstruction pipelines like Poisson or NKSR):

```python
from depthpro_wrapper import normals_from_depth
normals = normals_from_depth(result.depth, result.focal_length)
```

Returns `(H, W, 3)` float32 unit normals (unoriented).

### `save_point_cloud(path, points, colors=None, normals=None)`

Save a point cloud to an ASCII PLY file (readable by Open3D, MeshLab, CloudCompare, Blender, etc.):

```python
from depthpro_wrapper import save_point_cloud
save_point_cloud("cloud.ply", points, colors=colors, normals=normals)
```

---

## 🖥️ CLI Usage

```bash
# Basic: image → point cloud
python scripts/image_to_pointcloud.py photo.jpg cloud.ply

# With colours and normals
python scripts/image_to_pointcloud.py photo.jpg cloud.ply --colored --normals

# Down-sample for faster processing / smaller files
python scripts/image_to_pointcloud.py photo.jpg cloud.ply --sample-step 2

# Save intermediate depth & confidence maps
python scripts/image_to_pointcloud.py photo.jpg cloud.ply \
    --save-depth depth.npy --save-confidence conf.npy

# CPU fallback (very slow)
python scripts/image_to_pointcloud.py photo.jpg cloud.ply --device cpu --dtype float32
```

---

## 📂 Repository layout

```
depthpro-wrapper/
├── depthpro_wrapper/
│   ├── __init__.py              # public API
│   ├── depth_estimator.py      # DepthProEstimator + DepthResult
│   ├── point_cloud.py          # back-projection + normal estimation
│   └── io.py                   # image / PLY I/O helpers
├── scripts/
│   └── image_to_pointcloud.py  # CLI entry point
├── examples/
│   ├── quickstart.py           # 10-line minimal example
│   └── batch_processing.py     # folder-of-images batch script
├── setup.py
├── requirements.txt
└── README.md
```

---

## 🎯 Tips & Troubleshooting

| Problem | Solution |
|---------|----------|
| Out of memory | Use `dtype=torch.float16` (default).  If still OOM, use `--sample-step 2` or smaller images. |
| Depth looks wrong / flat | DepthPro works best on images with perspective (indoor rooms, outdoor scenes).  Very flat macro shots may under-estimate depth. |
| Point cloud is noisy at edges | Depth has uncertainty at object boundaries.  Use `sample_step=2` or filter by `confidence` if you saved it. |
| Focal length seems off | DepthPro estimates FOV from image content.  Very unusual aspect ratios or heavy cropping can confuse it.  You can override with your own `focal_length` in `depth_to_point_cloud()`. |
| Want a mesh, not a point cloud | Feed the point cloud into a surface-reconstruction method: Poisson (Open3D), Alpha shapes, or better yet **[NKSR](https://huggingface.co/bdck/nksr-wrapper)** for neural surface reconstruction. |
| Batch processing is slow | Use `estimate_batch()` with batch size 4–8 instead of looping over `estimate()`. |

---

## 🔗 Citation

If you use DepthPro in your research, please cite the original paper:

```bibtex
@article{depthpro2024,
  title={Depth Pro: Sharp Monocular Metric Depth in Less Than a Second},
  author={von_PLaten et al.},
  journal={arXiv preprint arXiv:2410.02073},
  year={2024}
}
```

Original code: [https://github.com/apple/ml-depth-pro](https://github.com/apple/ml-depth-pro)  
HuggingFace model: [https://huggingface.co/apple/DepthPro-hf](https://huggingface.co/apple/DepthPro-hf)

---

## 📄 License

This wrapper is released under the MIT License.  DepthPro itself is under Apple's own license (see the original repository).

---

*Built with ❤️ on top of Apple's DepthPro.*