depthpro-wrapper / README.md
bdck's picture
Upload README.md
c4512e1 verified
|
Raw
History Blame Contribute Delete
10.3 kB
# DepthPro Wrapper β€” Image to Point Cloud
[![Paper](https://img.shields.io/badge/Paper-arXiv%3A2410.02073-blue)](https://arxiv.org/abs/2410.02073)
[![Model](https://img.shields.io/badge/πŸ€—%20Model-apple%2FDepthPro--hf-yellow)](https://huggingface.co/apple/DepthPro-hf)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
A **clean, drop-in Python wrapper** around Apple's **DepthPro** (arXiv:2410.02073) that turns a single RGB image into a **metric depth map** and, if you want, a **3D point cloud** β€” with **zero calibration**.
DepthPro is a 952 M-parameter ViT-L model that predicts **absolute metric depth** (meters, not relative) and estimates the **camera focal length** and **field of view** automatically. No camera intrinsics, no per-scene training, no LiDAR required.
---
## πŸš€ Quick start (10 lines)
```python
from depthpro_wrapper import DepthProEstimator, rgbd_to_point_cloud, save_point_cloud
# 1. Load model (~2 GB download on first run)
estimator = DepthProEstimator(device="cuda:0")
# 2. Drop an image in
result = estimator.estimate("photo.jpg")
print(f"Focal length: {result.focal_length:.1f} px")
print(f"Depth range: {result.depth.min():.2f} – {result.depth.max():.2f} m")
# 3. Get a coloured point cloud out
points, colors = rgbd_to_point_cloud(
result.depth, result.image, result.focal_length
)
# 4. Save as PLY
save_point_cloud("scene.ply", points, colors=colors)
```
Or use the CLI:
```bash
python scripts/image_to_pointcloud.py photo.jpg scene.ply --colored --normals
```
---
## πŸ“¦ Installation
```bash
# 1. Core dependencies
pip install torch torchvision transformers pillow numpy
# 2. Install this wrapper
pip install -e .
```
DepthPro is a large ViT-L model (~2 GB). The weights download automatically from HuggingFace the first time you instantiate `DepthProEstimator`.
> **GPU strongly recommended.** The model runs in ~0.3 s on a modern GPU; CPU inference is possible but extremely slow.
---
## πŸ”¬ How it works (the full pipeline)
Here is exactly what happens under the hood when you call `estimate()`.
### Step 1 β€” Preprocessing (handled automatically)
Your input image is resized to **1536 Γ— 1536** (the fixed operating resolution DepthPro was trained on), rescaled by `1/255`, and normalised to `[-1, 1]` with `mean=0.5, std=0.5`. This is done by `DepthProImageProcessorFast`.
### Step 2 β€” Feature extraction (DINOv2 ViT-L + multi-scale patches)
DepthPro's backbone is a **DINOv2 ViT-L/16** (24 layers, 1024 hidden dim, 16Γ—16 patch size). It processes the image at three scales simultaneously:
| Scale | Resolution | Purpose |
|-------|-----------|---------|
| 0.25Γ— | 384Γ—384 | Global context, far-away geometry |
| 0.5Γ— | 768Γ—768 | Mid-range structure |
| 1.0Γ— | 1536Γ—1536 | Fine detail, edges, thin structures |
The three-scale features are fused with a **DPT-style decoder** (hidden size 256) into a dense feature map.
### Step 3 β€” Depth prediction (canonical inverse depth)
The decoder outputs **canonical inverse depth** `C`. This is not the final metric depth yet β€” it is a scale-invariant representation that the network learns to predict robustly across scenes. The actual metric depth is recovered in the post-processing step.
### Step 4 β€” FOV / focal-length estimation (no calibration needed)
DepthPro has a **dedicated FOV head**. It ingests frozen features from the depth network plus task-specific features from a separate ViT encoder to predict the **horizontal field of view (FOV)** in degrees.
From the FOV, the focal length in pixels is derived:
```
focal_length = (image_width / 2) / tan(FOV / 2)
```
This is the critical piece that makes the depth **metric** (in meters) rather than just up-to-scale.
### Step 5 β€” Post-processing (metric depth + back-projection)
The processor converts canonical inverse depth `C` to metric depth `D_m` using the estimated focal length:
```
D_m = (focal_length Γ— image_width) / C
```
Then, using the **pinhole camera model**, every pixel `(u, v)` is back-projected to a 3D point:
```
X = (u - cx) * Z / focal_length
Y = (v - cy) * Z / focal_length
Z = D_m[v, u]
```
where `(cx, cy)` is the principal point (image centre by default). DepthPro assumes **square pixels** (`fx == fy`), which is standard for most modern cameras.
### Result
You get:
* `depth` β€” (H, W) metric depth map in **meters**
* `focal_length` β€” estimated focal length in **pixels**
* `field_of_view` β€” estimated horizontal FOV in **degrees**
* `points` β€” (N, 3) 3D point cloud in camera coordinates (`+Z` forward)
---
## 🧰 API Reference
### `DepthProEstimator`
```python
class DepthProEstimator(
model_name="apple/DepthPro-hf",
device="cuda:0",
dtype=torch.float16,
)
```
* `model_name` β€” HuggingFace model ID or local path.
* `device` β€” PyTorch device. CUDA strongly recommended.
* `dtype` β€” `torch.float16` (default, fast) or `torch.float32` (slightly higher precision).
#### `.estimate(image)`
```python
result = estimator.estimate(
image, # str, Path, PIL.Image, or np.ndarray
return_confidence=False,
)
```
Returns a `DepthResult` dataclass:
| Attribute | Shape | Description |
|-----------|-------|-------------|
| `depth` | (H, W) | Metric depth in meters (float32) |
| `focal_length` | scalar | Estimated focal length in pixels |
| `field_of_view` | scalar | Estimated horizontal FOV in degrees |
| `image` | (H, W, 3) | Original RGB image (uint8) |
| `confidence` | (H, W) or None | Per-pixel confidence (if requested) |
| `height`, `width` | scalars | Convenience properties |
#### `.estimate_batch(images)`
Process multiple images in a single forward pass for efficiency:
```python
results = estimator.estimate_batch(["a.jpg", "b.jpg", "c.jpg"])
for r in results:
print(r.depth.shape)
```
### `depth_to_point_cloud(depth, focal_length, ...)`
```python
from depthpro_wrapper import depth_to_point_cloud
points = depth_to_point_cloud(
depth=result.depth, # (H, W) metric depth
focal_length=result.focal_length,
principal_point=None, # default = image centre
mask=None, # optional boolean mask
sample_step=1, # 2 = 1/4 points, 4 = 1/16
)
```
Returns `(N, 3)` float32 array of 3D points in camera coordinates.
### `rgbd_to_point_cloud(depth, rgb, focal_length, ...)`
Same as above but also returns per-point RGB colours:
```python
points, colors = rgbd_to_point_cloud(
result.depth, result.image, result.focal_length,
sample_step=2,
)
```
Returns `(N, 3)` points and `(N, 3)` uint8 colours.
### `normals_from_depth(depth, focal_length)`
Compute surface normals directly from the depth map (useful for feeding into surface-reconstruction pipelines like Poisson or NKSR):
```python
from depthpro_wrapper import normals_from_depth
normals = normals_from_depth(result.depth, result.focal_length)
```
Returns `(H, W, 3)` float32 unit normals (unoriented).
### `save_point_cloud(path, points, colors=None, normals=None)`
Save a point cloud to an ASCII PLY file (readable by Open3D, MeshLab, CloudCompare, Blender, etc.):
```python
from depthpro_wrapper import save_point_cloud
save_point_cloud("cloud.ply", points, colors=colors, normals=normals)
```
---
## πŸ–₯️ CLI Usage
```bash
# Basic: image β†’ point cloud
python scripts/image_to_pointcloud.py photo.jpg cloud.ply
# With colours and normals
python scripts/image_to_pointcloud.py photo.jpg cloud.ply --colored --normals
# Down-sample for faster processing / smaller files
python scripts/image_to_pointcloud.py photo.jpg cloud.ply --sample-step 2
# Save intermediate depth & confidence maps
python scripts/image_to_pointcloud.py photo.jpg cloud.ply \
--save-depth depth.npy --save-confidence conf.npy
# CPU fallback (very slow)
python scripts/image_to_pointcloud.py photo.jpg cloud.ply --device cpu --dtype float32
```
---
## πŸ“‚ Repository layout
```
depthpro-wrapper/
β”œβ”€β”€ depthpro_wrapper/
β”‚ β”œβ”€β”€ __init__.py # public API
β”‚ β”œβ”€β”€ depth_estimator.py # DepthProEstimator + DepthResult
β”‚ β”œβ”€β”€ point_cloud.py # back-projection + normal estimation
β”‚ └── io.py # image / PLY I/O helpers
β”œβ”€β”€ scripts/
β”‚ └── image_to_pointcloud.py # CLI entry point
β”œβ”€β”€ examples/
β”‚ β”œβ”€β”€ quickstart.py # 10-line minimal example
β”‚ └── batch_processing.py # folder-of-images batch script
β”œβ”€β”€ setup.py
β”œβ”€β”€ requirements.txt
└── README.md
```
---
## 🎯 Tips & Troubleshooting
| Problem | Solution |
|---------|----------|
| Out of memory | Use `dtype=torch.float16` (default). If still OOM, use `--sample-step 2` or smaller images. |
| Depth looks wrong / flat | DepthPro works best on images with perspective (indoor rooms, outdoor scenes). Very flat macro shots may under-estimate depth. |
| Point cloud is noisy at edges | Depth has uncertainty at object boundaries. Use `sample_step=2` or filter by `confidence` if you saved it. |
| Focal length seems off | DepthPro estimates FOV from image content. Very unusual aspect ratios or heavy cropping can confuse it. You can override with your own `focal_length` in `depth_to_point_cloud()`. |
| Want a mesh, not a point cloud | Feed the point cloud into a surface-reconstruction method: Poisson (Open3D), Alpha shapes, or better yet **[NKSR](https://huggingface.co/bdck/nksr-wrapper)** for neural surface reconstruction. |
| Batch processing is slow | Use `estimate_batch()` with batch size 4–8 instead of looping over `estimate()`. |
---
## πŸ”— Citation
If you use DepthPro in your research, please cite the original paper:
```bibtex
@article{depthpro2024,
title={Depth Pro: Sharp Monocular Metric Depth in Less Than a Second},
author={von_PLaten et al.},
journal={arXiv preprint arXiv:2410.02073},
year={2024}
}
```
Original code: [https://github.com/apple/ml-depth-pro](https://github.com/apple/ml-depth-pro)
HuggingFace model: [https://huggingface.co/apple/DepthPro-hf](https://huggingface.co/apple/DepthPro-hf)
---
## πŸ“„ License
This wrapper is released under the MIT License. DepthPro itself is under Apple's own license (see the original repository).
---
*Built with ❀️ on top of Apple's DepthPro.*