| # DepthPro Wrapper β Image to Point Cloud |
|
|
| [](https://arxiv.org/abs/2410.02073) |
| [](https://huggingface.co/apple/DepthPro-hf) |
| [](LICENSE) |
|
|
| A **clean, drop-in Python wrapper** around Apple's **DepthPro** (arXiv:2410.02073) that turns a single RGB image into a **metric depth map** and, if you want, a **3D point cloud** β with **zero calibration**. |
|
|
| DepthPro is a 952 M-parameter ViT-L model that predicts **absolute metric depth** (meters, not relative) and estimates the **camera focal length** and **field of view** automatically. No camera intrinsics, no per-scene training, no LiDAR required. |
|
|
| --- |
|
|
| ## π Quick start (10 lines) |
|
|
| ```python |
| from depthpro_wrapper import DepthProEstimator, rgbd_to_point_cloud, save_point_cloud |
| |
| # 1. Load model (~2 GB download on first run) |
| estimator = DepthProEstimator(device="cuda:0") |
| |
| # 2. Drop an image in |
| result = estimator.estimate("photo.jpg") |
| print(f"Focal length: {result.focal_length:.1f} px") |
| print(f"Depth range: {result.depth.min():.2f} β {result.depth.max():.2f} m") |
| |
| # 3. Get a coloured point cloud out |
| points, colors = rgbd_to_point_cloud( |
| result.depth, result.image, result.focal_length |
| ) |
| |
| # 4. Save as PLY |
| save_point_cloud("scene.ply", points, colors=colors) |
| ``` |
|
|
| Or use the CLI: |
|
|
| ```bash |
| python scripts/image_to_pointcloud.py photo.jpg scene.ply --colored --normals |
| ``` |
|
|
| --- |
|
|
| ## π¦ Installation |
|
|
| ```bash |
| # 1. Core dependencies |
| pip install torch torchvision transformers pillow numpy |
| |
| # 2. Install this wrapper |
| pip install -e . |
| ``` |
|
|
| DepthPro is a large ViT-L model (~2 GB). The weights download automatically from HuggingFace the first time you instantiate `DepthProEstimator`. |
|
|
| > **GPU strongly recommended.** The model runs in ~0.3 s on a modern GPU; CPU inference is possible but extremely slow. |
|
|
| --- |
|
|
| ## π¬ How it works (the full pipeline) |
|
|
| Here is exactly what happens under the hood when you call `estimate()`. |
|
|
| ### Step 1 β Preprocessing (handled automatically) |
| Your input image is resized to **1536 Γ 1536** (the fixed operating resolution DepthPro was trained on), rescaled by `1/255`, and normalised to `[-1, 1]` with `mean=0.5, std=0.5`. This is done by `DepthProImageProcessorFast`. |
|
|
| ### Step 2 β Feature extraction (DINOv2 ViT-L + multi-scale patches) |
| DepthPro's backbone is a **DINOv2 ViT-L/16** (24 layers, 1024 hidden dim, 16Γ16 patch size). It processes the image at three scales simultaneously: |
|
|
| | Scale | Resolution | Purpose | |
| |-------|-----------|---------| |
| | 0.25Γ | 384Γ384 | Global context, far-away geometry | |
| | 0.5Γ | 768Γ768 | Mid-range structure | |
| | 1.0Γ | 1536Γ1536 | Fine detail, edges, thin structures | |
|
|
| The three-scale features are fused with a **DPT-style decoder** (hidden size 256) into a dense feature map. |
|
|
| ### Step 3 β Depth prediction (canonical inverse depth) |
| The decoder outputs **canonical inverse depth** `C`. This is not the final metric depth yet β it is a scale-invariant representation that the network learns to predict robustly across scenes. The actual metric depth is recovered in the post-processing step. |
|
|
| ### Step 4 β FOV / focal-length estimation (no calibration needed) |
| DepthPro has a **dedicated FOV head**. It ingests frozen features from the depth network plus task-specific features from a separate ViT encoder to predict the **horizontal field of view (FOV)** in degrees. |
|
|
| From the FOV, the focal length in pixels is derived: |
|
|
| ``` |
| focal_length = (image_width / 2) / tan(FOV / 2) |
| ``` |
|
|
| This is the critical piece that makes the depth **metric** (in meters) rather than just up-to-scale. |
|
|
| ### Step 5 β Post-processing (metric depth + back-projection) |
| The processor converts canonical inverse depth `C` to metric depth `D_m` using the estimated focal length: |
|
|
| ``` |
| D_m = (focal_length Γ image_width) / C |
| ``` |
|
|
| Then, using the **pinhole camera model**, every pixel `(u, v)` is back-projected to a 3D point: |
|
|
| ``` |
| X = (u - cx) * Z / focal_length |
| Y = (v - cy) * Z / focal_length |
| Z = D_m[v, u] |
| ``` |
|
|
| where `(cx, cy)` is the principal point (image centre by default). DepthPro assumes **square pixels** (`fx == fy`), which is standard for most modern cameras. |
|
|
| ### Result |
| You get: |
| * `depth` β (H, W) metric depth map in **meters** |
| * `focal_length` β estimated focal length in **pixels** |
| * `field_of_view` β estimated horizontal FOV in **degrees** |
| * `points` β (N, 3) 3D point cloud in camera coordinates (`+Z` forward) |
|
|
| --- |
|
|
| ## π§° API Reference |
|
|
| ### `DepthProEstimator` |
|
|
| ```python |
| class DepthProEstimator( |
| model_name="apple/DepthPro-hf", |
| device="cuda:0", |
| dtype=torch.float16, |
| ) |
| ``` |
|
|
| * `model_name` β HuggingFace model ID or local path. |
| * `device` β PyTorch device. CUDA strongly recommended. |
| * `dtype` β `torch.float16` (default, fast) or `torch.float32` (slightly higher precision). |
|
|
| #### `.estimate(image)` |
|
|
| ```python |
| result = estimator.estimate( |
| image, # str, Path, PIL.Image, or np.ndarray |
| return_confidence=False, |
| ) |
| ``` |
|
|
| Returns a `DepthResult` dataclass: |
|
|
| | Attribute | Shape | Description | |
| |-----------|-------|-------------| |
| | `depth` | (H, W) | Metric depth in meters (float32) | |
| | `focal_length` | scalar | Estimated focal length in pixels | |
| | `field_of_view` | scalar | Estimated horizontal FOV in degrees | |
| | `image` | (H, W, 3) | Original RGB image (uint8) | |
| | `confidence` | (H, W) or None | Per-pixel confidence (if requested) | |
| | `height`, `width` | scalars | Convenience properties | |
|
|
| #### `.estimate_batch(images)` |
| |
| Process multiple images in a single forward pass for efficiency: |
| |
| ```python |
| results = estimator.estimate_batch(["a.jpg", "b.jpg", "c.jpg"]) |
| for r in results: |
| print(r.depth.shape) |
| ``` |
| |
| ### `depth_to_point_cloud(depth, focal_length, ...)` |
|
|
| ```python |
| from depthpro_wrapper import depth_to_point_cloud |
| |
| points = depth_to_point_cloud( |
| depth=result.depth, # (H, W) metric depth |
| focal_length=result.focal_length, |
| principal_point=None, # default = image centre |
| mask=None, # optional boolean mask |
| sample_step=1, # 2 = 1/4 points, 4 = 1/16 |
| ) |
| ``` |
|
|
| Returns `(N, 3)` float32 array of 3D points in camera coordinates. |
|
|
| ### `rgbd_to_point_cloud(depth, rgb, focal_length, ...)` |
|
|
| Same as above but also returns per-point RGB colours: |
|
|
| ```python |
| points, colors = rgbd_to_point_cloud( |
| result.depth, result.image, result.focal_length, |
| sample_step=2, |
| ) |
| ``` |
|
|
| Returns `(N, 3)` points and `(N, 3)` uint8 colours. |
|
|
| ### `normals_from_depth(depth, focal_length)` |
| |
| Compute surface normals directly from the depth map (useful for feeding into surface-reconstruction pipelines like Poisson or NKSR): |
| |
| ```python |
| from depthpro_wrapper import normals_from_depth |
| normals = normals_from_depth(result.depth, result.focal_length) |
| ``` |
| |
| Returns `(H, W, 3)` float32 unit normals (unoriented). |
| |
| ### `save_point_cloud(path, points, colors=None, normals=None)` |
| |
| Save a point cloud to an ASCII PLY file (readable by Open3D, MeshLab, CloudCompare, Blender, etc.): |
| |
| ```python |
| from depthpro_wrapper import save_point_cloud |
| save_point_cloud("cloud.ply", points, colors=colors, normals=normals) |
| ``` |
| |
| --- |
| |
| ## π₯οΈ CLI Usage |
| |
| ```bash |
| # Basic: image β point cloud |
| python scripts/image_to_pointcloud.py photo.jpg cloud.ply |
|
|
| # With colours and normals |
| python scripts/image_to_pointcloud.py photo.jpg cloud.ply --colored --normals |
|
|
| # Down-sample for faster processing / smaller files |
| python scripts/image_to_pointcloud.py photo.jpg cloud.ply --sample-step 2 |
|
|
| # Save intermediate depth & confidence maps |
| python scripts/image_to_pointcloud.py photo.jpg cloud.ply \ |
| --save-depth depth.npy --save-confidence conf.npy |
| |
| # CPU fallback (very slow) |
| python scripts/image_to_pointcloud.py photo.jpg cloud.ply --device cpu --dtype float32 |
| ``` |
| |
| --- |
| |
| ## π Repository layout |
| |
| ``` |
| depthpro-wrapper/ |
| βββ depthpro_wrapper/ |
| β βββ __init__.py # public API |
| β βββ depth_estimator.py # DepthProEstimator + DepthResult |
| β βββ point_cloud.py # back-projection + normal estimation |
| β βββ io.py # image / PLY I/O helpers |
| βββ scripts/ |
| β βββ image_to_pointcloud.py # CLI entry point |
| βββ examples/ |
| β βββ quickstart.py # 10-line minimal example |
| β βββ batch_processing.py # folder-of-images batch script |
| βββ setup.py |
| βββ requirements.txt |
| βββ README.md |
| ``` |
| |
| --- |
| |
| ## π― Tips & Troubleshooting |
| |
| | Problem | Solution | |
| |---------|----------| |
| | Out of memory | Use `dtype=torch.float16` (default). If still OOM, use `--sample-step 2` or smaller images. | |
| | Depth looks wrong / flat | DepthPro works best on images with perspective (indoor rooms, outdoor scenes). Very flat macro shots may under-estimate depth. | |
| | Point cloud is noisy at edges | Depth has uncertainty at object boundaries. Use `sample_step=2` or filter by `confidence` if you saved it. | |
| | Focal length seems off | DepthPro estimates FOV from image content. Very unusual aspect ratios or heavy cropping can confuse it. You can override with your own `focal_length` in `depth_to_point_cloud()`. | |
| | Want a mesh, not a point cloud | Feed the point cloud into a surface-reconstruction method: Poisson (Open3D), Alpha shapes, or better yet **[NKSR](https://huggingface.co/bdck/nksr-wrapper)** for neural surface reconstruction. | |
| | Batch processing is slow | Use `estimate_batch()` with batch size 4β8 instead of looping over `estimate()`. | |
| |
| --- |
| |
| ## π Citation |
| |
| If you use DepthPro in your research, please cite the original paper: |
| |
| ```bibtex |
| @article{depthpro2024, |
| title={Depth Pro: Sharp Monocular Metric Depth in Less Than a Second}, |
| author={von_PLaten et al.}, |
| journal={arXiv preprint arXiv:2410.02073}, |
| year={2024} |
| } |
| ``` |
| |
| Original code: [https://github.com/apple/ml-depth-pro](https://github.com/apple/ml-depth-pro) |
| HuggingFace model: [https://huggingface.co/apple/DepthPro-hf](https://huggingface.co/apple/DepthPro-hf) |
| |
| --- |
| |
| ## π License |
| |
| This wrapper is released under the MIT License. DepthPro itself is under Apple's own license (see the original repository). |
| |
| --- |
| |
| *Built with β€οΈ on top of Apple's DepthPro.* |
| |