# DepthPro Wrapper — Image to Point Cloud [![Paper](https://img.shields.io/badge/Paper-arXiv%3A2410.02073-blue)](https://arxiv.org/abs/2410.02073) [![Model](https://img.shields.io/badge/🤗%20Model-apple%2FDepthPro--hf-yellow)](https://huggingface.co/apple/DepthPro-hf) [![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE) A **clean, drop-in Python wrapper** around Apple's **DepthPro** (arXiv:2410.02073) that turns a single RGB image into a **metric depth map** and, if you want, a **3D point cloud** — with **zero calibration**. DepthPro is a 952 M-parameter ViT-L model that predicts **absolute metric depth** (meters, not relative) and estimates the **camera focal length** and **field of view** automatically. No camera intrinsics, no per-scene training, no LiDAR required. --- ## 🚀 Quick start (10 lines) ```python from depthpro_wrapper import DepthProEstimator, rgbd_to_point_cloud, save_point_cloud # 1. Load model (~2 GB download on first run) estimator = DepthProEstimator(device="cuda:0") # 2. Drop an image in result = estimator.estimate("photo.jpg") print(f"Focal length: {result.focal_length:.1f} px") print(f"Depth range: {result.depth.min():.2f} – {result.depth.max():.2f} m") # 3. Get a coloured point cloud out points, colors = rgbd_to_point_cloud( result.depth, result.image, result.focal_length ) # 4. Save as PLY save_point_cloud("scene.ply", points, colors=colors) ``` Or use the CLI: ```bash python scripts/image_to_pointcloud.py photo.jpg scene.ply --colored --normals ``` --- ## 📦 Installation ```bash # 1. Core dependencies pip install torch torchvision transformers pillow numpy # 2. Install this wrapper pip install -e . ``` DepthPro is a large ViT-L model (~2 GB). The weights download automatically from HuggingFace the first time you instantiate `DepthProEstimator`. > **GPU strongly recommended.** The model runs in ~0.3 s on a modern GPU; CPU inference is possible but extremely slow. --- ## 🔬 How it works (the full pipeline) Here is exactly what happens under the hood when you call `estimate()`. ### Step 1 — Preprocessing (handled automatically) Your input image is resized to **1536 × 1536** (the fixed operating resolution DepthPro was trained on), rescaled by `1/255`, and normalised to `[-1, 1]` with `mean=0.5, std=0.5`. This is done by `DepthProImageProcessorFast`. ### Step 2 — Feature extraction (DINOv2 ViT-L + multi-scale patches) DepthPro's backbone is a **DINOv2 ViT-L/16** (24 layers, 1024 hidden dim, 16×16 patch size). It processes the image at three scales simultaneously: | Scale | Resolution | Purpose | |-------|-----------|---------| | 0.25× | 384×384 | Global context, far-away geometry | | 0.5× | 768×768 | Mid-range structure | | 1.0× | 1536×1536 | Fine detail, edges, thin structures | The three-scale features are fused with a **DPT-style decoder** (hidden size 256) into a dense feature map. ### Step 3 — Depth prediction (canonical inverse depth) The decoder outputs **canonical inverse depth** `C`. This is not the final metric depth yet — it is a scale-invariant representation that the network learns to predict robustly across scenes. The actual metric depth is recovered in the post-processing step. ### Step 4 — FOV / focal-length estimation (no calibration needed) DepthPro has a **dedicated FOV head**. It ingests frozen features from the depth network plus task-specific features from a separate ViT encoder to predict the **horizontal field of view (FOV)** in degrees. From the FOV, the focal length in pixels is derived: ``` focal_length = (image_width / 2) / tan(FOV / 2) ``` This is the critical piece that makes the depth **metric** (in meters) rather than just up-to-scale. ### Step 5 — Post-processing (metric depth + back-projection) The processor converts canonical inverse depth `C` to metric depth `D_m` using the estimated focal length: ``` D_m = (focal_length × image_width) / C ``` Then, using the **pinhole camera model**, every pixel `(u, v)` is back-projected to a 3D point: ``` X = (u - cx) * Z / focal_length Y = (v - cy) * Z / focal_length Z = D_m[v, u] ``` where `(cx, cy)` is the principal point (image centre by default). DepthPro assumes **square pixels** (`fx == fy`), which is standard for most modern cameras. ### Result You get: * `depth` — (H, W) metric depth map in **meters** * `focal_length` — estimated focal length in **pixels** * `field_of_view` — estimated horizontal FOV in **degrees** * `points` — (N, 3) 3D point cloud in camera coordinates (`+Z` forward) --- ## 🧰 API Reference ### `DepthProEstimator` ```python class DepthProEstimator( model_name="apple/DepthPro-hf", device="cuda:0", dtype=torch.float16, ) ``` * `model_name` — HuggingFace model ID or local path. * `device` — PyTorch device. CUDA strongly recommended. * `dtype` — `torch.float16` (default, fast) or `torch.float32` (slightly higher precision). #### `.estimate(image)` ```python result = estimator.estimate( image, # str, Path, PIL.Image, or np.ndarray return_confidence=False, ) ``` Returns a `DepthResult` dataclass: | Attribute | Shape | Description | |-----------|-------|-------------| | `depth` | (H, W) | Metric depth in meters (float32) | | `focal_length` | scalar | Estimated focal length in pixels | | `field_of_view` | scalar | Estimated horizontal FOV in degrees | | `image` | (H, W, 3) | Original RGB image (uint8) | | `confidence` | (H, W) or None | Per-pixel confidence (if requested) | | `height`, `width` | scalars | Convenience properties | #### `.estimate_batch(images)` Process multiple images in a single forward pass for efficiency: ```python results = estimator.estimate_batch(["a.jpg", "b.jpg", "c.jpg"]) for r in results: print(r.depth.shape) ``` ### `depth_to_point_cloud(depth, focal_length, ...)` ```python from depthpro_wrapper import depth_to_point_cloud points = depth_to_point_cloud( depth=result.depth, # (H, W) metric depth focal_length=result.focal_length, principal_point=None, # default = image centre mask=None, # optional boolean mask sample_step=1, # 2 = 1/4 points, 4 = 1/16 ) ``` Returns `(N, 3)` float32 array of 3D points in camera coordinates. ### `rgbd_to_point_cloud(depth, rgb, focal_length, ...)` Same as above but also returns per-point RGB colours: ```python points, colors = rgbd_to_point_cloud( result.depth, result.image, result.focal_length, sample_step=2, ) ``` Returns `(N, 3)` points and `(N, 3)` uint8 colours. ### `normals_from_depth(depth, focal_length)` Compute surface normals directly from the depth map (useful for feeding into surface-reconstruction pipelines like Poisson or NKSR): ```python from depthpro_wrapper import normals_from_depth normals = normals_from_depth(result.depth, result.focal_length) ``` Returns `(H, W, 3)` float32 unit normals (unoriented). ### `save_point_cloud(path, points, colors=None, normals=None)` Save a point cloud to an ASCII PLY file (readable by Open3D, MeshLab, CloudCompare, Blender, etc.): ```python from depthpro_wrapper import save_point_cloud save_point_cloud("cloud.ply", points, colors=colors, normals=normals) ``` --- ## 🖥️ CLI Usage ```bash # Basic: image → point cloud python scripts/image_to_pointcloud.py photo.jpg cloud.ply # With colours and normals python scripts/image_to_pointcloud.py photo.jpg cloud.ply --colored --normals # Down-sample for faster processing / smaller files python scripts/image_to_pointcloud.py photo.jpg cloud.ply --sample-step 2 # Save intermediate depth & confidence maps python scripts/image_to_pointcloud.py photo.jpg cloud.ply \ --save-depth depth.npy --save-confidence conf.npy # CPU fallback (very slow) python scripts/image_to_pointcloud.py photo.jpg cloud.ply --device cpu --dtype float32 ``` --- ## 📂 Repository layout ``` depthpro-wrapper/ ├── depthpro_wrapper/ │ ├── __init__.py # public API │ ├── depth_estimator.py # DepthProEstimator + DepthResult │ ├── point_cloud.py # back-projection + normal estimation │ └── io.py # image / PLY I/O helpers ├── scripts/ │ └── image_to_pointcloud.py # CLI entry point ├── examples/ │ ├── quickstart.py # 10-line minimal example │ └── batch_processing.py # folder-of-images batch script ├── setup.py ├── requirements.txt └── README.md ``` --- ## 🎯 Tips & Troubleshooting | Problem | Solution | |---------|----------| | Out of memory | Use `dtype=torch.float16` (default). If still OOM, use `--sample-step 2` or smaller images. | | Depth looks wrong / flat | DepthPro works best on images with perspective (indoor rooms, outdoor scenes). Very flat macro shots may under-estimate depth. | | Point cloud is noisy at edges | Depth has uncertainty at object boundaries. Use `sample_step=2` or filter by `confidence` if you saved it. | | Focal length seems off | DepthPro estimates FOV from image content. Very unusual aspect ratios or heavy cropping can confuse it. You can override with your own `focal_length` in `depth_to_point_cloud()`. | | Want a mesh, not a point cloud | Feed the point cloud into a surface-reconstruction method: Poisson (Open3D), Alpha shapes, or better yet **[NKSR](https://huggingface.co/bdck/nksr-wrapper)** for neural surface reconstruction. | | Batch processing is slow | Use `estimate_batch()` with batch size 4–8 instead of looping over `estimate()`. | --- ## 🔗 Citation If you use DepthPro in your research, please cite the original paper: ```bibtex @article{depthpro2024, title={Depth Pro: Sharp Monocular Metric Depth in Less Than a Second}, author={von_PLaten et al.}, journal={arXiv preprint arXiv:2410.02073}, year={2024} } ``` Original code: [https://github.com/apple/ml-depth-pro](https://github.com/apple/ml-depth-pro) HuggingFace model: [https://huggingface.co/apple/DepthPro-hf](https://huggingface.co/apple/DepthPro-hf) --- ## 📄 License This wrapper is released under the MIT License. DepthPro itself is under Apple's own license (see the original repository). --- *Built with ❤️ on top of Apple's DepthPro.*