UniDepth Inference: Image → Metric Depth → Point Cloud
A clean, self-contained Python wrapper around UniDepth (CVPR 2024 / V2 2025) for converting a single RGB image into a metric 3D point cloud — with no camera intrinsics required.
Paper: UniDepth: Universal Monocular Metric Depth Estimation (CVPR 2024)
V2 Paper: UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler
Original Code: github.com/lpiccinelli-eth/UniDepth
Pretrained Weights:lpiccinelli/unidepth-v2-vits14·lpiccinelli/unidepth-v2-vitl14
Installation
You need the original unidepth package (which provides the model definitions) plus this wrapper:
# 1. Install the official UniDepth package from GitHub
pip install git+https://github.com/lpiccinelli-eth/UniDepth.git
# 2. Install this wrapper (or just copy the files into your project)
pip install .
Core dependencies: torch, torchvision, PIL, numpy, huggingface_hub, safetensors
Quick Start
from PIL import Image
from unidepth.inference import UniDepth
# Load image
image = Image.open("room.jpg").convert("RGB")
# Load model (auto-downloads weights from HuggingFace)
model = UniDepth.from_pretrained("lpiccinelli/unidepth-v2-vits14", device="cuda")
# Run inference → get depth + point cloud
results = model(image)
# Metric depth map [H, W] in meters
depth = results["depth"]
# 3D point cloud [N, 3] in meters (only valid depth pixels)
points = results["points"]
# Predicted camera intrinsics K [3, 3]
intrinsics = results["intrinsics"]
# Save as PLY
from unidepth.inference import save_pointcloud_ply
save_pointcloud_ply("room.ply", points, colors=results["colors"])
Command-line:
python examples/image_to_pointcloud.py room.jpg --output room.ply --checkpoint lpiccinelli/unidepth-v2-vits14
How It Works Internally
UniDepth is a universal monocular metric depth estimator. Unlike most depth models that only output relative depth or require known camera intrinsics, UniDepth simultaneously predicts:
- Camera intrinsics (self-promptable camera module)
- Metric depth (in meters)
- 3D ray directions for every pixel
1. Pseudo-Spherical Output Representation
The key innovation is the pseudo-spherical representation (θ, φ, z_log) instead of the standard Cartesian (x, y, z):
- θ (azimuth) — horizontal angle of the camera ray
- φ (elevation) — vertical angle of the camera ray
- z_log — log-depth (metric)
Why? Because (x, y) in Cartesian backprojection entangles camera rays with depth:
x = (u - cx) / fx * z ← both camera AND depth
y = (v - cy) / fy * z ← both camera AND depth
By predicting angles (θ, φ) separately from z_log, the model naturally disentangles camera calibration from depth estimation. The two tasks don't interfere during training.
2. Self-Promptable Camera Module
The camera module bootstraps intrinsics from the image itself:
- Takes the ViT class tokens as initialization
- Runs 2 self-attention layers → predicts 4 scalars
(Δfx, Δfy, Δcx, Δcy) - Converts to absolute intrinsics (invariant to image size):
fx = Δfx * W / 2 fy = Δfy * H / 2 cx = Δcx * W / 2 cy = Δcy * H / 2 - Backprojects every pixel through
K⁻¹to get rays on the unit sphere - Extracts azimuth/ elevation from those rays → dense camera representation
C
This means you don't need to know your camera's focal length — the model guesses it from visual cues (perspective lines, known object sizes, etc.).
3. Depth Module
- Encoder features from DINOv2 ViT at 4 scales (H/14 × W/14 resolution)
- Each scale is cross-attention conditioned on the camera embeddings
E = SHE(C)SHE= Spherical Harmonic Encoding (128 channels from 64 harmonics per angle)
- FPN-style decoder with transposed-convolution upsampling
- Final output:
Z_logupsampled to full(H, W)+ 2 conv layers
4. Converting to Cartesian Point Cloud
The model outputs O = [θ, φ, Z] where Z = exp(Z_log). To get standard (X, Y, Z):
X = Z * cos(φ) * cos(θ)
Y = Z * cos(φ) * sin(θ)
Z = Z * sin(φ)
This is what generate_pointcloud() does for you automatically.
5. Handling Known Intrinsics (Optional)
If you already know your camera matrix K, you can bypass the predicted camera and use your own:
results = model(image, intrinsics=your_K_matrix)
This gives more accurate 3D reconstruction when intrinsics are reliable.
API Reference
UniDepth.from_pretrained(checkpoint, device="cuda")
Load a pretrained model from HuggingFace Hub.
| Checkpoint | Size | Speed | Accuracy |
|---|---|---|---|
lpiccinelli/unidepth-v2-vits14 |
261 MB | Fast | Very Good |
lpiccinelli/unidepth-v2-vitl14 |
2.6 GB | Slower | Best |
model(image, intrinsics=None) → dict
Run inference on a PIL Image or tensor.
Returns a dictionary with keys:
"depth"—[H, W]metric depth (meters)"confidence"—[H, W]uncertainty (lower = more confident)"points"—[N, 3]Cartesian point cloud (valid pixels only)"colors"—[N, 3]RGB colors for each point"intrinsics"—[3, 3]predicted camera matrix K"camera"—[H, W, 2]predicted azimuth/elevation
generate_pointcloud(depth, camera, colors=None, mask=None, intrinsics=None)
Convert raw model outputs to a filtered point cloud.
save_pointcloud_ply(path, points, colors=None)
Save points (and optional colors) as an ASCII PLY file.
Model Variants
| UniDepth V1 (CVPR 2024) | UniDepth V2 (2025) | |
|---|---|---|
| Backbone | DINOv2 ViT-L | DINOv2 ViT-S / B / L |
| Camera encoding | Spherical Harmonics (81 coefficients) | Sine encoding (64 harmonics) |
| Output | (θ, φ, z_log) | (θ, φ, z_log) |
| Training data | 8 datasets | 23 datasets (16M images) |
| Losses | λ-MSE + Geometric Invariance | + Edge-Guided SSI + Confidence |
This wrapper works with both V1 and V2 checkpoints.
Notes
- Dynamic resolution: The model is trained on variable resolutions (0.2–0.6 MP). You can feed any image size; larger images give finer detail but cost more VRAM.
- Normalization: ImageNet normalization
(0.485, 0.456, 0.406)mean,(0.229, 0.224, 0.225)std is applied automatically. - Zero-shot: The model generalizes across indoor, outdoor, and challenging domains without fine-tuning.
Citation
@inproceedings{piccinelli2024unidepth,
title={UniDepth: Universal Monocular Metric Depth Estimation},
author={Piccinelli, Luigi and Yang, Yuedong and Sakaridis, Christos and Segu, Mattia and Li, Siyuan and Van Gool, Luc and Yu, Fisher},
booktitle={CVPR},
year={2024}
}
@article{piccinelli2025unidepthv2,
title={UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler},
author={Piccinelli, Luigi and Sakaridis, Christos and Yang, Yuedong and Segu, Mattia and Li, Siyuan and Abbeloos, Marc and Van Gool, Luc},
journal={arXiv:2502.20110},
year={2025}
}
License
MIT (same as the original repository).
Generated by ML Intern
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "bdck/unidepth-inference"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.