UniDepth Inference: Image → Metric Depth → Point Cloud

A clean, self-contained Python wrapper around UniDepth (CVPR 2024 / V2 2025) for converting a single RGB image into a metric 3D point cloud — with no camera intrinsics required.

Paper: UniDepth: Universal Monocular Metric Depth Estimation (CVPR 2024)
V2 Paper: UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler
Original Code: github.com/lpiccinelli-eth/UniDepth
Pretrained Weights: lpiccinelli/unidepth-v2-vits14 · lpiccinelli/unidepth-v2-vitl14

Installation

You need the original unidepth package (which provides the model definitions) plus this wrapper:

# 1. Install the official UniDepth package from GitHub
pip install git+https://github.com/lpiccinelli-eth/UniDepth.git

# 2. Install this wrapper (or just copy the files into your project)
pip install .

Core dependencies: torch, torchvision, PIL, numpy, huggingface_hub, safetensors

Quick Start

from PIL import Image
from unidepth.inference import UniDepth

# Load image
image = Image.open("room.jpg").convert("RGB")

# Load model (auto-downloads weights from HuggingFace)
model = UniDepth.from_pretrained("lpiccinelli/unidepth-v2-vits14", device="cuda")

# Run inference → get depth + point cloud
results = model(image)

# Metric depth map [H, W] in meters
depth = results["depth"]

# 3D point cloud [N, 3] in meters (only valid depth pixels)
points = results["points"]

# Predicted camera intrinsics K [3, 3]
intrinsics = results["intrinsics"]

# Save as PLY
from unidepth.inference import save_pointcloud_ply
save_pointcloud_ply("room.ply", points, colors=results["colors"])

Command-line:

python examples/image_to_pointcloud.py room.jpg --output room.ply --checkpoint lpiccinelli/unidepth-v2-vits14

How It Works Internally

UniDepth is a universal monocular metric depth estimator. Unlike most depth models that only output relative depth or require known camera intrinsics, UniDepth simultaneously predicts:

Camera intrinsics (self-promptable camera module)
Metric depth (in meters)
3D ray directions for every pixel

1. Pseudo-Spherical Output Representation

The key innovation is the pseudo-spherical representation (θ, φ, z_log) instead of the standard Cartesian (x, y, z):

θ (azimuth) — horizontal angle of the camera ray
φ (elevation) — vertical angle of the camera ray
z_log — log-depth (metric)

Why? Because (x, y) in Cartesian backprojection entangles camera rays with depth:

x = (u - cx) / fx * z   ← both camera AND depth
y = (v - cy) / fy * z   ← both camera AND depth

By predicting angles (θ, φ) separately from z_log, the model naturally disentangles camera calibration from depth estimation. The two tasks don't interfere during training.

2. Self-Promptable Camera Module

The camera module bootstraps intrinsics from the image itself:

Takes the ViT class tokens as initialization
Runs 2 self-attention layers → predicts 4 scalars (Δfx, Δfy, Δcx, Δcy)

Converts to absolute intrinsics (invariant to image size):

fx = Δfx * W / 2
fy = Δfy * H / 2
cx = Δcx * W / 2
cy = Δcy * H / 2

Backprojects every pixel through K⁻¹ to get rays on the unit sphere
Extracts azimuth/ elevation from those rays → dense camera representation C

This means you don't need to know your camera's focal length — the model guesses it from visual cues (perspective lines, known object sizes, etc.).

3. Depth Module

Encoder features from DINOv2 ViT at 4 scales (H/14 × W/14 resolution)
Each scale is cross-attention conditioned on the camera embeddings E = SHE(C)
- SHE = Spherical Harmonic Encoding (128 channels from 64 harmonics per angle)
FPN-style decoder with transposed-convolution upsampling
Final output: Z_log upsampled to full (H, W) + 2 conv layers

4. Converting to Cartesian Point Cloud

The model outputs O = [θ, φ, Z] where Z = exp(Z_log). To get standard (X, Y, Z):

X = Z * cos(φ) * cos(θ)
Y = Z * cos(φ) * sin(θ)
Z = Z * sin(φ)

This is what generate_pointcloud() does for you automatically.

5. Handling Known Intrinsics (Optional)

If you already know your camera matrix K, you can bypass the predicted camera and use your own:

results = model(image, intrinsics=your_K_matrix)

This gives more accurate 3D reconstruction when intrinsics are reliable.

API Reference

`UniDepth.from_pretrained(checkpoint, device="cuda")`

Load a pretrained model from HuggingFace Hub.

Checkpoint	Size	Speed	Accuracy
`lpiccinelli/unidepth-v2-vits14`	261 MB	Fast	Very Good
`lpiccinelli/unidepth-v2-vitl14`	2.6 GB	Slower	Best

`model(image, intrinsics=None) → dict`

Run inference on a PIL Image or tensor.

Returns a dictionary with keys:

"depth" — [H, W] metric depth (meters)
"confidence" — [H, W] uncertainty (lower = more confident)
"points" — [N, 3] Cartesian point cloud (valid pixels only)
"colors" — [N, 3] RGB colors for each point
"intrinsics" — [3, 3] predicted camera matrix K
"camera" — [H, W, 2] predicted azimuth/elevation

`generate_pointcloud(depth, camera, colors=None, mask=None, intrinsics=None)`

Convert raw model outputs to a filtered point cloud.

`save_pointcloud_ply(path, points, colors=None)`

Save points (and optional colors) as an ASCII PLY file.

Model Variants

	UniDepth V1 (CVPR 2024)	UniDepth V2 (2025)
Backbone	DINOv2 ViT-L	DINOv2 ViT-S / B / L
Camera encoding	Spherical Harmonics (81 coefficients)	Sine encoding (64 harmonics)
Output	(θ, φ, z_log)	(θ, φ, z_log)
Training data	8 datasets	23 datasets (16M images)
Losses	λ-MSE + Geometric Invariance	+ Edge-Guided SSI + Confidence

This wrapper works with both V1 and V2 checkpoints.

Notes

Dynamic resolution: The model is trained on variable resolutions (0.2–0.6 MP). You can feed any image size; larger images give finer detail but cost more VRAM.
Normalization: ImageNet normalization (0.485, 0.456, 0.406) mean, (0.229, 0.224, 0.225) std is applied automatically.
Zero-shot: The model generalizes across indoor, outdoor, and challenging domains without fine-tuning.

Citation

@inproceedings{piccinelli2024unidepth,
  title={UniDepth: Universal Monocular Metric Depth Estimation},
  author={Piccinelli, Luigi and Yang, Yuedong and Sakaridis, Christos and Segu, Mattia and Li, Siyuan and Van Gool, Luc and Yu, Fisher},
  booktitle={CVPR},
  year={2024}
}

@article{piccinelli2025unidepthv2,
  title={UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler},
  author={Piccinelli, Luigi and Sakaridis, Christos and Yang, Yuedong and Segu, Mattia and Li, Siyuan and Abbeloos, Marc and Van Gool, Luc},
  journal={arXiv:2502.20110},
  year={2025}
}

License

MIT (same as the original repository).

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Try ML Intern: https://smolagents-ml-intern.hf.space
Source code: https://github.com/huggingface/ml-intern

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "bdck/unidepth-inference"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for bdck/unidepth-inference

UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler

Paper • 2502.20110 • Published Feb 27, 2025

UniDepth: Universal Monocular Metric Depth Estimation

Paper • 2403.18913 • Published Mar 27, 2024 • 2