Upload README.md

c4512e1 verified about 1 month ago

10.3 kB

	# DepthPro Wrapper — Image to Point Cloud

	[![Paper](https://img.shields.io/badge/Paper-arXiv%3A2410.02073-blue)](https://arxiv.org/abs/2410.02073)
	[![Model](https://img.shields.io/badge/🤗%20Model-apple%2FDepthPro--hf-yellow)](https://huggingface.co/apple/DepthPro-hf)
	[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)

	A clean, drop-in Python wrapper around Apple's DepthPro (arXiv:2410.02073) that turns a single RGB image into a metric depth map and, if you want, a 3D point cloud — with zero calibration.

	DepthPro is a 952 M-parameter ViT-L model that predicts absolute metric depth (meters, not relative) and estimates the camera focal length and field of view automatically. No camera intrinsics, no per-scene training, no LiDAR required.

	---

	## 🚀 Quick start (10 lines)

	```python
	from depthpro_wrapper import DepthProEstimator, rgbd_to_point_cloud, save_point_cloud

	# 1. Load model (~2 GB download on first run)
	estimator = DepthProEstimator(device="cuda:0")

	# 2. Drop an image in
	result = estimator.estimate("photo.jpg")
	print(f"Focal length: {result.focal_length:.1f} px")
	print(f"Depth range: {result.depth.min():.2f} – {result.depth.max():.2f} m")

	# 3. Get a coloured point cloud out
	points, colors = rgbd_to_point_cloud(
	result.depth, result.image, result.focal_length
	)

	# 4. Save as PLY
	save_point_cloud("scene.ply", points, colors=colors)
	```

	Or use the CLI:

	```bash
	python scripts/image_to_pointcloud.py photo.jpg scene.ply --colored --normals
	```

	---

	## 📦 Installation

	```bash
	# 1. Core dependencies
	pip install torch torchvision transformers pillow numpy

	# 2. Install this wrapper
	pip install -e .
	```

	DepthPro is a large ViT-L model (~2 GB). The weights download automatically from HuggingFace the first time you instantiate `DepthProEstimator`.

	> GPU strongly recommended. The model runs in ~0.3 s on a modern GPU; CPU inference is possible but extremely slow.

	---

	## 🔬 How it works (the full pipeline)

	Here is exactly what happens under the hood when you call `estimate()`.

	### Step 1 — Preprocessing (handled automatically)
	Your input image is resized to 1536 × 1536 (the fixed operating resolution DepthPro was trained on), rescaled by `1/255`, and normalised to `[-1, 1]` with `mean=0.5, std=0.5`. This is done by `DepthProImageProcessorFast`.

	### Step 2 — Feature extraction (DINOv2 ViT-L + multi-scale patches)
	DepthPro's backbone is a DINOv2 ViT-L/16 (24 layers, 1024 hidden dim, 16×16 patch size). It processes the image at three scales simultaneously:

	\| Scale \| Resolution \| Purpose \|
	\|-------\|-----------\|---------\|
	\| 0.25× \| 384×384 \| Global context, far-away geometry \|
	\| 0.5× \| 768×768 \| Mid-range structure \|
	\| 1.0× \| 1536×1536 \| Fine detail, edges, thin structures \|

	The three-scale features are fused with a DPT-style decoder (hidden size 256) into a dense feature map.

	### Step 3 — Depth prediction (canonical inverse depth)
	The decoder outputs canonical inverse depth `C`. This is not the final metric depth yet — it is a scale-invariant representation that the network learns to predict robustly across scenes. The actual metric depth is recovered in the post-processing step.

	### Step 4 — FOV / focal-length estimation (no calibration needed)
	DepthPro has a dedicated FOV head. It ingests frozen features from the depth network plus task-specific features from a separate ViT encoder to predict the horizontal field of view (FOV) in degrees.

	From the FOV, the focal length in pixels is derived:

	```
	focal_length = (image_width / 2) / tan(FOV / 2)
	```

	This is the critical piece that makes the depth metric (in meters) rather than just up-to-scale.

	### Step 5 — Post-processing (metric depth + back-projection)
	The processor converts canonical inverse depth `C` to metric depth `D_m` using the estimated focal length:

	```
	D_m = (focal_length × image_width) / C
	```

	Then, using the pinhole camera model, every pixel `(u, v)` is back-projected to a 3D point:

	```
	X = (u - cx) * Z / focal_length
	Y = (v - cy) * Z / focal_length
	Z = D_m[v, u]
	```

	where `(cx, cy)` is the principal point (image centre by default). DepthPro assumes square pixels (`fx == fy`), which is standard for most modern cameras.

	### Result
	You get:
	* `depth` — (H, W) metric depth map in meters
	* `focal_length` — estimated focal length in pixels
	* `field_of_view` — estimated horizontal FOV in degrees
	* `points` — (N, 3) 3D point cloud in camera coordinates (`+Z` forward)

	---

	## 🧰 API Reference

	### `DepthProEstimator`

	```python
	class DepthProEstimator(
	model_name="apple/DepthPro-hf",
	device="cuda:0",
	dtype=torch.float16,
	)
	```

	* `model_name` — HuggingFace model ID or local path.
	* `device` — PyTorch device. CUDA strongly recommended.
	* `dtype` — `torch.float16` (default, fast) or `torch.float32` (slightly higher precision).

	#### `.estimate(image)`

	```python
	result = estimator.estimate(
	image, # str, Path, PIL.Image, or np.ndarray
	return_confidence=False,
	)
	```

	Returns a `DepthResult` dataclass:

	\| Attribute \| Shape \| Description \|
	\|-----------\|-------\|-------------\|
	\| `depth` \| (H, W) \| Metric depth in meters (float32) \|
	\| `focal_length` \| scalar \| Estimated focal length in pixels \|
	\| `field_of_view` \| scalar \| Estimated horizontal FOV in degrees \|
	\| `image` \| (H, W, 3) \| Original RGB image (uint8) \|
	\| `confidence` \| (H, W) or None \| Per-pixel confidence (if requested) \|
	\| `height`, `width` \| scalars \| Convenience properties \|

	#### `.estimate_batch(images)`

	Process multiple images in a single forward pass for efficiency:

	```python
	results = estimator.estimate_batch(["a.jpg", "b.jpg", "c.jpg"])
	for r in results:
	print(r.depth.shape)
	```

	### `depth_to_point_cloud(depth, focal_length, ...)`

	```python
	from depthpro_wrapper import depth_to_point_cloud

	points = depth_to_point_cloud(
	depth=result.depth, # (H, W) metric depth
	focal_length=result.focal_length,
	principal_point=None, # default = image centre
	mask=None, # optional boolean mask
	sample_step=1, # 2 = 1/4 points, 4 = 1/16
	)
	```

	Returns `(N, 3)` float32 array of 3D points in camera coordinates.

	### `rgbd_to_point_cloud(depth, rgb, focal_length, ...)`

	Same as above but also returns per-point RGB colours:

	```python
	points, colors = rgbd_to_point_cloud(
	result.depth, result.image, result.focal_length,
	sample_step=2,
	)
	```

	Returns `(N, 3)` points and `(N, 3)` uint8 colours.

	### `normals_from_depth(depth, focal_length)`

	Compute surface normals directly from the depth map (useful for feeding into surface-reconstruction pipelines like Poisson or NKSR):

	```python
	from depthpro_wrapper import normals_from_depth
	normals = normals_from_depth(result.depth, result.focal_length)
	```

	Returns `(H, W, 3)` float32 unit normals (unoriented).

	### `save_point_cloud(path, points, colors=None, normals=None)`

	Save a point cloud to an ASCII PLY file (readable by Open3D, MeshLab, CloudCompare, Blender, etc.):

	```python
	from depthpro_wrapper import save_point_cloud
	save_point_cloud("cloud.ply", points, colors=colors, normals=normals)
	```

	---

	## 🖥️ CLI Usage

	```bash
	# Basic: image → point cloud
	python scripts/image_to_pointcloud.py photo.jpg cloud.ply

	# With colours and normals
	python scripts/image_to_pointcloud.py photo.jpg cloud.ply --colored --normals

	# Down-sample for faster processing / smaller files
	python scripts/image_to_pointcloud.py photo.jpg cloud.ply --sample-step 2

	# Save intermediate depth & confidence maps
	python scripts/image_to_pointcloud.py photo.jpg cloud.ply \
	--save-depth depth.npy --save-confidence conf.npy

	# CPU fallback (very slow)
	python scripts/image_to_pointcloud.py photo.jpg cloud.ply --device cpu --dtype float32
	```

	---

	## 📂 Repository layout

	```
	depthpro-wrapper/
	├── depthpro_wrapper/
	│ ├── __init__.py # public API
	│ ├── depth_estimator.py # DepthProEstimator + DepthResult
	│ ├── point_cloud.py # back-projection + normal estimation
	│ └── io.py # image / PLY I/O helpers
	├── scripts/
	│ └── image_to_pointcloud.py # CLI entry point
	├── examples/
	│ ├── quickstart.py # 10-line minimal example
	│ └── batch_processing.py # folder-of-images batch script
	├── setup.py
	├── requirements.txt
	└── README.md
	```

	---

	## 🎯 Tips & Troubleshooting

	\| Problem \| Solution \|
	\|---------\|----------\|
	\| Out of memory \| Use `dtype=torch.float16` (default). If still OOM, use `--sample-step 2` or smaller images. \|
	\| Depth looks wrong / flat \| DepthPro works best on images with perspective (indoor rooms, outdoor scenes). Very flat macro shots may under-estimate depth. \|
	\| Point cloud is noisy at edges \| Depth has uncertainty at object boundaries. Use `sample_step=2` or filter by `confidence` if you saved it. \|
	\| Focal length seems off \| DepthPro estimates FOV from image content. Very unusual aspect ratios or heavy cropping can confuse it. You can override with your own `focal_length` in `depth_to_point_cloud()`. \|
	\| Want a mesh, not a point cloud \| Feed the point cloud into a surface-reconstruction method: Poisson (Open3D), Alpha shapes, or better yet [NKSR](https://huggingface.co/bdck/nksr-wrapper) for neural surface reconstruction. \|
	\| Batch processing is slow \| Use `estimate_batch()` with batch size 4–8 instead of looping over `estimate()`. \|

	---

	## 🔗 Citation

	If you use DepthPro in your research, please cite the original paper:

	```bibtex
	@article{depthpro2024,
	title={Depth Pro: Sharp Monocular Metric Depth in Less Than a Second},
	author={von_PLaten et al.},
	journal={arXiv preprint arXiv:2410.02073},
	year={2024}
	}
	```

	Original code: [https://github.com/apple/ml-depth-pro](https://github.com/apple/ml-depth-pro)
	HuggingFace model: [https://huggingface.co/apple/DepthPro-hf](https://huggingface.co/apple/DepthPro-hf)

	---

	## 📄 License

	This wrapper is released under the MIT License. DepthPro itself is under Apple's own license (see the original repository).

	---

	Built with ❤️ on top of Apple's DepthPro.