code

436b829 verified 3 days ago

6.64 kB

	<div align="center">
	<h1>Pixel-Perfect Depth</h1>

	[Gangwei Xu](https://gangweix.github.io/)<sup>1,2,&ast;</sup> · [Haotong Lin](https://haotongl.github.io/)<sup>3,&ast;</sup> · Hongcheng Luo<sup>2</sup> · [Xianqi Wang](https://scholar.google.com/citations?user=1GCLBNAAAAAJ&hl=zh-CN&oi=ao)<sup>1</sup> · [Jingfeng Yao](https://jingfengyao.github.io/)<sup>1</sup>
	<br>
	[Lianghui Zhu](https://scholar.google.com/citations?user=NvMHcs0AAAAJ&hl=zh-CN&oi=ao)<sup>1</sup> · Yuechuan Pu<sup>2</sup> · Cheng Chi<sup>2</sup> · Haiyang Sun<sup>2,&dagger;</sup> · Bing Wang<sup>2</sup>
	<br>
	Guang Chen<sup>2</sup> · Hangjun Ye<sup>2</sup> · [Sida Peng](https://pengsida.net/)<sup>3</sup> · [Xin Yang](https://sites.google.com/view/xinyang/home)<sup>1,&dagger;,✉️</sup>

	<sup>1</sup>HUST&emsp; <sup>2</sup>Xiaomi EV&emsp; <sup>3</sup>Zhejiang University
	<br>
	&ast;co-first author &emsp; &dagger;project leader &emsp; ✉️ corresponding author

	<a href="https://arxiv.org/pdf/2510.07316"><img src='https://img.shields.io/badge/arXiv-Pixel Perfect Depth-red' alt='Paper PDF'></a>
	<a href='https://pixel-perfect-depth.github.io/'><img src='https://img.shields.io/badge/Project_Page-Pixel Perfect Depth-green' alt='Project Page'></a>
	<a href='https://huggingface.co/spaces/gangweix/Pixel-Perfect-Depth'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Demo-blue'></a>
	</div>

	This work presents Pixel-Perfect Depth, a monocular depth estimation model with pixel-space diffusion transformers. Compared to existing discriminative and generative models,
	its estimated depth maps can produce high-quality, flying-pixel-free point clouds.

	![teaser](assets/teaser.png)

	![overview](assets/overview.png)
	Overview of Pixel-Perfect Depth. We perform diffusion generation directly in pixel space without using any VAE.

	## 🌟 Features

	* Pixel-space diffusion generation (operating directly in image space, without VAE or latent representations), capable of producing flying-pixel-free point clouds from estimated depth maps.
	* Our model integrates the discriminative representation (ViT) into generative modeling (DiT), fully leveraging the strengths of both paradigms.
	* Our network architecture is purely transformer-based, containing no convolutional layers.
	* Although our model is trained at a fixed resolution of 1024×768, it can flexibly support various input resolutions and aspect ratios during inference.

	## News
	- 2026-02-12: We release the evaluation code for 5 benchmarks.
	- 2026-01-09: We release the PPVD model together with its weights.
	- 2025-12-20: We release the training code for PPD, featuring a two-stage pipeline: 512×512 pre-training followed by 1024×768 fine-tuning.
	- 2025-12-01: We release a new PPD model together with its weights, which leverage MoGe2 to provide semantics and deliver a 20–30% improvement on zero-shot benchmarks.
	- 2025-10-01: Paper, project page, code, models, and demo are all released.

	## Benchmarks
	![benchmark](assets/benchmarks.jpg)

	## Pre-trained Models

	Our pretrained models are available on the huggingface hub:

	\| Model \| Semantics \| Params \| Checkpoint \| Training Resolution \|
	\|:-\|:-\|-:\|:-:\|:-:\|
	\| PPD \| DA2 \| 500M \| [Download](https://huggingface.co/gangweix/Pixel-Perfect-Depth/resolve/main/ppd.pth) \| 1024×768 \|
	\| PPD \| MoGe2 \| 500M \| [Download](https://drive.google.com/file/d/1tabmcsbRVDKDfmO4KU1vOjurzN-wp0HV/view?usp=sharing) \| 1024×768 \|

	## Usage

	### Prepraration

	```bash
	git clone https://github.com/gangweix/pixel-perfect-depth
	cd pixel-perfect-depth
	pip install -r requirements.txt
	```

	Download our pretrained model [ppd.pth](https://huggingface.co/gangweix/Pixel-Perfect-Depth/resolve/main/ppd.pth) and put it under the `checkpoints/` directory.
	In addition, you also need to download the pretrained model [depth_anything_v2_vitl.pth](https://huggingface.co/depth-anything/Depth-Anything-V2-Large/resolve/main/depth_anything_v2_vitl.pth?download=true) (or [moge2.pt](https://huggingface.co/Ruicheng/moge-2-vitl-normal/resolve/main/model.pt?download=true)) and put it under the `checkpoints/` directory.

	### Running depth on images

	```bash
	python run.py
	```

	### Running point cloud on images

	Generating point clouds requires metric depth and camera intrinsics from MoGe.
	Please download the pretrained model [moge2.pt](https://huggingface.co/Ruicheng/moge-2-vitl-normal/resolve/main/model.pt?download=true) and place it under the `checkpoints/` folder.

	```bash
	python run_point_cloud.py --save_pcd
	```

	### Running depth on video
	Download our pretrained model [ppvd.pth](https://drive.google.com/file/d/1IbMxrljpqkw92Z0G3CVEIrf-JffbI8sN/view?usp=drive_link) and put it under the `checkpoints/` directory. In addition, you also need to download the pretrained model [pi3.safetensors](https://huggingface.co/yyfz233/Pi3/resolve/main/model.safetensors)

	```bash
	python run_video.py
	```

	### Training

	Our training strategy follows a two-stage curriculum:

	* Stage 1: Pre-training. Conducted at 512×512 resolution on the Hypersim dataset.
	```bash
	python main.py --cfg_file ppd/configs/train_pretrain.yaml pl_trainer.devices=8
	```
	* Stage 2: Fine-tuning. Conducted at 1024×768 resolution on a mixture of five datasets.
	```bash
	python main.py --cfg_file ppd/configs/train_finetune.yaml pl_trainer.devices=8
	```

	### Evaluation

	Before running the evaluation, please first modify the data_root field in ppd/configs/eval.yaml to point to your local evaluation dataset path.

	```bash
	bash eval.sh
	```

	## Qualitative Comparisons with Previous Methods

	Our model preserves more fine-grained details than Depth Anything v2 and MoGe 2, while demonstrating significantly higher robustness compared to Depth Pro.

	![teaser](assets/vis_comp.png)

	## Acknowledgement

	We are grateful to the [Depth Anything V2](https://github.com/DepthAnything/Depth-Anything-V2), [MoGe](https://github.com/microsoft/MoGe) and [DiT](https://github.com/facebookresearch/DiT) teams for their code and model release. We would also like to sincerely thank the NeurIPS reviewers for their appreciation of this work (ratings: 5, 5, 5, 5).

	## Citation

	If you find this project useful, please consider citing:

	```bibtex
	@article{xu2025pixel,
	title={Pixel-perfect depth with semantics-prompted diffusion transformers},
	author={Xu, Gangwei and Lin, Haotong and Luo, Hongcheng and Wang, Xianqi and Yao, Jingfeng and Zhu, Lianghui and Pu, Yuechuan and Chi, Cheng and Sun, Haiyang and Wang, Bing and others},
	journal={arXiv preprint arXiv:2510.07316},
	year={2025}
	}
	```