Update README.md

2b2d02f verified 2 days ago

4.18 kB

	---
	license: apache-2.0
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- Qwen/Qwen3-VL-4B-Instruct
	library_name: transformers
	tags:
	- vision-language-model
	- depth-estimation
	- 3d-vision
	- multimodal
	pipeline_tag: depth-estimation
	---

	<p align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/652d3023cdb2a91205709b6a/G_n7TbVjIv8PYUjq0kxPP.png" width="160" />
	</p>
	<h2 align="center">
	Unlocking Dense Metric Depth Estimation in VLMs
	</h2>

	<h4 align="center">
	<b>Project Page:</b> <a href="https://depthvlm.github.io/">depthvlm.github.io</a> \|
	<b>GitHub:</b> <a href="https://github.com/hanxunyu/DepthVLM">hanxunyu/DepthVLM</a> \|
	<b>arXiv:</b> <a href="https://arxiv.org/abs/2605.15876">2605.15876</a>
	<br><br>
	<a href="https://depthvlm.github.io/"><img src="https://img.shields.io/badge/Project-Home Page-green?logo=safari&logoColor=white" alt="Project Home Page"></a>
	<a href="https://github.com/hanxunyu/DepthVLM"><img src="https://img.shields.io/badge/GitHub-Repository-blue?logo=github" alt="GitHub Badge"></a>
	<a href="https://huggingface.co/datasets/JonnyYu828/DepthVLM-Bench"><img src="https://img.shields.io/badge/HuggingFace-Benchmark-yellow?logo=huggingface" alt="Hugging Face Benchmark"></a>
	<a href="https://arxiv.org/abs/2605.15876"><img src="https://img.shields.io/badge/arXiv-2605.15876-b31b1b.svg?logo=arxiv&logoColor=red" alt="arXiv"></a>
	</h4>


	---

	## 📰 News

	* 2026.05 — Released [DepthVLM-Bench](https://huggingface.co/datasets/JonnyYu828/DepthVLM-Bench).
	* 2026.05 — Released [DepthVLM-4B](https://huggingface.co/JonnyYu828/DepthVLM-4B).

	---

	## 🌟 Model Overview

	DepthVLM serves as a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL.

	By attaching a lightweight depth head to the LLM backbone and adopting a two-stage supervision paradigm, DepthVLM transforms a single VLM into a native dense geometry predictor, while preserving its multimodal capabilities and enhancing its spatial reasoning.

	## 🧠 Key Characteristics

	- Native Dense Metric Depth Estimation in VLMs: Directly predicts geometry within the VLM framework.

	- Unified Multimodal Understanding and Geometry Prediction: Generates full-resolution depth maps alongside language outputs in a single forward pass.

	- Efficient Inference: Achieves higher efficiency compared to per-pixel query or coarse token-level outputs.

	- Versatile Application: Supports both indoor and outdoor metric depth estimation.

	- Improved 3D Spatial Reasoning: Moving toward a truly unified foundation model.

	---

	## 🚀 Main Results

	### Comparison with VLMs
	\| Benchmark \| Ours-8B \| Ours-4B \| DepthLM-12B \| Youtu-VL-4B \|
	\| ---------- \| -----------: \| -----------: \| ----------: \| --------------: \|
	\| Argoverse2 \| <u>0.798</u> \| 0.810 \| 0.761 \| 0.663 \|
	\| Waymo \| <u>0.865</u> \| 0.879 \| 0.588 \| 0.473 \|
	\| DDAD \| <u>0.813</u> \| 0.818 \| 0.654 \| 0.342 \|
	\| NuScenes \| 0.831 \| <u>0.821</u> \| 0.736 \| 0.698 \|
	\| ETH3D \| 0.928 \| <u>0.924</u> \| 0.666 \| 0.286 \|
	\| ScanNet++ \| 0.901 \| <u>0.861</u> \| 0.756 \| 0.522 \|
	\| SUN RGB-D \| 0.889 \| <u>0.882</u> \| 0.785 \| 0.734 \|
	\| IBims-1 \| 0.936 \| <u>0.912</u> \| 0.754 \| 0.856 \|
	\| NYUv2 \| 0.920 \| <u>0.908</u> \| 0.866 \| 0.849 \|
	> Bold and <u>underlined</u> indicate the best and second-best results among the compared models.
	> More details can be found in our paper.


	## Citation

	If you find DepthVLM useful for your research or applications, please consider citing our work using the following BibTeX:

	```bibtex
	@article{yu2026unlocking,
	title={Unlocking Dense Metric Depth Estimation in VLMs},
	author={Hanxun Yu and Xuan Qu and Yuxin Wang and Jianke Zhu and Lei Ke},
	journal={arXiv preprint arXiv:2605.15876},
	year={2026}
	}
	```