File size: 4,176 Bytes
627d308
5a5e54f
 
 
 
 
c9e5ec0
585453e
 
bd84f4b
585453e
 
 
 
5a5e54f
 
 
 
51bea97
5a5e54f
b3b0bda
5a5e54f
b3b0bda
5a5e54f
 
 
 
 
 
cd6abb1
 
 
5a5e54f
 
 
 
13ae84c
 
5a5e54f
13ae84c
cd6abb1
 
13ae84c
5a5e54f
 
 
 
 
13ae84c
cd6abb1
585453e
1b7c7cf
13ae84c
be7985a
5a5e54f
be7985a
5a5e54f
 
 
585453e
5a5e54f
be7985a
13ae84c
5a5e54f
 
 
13ae84c
5a5e54f
 
 
 
 
 
 
 
 
 
 
 
3306f04
5a5e54f
b8ee814
 
 
 
2b2d02f
b8ee814
585453e
b8ee814
5a5e54f
 
 
 
585453e
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
license: apache-2.0
language:
- en
metrics:
- accuracy
base_model:
- Qwen/Qwen3-VL-4B-Instruct
library_name: transformers
tags:
- vision-language-model
- depth-estimation
- 3d-vision
- multimodal
pipeline_tag: depth-estimation
---

<p align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/652d3023cdb2a91205709b6a/G_n7TbVjIv8PYUjq0kxPP.png" width="160" />
</p>
<h2 align="center">
Unlocking Dense Metric Depth Estimation in VLMs
</h2>

<h4 align="center">
  <b>Project Page:</b> <a href="https://depthvlm.github.io/">depthvlm.github.io</a> | 
  <b>GitHub:</b> <a href="https://github.com/hanxunyu/DepthVLM">hanxunyu/DepthVLM</a> | 
  <b>arXiv:</b> <a href="https://arxiv.org/abs/2605.15876">2605.15876</a>
  <br><br>
  <a href="https://depthvlm.github.io/"><img src="https://img.shields.io/badge/Project-Home Page-green?logo=safari&logoColor=white" alt="Project Home Page"></a>
  <a href="https://github.com/hanxunyu/DepthVLM"><img src="https://img.shields.io/badge/GitHub-Repository-blue?logo=github" alt="GitHub Badge"></a>
  <a href="https://huggingface.co/datasets/JonnyYu828/DepthVLM-Bench"><img src="https://img.shields.io/badge/HuggingFace-Benchmark-yellow?logo=huggingface" alt="Hugging Face Benchmark"></a>
  <a href="https://arxiv.org/abs/2605.15876"><img src="https://img.shields.io/badge/arXiv-2605.15876-b31b1b.svg?logo=arxiv&logoColor=red" alt="arXiv"></a>
</h4>


---

## ๐Ÿ“ฐ News

* **2026.05** โ€” Released [DepthVLM-Bench](https://huggingface.co/datasets/JonnyYu828/DepthVLM-Bench).
* **2026.05** โ€” Released [DepthVLM-4B](https://huggingface.co/JonnyYu828/DepthVLM-4B).

---

## ๐ŸŒŸ Model Overview

DepthVLM serves as **a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding**, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL.

By attaching a lightweight depth head to the LLM backbone and adopting a two-stage supervision paradigm, DepthVLM transforms a single VLM into a native dense geometry predictor, while preserving its multimodal capabilities and enhancing its spatial reasoning.

## ๐Ÿง  Key Characteristics

- **Native Dense Metric Depth Estimation in VLMs**: Directly predicts geometry within the VLM framework.

- **Unified Multimodal Understanding and Geometry Prediction**: Generates full-resolution depth maps alongside language outputs in a single forward pass.

- **Efficient Inference**: Achieves higher efficiency compared to per-pixel query or coarse token-level outputs.  

- **Versatile Application**: Supports both indoor and outdoor metric depth estimation.

- **Improved 3D Spatial Reasoning**: Moving toward a truly unified foundation model.

---

## ๐Ÿš€ Main Results

### Comparison with VLMs
| Benchmark  |      Ours-8B |      Ours-4B | DepthLM-12B | **Youtu-VL-4B** |
| ---------- | -----------: | -----------: | ----------: | --------------: |
| Argoverse2 | <u>0.798</u> |    **0.810** |       0.761 |           0.663 |
| Waymo      | <u>0.865</u> |    **0.879** |       0.588 |           0.473 |
| DDAD       | <u>0.813</u> |    **0.818** |       0.654 |           0.342 |
| NuScenes   |    **0.831** | <u>0.821</u> |       0.736 |           0.698 |
| ETH3D      |    **0.928** | <u>0.924</u> |       0.666 |           0.286 |
| ScanNet++  |    **0.901** | <u>0.861</u> |       0.756 |           0.522 |
| SUN RGB-D  |    **0.889** | <u>0.882</u> |       0.785 |           0.734 |
| IBims-1    |    **0.936** | <u>0.912</u> |       0.754 |           0.856 |
| NYUv2      |    **0.920** | <u>0.908</u> |       0.866 |           0.849 |
> **Bold** and <u>underlined</u> indicate the best and second-best results among the compared models.
> More details can be found in our paper.


## Citation

If you find DepthVLM useful for your research or applications, please consider citing our work using the following BibTeX:

```bibtex
@article{yu2026unlocking,
    title={Unlocking Dense Metric Depth Estimation in VLMs},
    author={Hanxun Yu and Xuan Qu and Yuxin Wang and Jianke Zhu and Lei Ke},
    journal={arXiv preprint arXiv:2605.15876},
    year={2026}
}
```