---
license: apache-2.0
language:
- en
metrics:
- accuracy
base_model:
- Qwen/Qwen3-VL-4B-Instruct
library_name: transformers
tags:
- vision-language-model
- depth-estimation
- 3d-vision
- multimodal
pipeline_tag: depth-estimation
---
Unlocking Dense Metric Depth Estimation in VLMs
---
## 📰 News
* **2026.05** — Released [DepthVLM-Bench](https://huggingface.co/datasets/JonnyYu828/DepthVLM-Bench).
* **2026.05** — Released [DepthVLM-4B](https://huggingface.co/JonnyYu828/DepthVLM-4B).
---
## 🌟 Model Overview
DepthVLM serves as **a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding**, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL.
By attaching a lightweight depth head to the LLM backbone and adopting a two-stage supervision paradigm, DepthVLM transforms a single VLM into a native dense geometry predictor, while preserving its multimodal capabilities and enhancing its spatial reasoning.
## 🧠Key Characteristics
- **Native Dense Metric Depth Estimation in VLMs**: Directly predicts geometry within the VLM framework.
- **Unified Multimodal Understanding and Geometry Prediction**: Generates full-resolution depth maps alongside language outputs in a single forward pass.
- **Efficient Inference**: Achieves higher efficiency compared to per-pixel query or coarse token-level outputs.
- **Versatile Application**: Supports both indoor and outdoor metric depth estimation.
- **Improved 3D Spatial Reasoning**: Moving toward a truly unified foundation model.
---
## 🚀 Main Results
### Comparison with VLMs
| Benchmark | Ours-8B | Ours-4B | DepthLM-12B | **Youtu-VL-4B** |
| ---------- | -----------: | -----------: | ----------: | --------------: |
| Argoverse2 | 0.798 | **0.810** | 0.761 | 0.663 |
| Waymo | 0.865 | **0.879** | 0.588 | 0.473 |
| DDAD | 0.813 | **0.818** | 0.654 | 0.342 |
| NuScenes | **0.831** | 0.821 | 0.736 | 0.698 |
| ETH3D | **0.928** | 0.924 | 0.666 | 0.286 |
| ScanNet++ | **0.901** | 0.861 | 0.756 | 0.522 |
| SUN RGB-D | **0.889** | 0.882 | 0.785 | 0.734 |
| IBims-1 | **0.936** | 0.912 | 0.754 | 0.856 |
| NYUv2 | **0.920** | 0.908 | 0.866 | 0.849 |
> **Bold** and underlined indicate the best and second-best results among the compared models.
> More details can be found in our paper.
## Citation
If you find DepthVLM useful for your research or applications, please consider citing our work using the following BibTeX:
```bibtex
@article{yu2026unlocking,
title={Unlocking Dense Metric Depth Estimation in VLMs},
author={Hanxun Yu and Xuan Qu and Yuxin Wang and Jianke Zhu and Lei Ke},
journal={arXiv preprint arXiv:2605.15876},
year={2026}
}
```