DepthVLM-4B / README.md
JonnyYu828's picture
Update README.md
2b2d02f verified
---
license: apache-2.0
language:
- en
metrics:
- accuracy
base_model:
- Qwen/Qwen3-VL-4B-Instruct
library_name: transformers
tags:
- vision-language-model
- depth-estimation
- 3d-vision
- multimodal
pipeline_tag: depth-estimation
---
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/652d3023cdb2a91205709b6a/G_n7TbVjIv8PYUjq0kxPP.png" width="160" />
</p>
<h2 align="center">
Unlocking Dense Metric Depth Estimation in VLMs
</h2>
<h4 align="center">
<b>Project Page:</b> <a href="https://depthvlm.github.io/">depthvlm.github.io</a> |
<b>GitHub:</b> <a href="https://github.com/hanxunyu/DepthVLM">hanxunyu/DepthVLM</a> |
<b>arXiv:</b> <a href="https://arxiv.org/abs/2605.15876">2605.15876</a>
<br><br>
<a href="https://depthvlm.github.io/"><img src="https://img.shields.io/badge/Project-Home Page-green?logo=safari&logoColor=white" alt="Project Home Page"></a>
<a href="https://github.com/hanxunyu/DepthVLM"><img src="https://img.shields.io/badge/GitHub-Repository-blue?logo=github" alt="GitHub Badge"></a>
<a href="https://huggingface.co/datasets/JonnyYu828/DepthVLM-Bench"><img src="https://img.shields.io/badge/HuggingFace-Benchmark-yellow?logo=huggingface" alt="Hugging Face Benchmark"></a>
<a href="https://arxiv.org/abs/2605.15876"><img src="https://img.shields.io/badge/arXiv-2605.15876-b31b1b.svg?logo=arxiv&logoColor=red" alt="arXiv"></a>
</h4>
---
## πŸ“° News
* **2026.05** β€” Released [DepthVLM-Bench](https://huggingface.co/datasets/JonnyYu828/DepthVLM-Bench).
* **2026.05** β€” Released [DepthVLM-4B](https://huggingface.co/JonnyYu828/DepthVLM-4B).
---
## 🌟 Model Overview
DepthVLM serves as **a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding**, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL.
By attaching a lightweight depth head to the LLM backbone and adopting a two-stage supervision paradigm, DepthVLM transforms a single VLM into a native dense geometry predictor, while preserving its multimodal capabilities and enhancing its spatial reasoning.
## 🧠 Key Characteristics
- **Native Dense Metric Depth Estimation in VLMs**: Directly predicts geometry within the VLM framework.
- **Unified Multimodal Understanding and Geometry Prediction**: Generates full-resolution depth maps alongside language outputs in a single forward pass.
- **Efficient Inference**: Achieves higher efficiency compared to per-pixel query or coarse token-level outputs.
- **Versatile Application**: Supports both indoor and outdoor metric depth estimation.
- **Improved 3D Spatial Reasoning**: Moving toward a truly unified foundation model.
---
## πŸš€ Main Results
### Comparison with VLMs
| Benchmark | Ours-8B | Ours-4B | DepthLM-12B | **Youtu-VL-4B** |
| ---------- | -----------: | -----------: | ----------: | --------------: |
| Argoverse2 | <u>0.798</u> | **0.810** | 0.761 | 0.663 |
| Waymo | <u>0.865</u> | **0.879** | 0.588 | 0.473 |
| DDAD | <u>0.813</u> | **0.818** | 0.654 | 0.342 |
| NuScenes | **0.831** | <u>0.821</u> | 0.736 | 0.698 |
| ETH3D | **0.928** | <u>0.924</u> | 0.666 | 0.286 |
| ScanNet++ | **0.901** | <u>0.861</u> | 0.756 | 0.522 |
| SUN RGB-D | **0.889** | <u>0.882</u> | 0.785 | 0.734 |
| IBims-1 | **0.936** | <u>0.912</u> | 0.754 | 0.856 |
| NYUv2 | **0.920** | <u>0.908</u> | 0.866 | 0.849 |
> **Bold** and <u>underlined</u> indicate the best and second-best results among the compared models.
> More details can be found in our paper.
## Citation
If you find DepthVLM useful for your research or applications, please consider citing our work using the following BibTeX:
```bibtex
@article{yu2026unlocking,
title={Unlocking Dense Metric Depth Estimation in VLMs},
author={Hanxun Yu and Xuan Qu and Yuxin Wang and Jianke Zhu and Lei Ke},
journal={arXiv preprint arXiv:2605.15876},
year={2026}
}
```