--- license: apache-2.0 language: - en metrics: - accuracy base_model: - Qwen/Qwen3-VL-4B-Instruct library_name: transformers tags: - vision-language-model - depth-estimation - 3d-vision - multimodal pipeline_tag: depth-estimation ---

Unlocking Dense Metric Depth Estimation in VLMs

Project Page: depthvlm.github.io | GitHub: hanxunyu/DepthVLM | arXiv: 2605.15876

--- ## 📰 News * **2026.05** — Released [DepthVLM-Bench](https://huggingface.co/datasets/JonnyYu828/DepthVLM-Bench). * **2026.05** — Released [DepthVLM-4B](https://huggingface.co/JonnyYu828/DepthVLM-4B). --- ## 🌟 Model Overview DepthVLM serves as **a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding**, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL. By attaching a lightweight depth head to the LLM backbone and adopting a two-stage supervision paradigm, DepthVLM transforms a single VLM into a native dense geometry predictor, while preserving its multimodal capabilities and enhancing its spatial reasoning. ## 🧠 Key Characteristics - **Native Dense Metric Depth Estimation in VLMs**: Directly predicts geometry within the VLM framework. - **Unified Multimodal Understanding and Geometry Prediction**: Generates full-resolution depth maps alongside language outputs in a single forward pass. - **Efficient Inference**: Achieves higher efficiency compared to per-pixel query or coarse token-level outputs. - **Versatile Application**: Supports both indoor and outdoor metric depth estimation. - **Improved 3D Spatial Reasoning**: Moving toward a truly unified foundation model. --- ## 🚀 Main Results ### Comparison with VLMs | Benchmark | Ours-8B | Ours-4B | DepthLM-12B | **Youtu-VL-4B** | | ---------- | -----------: | -----------: | ----------: | --------------: | | Argoverse2 | 0.798 | **0.810** | 0.761 | 0.663 | | Waymo | 0.865 | **0.879** | 0.588 | 0.473 | | DDAD | 0.813 | **0.818** | 0.654 | 0.342 | | NuScenes | **0.831** | 0.821 | 0.736 | 0.698 | | ETH3D | **0.928** | 0.924 | 0.666 | 0.286 | | ScanNet++ | **0.901** | 0.861 | 0.756 | 0.522 | | SUN RGB-D | **0.889** | 0.882 | 0.785 | 0.734 | | IBims-1 | **0.936** | 0.912 | 0.754 | 0.856 | | NYUv2 | **0.920** | 0.908 | 0.866 | 0.849 | > **Bold** and underlined indicate the best and second-best results among the compared models. > More details can be found in our paper. ## Citation If you find DepthVLM useful for your research or applications, please consider citing our work using the following BibTeX: ```bibtex @article{yu2026unlocking, title={Unlocking Dense Metric Depth Estimation in VLMs}, author={Hanxun Yu and Xuan Qu and Yuxin Wang and Jianke Zhu and Lei Ke}, journal={arXiv preprint arXiv:2605.15876}, year={2026} } ```