Depth Estimation
Transformers
Safetensors
English
qwen3_vl
image-text-to-text
vision-language-model
3d-vision
multimodal
Instructions to use JonnyYu828/DepthVLM-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use JonnyYu828/DepthVLM-4B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("depth-estimation", model="JonnyYu828/DepthVLM-4B")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("JonnyYu828/DepthVLM-4B") model = AutoModelForImageTextToText.from_pretrained("JonnyYu828/DepthVLM-4B") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| metrics: | |
| - accuracy | |
| base_model: | |
| - Qwen/Qwen3-VL-4B-Instruct | |
| library_name: transformers | |
| tags: | |
| - vision-language-model | |
| - depth-estimation | |
| - 3d-vision | |
| - multimodal | |
| pipeline_tag: depth-estimation | |
| <p align="center"> | |
| <img src="https://cdn-uploads.huggingface.co/production/uploads/652d3023cdb2a91205709b6a/G_n7TbVjIv8PYUjq0kxPP.png" width="160" /> | |
| </p> | |
| <h2 align="center"> | |
| Unlocking Dense Metric Depth Estimation in VLMs | |
| </h2> | |
| <h4 align="center"> | |
| <b>Project Page:</b> <a href="https://depthvlm.github.io/">depthvlm.github.io</a> | | |
| <b>GitHub:</b> <a href="https://github.com/hanxunyu/DepthVLM">hanxunyu/DepthVLM</a> | | |
| <b>arXiv:</b> <a href="https://arxiv.org/abs/2605.15876">2605.15876</a> | |
| <br><br> | |
| <a href="https://depthvlm.github.io/"><img src="https://img.shields.io/badge/Project-Home Page-green?logo=safari&logoColor=white" alt="Project Home Page"></a> | |
| <a href="https://github.com/hanxunyu/DepthVLM"><img src="https://img.shields.io/badge/GitHub-Repository-blue?logo=github" alt="GitHub Badge"></a> | |
| <a href="https://huggingface.co/datasets/JonnyYu828/DepthVLM-Bench"><img src="https://img.shields.io/badge/HuggingFace-Benchmark-yellow?logo=huggingface" alt="Hugging Face Benchmark"></a> | |
| <a href="https://arxiv.org/abs/2605.15876"><img src="https://img.shields.io/badge/arXiv-2605.15876-b31b1b.svg?logo=arxiv&logoColor=red" alt="arXiv"></a> | |
| </h4> | |
| --- | |
| ## π° News | |
| * **2026.05** β Released [DepthVLM-Bench](https://huggingface.co/datasets/JonnyYu828/DepthVLM-Bench). | |
| * **2026.05** β Released [DepthVLM-4B](https://huggingface.co/JonnyYu828/DepthVLM-4B). | |
| --- | |
| ## π Model Overview | |
| DepthVLM serves as **a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding**, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL. | |
| By attaching a lightweight depth head to the LLM backbone and adopting a two-stage supervision paradigm, DepthVLM transforms a single VLM into a native dense geometry predictor, while preserving its multimodal capabilities and enhancing its spatial reasoning. | |
| ## π§ Key Characteristics | |
| - **Native Dense Metric Depth Estimation in VLMs**: Directly predicts geometry within the VLM framework. | |
| - **Unified Multimodal Understanding and Geometry Prediction**: Generates full-resolution depth maps alongside language outputs in a single forward pass. | |
| - **Efficient Inference**: Achieves higher efficiency compared to per-pixel query or coarse token-level outputs. | |
| - **Versatile Application**: Supports both indoor and outdoor metric depth estimation. | |
| - **Improved 3D Spatial Reasoning**: Moving toward a truly unified foundation model. | |
| --- | |
| ## π Main Results | |
| ### Comparison with VLMs | |
| | Benchmark | Ours-8B | Ours-4B | DepthLM-12B | **Youtu-VL-4B** | | |
| | ---------- | -----------: | -----------: | ----------: | --------------: | | |
| | Argoverse2 | <u>0.798</u> | **0.810** | 0.761 | 0.663 | | |
| | Waymo | <u>0.865</u> | **0.879** | 0.588 | 0.473 | | |
| | DDAD | <u>0.813</u> | **0.818** | 0.654 | 0.342 | | |
| | NuScenes | **0.831** | <u>0.821</u> | 0.736 | 0.698 | | |
| | ETH3D | **0.928** | <u>0.924</u> | 0.666 | 0.286 | | |
| | ScanNet++ | **0.901** | <u>0.861</u> | 0.756 | 0.522 | | |
| | SUN RGB-D | **0.889** | <u>0.882</u> | 0.785 | 0.734 | | |
| | IBims-1 | **0.936** | <u>0.912</u> | 0.754 | 0.856 | | |
| | NYUv2 | **0.920** | <u>0.908</u> | 0.866 | 0.849 | | |
| > **Bold** and <u>underlined</u> indicate the best and second-best results among the compared models. | |
| > More details can be found in our paper. | |
| ## Citation | |
| If you find DepthVLM useful for your research or applications, please consider citing our work using the following BibTeX: | |
| ```bibtex | |
| @article{yu2026unlocking, | |
| title={Unlocking Dense Metric Depth Estimation in VLMs}, | |
| author={Hanxun Yu and Xuan Qu and Yuxin Wang and Jianke Zhu and Lei Ke}, | |
| journal={arXiv preprint arXiv:2605.15876}, | |
| year={2026} | |
| } | |
| ``` |