Depth Estimation
Transformers
Safetensors
English
qwen3_vl
image-text-to-text
vision-language-model
3d-vision
multimodal
Instructions to use JonnyYu828/DepthVLM-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use JonnyYu828/DepthVLM-4B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("depth-estimation", model="JonnyYu828/DepthVLM-4B")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("JonnyYu828/DepthVLM-4B") model = AutoModelForImageTextToText.from_pretrained("JonnyYu828/DepthVLM-4B") - Notebooks
- Google Colab
- Kaggle
File size: 4,176 Bytes
627d308 5a5e54f c9e5ec0 585453e bd84f4b 585453e 5a5e54f 51bea97 5a5e54f b3b0bda 5a5e54f b3b0bda 5a5e54f cd6abb1 5a5e54f 13ae84c 5a5e54f 13ae84c cd6abb1 13ae84c 5a5e54f 13ae84c cd6abb1 585453e 1b7c7cf 13ae84c be7985a 5a5e54f be7985a 5a5e54f 585453e 5a5e54f be7985a 13ae84c 5a5e54f 13ae84c 5a5e54f 3306f04 5a5e54f b8ee814 2b2d02f b8ee814 585453e b8ee814 5a5e54f 585453e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | ---
license: apache-2.0
language:
- en
metrics:
- accuracy
base_model:
- Qwen/Qwen3-VL-4B-Instruct
library_name: transformers
tags:
- vision-language-model
- depth-estimation
- 3d-vision
- multimodal
pipeline_tag: depth-estimation
---
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/652d3023cdb2a91205709b6a/G_n7TbVjIv8PYUjq0kxPP.png" width="160" />
</p>
<h2 align="center">
Unlocking Dense Metric Depth Estimation in VLMs
</h2>
<h4 align="center">
<b>Project Page:</b> <a href="https://depthvlm.github.io/">depthvlm.github.io</a> |
<b>GitHub:</b> <a href="https://github.com/hanxunyu/DepthVLM">hanxunyu/DepthVLM</a> |
<b>arXiv:</b> <a href="https://arxiv.org/abs/2605.15876">2605.15876</a>
<br><br>
<a href="https://depthvlm.github.io/"><img src="https://img.shields.io/badge/Project-Home Page-green?logo=safari&logoColor=white" alt="Project Home Page"></a>
<a href="https://github.com/hanxunyu/DepthVLM"><img src="https://img.shields.io/badge/GitHub-Repository-blue?logo=github" alt="GitHub Badge"></a>
<a href="https://huggingface.co/datasets/JonnyYu828/DepthVLM-Bench"><img src="https://img.shields.io/badge/HuggingFace-Benchmark-yellow?logo=huggingface" alt="Hugging Face Benchmark"></a>
<a href="https://arxiv.org/abs/2605.15876"><img src="https://img.shields.io/badge/arXiv-2605.15876-b31b1b.svg?logo=arxiv&logoColor=red" alt="arXiv"></a>
</h4>
---
## ๐ฐ News
* **2026.05** โ Released [DepthVLM-Bench](https://huggingface.co/datasets/JonnyYu828/DepthVLM-Bench).
* **2026.05** โ Released [DepthVLM-4B](https://huggingface.co/JonnyYu828/DepthVLM-4B).
---
## ๐ Model Overview
DepthVLM serves as **a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding**, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL.
By attaching a lightweight depth head to the LLM backbone and adopting a two-stage supervision paradigm, DepthVLM transforms a single VLM into a native dense geometry predictor, while preserving its multimodal capabilities and enhancing its spatial reasoning.
## ๐ง Key Characteristics
- **Native Dense Metric Depth Estimation in VLMs**: Directly predicts geometry within the VLM framework.
- **Unified Multimodal Understanding and Geometry Prediction**: Generates full-resolution depth maps alongside language outputs in a single forward pass.
- **Efficient Inference**: Achieves higher efficiency compared to per-pixel query or coarse token-level outputs.
- **Versatile Application**: Supports both indoor and outdoor metric depth estimation.
- **Improved 3D Spatial Reasoning**: Moving toward a truly unified foundation model.
---
## ๐ Main Results
### Comparison with VLMs
| Benchmark | Ours-8B | Ours-4B | DepthLM-12B | **Youtu-VL-4B** |
| ---------- | -----------: | -----------: | ----------: | --------------: |
| Argoverse2 | <u>0.798</u> | **0.810** | 0.761 | 0.663 |
| Waymo | <u>0.865</u> | **0.879** | 0.588 | 0.473 |
| DDAD | <u>0.813</u> | **0.818** | 0.654 | 0.342 |
| NuScenes | **0.831** | <u>0.821</u> | 0.736 | 0.698 |
| ETH3D | **0.928** | <u>0.924</u> | 0.666 | 0.286 |
| ScanNet++ | **0.901** | <u>0.861</u> | 0.756 | 0.522 |
| SUN RGB-D | **0.889** | <u>0.882</u> | 0.785 | 0.734 |
| IBims-1 | **0.936** | <u>0.912</u> | 0.754 | 0.856 |
| NYUv2 | **0.920** | <u>0.908</u> | 0.866 | 0.849 |
> **Bold** and <u>underlined</u> indicate the best and second-best results among the compared models.
> More details can be found in our paper.
## Citation
If you find DepthVLM useful for your research or applications, please consider citing our work using the following BibTeX:
```bibtex
@article{yu2026unlocking,
title={Unlocking Dense Metric Depth Estimation in VLMs},
author={Hanxun Yu and Xuan Qu and Yuxin Wang and Jianke Zhu and Lei Ke},
journal={arXiv preprint arXiv:2605.15876},
year={2026}
}
``` |