Depth Estimation
Transformers
Safetensors
English
qwen3_vl
image-text-to-text
vision-language-model
3d-vision
multimodal
Instructions to use JonnyYu828/DepthVLM-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use JonnyYu828/DepthVLM-4B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("depth-estimation", model="JonnyYu828/DepthVLM-4B")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("JonnyYu828/DepthVLM-4B") model = AutoModelForImageTextToText.from_pretrained("JonnyYu828/DepthVLM-4B") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -27,9 +27,9 @@ Unlocking Dense Metric Depth Estimation in VLMs
|
|
| 27 |
<b>GitHub:</b> <a href="https://github.com/hanxunyu/DepthVLM">hanxunyu/DepthVLM</a> |
|
| 28 |
<b>arXiv:</b> <a href="https://arxiv.org/abs/2605.15876">2605.15876</a>
|
| 29 |
<br><br>
|
| 30 |
-
<a href="https://depthvlm.github.io/"><img src="https://img.shields.io/badge/Project-Page-green?logo=safari&logoColor=white" alt="Project Page"></a>
|
| 31 |
-
<a href="https://github.com/hanxunyu/DepthVLM"><img src="https://img.shields.io/badge/GitHub-
|
| 32 |
-
<a href="https://huggingface.co/JonnyYu828/DepthVLM-
|
| 33 |
<a href="https://arxiv.org/abs/2605.15876"><img src="https://img.shields.io/badge/arXiv-2605.15876-b31b1b.svg?logo=arxiv&logoColor=red" alt="arXiv"></a>
|
| 34 |
</h4>
|
| 35 |
|
|
@@ -38,8 +38,8 @@ Unlocking Dense Metric Depth Estimation in VLMs
|
|
| 38 |
|
| 39 |
## 📰 News
|
| 40 |
|
| 41 |
-
* **2026.05** — Released DepthVLM-Bench.
|
| 42 |
-
* **2026.05** — Released DepthVLM-4B.
|
| 43 |
|
| 44 |
---
|
| 45 |
|
|
@@ -47,7 +47,7 @@ Unlocking Dense Metric Depth Estimation in VLMs
|
|
| 47 |
|
| 48 |
DepthVLM serves as **a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding**, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL.
|
| 49 |
|
| 50 |
-
By attaching a lightweight depth head to the LLM backbone and
|
| 51 |
|
| 52 |
### Key Characteristics
|
| 53 |
|
|
|
|
| 27 |
<b>GitHub:</b> <a href="https://github.com/hanxunyu/DepthVLM">hanxunyu/DepthVLM</a> |
|
| 28 |
<b>arXiv:</b> <a href="https://arxiv.org/abs/2605.15876">2605.15876</a>
|
| 29 |
<br><br>
|
| 30 |
+
<a href="https://depthvlm.github.io/"><img src="https://img.shields.io/badge/Project-Home Page-green?logo=safari&logoColor=white" alt="Project Home Page"></a>
|
| 31 |
+
<a href="https://github.com/hanxunyu/DepthVLM"><img src="https://img.shields.io/badge/GitHub-Repository-blue?logo=github" alt="GitHub Badge"></a>
|
| 32 |
+
<a href="https://huggingface.co/datasets/JonnyYu828/DepthVLM-Bench"><img src="https://img.shields.io/badge/HuggingFace-Benchmark-yellow?logo=huggingface" alt="Hugging Face Benchmark"></a>
|
| 33 |
<a href="https://arxiv.org/abs/2605.15876"><img src="https://img.shields.io/badge/arXiv-2605.15876-b31b1b.svg?logo=arxiv&logoColor=red" alt="arXiv"></a>
|
| 34 |
</h4>
|
| 35 |
|
|
|
|
| 38 |
|
| 39 |
## 📰 News
|
| 40 |
|
| 41 |
+
* **2026.05** — Released [DepthVLM-Bench](https://huggingface.co/datasets/JonnyYu828/DepthVLM-Bench).
|
| 42 |
+
* **2026.05** — Released [DepthVLM-4B](https://huggingface.co/JonnyYu828/DepthVLM-4B).
|
| 43 |
|
| 44 |
---
|
| 45 |
|
|
|
|
| 47 |
|
| 48 |
DepthVLM serves as **a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding**, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL.
|
| 49 |
|
| 50 |
+
By attaching a lightweight depth head to the LLM backbone and adopting a two-stage supervision paradigm, DepthVLM transforms a single VLM into a native dense geometry predictor, while preserving its multimodal capabilities and enhancing its spatial reasoning.
|
| 51 |
|
| 52 |
### Key Characteristics
|
| 53 |
|