Improve model card metadata and content

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +22 -24
README.md CHANGED
@@ -1,20 +1,15 @@
1
  ---
2
- license: apache-2.0
3
-
4
  base_model:
5
- - Qwen/Qwen3-VL-4B-Instruct
6
-
7
  pipeline_tag: depth-estimation
8
-
9
  tags:
10
- - vision-language-model
11
- - depth-estimation
12
- - 3d-vision
13
- - multimodal
14
- - qwen3-vl
15
-
16
- paper:
17
- - arxiv: 2605.15876
18
  ---
19
 
20
  Update 2026-05-18 (v1.0): Initial release
@@ -23,17 +18,21 @@ Update 2026-05-18 (v1.0): Initial release
23
 
24
  DepthVLM serves as a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL.
25
 
 
 
26
  ## Highlights
27
 
28
- - Native dense metric depth estimation in VLMs
29
- - Unified multimodal understanding and geometry prediction
30
- - Full-resolution depth prediction with efficient inference
31
- - Supports both indoor and outdoor metric depth estimation
32
- - Improved 3D spatial reasoning capability
33
 
34
- ## Paper
35
 
36
- [Unlocking Dense Metric Depth Estimation in VLMs](https://arxiv.org/abs/2605.15876)
 
 
37
 
38
  ## Usage
39
 
@@ -44,16 +43,15 @@ Please refer to the official repository for detailed instructions on:
44
  - Evaluation
45
  - Inference and visualization
46
 
47
- Repository: https://github.com/hanxunyu/DepthVLM
48
-
49
  ## Citation
50
 
51
  If you find this work useful, please cite:
52
 
53
- ```bibtex id="k2m9wq"
54
  @article{yu2026unlocking,
55
  title={Unlocking Dense Metric Depth Estimation in VLMs},
56
  author={Hanxun Yu and Xuan Qu and Yuxin Wang and Jianke Zhu and Lei Ke},
57
  journal={arXiv preprint arXiv:2605.15876},
58
  year={2026}
59
- }
 
 
1
  ---
 
 
2
  base_model:
3
+ - Qwen/Qwen3-VL-4B-Instruct
4
+ license: apache-2.0
5
  pipeline_tag: depth-estimation
6
+ library_name: transformers
7
  tags:
8
+ - vision-language-model
9
+ - depth-estimation
10
+ - 3d-vision
11
+ - multimodal
12
+ - qwen3-vl
 
 
 
13
  ---
14
 
15
  Update 2026-05-18 (v1.0): Initial release
 
18
 
19
  DepthVLM serves as a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL.
20
 
21
+ By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm, DepthVLM transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability.
22
+
23
  ## Highlights
24
 
25
+ - **Native dense metric depth estimation in VLMs**: Directly predicts geometry within the VLM framework.
26
+ - **Unified multimodal understanding and geometry prediction**: Generates full-resolution depth maps alongside language outputs in a single forward pass.
27
+ - **Efficient Inference**: Achieves higher efficiency compared to per-pixel query or coarse token-level outputs.
28
+ - **Versatile Application**: Supports both indoor and outdoor metric depth estimation.
29
+ - **Improved 3D spatial reasoning**: Moving toward a truly unified foundation model.
30
 
31
+ ## Resources
32
 
33
+ - **Paper:** [Unlocking Dense Metric Depth Estimation in VLMs](https://arxiv.org/abs/2605.15876)
34
+ - **Project Page:** [https://depthvlm.github.io/](https://depthvlm.github.io/)
35
+ - **Repository:** [https://github.com/hanxunyu/DepthVLM](https://github.com/hanxunyu/DepthVLM)
36
 
37
  ## Usage
38
 
 
43
  - Evaluation
44
  - Inference and visualization
45
 
 
 
46
  ## Citation
47
 
48
  If you find this work useful, please cite:
49
 
50
+ ```bibtex
51
  @article{yu2026unlocking,
52
  title={Unlocking Dense Metric Depth Estimation in VLMs},
53
  author={Hanxun Yu and Xuan Qu and Yuxin Wang and Jianke Zhu and Lei Ke},
54
  journal={arXiv preprint arXiv:2605.15876},
55
  year={2026}
56
+ }
57
+ ```