JonnyYu828
/

DepthVLM-4B

@@ -1,20 +1,15 @@
 ---
-license: apache-2.0
 base_model:
-  - Qwen/Qwen3-VL-4B-Instruct
 pipeline_tag: depth-estimation
 tags:
-  - vision-language-model
-  - depth-estimation
-  - 3d-vision
-  - multimodal
-  - qwen3-vl
-paper:
-  - arxiv: 2605.15876
 ---
 Update 2026-05-18 (v1.0): Initial release
@@ -23,17 +18,21 @@ Update 2026-05-18 (v1.0): Initial release
 DepthVLM serves as a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL.
 ## Highlights
-- Native dense metric depth estimation in VLMs
-- Unified multimodal understanding and geometry prediction
-- Full-resolution depth prediction with efficient inference
-- Supports both indoor and outdoor metric depth estimation
-- Improved 3D spatial reasoning capability
-## Paper
-[Unlocking Dense Metric Depth Estimation in VLMs](https://arxiv.org/abs/2605.15876)
 ## Usage
@@ -44,16 +43,15 @@ Please refer to the official repository for detailed instructions on:
 - Evaluation
 - Inference and visualization
-Repository: https://github.com/hanxunyu/DepthVLM
 ## Citation
 If you find this work useful, please cite:
-```bibtex id="k2m9wq"
 @article{yu2026unlocking,
   title={Unlocking Dense Metric Depth Estimation in VLMs},
   author={Hanxun Yu and Xuan Qu and Yuxin Wang and Jianke Zhu and Lei Ke},
   journal={arXiv preprint arXiv:2605.15876},
   year={2026}
-}

 ---
 base_model:
+- Qwen/Qwen3-VL-4B-Instruct
+license: apache-2.0
 pipeline_tag: depth-estimation
+library_name: transformers
 tags:
+- vision-language-model
+- depth-estimation
+- 3d-vision
+- multimodal
+- qwen3-vl
 ---
 Update 2026-05-18 (v1.0): Initial release
 DepthVLM serves as a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL.
+By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm, DepthVLM transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability.
 ## Highlights
+- **Native dense metric depth estimation in VLMs**: Directly predicts geometry within the VLM framework.
+- **Unified multimodal understanding and geometry prediction**: Generates full-resolution depth maps alongside language outputs in a single forward pass.
+- **Efficient Inference**: Achieves higher efficiency compared to per-pixel query or coarse token-level outputs.
+- **Versatile Application**: Supports both indoor and outdoor metric depth estimation.
+- **Improved 3D spatial reasoning**: Moving toward a truly unified foundation model.
+## Resources
+- **Paper:** [Unlocking Dense Metric Depth Estimation in VLMs](https://arxiv.org/abs/2605.15876)
+- **Project Page:** [https://depthvlm.github.io/](https://depthvlm.github.io/)
+- **Repository:** [https://github.com/hanxunyu/DepthVLM](https://github.com/hanxunyu/DepthVLM)
 ## Usage
 - Evaluation
 - Inference and visualization
 ## Citation
 If you find this work useful, please cite:
+```bibtex
 @article{yu2026unlocking,
   title={Unlocking Dense Metric Depth Estimation in VLMs},
   author={Hanxun Yu and Xuan Qu and Yuxin Wang and Jianke Zhu and Lei Ke},
   journal={arXiv preprint arXiv:2605.15876},
   year={2026}
+}
+```