yeliudev
/

VideoMind-2B

@@ -1,24 +1,36 @@
 ---
 license: bsd-3-clause
 pipeline_tag: video-text-to-text
 ---
 # VideoMind-2B
 <div style="display: flex; gap: 5px;">
   <a href="https://arxiv.org/abs/2503.13444" target="_blank"><img src="https://img.shields.io/badge/arXiv-2503.13444-red"></a>
   <a href="https://videomind.github.io/" target="_blank"><img src="https://img.shields.io/badge/Project-Page-brightgreen"></a>
   <a href="https://github.com/yeliudev/VideoMind/blob/main/LICENSE" target="_blank"><img src="https://img.shields.io/badge/License-BSD--3--Clause-purple"></a>
   <a href="https://github.com/yeliudev/VideoMind" target="_blank"><img src="https://img.shields.io/github/stars/yeliudev/VideoMind"></a>
 </div>
-VideoMind is a multi-modal agent framework that enhances video reasoning by emulating *human-like* processes, such as *breaking down tasks*, *localizing and verifying moments*, and *synthesizing answers*.
 ## 🔖 Model Details
 - **Model type:** Multi-modal Large Language Model
 - **Language(s):** English
 - **License:** BSD-3-Clause
 ## 🚀 Quick Start
@@ -286,11 +298,11 @@ print(f'Answerer Response: {response}')
 Please kindly cite our paper if you find this project helpful.
-```
 @inproceedings{liu2026videomind,
   title={VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning},
   author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng},
   booktitle={International Conference on Learning Representations (ICLR)},
   year={2026}
 }
-```

 ---
 license: bsd-3-clause
 pipeline_tag: video-text-to-text
+base_model: Qwen/Qwen2-VL-2B-Instruct
+datasets:
+- yeliudev/VideoMind-Dataset
+tags:
+- video-reasoning
+- temporal-grounding
+- chain-of-lora
+- multimodal
+- agent
 ---
 # VideoMind-2B
 <div style="display: flex; gap: 5px;">
+  <a href="https://huggingface.co/papers/2503.13444" target="_blank"><img src="https://img.shields.io/badge/Paper-huggingface-red"></a>
   <a href="https://arxiv.org/abs/2503.13444" target="_blank"><img src="https://img.shields.io/badge/arXiv-2503.13444-red"></a>
   <a href="https://videomind.github.io/" target="_blank"><img src="https://img.shields.io/badge/Project-Page-brightgreen"></a>
   <a href="https://github.com/yeliudev/VideoMind/blob/main/LICENSE" target="_blank"><img src="https://img.shields.io/badge/License-BSD--3--Clause-purple"></a>
   <a href="https://github.com/yeliudev/VideoMind" target="_blank"><img src="https://img.shields.io/github/stars/yeliudev/VideoMind"></a>
 </div>
+VideoMind is a multi-modal agent framework that enhances video reasoning by emulating *human-like* processes, such as *breaking down tasks*, *localizing and verifying moments*, and *synthesizing answers*. It was introduced in the paper [VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning](https://huggingface.co/papers/2503.13444).
 ## 🔖 Model Details
 - **Model type:** Multi-modal Large Language Model
+- **Base model:** [Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)
 - **Language(s):** English
 - **License:** BSD-3-Clause
+- **Authors:** [Ye Liu](https://huggingface.co/yeliudev), [Kevin Qinghong Lin](https://huggingface.co/KevinQHLin), Chang Wen Chen, and [Mike Zheng Shou](https://huggingface.co/AnalMom).
 ## 🚀 Quick Start
 Please kindly cite our paper if you find this project helpful.
+```bibtex
 @inproceedings{liu2026videomind,
   title={VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning},
   author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng},
   booktitle={International Conference on Learning Representations (ICLR)},
   year={2026}
 }
+```