yeliudev
/

VideoMind-7B

@@ -1,11 +1,20 @@
 ---
 license: bsd-3-clause
 pipeline_tag: video-text-to-text
 ---
 # VideoMind-7B
 <div style="display: flex; gap: 5px;">
   <a href="https://arxiv.org/abs/2503.13444" target="_blank"><img src="https://img.shields.io/badge/arXiv-2503.13444-red"></a>
   <a href="https://videomind.github.io/" target="_blank"><img src="https://img.shields.io/badge/Project-Page-brightgreen"></a>
   <a href="https://github.com/yeliudev/VideoMind/blob/main/LICENSE" target="_blank"><img src="https://img.shields.io/badge/License-BSD--3--Clause-purple"></a>
@@ -14,11 +23,15 @@ pipeline_tag: video-text-to-text
 VideoMind is a multi-modal agent framework that enhances video reasoning by emulating *human-like* processes, such as *breaking down tasks*, *localizing and verifying moments*, and *synthesizing answers*.
 ## 🔖 Model Details
 - **Model type:** Multi-modal Large Language Model
 - **Language(s):** English
 - **License:** BSD-3-Clause
 ## 🚀 Quick Start
@@ -289,8 +302,8 @@ Please kindly cite our paper if you find this project helpful.
 ```
 @inproceedings{liu2026videomind,
   title={VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning},
-  author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng},
   booktitle={International Conference on Learning Representations (ICLR)},
   year={2026}
 }
-```

 ---
 license: bsd-3-clause
 pipeline_tag: video-text-to-text
+base_model: Qwen/Qwen2-VL-7B-Instruct
+datasets:
+- yeliudev/VideoMind-Dataset
+tags:
+- video-grounding
+- video-qa
+- agents
+- chain-of-lora
 ---
 # VideoMind-7B
 <div style="display: flex; gap: 5px;">
+  <a href="https://huggingface.co/papers/2503.13444" target="_blank"><img src="https://img.shields.io/badge/%F0%9F%A4%97-Paper-blue"></a>
   <a href="https://arxiv.org/abs/2503.13444" target="_blank"><img src="https://img.shields.io/badge/arXiv-2503.13444-red"></a>
   <a href="https://videomind.github.io/" target="_blank"><img src="https://img.shields.io/badge/Project-Page-brightgreen"></a>
   <a href="https://github.com/yeliudev/VideoMind/blob/main/LICENSE" target="_blank"><img src="https://img.shields.io/badge/License-BSD--3--Clause-purple"></a>
 VideoMind is a multi-modal agent framework that enhances video reasoning by emulating *human-like* processes, such as *breaking down tasks*, *localizing and verifying moments*, and *synthesizing answers*.
+The model is presented in the paper [VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning](https://huggingface.co/papers/2503.13444).
 ## 🔖 Model Details
 - **Model type:** Multi-modal Large Language Model
+- **Base Model:** [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)
 - **Language(s):** English
 - **License:** BSD-3-Clause
+- **Architecture:** Chain-of-LoRA mechanism using multiple specialized adapters (Planner, Grounder, Verifier) on top of a base model.
 ## 🚀 Quick Start
 ```
 @inproceedings{liu2026videomind,
   title={VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning},
+  author={Liu, Ye and Lin, Kevin Qinghong, and Chen, Chang Wen and Shou, Mike Zheng},
   booktitle={International Conference on Learning Representations (ICLR)},
   year={2026}
 }
+```