File size: 1,568 Bytes
feaea23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b01d581
 
feaea23
b01d581
 
feaea23
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
---
license: bsd-3-clause
pipeline_tag: video-text-to-text
---

# VideoMind-2B-FT-QVHighlights

<div style="display: flex; gap: 5px;">
  <a href="https://arxiv.org/abs/2503.13444" target="_blank"><img src="https://img.shields.io/badge/arXiv-2503.13444-red"></a>
  <a href="https://videomind.github.io/" target="_blank"><img src="https://img.shields.io/badge/Project-Page-brightgreen"></a>
  <a href="https://github.com/yeliudev/VideoMind/blob/main/README.md" target="_blank"><img src="https://img.shields.io/badge/License-BSD--3--Clause-purple"></a>
  <a href="https://github.com/yeliudev/VideoMind" target="_blank"><img src="https://img.shields.io/github/stars/yeliudev/VideoMind"></a>
</div>

VideoMind is a multi-modal agent framework that enhances video reasoning by emulating *human-like* processes, such as *breaking down tasks*, *localizing and verifying moments*, and *synthesizing answers*.

## 🔖 Model Details

### Model Description

- **Model type:** Multi-modal Large Language Model
- **Language(s):** English
- **License:** BSD-3-Clause

### More Details

Please refer to our [GitHub Repository](https://github.com/yeliudev/VideoMind) for more details about this model.

## 📖 Citation

Please kindly cite our paper if you find this project helpful.

```
@inproceedings{liu2026videomind,
  title={VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning},
  author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}
```