File size: 1,568 Bytes
feaea23 b01d581 feaea23 b01d581 feaea23 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | ---
license: bsd-3-clause
pipeline_tag: video-text-to-text
---
# VideoMind-2B-FT-QVHighlights
<div style="display: flex; gap: 5px;">
<a href="https://arxiv.org/abs/2503.13444" target="_blank"><img src="https://img.shields.io/badge/arXiv-2503.13444-red"></a>
<a href="https://videomind.github.io/" target="_blank"><img src="https://img.shields.io/badge/Project-Page-brightgreen"></a>
<a href="https://github.com/yeliudev/VideoMind/blob/main/README.md" target="_blank"><img src="https://img.shields.io/badge/License-BSD--3--Clause-purple"></a>
<a href="https://github.com/yeliudev/VideoMind" target="_blank"><img src="https://img.shields.io/github/stars/yeliudev/VideoMind"></a>
</div>
VideoMind is a multi-modal agent framework that enhances video reasoning by emulating *human-like* processes, such as *breaking down tasks*, *localizing and verifying moments*, and *synthesizing answers*.
## 🔖 Model Details
### Model Description
- **Model type:** Multi-modal Large Language Model
- **Language(s):** English
- **License:** BSD-3-Clause
### More Details
Please refer to our [GitHub Repository](https://github.com/yeliudev/VideoMind) for more details about this model.
## 📖 Citation
Please kindly cite our paper if you find this project helpful.
```
@inproceedings{liu2026videomind,
title={VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning},
author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}
```
|