metadata
license: bsd-3-clause
pipeline_tag: video-text-to-text
VideoMind-2B-FT-QVHighlights
VideoMind is a multi-modal agent framework that enhances video reasoning by emulating human-like processes, such as breaking down tasks, localizing and verifying moments, and synthesizing answers.
π Model Details
Model Description
- Model type: Multi-modal Large Language Model
- Language(s): English
- License: BSD-3-Clause
More Details
Please refer to our GitHub Repository for more details about this model.
π Citation
Please kindly cite our paper if you find this project helpful.
@article{liu2025videomind,
title={VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning},
author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng},
journal={arXiv preprint arXiv:2503.13444},
year={2025}
}