|
|
--- |
|
|
license: bsd-3-clause |
|
|
pipeline_tag: video-text-to-text |
|
|
--- |
|
|
|
|
|
# VideoMind-2B-FT-QVHighlights |
|
|
|
|
|
<div style="display: flex; gap: 5px;"> |
|
|
<a href="https://arxiv.org/abs/2503.13444" target="_blank"><img src="https://img.shields.io/badge/arXiv-2503.13444-red"></a> |
|
|
<a href="https://videomind.github.io/" target="_blank"><img src="https://img.shields.io/badge/Project-Page-brightgreen"></a> |
|
|
<a href="https://github.com/yeliudev/VideoMind/blob/main/README.md" target="_blank"><img src="https://img.shields.io/badge/License-BSD--3--Clause-purple"></a> |
|
|
<a href="https://github.com/yeliudev/VideoMind" target="_blank"><img src="https://img.shields.io/github/stars/yeliudev/VideoMind"></a> |
|
|
</div> |
|
|
|
|
|
VideoMind is a multi-modal agent framework that enhances video reasoning by emulating *human-like* processes, such as *breaking down tasks*, *localizing and verifying moments*, and *synthesizing answers*. |
|
|
|
|
|
## ๐ Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
- **Model type:** Multi-modal Large Language Model |
|
|
- **Language(s):** English |
|
|
- **License:** BSD-3-Clause |
|
|
|
|
|
### More Details |
|
|
|
|
|
Please refer to our [GitHub Repository](https://github.com/yeliudev/VideoMind) for more details about this model. |
|
|
|
|
|
## ๐ Citation |
|
|
|
|
|
Please kindly cite our paper if you find this project helpful. |
|
|
|
|
|
``` |
|
|
@article{liu2025videomind, |
|
|
title={VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning}, |
|
|
author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng}, |
|
|
journal={arXiv preprint arXiv:2503.13444}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|