yeliudev
/

VideoMind-2B-FT-QVHighlights

Video-Text-to-Text

Model card Files Files and versions

VideoMind-2B-FT-QVHighlights / README.md

yeliudev's picture

Update README.md

feaea23 verified 10 months ago

|

history blame contribute delete

1.52 kB

	---
	license: bsd-3-clause
	pipeline_tag: video-text-to-text
	---

	# VideoMind-2B-FT-QVHighlights

	<div style="display: flex; gap: 5px;">
	<a href="https://arxiv.org/abs/2503.13444" target="_blank"><img src="https://img.shields.io/badge/arXiv-2503.13444-red"></a>
	<a href="https://videomind.github.io/" target="_blank"><img src="https://img.shields.io/badge/Project-Page-brightgreen"></a>
	<a href="https://github.com/yeliudev/VideoMind/blob/main/README.md" target="_blank"><img src="https://img.shields.io/badge/License-BSD--3--Clause-purple"></a>
	<a href="https://github.com/yeliudev/VideoMind" target="_blank"><img src="https://img.shields.io/github/stars/yeliudev/VideoMind"></a>
	</div>

	VideoMind is a multi-modal agent framework that enhances video reasoning by emulating human-like processes, such as breaking down tasks, localizing and verifying moments, and synthesizing answers.

	## 🔖 Model Details

	### Model Description

	- Model type: Multi-modal Large Language Model
	- Language(s): English
	- License: BSD-3-Clause

	### More Details

	Please refer to our [GitHub Repository](https://github.com/yeliudev/VideoMind) for more details about this model.

	## 📖 Citation

	Please kindly cite our paper if you find this project helpful.

	```
	@article{liu2025videomind,
	title={VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning},
	author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng},
	journal={arXiv preprint arXiv:2503.13444},
	year={2025}
	}
	```