yeliudev
/

VideoMind-7B

Video-Text-to-Text

Model card Files Files and versions

yeliudev commited on Mar 31, 2025

Commit

cb43899

·

verified ·

1 Parent(s): 4d9fed3

Update README.md

Files changed (1) hide show

README.md +41 -3

README.md CHANGED Viewed

@@ -1,3 +1,41 @@
----
-license: bsd-3-clause
----

+---
+license: bsd-3-clause
+pipeline_tag: video-text-to-text
+---
+# VideoMind-7B
+<div style="display: flex; gap: 5px;">
+  <a href="https://arxiv.org/abs/2503.13444" target="_blank"><img src="https://img.shields.io/badge/arXiv-2503.13444-red"></a>
+  <a href="https://videomind.github.io/" target="_blank"><img src="https://img.shields.io/badge/Project-Page-brightgreen"></a>
+  <a href="https://github.com/yeliudev/VideoMind/blob/main/README.md" target="_blank"><img src="https://img.shields.io/badge/License-BSD--3--Clause-purple"></a>
+  <a href="https://huggingface.co/spaces/yeliudev/VideoMind-2B" target="_blank" style="margin: 0;"><img src="https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm-dark.svg"></a>
+  <a href="https://github.com/yeliudev/VideoMind" target="_blank"><img src="https://img.shields.io/github/stars/yeliudev/VideoMind"></a>
+</div>
+VideoMind is a multi-modal agent framework that enhances video reasoning by emulating *human-like* processes, such as *breaking down tasks*, *localizing and verifying moments*, and *synthesizing answers*.
+## 🔖 Model Details
+### Model Description
+- **Model type:** Multi-modal Large Language Model
+- **Language(s):** English
+- **License:** BSD-3-Clause
+### More Details
+Please refer to our [GitHub Repository](https://github.com/yeliudev/VideoMind) for more details about this model.
+## 📖 Citation
+Please kindly cite our paper if you find this project helpful.
+```
+@article{liu2025videomind,
+  title={VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning},
+  author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng},
+  journal={arXiv preprint arXiv:2503.13444},
+  year={2025}
+}
+```