Improve metadata and add base model information

Hi! I'm Niels from the Hugging Face community science team. I've noticed this repository has a very thorough README, but it's missing some structured metadata that could help with discoverability.

This PR adds:
- `base_model` reference to Qwen2-VL-7B-Instruct.
- `datasets` reference to the VideoMind-Dataset.
- Descriptive tags for video grounding, video QA, and the agentic workflow.
- A direct link to the Hugging Face paper page in the badges section.

These changes make it easier for users to find and understand the context of your model on the Hub. Great work on this framework!

Files changed (1) hide show

README.md +15 -2

README.md CHANGED Viewed

@@ -1,11 +1,20 @@
 ---
 license: bsd-3-clause
 pipeline_tag: video-text-to-text
 ---
 # VideoMind-7B
 <div style="display: flex; gap: 5px;">
   <a href="https://arxiv.org/abs/2503.13444" target="_blank"><img src="https://img.shields.io/badge/arXiv-2503.13444-red"></a>
   <a href="https://videomind.github.io/" target="_blank"><img src="https://img.shields.io/badge/Project-Page-brightgreen"></a>
   <a href="https://github.com/yeliudev/VideoMind/blob/main/LICENSE" target="_blank"><img src="https://img.shields.io/badge/License-BSD--3--Clause-purple"></a>
@@ -14,11 +23,15 @@ pipeline_tag: video-text-to-text
 VideoMind is a multi-modal agent framework that enhances video reasoning by emulating *human-like* processes, such as *breaking down tasks*, *localizing and verifying moments*, and *synthesizing answers*.
 ## 🔖 Model Details
 - **Model type:** Multi-modal Large Language Model
 - **Language(s):** English
 - **License:** BSD-3-Clause
 ## 🚀 Quick Start
@@ -289,8 +302,8 @@ Please kindly cite our paper if you find this project helpful.
 ```
 @inproceedings{liu2026videomind,
   title={VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning},
-  author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng},
   booktitle={International Conference on Learning Representations (ICLR)},
   year={2026}
 }
-```

 ---
 license: bsd-3-clause
 pipeline_tag: video-text-to-text
+base_model: Qwen/Qwen2-VL-7B-Instruct
+datasets:
+- yeliudev/VideoMind-Dataset
+tags:
+- video-grounding
+- video-qa
+- agents
+- chain-of-lora
 ---
 # VideoMind-7B
 <div style="display: flex; gap: 5px;">
+  <a href="https://huggingface.co/papers/2503.13444" target="_blank"><img src="https://img.shields.io/badge/%F0%9F%A4%97-Paper-blue"></a>
   <a href="https://arxiv.org/abs/2503.13444" target="_blank"><img src="https://img.shields.io/badge/arXiv-2503.13444-red"></a>
   <a href="https://videomind.github.io/" target="_blank"><img src="https://img.shields.io/badge/Project-Page-brightgreen"></a>
   <a href="https://github.com/yeliudev/VideoMind/blob/main/LICENSE" target="_blank"><img src="https://img.shields.io/badge/License-BSD--3--Clause-purple"></a>
 VideoMind is a multi-modal agent framework that enhances video reasoning by emulating *human-like* processes, such as *breaking down tasks*, *localizing and verifying moments*, and *synthesizing answers*.
+The model is presented in the paper [VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning](https://huggingface.co/papers/2503.13444).
 ## 🔖 Model Details
 - **Model type:** Multi-modal Large Language Model
+- **Base Model:** [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)
 - **Language(s):** English
 - **License:** BSD-3-Clause
+- **Architecture:** Chain-of-LoRA mechanism using multiple specialized adapters (Planner, Grounder, Verifier) on top of a base model.
 ## 🚀 Quick Start
 ```
 @inproceedings{liu2026videomind,
   title={VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning},
+  author={Liu, Ye and Lin, Kevin Qinghong, and Chen, Chang Wen and Shou, Mike Zheng},
   booktitle={International Conference on Learning Representations (ICLR)},
   year={2026}
 }
+```