nielsr HF Staff commited on
Commit
1a658a0
·
verified ·
1 Parent(s): cb57340

Improve metadata and add base model information

Browse files

Hi! I'm Niels from the Hugging Face community science team. I've noticed this repository has a very thorough README, but it's missing some structured metadata that could help with discoverability.

This PR adds:
- `base_model` reference to Qwen2-VL-7B-Instruct.
- `datasets` reference to the VideoMind-Dataset.
- Descriptive tags for video grounding, video QA, and the agentic workflow.
- A direct link to the Hugging Face paper page in the badges section.

These changes make it easier for users to find and understand the context of your model on the Hub. Great work on this framework!

Files changed (1) hide show
  1. README.md +15 -2
README.md CHANGED
@@ -1,11 +1,20 @@
1
  ---
2
  license: bsd-3-clause
3
  pipeline_tag: video-text-to-text
 
 
 
 
 
 
 
 
4
  ---
5
 
6
  # VideoMind-7B
7
 
8
  <div style="display: flex; gap: 5px;">
 
9
  <a href="https://arxiv.org/abs/2503.13444" target="_blank"><img src="https://img.shields.io/badge/arXiv-2503.13444-red"></a>
10
  <a href="https://videomind.github.io/" target="_blank"><img src="https://img.shields.io/badge/Project-Page-brightgreen"></a>
11
  <a href="https://github.com/yeliudev/VideoMind/blob/main/LICENSE" target="_blank"><img src="https://img.shields.io/badge/License-BSD--3--Clause-purple"></a>
@@ -14,11 +23,15 @@ pipeline_tag: video-text-to-text
14
 
15
  VideoMind is a multi-modal agent framework that enhances video reasoning by emulating *human-like* processes, such as *breaking down tasks*, *localizing and verifying moments*, and *synthesizing answers*.
16
 
 
 
17
  ## 🔖 Model Details
18
 
19
  - **Model type:** Multi-modal Large Language Model
 
20
  - **Language(s):** English
21
  - **License:** BSD-3-Clause
 
22
 
23
  ## 🚀 Quick Start
24
 
@@ -289,8 +302,8 @@ Please kindly cite our paper if you find this project helpful.
289
  ```
290
  @inproceedings{liu2026videomind,
291
  title={VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning},
292
- author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng},
293
  booktitle={International Conference on Learning Representations (ICLR)},
294
  year={2026}
295
  }
296
- ```
 
1
  ---
2
  license: bsd-3-clause
3
  pipeline_tag: video-text-to-text
4
+ base_model: Qwen/Qwen2-VL-7B-Instruct
5
+ datasets:
6
+ - yeliudev/VideoMind-Dataset
7
+ tags:
8
+ - video-grounding
9
+ - video-qa
10
+ - agents
11
+ - chain-of-lora
12
  ---
13
 
14
  # VideoMind-7B
15
 
16
  <div style="display: flex; gap: 5px;">
17
+ <a href="https://huggingface.co/papers/2503.13444" target="_blank"><img src="https://img.shields.io/badge/%F0%9F%A4%97-Paper-blue"></a>
18
  <a href="https://arxiv.org/abs/2503.13444" target="_blank"><img src="https://img.shields.io/badge/arXiv-2503.13444-red"></a>
19
  <a href="https://videomind.github.io/" target="_blank"><img src="https://img.shields.io/badge/Project-Page-brightgreen"></a>
20
  <a href="https://github.com/yeliudev/VideoMind/blob/main/LICENSE" target="_blank"><img src="https://img.shields.io/badge/License-BSD--3--Clause-purple"></a>
 
23
 
24
  VideoMind is a multi-modal agent framework that enhances video reasoning by emulating *human-like* processes, such as *breaking down tasks*, *localizing and verifying moments*, and *synthesizing answers*.
25
 
26
+ The model is presented in the paper [VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning](https://huggingface.co/papers/2503.13444).
27
+
28
  ## 🔖 Model Details
29
 
30
  - **Model type:** Multi-modal Large Language Model
31
+ - **Base Model:** [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)
32
  - **Language(s):** English
33
  - **License:** BSD-3-Clause
34
+ - **Architecture:** Chain-of-LoRA mechanism using multiple specialized adapters (Planner, Grounder, Verifier) on top of a base model.
35
 
36
  ## 🚀 Quick Start
37
 
 
302
  ```
303
  @inproceedings{liu2026videomind,
304
  title={VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning},
305
+ author={Liu, Ye and Lin, Kevin Qinghong, and Chen, Chang Wen and Shou, Mike Zheng},
306
  booktitle={International Conference on Learning Representations (ICLR)},
307
  year={2026}
308
  }
309
+ ```