Improve model card for Video-Thinker-7B: Add pipeline tag, library, and correct license
Browse filesThis PR significantly enhances the model card for the **Video-Thinker-7B** model by:
- **Updating the license** to `mit`, aligning with the explicit statement in the GitHub repository.
- **Adding `pipeline_tag: video-text-to-text`** to accurately reflect the model's functionality (video input to text output for reasoning) and improve discoverability on the Hugging Face Hub.
- **Adding `library_name: transformers`**, as evidenced by the `config.json` and `tokenizer_config.json` files, which specify `transformers_version` and `Qwen2_5_VLProcessor`, enabling the automated "how to use" widget on the model page.
- **Adding a direct link to the Hugging Face paper page**, complementing the existing arXiv link and providing more context for users.
- **Enriching the model card content** by incorporating key sections from the project's GitHub README, including the overview, performance details, framework description, installation instructions, data preparation, training, evaluation, acknowledgement, citation, license, and contact information. All image paths have been updated to absolute GitHub URLs.
- **Removing irrelevant "File information"** and the `TODO` section for a cleaner presentation.
This update provides a more complete, accurate, and user-friendly resource for the community.
|
@@ -1,12 +1,205 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
language:
|
| 4 |
-
- en
|
| 5 |
base_model:
|
| 6 |
- Qwen/Qwen2.5-VL-7B-Instruct
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
---
|
|
|
|
| 8 |
# Video-Thinker-7B
|
| 9 |
|
| 10 |
- **Repository:** https://github.com/shijian2001/Video-Thinker
|
| 11 |
-
- **Paper:** [Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning
|
| 12 |
-
](https://
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 4 |
+
language:
|
| 5 |
+
- en
|
| 6 |
+
license: mit
|
| 7 |
+
pipeline_tag: video-text-to-text
|
| 8 |
+
library_name: transformers
|
| 9 |
---
|
| 10 |
+
|
| 11 |
# Video-Thinker-7B
|
| 12 |
|
| 13 |
- **Repository:** https://github.com/shijian2001/Video-Thinker
|
| 14 |
+
- **Paper:** [Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning](https://www.arxiv.org/abs/2510.23473)
|
| 15 |
+
- **Hugging Face Paper:** [Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning](https://huggingface.co/papers/2510.23473)
|
| 16 |
+
|
| 17 |
+
<h1 align="center"> <img src="https://github.com/shijian2001/Video-Thinker/blob/main/figures/video-thinker-logo.jpg" width="270" style="vertical-align:middle;"/><br>Sparking "Thinking with Videos" via Reinforcement Learning</a></h1>
|
| 18 |
+
|
| 19 |
+
<div align="center">
|
| 20 |
+
|
| 21 |
+
[](https://www.arxiv.org/abs/2510.23473)
|
| 22 |
+
[](https://huggingface.co/papers/2510.23473)
|
| 23 |
+
[](https://huggingface.co/ShijianW01/Video-Thinker-7B)
|
| 24 |
+
[](https://opensource.org/licenses/MIT)
|
| 25 |
+
[](https://www.python.org/downloads/release/python-390/)
|
| 26 |
+
</div>
|
| 27 |
+
|
| 28 |
+
<h5 align="center"> If you like our project, please give us a star β on GitHub for the latest update.</h5>
|
| 29 |
+
|
| 30 |
+
<div align="center">
|
| 31 |
+
<img src="https://readme-typing-svg.herokuapp.com?font=Orbitron&size=20&duration=3000&pause=1000&color=005DE3¢er=true&vCenter=true&width=800&lines=Welcome+to+Video-Thinker;Sparking+%22Thinking+with+Videos%22+via+Reinforcement+Learning;Powered+by+SEU+x+Monash+x+Xiaohongshu+Inc." alt="Typing Animation" />
|
| 32 |
+
</div>
|
| 33 |
+
|
| 34 |
+
## π£ Latest News
|
| 35 |
+
|
| 36 |
+
- **[October 30, 2025]**: π Our paper is now available on **[arXiv](https://www.arxiv.org/abs/2510.23473)** and **[HF Paper](https://huggingface.co/papers/2510.23473)**.
|
| 37 |
+
- **[October 28, 2025]**: π Our codebase and model released. You can now use Video-Thinker-7B at **[Huggingface Model](https://huggingface.co/ShijianW01/Video-Thinker-7B)**.
|
| 38 |
+
|
| 39 |
+
## π‘ Overview
|
| 40 |
+
**Video-Thinker** is an end-to-end video reasoning framework that empowers MLLMs to autonomously leverage intrinsic "grounding" and "captioning" capabilities during inference. This paradigm extends "Thinking with Images" to video understanding, enabling dynamic temporal navigation and visual cue extraction without relying on external tools or pre-designed prompts.
|
| 41 |
+
To spark this capability, we construct **Video-Thinker-10K**, a curated dataset with structured reasoning traces synthesized through **hindsight-curation reasoning**, ensuring that temporal localizations and visual descriptions genuinely contribute to correct answers.
|
| 42 |
+
Furthermore, we propose a **two-stage training strategy** combining SFT for **format learning** and GRPO with **pure outcome reward** for reinforcement learning, enabling Video-Thinker to achieve state-of-the-art performance on challenging video reasoning benchmarks with remarkable data efficiency.
|
| 43 |
+
|
| 44 |
+
<div align="center">
|
| 45 |
+
<img src="https://github.com/shijian2001/Video-Thinker/blob/main/figures/banner.jpg" width="90%" />
|
| 46 |
+
</div>
|
| 47 |
+
|
| 48 |
+
### π Overall Performance
|
| 49 |
+
|
| 50 |
+
<div align="center">
|
| 51 |
+
<img src="https://github.com/shijian2001/Video-Thinker/blob/main/figures/results_fig.jpg" width="80%" />
|
| 52 |
+
</div>
|
| 53 |
+
|
| 54 |
+
**Video-Thinker-7B** achieves **state-of-the-art performance** among 7B-sized MLLMs across multiple challenging video reasoning benchmarks. Our model demonstrates exceptional capabilities on both **in-domain** and **out-of-domain** tasks:
|
| 55 |
+
|
| 56 |
+
- **Out-of-Domain Benchmarks**:
|
| 57 |
+
- **Video-Holmes**: **43.22%** (β4.68% over best baseline)
|
| 58 |
+
- **CG-Bench-Reasoning**: **33.25%** (β3.81% over best baseline)
|
| 59 |
+
- **VRBench**: **80.69%** (β11.44% over best baseline)
|
| 60 |
+
|
| 61 |
+
- **In-Domain Benchmarks**:
|
| 62 |
+
- **ActivityNet**: **78.72%** | **Star**: **70.66%** | **ScaleLong**: **49.53%**
|
| 63 |
+
- **YouCook2**: **73.66%** | **LVBench**: **37.04%**
|
| 64 |
+
|
| 65 |
+
Our approach enables MLLMs to **"Think with Videos"** by autonomously leveraging intrinsic **grounding** and **captioning** capabilities, achieving superior reasoning performance with only **10K training samples**.
|
| 66 |
+
|
| 67 |
+
<div align="center">
|
| 68 |
+
<img src="https://github.com/shijian2001/Video-Thinker/blob/main/figures/results_table.jpg" width="100%" />
|
| 69 |
+
</div>
|
| 70 |
+
|
| 71 |
+
### β¨ The Video-Thinker Framework
|
| 72 |
+
|
| 73 |
+
#### π Data Synthesis Pipeline
|
| 74 |
+
|
| 75 |
+
<div align="center">
|
| 76 |
+
<img src="https://github.com/shijian2001/Video-Thinker/blob/main/figures/data_pipe.jpg" width="100%" />
|
| 77 |
+
</div>
|
| 78 |
+
|
| 79 |
+
We construct **Video-Thinker-10K** through a systematic pipeline that transforms diverse video data into structured reasoning samples:
|
| 80 |
+
|
| 81 |
+
- **Data Sources**: We curate from 6 datasets spanning multiple domains:
|
| 82 |
+
- **Caption-labeled** (ActivityNet, TutorialVQA, YouCook2): Rich temporal annotations but lack complex reasoning questions
|
| 83 |
+
- **QA-labeled** (STAR, ScaleLong, LVBench): Challenging QA pairs but lack granular visual descriptions
|
| 84 |
+
|
| 85 |
+
- **Complementary Generation**:
|
| 86 |
+
- For caption-labeled data β Generate complex multi-segment reasoning questions
|
| 87 |
+
- For QA-labeled data β Generate answer-conditioned visual descriptions for key segments
|
| 88 |
+
|
| 89 |
+
- **Hindsight-Curation Reasoning**: We employ a novel quality assurance process where generated `<time>` and `<caption>` contents are validated by testing whether they enable models to derive correct answers, with up to 3 regeneration attempts to ensure high-quality supervision.
|
| 90 |
+
|
| 91 |
+
<div align="center">
|
| 92 |
+
<img src="https://github.com/shijian2001/Video-Thinker/blob/main/figures/data_stat.jpg" width="100%" />
|
| 93 |
+
</div>
|
| 94 |
+
|
| 95 |
+
#### π― Training Strategy of Video-Thinker
|
| 96 |
+
|
| 97 |
+
We adopt a **two-stage training approach** to progressively build video reasoning capabilities:
|
| 98 |
+
|
| 99 |
+
**Stage 1: SFT for Format-Following**
|
| 100 |
+
- Initialize the model to generate structured reasoning traces with `<time>`, `<caption>`, and `<think>` tags
|
| 101 |
+
- Provides essential cold-start by teaching the specialized reasoning format
|
| 102 |
+
|
| 103 |
+
**Stage 2: GRPO for Autonomous Navigation**
|
| 104 |
+
- Strengthens intrinsic grounding and captioning capabilities through reinforcement learning
|
| 105 |
+
- Uses **outcome-based rewards** (correctness + format adherence) without requiring step-wise annotations
|
| 106 |
+
- Enables the model to autonomously discover effective temporal reasoning strategies
|
| 107 |
+
- Demonstrates remarkable data efficiency (10K samples)
|
| 108 |
+
|
| 109 |
+
## π§ Installation
|
| 110 |
+
|
| 111 |
+
```bash
|
| 112 |
+
# Create conda environment
|
| 113 |
+
conda create -n videothinker python=3.10
|
| 114 |
+
conda activate videothinker
|
| 115 |
+
|
| 116 |
+
# Install requirements
|
| 117 |
+
cd Video-Thinker
|
| 118 |
+
pip install -r requirements.txt
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
## π¦ Data Preparation
|
| 122 |
+
|
| 123 |
+
π Training and evaluation data are available in [data](https://github.com/shijian2001/Video-Thinker/tree/main/data):
|
| 124 |
+
- `data/train/` - Training data
|
| 125 |
+
- `data/eval/id/` - In-domain Evaluation data
|
| 126 |
+
- `data/eval/ood/` - Out-of-domain Evaluation data
|
| 127 |
+
|
| 128 |
+
**Note:** Video files will be released soon. Current data files contain video IDs and annotations.
|
| 129 |
+
|
| 130 |
+
### π Benchmark Datasets
|
| 131 |
+
|
| 132 |
+
We evaluate on both **in-domain** and **out-of-domain** benchmarks:
|
| 133 |
+
|
| 134 |
+
**Out-of-Domain:**
|
| 135 |
+
- Video-Holmes, CG-Bench-Reasoning, VRBench
|
| 136 |
+
|
| 137 |
+
**In-Domain:**
|
| 138 |
+
- ActivityNet, STAR, ScaleLong, YouCook2, LVBench
|
| 139 |
+
|
| 140 |
+
### π― Training Data
|
| 141 |
+
|
| 142 |
+
**Video-Thinker-10K** is curated from diverse video reasoning tasks:
|
| 143 |
+
- **Caption-labeled**: ActivityNet, TutorialVQA, YouCook2
|
| 144 |
+
- **QA-labeled**: STAR, ScaleLong, LVBench
|
| 145 |
+
|
| 146 |
+
## π¨ Base Model
|
| 147 |
+
|
| 148 |
+
We build upon **[Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)** as our foundation model, which provides strong multimodal understanding capabilities.
|
| 149 |
+
|
| 150 |
+
## π Training
|
| 151 |
+
|
| 152 |
+
### Step 1: Supervised Fine-Tuning (SFT)
|
| 153 |
+
|
| 154 |
+
Configure your training parameters and run:
|
| 155 |
+
```bash
|
| 156 |
+
bash scripts/run_sft_video.sh
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
### Step 2: Group Relative Policy Optimization (GRPO)
|
| 160 |
+
|
| 161 |
+
After SFT completion, run GRPO training:
|
| 162 |
+
```bash
|
| 163 |
+
bash scripts/run_grpo_video.sh
|
| 164 |
+
```
|
| 165 |
+
|
| 166 |
+
## π Evaluation
|
| 167 |
+
|
| 168 |
+
Our trained model **[Video-Thinker-7B](https://huggingface.co/ShijianW01/Video-Thinker-7B)** is available on Hugging Face. You can directly use it to evaluate on your custom video reasoning tasks.
|
| 169 |
+
|
| 170 |
+
To run batch evaluation on trained models:
|
| 171 |
+
```bash
|
| 172 |
+
bash scripts/run_eval_batch.py
|
| 173 |
+
```
|
| 174 |
+
|
| 175 |
+
## π Acknowledgement
|
| 176 |
+
|
| 177 |
+
We sincerely appreciate the contributions of the open-source community:
|
| 178 |
+
- [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)
|
| 179 |
+
- [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1)
|
| 180 |
+
- [Video-R1](https://github.com/tulerfeng/Video-R1)
|
| 181 |
+
|
| 182 |
+
## π Citation
|
| 183 |
+
|
| 184 |
+
If you find Video-Thinker useful in your research, please consider citing:
|
| 185 |
+
|
| 186 |
+
```bibtex
|
| 187 |
+
@article{wang2025video,
|
| 188 |
+
title={Video-Thinker: Sparking" Thinking with Videos" via Reinforcement Learning},
|
| 189 |
+
author={Wang, Shijian and Jin, Jiarui and Wang, Xingjian and Song, Linxin and Fu, Runhao and Wang, Hecheng and Ge, Zongyuan and Lu, Yuan and Cheng, Xuelian},
|
| 190 |
+
journal={arXiv preprint arXiv:2510.23473},
|
| 191 |
+
year={2025}
|
| 192 |
+
}
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
## π License
|
| 196 |
+
|
| 197 |
+
This project is released under the [MIT License](https://opensource.org/licenses/MIT).
|
| 198 |
+
|
| 199 |
+
## π Contact
|
| 200 |
+
|
| 201 |
+
For any questions or feedback, please reach out to us at [shijian@seu.edu.cn](shijian@seu.edu.cn).
|
| 202 |
+
|
| 203 |
+
## Star History
|
| 204 |
+
|
| 205 |
+
[](https://www.star-history.com/#shijian2001/Video-Thinker&Date)
|