nielsr HF Staff commited on
Commit
bb3e1d1
Β·
verified Β·
1 Parent(s): e404ca7

Improve model card for Video-Thinker-7B: Add pipeline tag, library, and correct license

Browse files

This PR significantly enhances the model card for the **Video-Thinker-7B** model by:

- **Updating the license** to `mit`, aligning with the explicit statement in the GitHub repository.
- **Adding `pipeline_tag: video-text-to-text`** to accurately reflect the model's functionality (video input to text output for reasoning) and improve discoverability on the Hugging Face Hub.
- **Adding `library_name: transformers`**, as evidenced by the `config.json` and `tokenizer_config.json` files, which specify `transformers_version` and `Qwen2_5_VLProcessor`, enabling the automated "how to use" widget on the model page.
- **Adding a direct link to the Hugging Face paper page**, complementing the existing arXiv link and providing more context for users.
- **Enriching the model card content** by incorporating key sections from the project's GitHub README, including the overview, performance details, framework description, installation instructions, data preparation, training, evaluation, acknowledgement, citation, license, and contact information. All image paths have been updated to absolute GitHub URLs.
- **Removing irrelevant "File information"** and the `TODO` section for a cleaner presentation.

This update provides a more complete, accurate, and user-friendly resource for the community.

Files changed (1) hide show
  1. README.md +198 -5
README.md CHANGED
@@ -1,12 +1,205 @@
1
  ---
2
- license: apache-2.0
3
- language:
4
- - en
5
  base_model:
6
  - Qwen/Qwen2.5-VL-7B-Instruct
 
 
 
 
 
7
  ---
 
8
  # Video-Thinker-7B
9
 
10
  - **Repository:** https://github.com/shijian2001/Video-Thinker
11
- - **Paper:** [Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning
12
- ](https://www.arxiv.org/abs/2510.23473)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
2
  base_model:
3
  - Qwen/Qwen2.5-VL-7B-Instruct
4
+ language:
5
+ - en
6
+ license: mit
7
+ pipeline_tag: video-text-to-text
8
+ library_name: transformers
9
  ---
10
+
11
  # Video-Thinker-7B
12
 
13
  - **Repository:** https://github.com/shijian2001/Video-Thinker
14
+ - **Paper:** [Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning](https://www.arxiv.org/abs/2510.23473)
15
+ - **Hugging Face Paper:** [Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning](https://huggingface.co/papers/2510.23473)
16
+
17
+ <h1 align="center"> <img src="https://github.com/shijian2001/Video-Thinker/blob/main/figures/video-thinker-logo.jpg" width="270" style="vertical-align:middle;"/><br>Sparking "Thinking with Videos" via Reinforcement Learning</a></h1>
18
+
19
+ <div align="center">
20
+
21
+ [![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b.svg?logo=arxiv)](https://www.arxiv.org/abs/2510.23473)
22
+ [![Paper](https://img.shields.io/badge/Paper-Hugging%20Face-yellow?logo=huggingface)](https://huggingface.co/papers/2510.23473)
23
+ [![Model](https://img.shields.io/badge/Model-Hugging%20Face-yellow?logo=huggingface)](https://huggingface.co/ShijianW01/Video-Thinker-7B)
24
+ [![License](https://img.shields.io/badge/LICENSE-MIT-green.svg)](https://opensource.org/licenses/MIT)
25
+ [![Python 3.10+](https://img.shields.io/badge/Python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/)
26
+ </div>
27
+
28
+ <h5 align="center"> If you like our project, please give us a star ⭐ on GitHub for the latest update.</h5>
29
+
30
+ <div align="center">
31
+ <img src="https://readme-typing-svg.herokuapp.com?font=Orbitron&size=20&duration=3000&pause=1000&color=005DE3&center=true&vCenter=true&width=800&lines=Welcome+to+Video-Thinker;Sparking+%22Thinking+with+Videos%22+via+Reinforcement+Learning;Powered+by+SEU+x+Monash+x+Xiaohongshu+Inc." alt="Typing Animation" />
32
+ </div>
33
+
34
+ ## πŸ“£ Latest News
35
+
36
+ - **[October 30, 2025]**: πŸ“„ Our paper is now available on **[arXiv](https://www.arxiv.org/abs/2510.23473)** and **[HF Paper](https://huggingface.co/papers/2510.23473)**.
37
+ - **[October 28, 2025]**: πŸš€ Our codebase and model released. You can now use Video-Thinker-7B at **[Huggingface Model](https://huggingface.co/ShijianW01/Video-Thinker-7B)**.
38
+
39
+ ## πŸ’‘ Overview
40
+ **Video-Thinker** is an end-to-end video reasoning framework that empowers MLLMs to autonomously leverage intrinsic "grounding" and "captioning" capabilities during inference. This paradigm extends "Thinking with Images" to video understanding, enabling dynamic temporal navigation and visual cue extraction without relying on external tools or pre-designed prompts.
41
+ To spark this capability, we construct **Video-Thinker-10K**, a curated dataset with structured reasoning traces synthesized through **hindsight-curation reasoning**, ensuring that temporal localizations and visual descriptions genuinely contribute to correct answers.
42
+ Furthermore, we propose a **two-stage training strategy** combining SFT for **format learning** and GRPO with **pure outcome reward** for reinforcement learning, enabling Video-Thinker to achieve state-of-the-art performance on challenging video reasoning benchmarks with remarkable data efficiency.
43
+
44
+ <div align="center">
45
+ <img src="https://github.com/shijian2001/Video-Thinker/blob/main/figures/banner.jpg" width="90%" />
46
+ </div>
47
+
48
+ ### πŸ“Š Overall Performance
49
+
50
+ <div align="center">
51
+ <img src="https://github.com/shijian2001/Video-Thinker/blob/main/figures/results_fig.jpg" width="80%" />
52
+ </div>
53
+
54
+ **Video-Thinker-7B** achieves **state-of-the-art performance** among 7B-sized MLLMs across multiple challenging video reasoning benchmarks. Our model demonstrates exceptional capabilities on both **in-domain** and **out-of-domain** tasks:
55
+
56
+ - **Out-of-Domain Benchmarks**:
57
+ - **Video-Holmes**: **43.22%** (↑4.68% over best baseline)
58
+ - **CG-Bench-Reasoning**: **33.25%** (↑3.81% over best baseline)
59
+ - **VRBench**: **80.69%** (↑11.44% over best baseline)
60
+
61
+ - **In-Domain Benchmarks**:
62
+ - **ActivityNet**: **78.72%** | **Star**: **70.66%** | **ScaleLong**: **49.53%**
63
+ - **YouCook2**: **73.66%** | **LVBench**: **37.04%**
64
+
65
+ Our approach enables MLLMs to **"Think with Videos"** by autonomously leveraging intrinsic **grounding** and **captioning** capabilities, achieving superior reasoning performance with only **10K training samples**.
66
+
67
+ <div align="center">
68
+ <img src="https://github.com/shijian2001/Video-Thinker/blob/main/figures/results_table.jpg" width="100%" />
69
+ </div>
70
+
71
+ ### ✨ The Video-Thinker Framework
72
+
73
+ #### πŸ”„ Data Synthesis Pipeline
74
+
75
+ <div align="center">
76
+ <img src="https://github.com/shijian2001/Video-Thinker/blob/main/figures/data_pipe.jpg" width="100%" />
77
+ </div>
78
+
79
+ We construct **Video-Thinker-10K** through a systematic pipeline that transforms diverse video data into structured reasoning samples:
80
+
81
+ - **Data Sources**: We curate from 6 datasets spanning multiple domains:
82
+ - **Caption-labeled** (ActivityNet, TutorialVQA, YouCook2): Rich temporal annotations but lack complex reasoning questions
83
+ - **QA-labeled** (STAR, ScaleLong, LVBench): Challenging QA pairs but lack granular visual descriptions
84
+
85
+ - **Complementary Generation**:
86
+ - For caption-labeled data β†’ Generate complex multi-segment reasoning questions
87
+ - For QA-labeled data β†’ Generate answer-conditioned visual descriptions for key segments
88
+
89
+ - **Hindsight-Curation Reasoning**: We employ a novel quality assurance process where generated `<time>` and `<caption>` contents are validated by testing whether they enable models to derive correct answers, with up to 3 regeneration attempts to ensure high-quality supervision.
90
+
91
+ <div align="center">
92
+ <img src="https://github.com/shijian2001/Video-Thinker/blob/main/figures/data_stat.jpg" width="100%" />
93
+ </div>
94
+
95
+ #### 🎯 Training Strategy of Video-Thinker
96
+
97
+ We adopt a **two-stage training approach** to progressively build video reasoning capabilities:
98
+
99
+ **Stage 1: SFT for Format-Following**
100
+ - Initialize the model to generate structured reasoning traces with `<time>`, `<caption>`, and `<think>` tags
101
+ - Provides essential cold-start by teaching the specialized reasoning format
102
+
103
+ **Stage 2: GRPO for Autonomous Navigation**
104
+ - Strengthens intrinsic grounding and captioning capabilities through reinforcement learning
105
+ - Uses **outcome-based rewards** (correctness + format adherence) without requiring step-wise annotations
106
+ - Enables the model to autonomously discover effective temporal reasoning strategies
107
+ - Demonstrates remarkable data efficiency (10K samples)
108
+
109
+ ## πŸ”§ Installation
110
+
111
+ ```bash
112
+ # Create conda environment
113
+ conda create -n videothinker python=3.10
114
+ conda activate videothinker
115
+
116
+ # Install requirements
117
+ cd Video-Thinker
118
+ pip install -r requirements.txt
119
+ ```
120
+
121
+ ## πŸ“¦ Data Preparation
122
+
123
+ πŸ“‚ Training and evaluation data are available in [data](https://github.com/shijian2001/Video-Thinker/tree/main/data):
124
+ - `data/train/` - Training data
125
+ - `data/eval/id/` - In-domain Evaluation data
126
+ - `data/eval/ood/` - Out-of-domain Evaluation data
127
+
128
+ **Note:** Video files will be released soon. Current data files contain video IDs and annotations.
129
+
130
+ ### πŸ“Š Benchmark Datasets
131
+
132
+ We evaluate on both **in-domain** and **out-of-domain** benchmarks:
133
+
134
+ **Out-of-Domain:**
135
+ - Video-Holmes, CG-Bench-Reasoning, VRBench
136
+
137
+ **In-Domain:**
138
+ - ActivityNet, STAR, ScaleLong, YouCook2, LVBench
139
+
140
+ ### 🎯 Training Data
141
+
142
+ **Video-Thinker-10K** is curated from diverse video reasoning tasks:
143
+ - **Caption-labeled**: ActivityNet, TutorialVQA, YouCook2
144
+ - **QA-labeled**: STAR, ScaleLong, LVBench
145
+
146
+ ## 🎨 Base Model
147
+
148
+ We build upon **[Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)** as our foundation model, which provides strong multimodal understanding capabilities.
149
+
150
+ ## πŸš€ Training
151
+
152
+ ### Step 1: Supervised Fine-Tuning (SFT)
153
+
154
+ Configure your training parameters and run:
155
+ ```bash
156
+ bash scripts/run_sft_video.sh
157
+ ```
158
+
159
+ ### Step 2: Group Relative Policy Optimization (GRPO)
160
+
161
+ After SFT completion, run GRPO training:
162
+ ```bash
163
+ bash scripts/run_grpo_video.sh
164
+ ```
165
+
166
+ ## πŸ“ˆ Evaluation
167
+
168
+ Our trained model **[Video-Thinker-7B](https://huggingface.co/ShijianW01/Video-Thinker-7B)** is available on Hugging Face. You can directly use it to evaluate on your custom video reasoning tasks.
169
+
170
+ To run batch evaluation on trained models:
171
+ ```bash
172
+ bash scripts/run_eval_batch.py
173
+ ```
174
+
175
+ ## πŸ™ Acknowledgement
176
+
177
+ We sincerely appreciate the contributions of the open-source community:
178
+ - [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)
179
+ - [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1)
180
+ - [Video-R1](https://github.com/tulerfeng/Video-R1)
181
+
182
+ ## πŸ“ Citation
183
+
184
+ If you find Video-Thinker useful in your research, please consider citing:
185
+
186
+ ```bibtex
187
+ @article{wang2025video,
188
+ title={Video-Thinker: Sparking" Thinking with Videos" via Reinforcement Learning},
189
+ author={Wang, Shijian and Jin, Jiarui and Wang, Xingjian and Song, Linxin and Fu, Runhao and Wang, Hecheng and Ge, Zongyuan and Lu, Yuan and Cheng, Xuelian},
190
+ journal={arXiv preprint arXiv:2510.23473},
191
+ year={2025}
192
+ }
193
+ ```
194
+
195
+ ## πŸ“„ License
196
+
197
+ This project is released under the [MIT License](https://opensource.org/licenses/MIT).
198
+
199
+ ## πŸ“ž Contact
200
+
201
+ For any questions or feedback, please reach out to us at [shijian@seu.edu.cn](shijian@seu.edu.cn).
202
+
203
+ ## Star History
204
+
205
+ [![Star History Chart](https://api.star-history.com/svg?repos=shijian2001/Video-Thinker&type=Date)](https://www.star-history.com/#shijian2001/Video-Thinker&Date)