nielsr HF Staff commited on
Commit
272c8ba
ยท
verified ยท
1 Parent(s): b2a10e2

Improve model card for TSPO: Add metadata, paper link, and project page

Browse files

This PR significantly enhances the model card for TSPO by:

* Activating the existing content, which was previously commented out.
* Adding `license: apache-2.0`, `pipeline_tag: video-text-to-text`, and `library_name: transformers` to the YAML metadata, which improves discoverability and provides crucial information at a glance.
* Including descriptive tags: `video-understanding`, `reinforcement-learning`, and `long-video`.
* Updating the content to match the more comprehensive GitHub README.
* Replacing the arXiv paper link with the official Hugging Face paper page: [TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding](https://huggingface.co/papers/2508.04369).
* Adding a link to the project page: [https://vision-cair.github.io/LongVU](https://vision-cair.github.io/LongVU).
* Correcting image paths to ensure they render correctly on the Hugging Face Hub.
* Adding a proper BibTeX citation for the paper.

This makes the model more accessible and informative for researchers and practitioners on the Hugging Face Hub.

Files changed (1) hide show
  1. README.md +96 -43
README.md CHANGED
@@ -1,18 +1,23 @@
1
- <!-- ---
2
  license: apache-2.0
 
 
 
 
 
 
3
  ---
4
 
5
  # TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding
6
 
7
- [[๐Ÿ“– Paper](https://arxiv.org/pdf/2508.04369)] [[๐Ÿค— TSPO-model](https://huggingface.co/hzf666/TSPO-0.4B)] [[๐Ÿค— TSPO-train-data](https://huggingface.co/datasets/hzf666/TSPO-10K)]
8
-
9
 
10
  ## ๐Ÿ‘€ Overview
11
 
12
- Inspired by Deepseek-R1's GRPO algorithm, We propose **Temporal Sampling Policy Optimization (TSPO)**, a reinforcement learning framework that advances long-form video understanding by addressing the challenges of unsupervised and non-differentiable sparse frame sampling.
13
 
14
  <div align="center">
15
- <img src="./assets/main_fig.png" width="800" height="400" style="object-fit: contain;">
16
  </div>
17
 
18
 
@@ -20,10 +25,10 @@ Inspired by Deepseek-R1's GRPO algorithm, We propose **Temporal Sampling Policy
20
 
21
  - Our method achieves **63.9%** accuracy on LongVideoBench and **76.3%** on MLVU, setting a new state-of-the-art performance among 7B video-MLLMs.
22
 
23
- - Our trained temporal agent (TSPO-0.4B) demonstrates strong generalizability. When applied to LLaVA-Video-7B, it achieves an average improvement of **4.3%** across four benchmarks; with Qwen2.5VL-7B, the gain reaches **5.3%**. Transferability to other backbones is further analyzed in Table 2 of our paper.
24
 
25
  <div align="center">
26
- <img src="./assets/main_results.png" width="700" height="350" style="object-fit: contain;">
27
  </div>
28
 
29
 
@@ -33,16 +38,17 @@ Inspired by Deepseek-R1's GRPO algorithm, We propose **Temporal Sampling Policy
33
 
34
  ## ๐Ÿงธ Toy example
35
 
36
- We present a toy example to show how TSPO works. We follow an intuition that Video-MLLMs can only give correct answers if the temporal agent samples the correct keyframes. Thanks to our joint modeling of keyframe sampling and language generation, we can use the language accuracy reward $R_A$ derived from multiple-choice training data to supervise the temporal agent.
37
 
38
  - As shown in the GIF, through TSPO training, the temporal agent learns to select frames that lead to the correct answer for the question *"What is the scene at the beginning of the video?"*. As a result, $R_A$ increases, the predicted score peaks converge at the video's beginning, and the sampled frames converge to this time segment.
39
 
40
- - **For reproduce this example**, first download [LLaVA-Video-Qwen](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2), [CLIP-Large](https://huggingface.co/openai/clip-vit-large-patch14), and [208.mp4](https://drive.google.com/file/d/1FDIxxIcyjL0v2O6sGc0KljXi8MRLDFJk/view?usp=sharing), and modify the ``model_name_or_path`` and ``clip_path`` in the ``toy_example.sh``. The script can be run on a single GPU with at least 28GB.
41
 
42
  <div align="center">
43
- <img src="./assets/gif.gif" width="800" height="400" style="object-fit: contain;">
44
  </div>
45
 
 
46
  ## ๐Ÿ“ Set up
47
 
48
  ```
@@ -51,6 +57,8 @@ conda activate TSPO
51
 
52
  pip install -r requirement.txt
53
  pip install flash-attn==2.5.9.post1 --no-build-isolation
 
 
54
 
55
  cd lmms-eval
56
  pip install -e .
@@ -58,12 +66,11 @@ cd ../
58
  ```
59
 
60
 
61
-
62
  ## ๐ŸŽฅ Demo
63
 
64
- - Download [LLaVA-Video-Qwen-7B](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2) or [Qwen2.5vl-7B](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct), and our [TSPO-0.4B](). Then, you can try the ``demo/llava_video_tspo.py`` or ``demo/qwen25vl_tspo.py`` .
65
 
66
- - We provide example long videos: [208.mp4](https://drive.google.com/file/d/1FDIxxIcyjL0v2O6sGc0KljXi8MRLDFJk/view?usp=sharing), [7XWqI121-Q4.mp4](https://drive.google.com/file/d/1qh-8I1DsgH5TbqEbr05PPO5hdGtvUK23/view?usp=sharing), [5dJUUQufzw4.mp4](https://drive.google.com/file/d/1lBf6Oo7jkhi7-fSvrc_U7SqvqET3vhrh/view?usp=sharing). you can feel free to edit the "video_path" and "question". Our model will output the responses and the sampled frames will be saved under the demo directory.
67
 
68
  ```
69
  # using llava_video as backbone
@@ -74,66 +81,112 @@ CUDA_VISIBLE_DEVICES=0 python demo/qwen25vl_tspo.py
74
  ```
75
 
76
  <div align="center">
77
- <img src="./assets/demo2.png" width="700" height="350" style="object-fit: contain;">
78
  </div>
79
 
80
- ## ๐Ÿ’พ Dataset
81
 
82
- We provide TSPO-10K train dataset, which is available at [[๐Ÿค— TSPO-train-data]()].
83
 
 
 
 
84
 
 
85
 
86
- ## ๐Ÿš€ Training
87
 
88
- First download [LLaVA-Video-Qwen](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2) and [CLIP-Large](https://huggingface.co/openai/clip-vit-large-patch14) and modify the ``model_name_or_path`` and ``clip_path ``in the ``train_deepspeed.sh``
89
 
90
- ```
91
- bash train_deepspeed.sh
92
- ```
93
 
94
- To get your trained TSPO-0.4B weights, you should run the merge_weights.py
95
 
96
- ```
97
- bash scripts/merge_weights.py
98
- ```
 
 
 
 
 
 
 
 
99
 
 
100
 
 
101
 
102
- ## ๐Ÿ”ฎ Evaluation
103
 
104
- For LongVideoBench, VideoMME, and MLVU, we use lmms-eval, which we have adapted to the current project by adding files such as `llava_vid_tspo.py` and `qwen_2_5_vl_tspo.py`.
105
 
106
  ```
107
- # For Qwen2.5-VL+TSPO
108
- bash eval_scripts/TSPO_qwen25_vl.sh
109
-
110
- # For LLaVA-Video
111
- bash eval_scripts/TSPO_llava_video.sh
112
  ```
113
 
114
- You can evaluate original model without our TSPO by:
115
 
116
  ```
117
- # For Original Qwen2.5-VL
118
- bash eval_scripts/original_qwen25_vl.sh
119
-
120
- # For Original LLaVA-Video
121
- bash eval_scripts/original_llava_video.sh
122
  ```
123
 
124
- For [LVBench](https://github.com/zai-org/LVBench), we use its own evaluation protocol. You can combine our demo.py and LVBench's official github to evaluate it.
125
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
 
127
 
128
  ## Acknowledgements
129
 
 
130
 
131
 
132
  ## Citations
133
 
134
  If you find our work helpful for your research, please consider citing our work.
135
 
136
- ```
137
-
138
- ```
139
- -->
 
 
 
 
 
1
+ ---
2
  license: apache-2.0
3
+ pipeline_tag: video-text-to-text
4
+ library_name: transformers
5
+ tags:
6
+ - video-understanding
7
+ - reinforcement-learning
8
+ - long-video
9
  ---
10
 
11
  # TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding
12
 
13
+ [[๐Ÿ“– Paper](https://huggingface.co/papers/2508.04369)] [[๐ŸŒ Project Page](https://vision-cair.github.io/LongVU)] [[๐Ÿค— TSPO-model](https://huggingface.co/hzf666/TSPO-0.4B)] [[๐Ÿค— TSPO-train-data](https://huggingface.co/datasets/Canhui99/TSPO-10K)]
 
14
 
15
  ## ๐Ÿ‘€ Overview
16
 
17
+ To addressing the challenges of unsupervised and non-differentiable sparse frame sampling in Video-MLLMs, We propose **Temporal Sampling Policy Optimization (TSPO)**, a reinforcement learning framework that advances long-form video understanding.
18
 
19
  <div align="center">
20
+ <img src="https://github.com/Hui-design/TSPO/raw/main/assets/main_fig.png" width="800" height="400" style="object-fit: contain;">
21
  </div>
22
 
23
 
 
25
 
26
  - Our method achieves **63.9%** accuracy on LongVideoBench and **76.3%** on MLVU, setting a new state-of-the-art performance among 7B video-MLLMs.
27
 
28
+ - Our trained temporal agent (TSPO-0.4B) demonstrates strong generalizability. When applied to LLaVA-Video-7B, it achieves an average improvement of **4.3%** across four benchmarks; with Qwen2.5VL-7B, the gain reaches **6.1%**. Transferability to other backbones is further analyzed in Table 2 of our paper.
29
 
30
  <div align="center">
31
+ <img src="https://github.com/Hui-design/TSPO/raw/main/assets/main_results.png" width="650" height="325" style="object-fit: contain;">
32
  </div>
33
 
34
 
 
38
 
39
  ## ๐Ÿงธ Toy example
40
 
41
+ We present a toy example to show how TSPO works. We follow an intuition that Video-MLLMs can only give correct answers if the temporal agent samples the correct keyframes. Thanks to our joint modeling of keyframe sampling and language generation, we can use the language response accuracy reward $R_A$ derived from multiple-choice QA to supervise the temporal agent (without frame-level annotation).
42
 
43
  - As shown in the GIF, through TSPO training, the temporal agent learns to select frames that lead to the correct answer for the question *"What is the scene at the beginning of the video?"*. As a result, $R_A$ increases, the predicted score peaks converge at the video's beginning, and the sampled frames converge to this time segment.
44
 
45
+ - **For reproduce this example**, first download [LLaVA-Video-Qwen](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2), [CLIP-Large](https://huggingface.co/openai/clip-vit-large-patch14), and [208.mp4](https://drive.google.com/file/d/1FDIxxIcyjL0v2O6sGc0KljXi8MRLDFJk/view?usp=sharing), and modify the ``model_name_or_path`` and ``clip_path`` in the ``toy_example.sh``. The script can be run on a single GPU with at least 28GB.
46
 
47
  <div align="center">
48
+ <img src="https://github.com/Hui-design/TSPO/raw/main/assets/gif_short.gif" width="800" height="400" style="object-fit: contain;">
49
  </div>
50
 
51
+
52
  ## ๐Ÿ“ Set up
53
 
54
  ```
 
57
 
58
  pip install -r requirement.txt
59
  pip install flash-attn==2.5.9.post1 --no-build-isolation
60
+ pip install qwen-vl-utils
61
+ pip install math_verify
62
 
63
  cd lmms-eval
64
  pip install -e .
 
66
  ```
67
 
68
 
 
69
  ## ๐ŸŽฅ Demo
70
 
71
+ - Download [LLaVA-Video-Qwen-7B](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2) or [Qwen2.5vl-7B](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct), and our ๐Ÿค—[TSPO-0.4B](https://huggingface.co/hzf666/TSPO-0.4B). Then, you can try the ``demo/llava_video_tspo.py`` or ``demo/qwen25vl_tspo.py`` .
72
 
73
+ - We provide example long videos: [208.mp4](https://drive.google.com/file/d/1FDIxxIcyjL0v2O6sGc0KljXi8MRLDFJk/view?usp=sharing), [7XWqI121-Q4.mp4](https://drive.google.com/file/d/1qh-8I1DsgH5TbqEbr05PPO5hdGtvUK23/view?usp=sharing), [5dJUUQufzw4.mp4](https://drive.google.com/file/d/1lBf6Oo7jkhi7-fSvrc_U7SqvqET3vhrh/view?usp=sharing). you can feel free to edit the "video_path" and "question". Our model will output the responses and the sampled frames will be saved under the demo directory.
74
 
75
  ```
76
  # using llava_video as backbone
 
81
  ```
82
 
83
  <div align="center">
84
+ <img src="https://github.com/Hui-design/TSPO/raw/main/assets/demo2.png" width="700" height="350" style="object-fit: contain;">
85
  </div>
86
 
 
87
 
88
+ ## ๐Ÿ’พ Dataset
89
 
90
+ - Training
91
+ - Download [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K). You don't need to download the llava_hound video inside it.
92
+ - Download our TSPO-10K train dataset, which is available at [๐Ÿค— TSPO-train-data](https://huggingface.co/datasets/Canhui99/TSPO-10K)
93
 
94
+ - Evaluation
95
 
96
+ - Download [LongVideoBench](https://huggingface.co/datasets/longvideobench/LongVideoBench), [MLVU](https://huggingface.co/datasets/sy1998/MLVU_dev), [VideoMME](https://huggingface.co/datasets/lmms-lab/Video-MME), [LVBench](https://huggingface.co/datasets/DongfuJiang/LVBench)
97
 
98
+ - For LongVideoBench and LVBench, we use the original JSON files. For MLVU and VideoMME, we convert their Parquet files into JSON format. These JSON files are stored in `script/jsons`.
99
 
100
+ - To adapt the data to our commonly used evaluation pipeline, we further organize them into TSV format and place them under `evaluation/data`.
 
 
101
 
102
+ - The final directory structure is as follows:
103
 
104
+ ```
105
+ - evaluation
106
+ - data
107
+ - *.tsv
108
+ - videos
109
+ - LongVideoBench
110
+ - video
111
+ - data
112
+ - *.mp4
113
+ - MLVU
114
+ ```
115
 
116
+
117
 
118
+ ## ๐Ÿš€ Training
119
 
120
+ First download [LLaVA-Video-Qwen](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2) and [CLIP-Large](https://huggingface.co/openai/clip-vit-large-patch14) and modify the ``model_name_or_path`` and ``clip_path `` in the ``train_deepspeed.sh``. For data path, you should modify the ``video_folder`` to be the path of LLaVA-Video-178K and ``jsonl_path`` to be the path of TSPO-10K.jsonl
121
 
122
+ Then, you can run the following command:
123
 
124
  ```
125
+ bash train_deepspeed.sh
 
 
 
 
126
  ```
127
 
128
+ To get your trained TSPO-0.4B weights, you should run the merge_weights.py
129
 
130
  ```
131
+ python scripts/merge_weights.py
 
 
 
 
132
  ```
133
 
 
134
 
135
+ ## ๐Ÿ”ฎ Evaluation
136
+
137
+ - Extract clip feature and select frame index
138
+ - You need to edit the `model_path`, `root`, and `save_root` in `mp_tools/vlmeval/config.py`.
139
+ - The first run will save the features locally; subsequent runs will directly load the saved features, making the process much faster.
140
+
141
+ ```
142
+ cd mp_tools
143
+ bash get_frame_idx.sh LongVideoBench TSPO # dataset_name method_name
144
+ cd ../
145
+ ```
146
+
147
+ - Run Lmms-eval
148
+ - For LongVideoBench, VideoMME, and MLVU, we use lmms-eval, which we have adapted to the current project by adding files such as `llava_vid_tspo.py` and `qwen_2_5_vl_tspo.py`.
149
+ - Run:
150
+
151
+ ```
152
+ # For LLaVA-Video
153
+ bash eval_scripts/TSPO_llava_video.sh LongVideoBench TSPO # dataset_name method_name
154
+
155
+ # For Qwen2.5-VL+TSPO
156
+ bash eval_scripts/TSPO_qwen25_vl.sh LongVideoBench TSPO
157
+ ```
158
+
159
+
160
+
161
+ - You can evaluate original model without our TSPO by:
162
+
163
+ ```
164
+ # For Original Qwen2.5-VL
165
+ bash eval_scripts/original_qwen25_vl.sh
166
+
167
+ # For Original LLaVA-Video
168
+ bash eval_scripts/original_llava_video.sh
169
+ ```
170
+
171
+
172
+
173
+ - For [LVBench](https://github.com/zai-org/LVBench), we use its own evaluation protocol. The detailed code is to be released soon.
174
 
175
 
176
  ## Acknowledgements
177
 
178
+ [Open-LLaVA-Video-R1](https://github.com/Hui-design/Open-LLaVA-Video-R1), [Lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), [AKS](https://github.com/ncTimTang/AKS)
179
 
180
 
181
  ## Citations
182
 
183
  If you find our work helpful for your research, please consider citing our work.
184
 
185
+ ```bibtex
186
+ @article{hu2025tspo,
187
+ title={TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding},
188
+ author={Hu, Zifei and Shen, Xiaoqian and Li, Tianchi and Wu, Lemeng and Long, Yang and Li, Hongsheng},
189
+ journal={arXiv preprint arXiv:2508.04369},
190
+ year={2025}
191
+ }
192
+ ```