Improve model card for TSPO: Add metadata, paper link, and project page
Browse filesThis PR significantly enhances the model card for TSPO by:
* Activating the existing content, which was previously commented out.
* Adding `license: apache-2.0`, `pipeline_tag: video-text-to-text`, and `library_name: transformers` to the YAML metadata, which improves discoverability and provides crucial information at a glance.
* Including descriptive tags: `video-understanding`, `reinforcement-learning`, and `long-video`.
* Updating the content to match the more comprehensive GitHub README.
* Replacing the arXiv paper link with the official Hugging Face paper page: [TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding](https://huggingface.co/papers/2508.04369).
* Adding a link to the project page: [https://vision-cair.github.io/LongVU](https://vision-cair.github.io/LongVU).
* Correcting image paths to ensure they render correctly on the Hugging Face Hub.
* Adding a proper BibTeX citation for the paper.
This makes the model more accessible and informative for researchers and practitioners on the Hugging Face Hub.
|
@@ -1,18 +1,23 @@
|
|
| 1 |
-
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
# TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding
|
| 6 |
|
| 7 |
-
[[๐ Paper](https://
|
| 8 |
-
|
| 9 |
|
| 10 |
## ๐ Overview
|
| 11 |
|
| 12 |
-
|
| 13 |
|
| 14 |
<div align="center">
|
| 15 |
-
<img src="./assets/main_fig.png" width="800" height="400" style="object-fit: contain;">
|
| 16 |
</div>
|
| 17 |
|
| 18 |
|
|
@@ -20,10 +25,10 @@ Inspired by Deepseek-R1's GRPO algorithm, We propose **Temporal Sampling Policy
|
|
| 20 |
|
| 21 |
- Our method achieves **63.9%** accuracy on LongVideoBench and **76.3%** on MLVU, setting a new state-of-the-art performance among 7B video-MLLMs.
|
| 22 |
|
| 23 |
-
- Our trained temporal agent (TSPO-0.4B) demonstrates strong generalizability. When applied to LLaVA-Video-7B, it achieves an average improvement of **4.3%** across four benchmarks; with Qwen2.5VL-7B, the gain reaches **
|
| 24 |
|
| 25 |
<div align="center">
|
| 26 |
-
<img src="./assets/main_results.png" width="
|
| 27 |
</div>
|
| 28 |
|
| 29 |
|
|
@@ -33,16 +38,17 @@ Inspired by Deepseek-R1's GRPO algorithm, We propose **Temporal Sampling Policy
|
|
| 33 |
|
| 34 |
## ๐งธ Toy example
|
| 35 |
|
| 36 |
-
We present a toy example to show how TSPO works. We follow an intuition that Video-MLLMs can only give correct answers if the temporal agent samples the correct keyframes. Thanks to our joint modeling of keyframe sampling and language generation, we can use the language accuracy reward $R_A$ derived from multiple-choice
|
| 37 |
|
| 38 |
- As shown in the GIF, through TSPO training, the temporal agent learns to select frames that lead to the correct answer for the question *"What is the scene at the beginning of the video?"*. As a result, $R_A$ increases, the predicted score peaks converge at the video's beginning, and the sampled frames converge to this time segment.
|
| 39 |
|
| 40 |
-
- **For reproduce this example**, first download [LLaVA-Video-Qwen](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2), [CLIP-Large](https://huggingface.co/openai/clip-vit-large-patch14), and [208.mp4](https://drive.google.com/file/d/1FDIxxIcyjL0v2O6sGc0KljXi8MRLDFJk/view?usp=sharing), and modify the ``model_name_or_path`` and ``clip_path`` in the ``toy_example.sh``.
|
| 41 |
|
| 42 |
<div align="center">
|
| 43 |
-
<img src="./assets/
|
| 44 |
</div>
|
| 45 |
|
|
|
|
| 46 |
## ๐ Set up
|
| 47 |
|
| 48 |
```
|
|
@@ -51,6 +57,8 @@ conda activate TSPO
|
|
| 51 |
|
| 52 |
pip install -r requirement.txt
|
| 53 |
pip install flash-attn==2.5.9.post1 --no-build-isolation
|
|
|
|
|
|
|
| 54 |
|
| 55 |
cd lmms-eval
|
| 56 |
pip install -e .
|
|
@@ -58,12 +66,11 @@ cd ../
|
|
| 58 |
```
|
| 59 |
|
| 60 |
|
| 61 |
-
|
| 62 |
## ๐ฅ Demo
|
| 63 |
|
| 64 |
-
- Download
|
| 65 |
|
| 66 |
-
- We provide example long videos: [208.mp4](https://drive.google.com/file/d/1FDIxxIcyjL0v2O6sGc0KljXi8MRLDFJk/view?usp=sharing), [7XWqI121-Q4.mp4](https://drive.google.com/file/d/1qh-8I1DsgH5TbqEbr05PPO5hdGtvUK23/view?usp=sharing), [5dJUUQufzw4.mp4](https://drive.google.com/file/d/1lBf6Oo7jkhi7-fSvrc_U7SqvqET3vhrh/view?usp=sharing). you can feel free to edit the "video_path" and "question".
|
| 67 |
|
| 68 |
```
|
| 69 |
# using llava_video as backbone
|
|
@@ -74,66 +81,112 @@ CUDA_VISIBLE_DEVICES=0 python demo/qwen25vl_tspo.py
|
|
| 74 |
```
|
| 75 |
|
| 76 |
<div align="center">
|
| 77 |
-
<img src="./assets/demo2.png" width="700" height="350" style="object-fit: contain;">
|
| 78 |
</div>
|
| 79 |
|
| 80 |
-
## ๐พ Dataset
|
| 81 |
|
| 82 |
-
|
| 83 |
|
|
|
|
|
|
|
|
|
|
| 84 |
|
|
|
|
| 85 |
|
| 86 |
-
|
| 87 |
|
| 88 |
-
|
| 89 |
|
| 90 |
-
``
|
| 91 |
-
bash train_deepspeed.sh
|
| 92 |
-
```
|
| 93 |
|
| 94 |
-
|
| 95 |
|
| 96 |
-
```
|
| 97 |
-
|
| 98 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
|
|
|
|
| 100 |
|
|
|
|
| 101 |
|
| 102 |
-
|
| 103 |
|
| 104 |
-
|
| 105 |
|
| 106 |
```
|
| 107 |
-
|
| 108 |
-
bash eval_scripts/TSPO_qwen25_vl.sh
|
| 109 |
-
|
| 110 |
-
# For LLaVA-Video
|
| 111 |
-
bash eval_scripts/TSPO_llava_video.sh
|
| 112 |
```
|
| 113 |
|
| 114 |
-
|
| 115 |
|
| 116 |
```
|
| 117 |
-
|
| 118 |
-
bash eval_scripts/original_qwen25_vl.sh
|
| 119 |
-
|
| 120 |
-
# For Original LLaVA-Video
|
| 121 |
-
bash eval_scripts/original_llava_video.sh
|
| 122 |
```
|
| 123 |
|
| 124 |
-
For [LVBench](https://github.com/zai-org/LVBench), we use its own evaluation protocol. You can combine our demo.py and LVBench's official github to evaluate it.
|
| 125 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
|
| 127 |
|
| 128 |
## Acknowledgements
|
| 129 |
|
|
|
|
| 130 |
|
| 131 |
|
| 132 |
## Citations
|
| 133 |
|
| 134 |
If you find our work helpful for your research, please consider citing our work.
|
| 135 |
|
| 136 |
-
```
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
pipeline_tag: video-text-to-text
|
| 4 |
+
library_name: transformers
|
| 5 |
+
tags:
|
| 6 |
+
- video-understanding
|
| 7 |
+
- reinforcement-learning
|
| 8 |
+
- long-video
|
| 9 |
---
|
| 10 |
|
| 11 |
# TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding
|
| 12 |
|
| 13 |
+
[[๐ Paper](https://huggingface.co/papers/2508.04369)] [[๐ Project Page](https://vision-cair.github.io/LongVU)] [[๐ค TSPO-model](https://huggingface.co/hzf666/TSPO-0.4B)] [[๐ค TSPO-train-data](https://huggingface.co/datasets/Canhui99/TSPO-10K)]
|
|
|
|
| 14 |
|
| 15 |
## ๐ Overview
|
| 16 |
|
| 17 |
+
To addressing the challenges of unsupervised and non-differentiable sparse frame sampling in Video-MLLMs, We propose **Temporal Sampling Policy Optimization (TSPO)**, a reinforcement learning framework that advances long-form video understanding.
|
| 18 |
|
| 19 |
<div align="center">
|
| 20 |
+
<img src="https://github.com/Hui-design/TSPO/raw/main/assets/main_fig.png" width="800" height="400" style="object-fit: contain;">
|
| 21 |
</div>
|
| 22 |
|
| 23 |
|
|
|
|
| 25 |
|
| 26 |
- Our method achieves **63.9%** accuracy on LongVideoBench and **76.3%** on MLVU, setting a new state-of-the-art performance among 7B video-MLLMs.
|
| 27 |
|
| 28 |
+
- Our trained temporal agent (TSPO-0.4B) demonstrates strong generalizability. When applied to LLaVA-Video-7B, it achieves an average improvement of **4.3%** across four benchmarks; with Qwen2.5VL-7B, the gain reaches **6.1%**. Transferability to other backbones is further analyzed in Table 2 of our paper.
|
| 29 |
|
| 30 |
<div align="center">
|
| 31 |
+
<img src="https://github.com/Hui-design/TSPO/raw/main/assets/main_results.png" width="650" height="325" style="object-fit: contain;">
|
| 32 |
</div>
|
| 33 |
|
| 34 |
|
|
|
|
| 38 |
|
| 39 |
## ๐งธ Toy example
|
| 40 |
|
| 41 |
+
We present a toy example to show how TSPO works. We follow an intuition that Video-MLLMs can only give correct answers if the temporal agent samples the correct keyframes. Thanks to our joint modeling of keyframe sampling and language generation, we can use the language response accuracy reward $R_A$ derived from multiple-choice QA to supervise the temporal agent (without frame-level annotation).
|
| 42 |
|
| 43 |
- As shown in the GIF, through TSPO training, the temporal agent learns to select frames that lead to the correct answer for the question *"What is the scene at the beginning of the video?"*. As a result, $R_A$ increases, the predicted score peaks converge at the video's beginning, and the sampled frames converge to this time segment.
|
| 44 |
|
| 45 |
+
- **For reproduce this example**, first download [LLaVA-Video-Qwen](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2), [CLIP-Large](https://huggingface.co/openai/clip-vit-large-patch14), and [208.mp4](https://drive.google.com/file/d/1FDIxxIcyjL0v2O6sGc0KljXi8MRLDFJk/view?usp=sharing), and modify the ``model_name_or_path`` and ``clip_path`` in the ``toy_example.sh``. The script can be run on a single GPU with at least 28GB.
|
| 46 |
|
| 47 |
<div align="center">
|
| 48 |
+
<img src="https://github.com/Hui-design/TSPO/raw/main/assets/gif_short.gif" width="800" height="400" style="object-fit: contain;">
|
| 49 |
</div>
|
| 50 |
|
| 51 |
+
|
| 52 |
## ๐ Set up
|
| 53 |
|
| 54 |
```
|
|
|
|
| 57 |
|
| 58 |
pip install -r requirement.txt
|
| 59 |
pip install flash-attn==2.5.9.post1 --no-build-isolation
|
| 60 |
+
pip install qwen-vl-utils
|
| 61 |
+
pip install math_verify
|
| 62 |
|
| 63 |
cd lmms-eval
|
| 64 |
pip install -e .
|
|
|
|
| 66 |
```
|
| 67 |
|
| 68 |
|
|
|
|
| 69 |
## ๐ฅ Demo
|
| 70 |
|
| 71 |
+
- Download [LLaVA-Video-Qwen-7B](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2) or [Qwen2.5vl-7B](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct), and our ๐ค[TSPO-0.4B](https://huggingface.co/hzf666/TSPO-0.4B). Then, you can try the ``demo/llava_video_tspo.py`` or ``demo/qwen25vl_tspo.py`` .
|
| 72 |
|
| 73 |
+
- We provide example long videos: [208.mp4](https://drive.google.com/file/d/1FDIxxIcyjL0v2O6sGc0KljXi8MRLDFJk/view?usp=sharing), [7XWqI121-Q4.mp4](https://drive.google.com/file/d/1qh-8I1DsgH5TbqEbr05PPO5hdGtvUK23/view?usp=sharing), [5dJUUQufzw4.mp4](https://drive.google.com/file/d/1lBf6Oo7jkhi7-fSvrc_U7SqvqET3vhrh/view?usp=sharing). you can feel free to edit the "video_path" and "question". Our model will output the responses and the sampled frames will be saved under the demo directory.
|
| 74 |
|
| 75 |
```
|
| 76 |
# using llava_video as backbone
|
|
|
|
| 81 |
```
|
| 82 |
|
| 83 |
<div align="center">
|
| 84 |
+
<img src="https://github.com/Hui-design/TSPO/raw/main/assets/demo2.png" width="700" height="350" style="object-fit: contain;">
|
| 85 |
</div>
|
| 86 |
|
|
|
|
| 87 |
|
| 88 |
+
## ๐พ Dataset
|
| 89 |
|
| 90 |
+
- Training
|
| 91 |
+
- Download [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K). You don't need to download the llava_hound video inside it.
|
| 92 |
+
- Download our TSPO-10K train dataset, which is available at [๐ค TSPO-train-data](https://huggingface.co/datasets/Canhui99/TSPO-10K)
|
| 93 |
|
| 94 |
+
- Evaluation
|
| 95 |
|
| 96 |
+
- Download [LongVideoBench](https://huggingface.co/datasets/longvideobench/LongVideoBench), [MLVU](https://huggingface.co/datasets/sy1998/MLVU_dev), [VideoMME](https://huggingface.co/datasets/lmms-lab/Video-MME), [LVBench](https://huggingface.co/datasets/DongfuJiang/LVBench)
|
| 97 |
|
| 98 |
+
- For LongVideoBench and LVBench, we use the original JSON files. For MLVU and VideoMME, we convert their Parquet files into JSON format. These JSON files are stored in `script/jsons`.
|
| 99 |
|
| 100 |
+
- To adapt the data to our commonly used evaluation pipeline, we further organize them into TSV format and place them under `evaluation/data`.
|
|
|
|
|
|
|
| 101 |
|
| 102 |
+
- The final directory structure is as follows:
|
| 103 |
|
| 104 |
+
```
|
| 105 |
+
- evaluation
|
| 106 |
+
- data
|
| 107 |
+
- *.tsv
|
| 108 |
+
- videos
|
| 109 |
+
- LongVideoBench
|
| 110 |
+
- video
|
| 111 |
+
- data
|
| 112 |
+
- *.mp4
|
| 113 |
+
- MLVU
|
| 114 |
+
```
|
| 115 |
|
| 116 |
+
|
| 117 |
|
| 118 |
+
## ๐ Training
|
| 119 |
|
| 120 |
+
First download [LLaVA-Video-Qwen](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2) and [CLIP-Large](https://huggingface.co/openai/clip-vit-large-patch14) and modify the ``model_name_or_path`` and ``clip_path `` in the ``train_deepspeed.sh``. For data path, you should modify the ``video_folder`` to be the path of LLaVA-Video-178K and ``jsonl_path`` to be the path of TSPO-10K.jsonl
|
| 121 |
|
| 122 |
+
Then, you can run the following command:
|
| 123 |
|
| 124 |
```
|
| 125 |
+
bash train_deepspeed.sh
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
```
|
| 127 |
|
| 128 |
+
To get your trained TSPO-0.4B weights, you should run the merge_weights.py
|
| 129 |
|
| 130 |
```
|
| 131 |
+
python scripts/merge_weights.py
|
|
|
|
|
|
|
|
|
|
|
|
|
| 132 |
```
|
| 133 |
|
|
|
|
| 134 |
|
| 135 |
+
## ๐ฎ Evaluation
|
| 136 |
+
|
| 137 |
+
- Extract clip feature and select frame index
|
| 138 |
+
- You need to edit the `model_path`, `root`, and `save_root` in `mp_tools/vlmeval/config.py`.
|
| 139 |
+
- The first run will save the features locally; subsequent runs will directly load the saved features, making the process much faster.
|
| 140 |
+
|
| 141 |
+
```
|
| 142 |
+
cd mp_tools
|
| 143 |
+
bash get_frame_idx.sh LongVideoBench TSPO # dataset_name method_name
|
| 144 |
+
cd ../
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
- Run Lmms-eval
|
| 148 |
+
- For LongVideoBench, VideoMME, and MLVU, we use lmms-eval, which we have adapted to the current project by adding files such as `llava_vid_tspo.py` and `qwen_2_5_vl_tspo.py`.
|
| 149 |
+
- Run:
|
| 150 |
+
|
| 151 |
+
```
|
| 152 |
+
# For LLaVA-Video
|
| 153 |
+
bash eval_scripts/TSPO_llava_video.sh LongVideoBench TSPO # dataset_name method_name
|
| 154 |
+
|
| 155 |
+
# For Qwen2.5-VL+TSPO
|
| 156 |
+
bash eval_scripts/TSPO_qwen25_vl.sh LongVideoBench TSPO
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
- You can evaluate original model without our TSPO by:
|
| 162 |
+
|
| 163 |
+
```
|
| 164 |
+
# For Original Qwen2.5-VL
|
| 165 |
+
bash eval_scripts/original_qwen25_vl.sh
|
| 166 |
+
|
| 167 |
+
# For Original LLaVA-Video
|
| 168 |
+
bash eval_scripts/original_llava_video.sh
|
| 169 |
+
```
|
| 170 |
+
|
| 171 |
+
|
| 172 |
+
|
| 173 |
+
- For [LVBench](https://github.com/zai-org/LVBench), we use its own evaluation protocol. The detailed code is to be released soon.
|
| 174 |
|
| 175 |
|
| 176 |
## Acknowledgements
|
| 177 |
|
| 178 |
+
[Open-LLaVA-Video-R1](https://github.com/Hui-design/Open-LLaVA-Video-R1), [Lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), [AKS](https://github.com/ncTimTang/AKS)
|
| 179 |
|
| 180 |
|
| 181 |
## Citations
|
| 182 |
|
| 183 |
If you find our work helpful for your research, please consider citing our work.
|
| 184 |
|
| 185 |
+
```bibtex
|
| 186 |
+
@article{hu2025tspo,
|
| 187 |
+
title={TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding},
|
| 188 |
+
author={Hu, Zifei and Shen, Xiaoqian and Li, Tianchi and Wu, Lemeng and Long, Yang and Li, Hongsheng},
|
| 189 |
+
journal={arXiv preprint arXiv:2508.04369},
|
| 190 |
+
year={2025}
|
| 191 |
+
}
|
| 192 |
+
```
|