TSPO-0.4B / README.md

nielsr HF Staff

Improve model card for TSPO: Add metadata, paper link, and project page

272c8ba verified 9 months ago

8.02 kB

license: apache-2.0
pipeline_tag: video-text-to-text
library_name: transformers
tags:
  - video-understanding
  - reinforcement-learning
  - long-video

TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

[📖 Paper] [🌐 Project Page] [🤗 TSPO-model] [🤗 TSPO-train-data]

👀 Overview

To addressing the challenges of unsupervised and non-differentiable sparse frame sampling in Video-MLLMs, We propose Temporal Sampling Policy Optimization (TSPO), a reinforcement learning framework that advances long-form video understanding.

🏆 Performance

Our method achieves 63.9% accuracy on LongVideoBench and 76.3% on MLVU, setting a new state-of-the-art performance among 7B video-MLLMs.
Our trained temporal agent (TSPO-0.4B) demonstrates strong generalizability. When applied to LLaVA-Video-7B, it achieves an average improvement of 4.3% across four benchmarks; with Qwen2.5VL-7B, the gain reaches 6.1%. Transferability to other backbones is further analyzed in Table 2 of our paper.

📝 Code

✒️ TSPO

🧸 Toy example

We present a toy example to show how TSPO works. We follow an intuition that Video-MLLMs can only give correct answers if the temporal agent samples the correct keyframes. Thanks to our joint modeling of keyframe sampling and language generation, we can use the language response accuracy reward $R_A$ derived from multiple-choice QA to supervise the temporal agent (without frame-level annotation).

As shown in the GIF, through TSPO training, the temporal agent learns to select frames that lead to the correct answer for the question "What is the scene at the beginning of the video?". As a result, $R_A$ increases, the predicted score peaks converge at the video's beginning, and the sampled frames converge to this time segment.
For reproduce this example, first download LLaVA-Video-Qwen, CLIP-Large, and 208.mp4, and modify the model_name_or_path and clip_path in the toy_example.sh. The script can be run on a single GPU with at least 28GB.

📐 Set up

conda create -n TSPO python=3.10
conda activate TSPO

pip install -r requirement.txt
pip install flash-attn==2.5.9.post1 --no-build-isolation
pip install qwen-vl-utils
pip install math_verify

cd lmms-eval
pip install -e .
cd ../

🎥 Demo

Download LLaVA-Video-Qwen-7B or Qwen2.5vl-7B, and our 🤗TSPO-0.4B. Then, you can try the demo/llava_video_tspo.py or demo/qwen25vl_tspo.py .
We provide example long videos: 208.mp4, 7XWqI121-Q4.mp4, 5dJUUQufzw4.mp4. you can feel free to edit the "video_path" and "question". Our model will output the responses and the sampled frames will be saved under the demo directory.

# using llava_video as backbone
CUDA_VISIBLE_DEVICES=0 python demo/llava_video_tspo.py

# using Qwen2.5vl as backbone
CUDA_VISIBLE_DEVICES=0 python demo/qwen25vl_tspo.py

💾 Dataset

Training
- Download LLaVA-Video-178K. You don't need to download the llava_hound video inside it.
- Download our TSPO-10K train dataset, which is available at 🤗 TSPO-train-data
Evaluation
- Download LongVideoBench, MLVU, VideoMME, LVBench
- For LongVideoBench and LVBench, we use the original JSON files. For MLVU and VideoMME, we convert their Parquet files into JSON format. These JSON files are stored in script/jsons.
- To adapt the data to our commonly used evaluation pipeline, we further organize them into TSV format and place them under evaluation/data.
- The final directory structure is as follows:
```
- evaluation
    - data
        - *.tsv
        - videos
            - LongVideoBench
                - video
                    - data
                        - *.mp4
            - MLVU
```

🚀 Training

First download LLaVA-Video-Qwen and CLIP-Large and modify the model_name_or_path and clip_path in the train_deepspeed.sh. For data path, you should modify the video_folder to be the path of LLaVA-Video-178K and jsonl_path to be the path of TSPO-10K.jsonl

Then, you can run the following command:

bash train_deepspeed.sh

To get your trained TSPO-0.4B weights, you should run the merge_weights.py

python scripts/merge_weights.py

🔮 Evaluation

Extract clip feature and select frame index
- You need to edit the model_path, root, and save_root in mp_tools/vlmeval/config.py.
- The first run will save the features locally; subsequent runs will directly load the saved features, making the process much faster.
```
cd mp_tools
bash get_frame_idx.sh LongVideoBench TSPO  # dataset_name method_name 
cd ../
```

Run Lmms-eval

For LongVideoBench, VideoMME, and MLVU, we use lmms-eval, which we have adapted to the current project by adding files such as llava_vid_tspo.py and qwen_2_5_vl_tspo.py.
Run:

# For LLaVA-Video
bash eval_scripts/TSPO_llava_video.sh LongVideoBench TSPO  # dataset_name method_name 

# For Qwen2.5-VL+TSPO
bash eval_scripts/TSPO_qwen25_vl.sh LongVideoBench TSPO

You can evaluate original model without our TSPO by:

# For Original Qwen2.5-VL 
bash eval_scripts/original_qwen25_vl.sh

# For Original LLaVA-Video
bash eval_scripts/original_llava_video.sh

For LVBench, we use its own evaluation protocol. The detailed code is to be released soon.

Acknowledgements

Open-LLaVA-Video-R1, Lmms-eval, VLMEvalKit, AKS

Citations

If you find our work helpful for your research, please consider citing our work.

@article{hu2025tspo,
  title={TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding},
  author={Hu, Zifei and Shen, Xiaoqian and Li, Tianchi and Wu, Lemeng and Long, Yang and Li, Hongsheng},
  journal={arXiv preprint arXiv:2508.04369},
  year={2025}
}