TSPO-0.4B / README.md
nielsr's picture
nielsr HF Staff
Improve model card for TSPO: Add metadata, paper link, and project page
272c8ba verified
|
raw
history blame
8.02 kB
metadata
license: apache-2.0
pipeline_tag: video-text-to-text
library_name: transformers
tags:
  - video-understanding
  - reinforcement-learning
  - long-video

TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

[๐Ÿ“– Paper] [๐ŸŒ Project Page] [๐Ÿค— TSPO-model] [๐Ÿค— TSPO-train-data]

๐Ÿ‘€ Overview

To addressing the challenges of unsupervised and non-differentiable sparse frame sampling in Video-MLLMs, We propose Temporal Sampling Policy Optimization (TSPO), a reinforcement learning framework that advances long-form video understanding.

๐Ÿ† Performance

  • Our method achieves 63.9% accuracy on LongVideoBench and 76.3% on MLVU, setting a new state-of-the-art performance among 7B video-MLLMs.

  • Our trained temporal agent (TSPO-0.4B) demonstrates strong generalizability. When applied to LLaVA-Video-7B, it achieves an average improvement of 4.3% across four benchmarks; with Qwen2.5VL-7B, the gain reaches 6.1%. Transferability to other backbones is further analyzed in Table 2 of our paper.

๐Ÿ“ Code

โœ’๏ธ TSPO

๐Ÿงธ Toy example

We present a toy example to show how TSPO works. We follow an intuition that Video-MLLMs can only give correct answers if the temporal agent samples the correct keyframes. Thanks to our joint modeling of keyframe sampling and language generation, we can use the language response accuracy reward $R_A$ derived from multiple-choice QA to supervise the temporal agent (without frame-level annotation).

  • As shown in the GIF, through TSPO training, the temporal agent learns to select frames that lead to the correct answer for the question "What is the scene at the beginning of the video?". As a result, $R_A$ increases, the predicted score peaks converge at the video's beginning, and the sampled frames converge to this time segment.

  • For reproduce this example, first download LLaVA-Video-Qwen, CLIP-Large, and 208.mp4, and modify the model_name_or_path and clip_path in the toy_example.sh. The script can be run on a single GPU with at least 28GB.

๐Ÿ“ Set up

conda create -n TSPO python=3.10
conda activate TSPO

pip install -r requirement.txt
pip install flash-attn==2.5.9.post1 --no-build-isolation
pip install qwen-vl-utils
pip install math_verify

cd lmms-eval
pip install -e .
cd ../

๐ŸŽฅ Demo

# using llava_video as backbone
CUDA_VISIBLE_DEVICES=0 python demo/llava_video_tspo.py

# using Qwen2.5vl as backbone
CUDA_VISIBLE_DEVICES=0 python demo/qwen25vl_tspo.py

๐Ÿ’พ Dataset

  • Training

  • Evaluation

    • Download LongVideoBench, MLVU, VideoMME, LVBench

    • For LongVideoBench and LVBench, we use the original JSON files. For MLVU and VideoMME, we convert their Parquet files into JSON format. These JSON files are stored in script/jsons.

    • To adapt the data to our commonly used evaluation pipeline, we further organize them into TSV format and place them under evaluation/data.

    • The final directory structure is as follows:

      - evaluation
          - data
              - *.tsv
              - videos
                  - LongVideoBench
                      - video
                          - data
                              - *.mp4
                  - MLVU
      

๐Ÿš€ Training

First download LLaVA-Video-Qwen and CLIP-Large and modify the model_name_or_path and clip_path in the train_deepspeed.sh. For data path, you should modify the video_folder to be the path of LLaVA-Video-178K and jsonl_path to be the path of TSPO-10K.jsonl

Then, you can run the following command:

bash train_deepspeed.sh

To get your trained TSPO-0.4B weights, you should run the merge_weights.py

python scripts/merge_weights.py

๐Ÿ”ฎ Evaluation

  • Extract clip feature and select frame index

    • You need to edit the model_path, root, and save_root in mp_tools/vlmeval/config.py.
    • The first run will save the features locally; subsequent runs will directly load the saved features, making the process much faster.
    cd mp_tools
    bash get_frame_idx.sh LongVideoBench TSPO  # dataset_name method_name 
    cd ../
    
  • Run Lmms-eval

    • For LongVideoBench, VideoMME, and MLVU, we use lmms-eval, which we have adapted to the current project by adding files such as llava_vid_tspo.py and qwen_2_5_vl_tspo.py.
    • Run:
    # For LLaVA-Video
    bash eval_scripts/TSPO_llava_video.sh LongVideoBench TSPO  # dataset_name method_name 
    
    # For Qwen2.5-VL+TSPO
    bash eval_scripts/TSPO_qwen25_vl.sh LongVideoBench TSPO
    
  • You can evaluate original model without our TSPO by:

    # For Original Qwen2.5-VL 
    bash eval_scripts/original_qwen25_vl.sh
    
    # For Original LLaVA-Video
    bash eval_scripts/original_llava_video.sh
    
  • For LVBench, we use its own evaluation protocol. The detailed code is to be released soon.

Acknowledgements

Open-LLaVA-Video-R1, Lmms-eval, VLMEvalKit, AKS

Citations

If you find our work helpful for your research, please consider citing our work.

@article{hu2025tspo,
  title={TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding},
  author={Hu, Zifei and Shen, Xiaoqian and Li, Tianchi and Wu, Lemeng and Long, Yang and Li, Hongsheng},
  journal={arXiv preprint arXiv:2508.04369},
  year={2025}
}