LongVT-RFT / README.md

Sudong Wang

Update README.md

4c9a2d4 verified 3 months ago

6.83 kB

	---
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	datasets:
	- longvideotool/LongVT-Parquet
	license: apache-2.0
	library_name: transformers
	pipeline_tag: video-text-to-text
	---

	# LongVT: Incentivizing “Thinking with Long Videos” via Native Tool Calling

	<div align="center">

	[![Data](https://img.shields.io/badge/Data-0040A1?style=for-the-badge&logo=huggingface&logoColor=ffffff&labelColor)](https://huggingface.co/collections/lmms-lab/longvt)
	[![Paper](https://img.shields.io/badge/Paper-000000?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2511.20785)
	[![Project Page](https://img.shields.io/badge/Website-000000?style=for-the-badge&logo=google-chrome&logoColor=white)](https://evolvinglmms-lab.github.io/LongVT/)
	[![Github](https://img.shields.io/badge/Code-000000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/EvolvingLMMs-Lab/LongVT)
	</div>

	## Overview

	Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought.
	However, they remain vulnerable to hallucination, especially when processing long-form videos where evidence is sparse and temporally dispersed.
	Inspired by how humans comprehend long videos-by first skimming globally and then examining relevant clips for details-we introduce LongVT, an end-to-end agentic framework that enables ``Thinking with Long Videos'' via interleaved Multimodal Chain-of-Tool-Thought.
	Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames.

	This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence.
	Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation.
	Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively.
	Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation.
	With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks.


	## Model Card

	The model is the RFT version of the LongVT and was trained on https://huggingface.co/datasets/longvideotool/LongVT-Parquet.

	## Usage & Evaluation

	For detailed instructions on inference and evaluation, please refer to our [GitHub repository](https://github.com/EvolvingLMMs-Lab/LongVT). We recommend using the scripts and environment provided there to reproduce our results.

	## Evaluation Results


	\| Model \| Reasoning Prompt \| Tool Calling \| VideoMME<br>(≈1018s) \| VideoMMMU<br>(subtitle) \| VideoMMMU<br>(adaptation) \| VideoMMMU<br>(comprehension) \| LVBench<br>(≈4101s) \| VideoSIAH-Eval<br>(≈1688s) \| Average Score \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| Proprietary LMMs \| \| \| \| \| \| \| \| \| \|
	\| GPT-4o \| ✗ \| ✗ \| 77.2<sup>†</sup> \| 66.0<sup>†</sup> \| 62.0<sup>†</sup> \| 55.7<sup>†</sup> \| 30.8<sup>†</sup> \| 17.4 \| 51.5 \|
	\| Gemini 1.5 Pro \| ✗ \| ✗ \| 81.3<sup>†</sup> \| 59.0<sup>†</sup> \| 53.3<sup>†</sup> \| 49.3<sup>†</sup> \| 33.1<sup>†</sup> \| - \| 55.2 \|
	\| Open-Source (Sparse) \| \| \| \| \| \| \| \| \| \|
	\| Qwen2.5-VL-7B \| ✗ \| ✗ \| <u>62.6</u> \| <u>37.3</u> \| 28.0 \| 36.7 \| 30.7 \| <u>28.1</u> \| 37.2 \|
	\| Video-R1-7B \| ✓ \| ✗ \| 61.0 \| 36.3 \| 40.7 \| 52.3 \| 37.2 \| 27.9 \| <u>42.6</u> \|
	\| VideoRFT-7B \| ✓ \| ✗ \| 60.9 \| 36.7 \| 42.0 \| <u>53.0</u> \| 34.7 \| 26.5 \| 42.3 \|
	\| Video-Thinker-7B \| ✓ \| ✗ \| 61.0 \| 34.3 \| <u>44.7</u> \| <u>53.0</u> \| 52.2 \| 10.4 \| <u>42.6</u> \|
	\| LongVT-7B-SFT (Ours) \| ✓ \| ✓ \| 12.5 \| 37.7 \| 46.0 \| 58.3 \| 36.0 \| 26.8 \| 36.2 \|
	\| LongVT-7B-RL (Ours) \| ✓ \| ✓ \| 66.1 \| 32.7 \| <u>44.7</u> \| 50.0 \| <u>37.8</u> \| 31.0 \| 43.7 \|
	\| Open-Source (Dense) \| \| \| \| \| \| \| \| \| \|
	\| Qwen2.5-VL-7B \| ✗ \| ✗ \| 64.3 \| 35.7 \| 44.3 \| 56.7 \| 40.9 \| 33.8 \| 46.0 \|
	\| Video-R1-7B \| ✓ \| ✗ \| 60.5 \| <u>37.3</u> \| 38.7 \| 46.3 \| 40.1 \| 33.1 \| 42.7 \|
	\| VideoRFT-7B \| ✓ \| ✗ \| 49.2 \| 37.7 \| 40.7 \| 48.7 \| 18.7 \| 26.9 \| 37.0 \|
	\| Video-Thinker-7B \| ✓ \| ✗ \| 60.8 \| 37.7 \| 42.7 \| 55.3 \| 54.3 \| 6.6 \| 42.9 \|
	\| LongVT-7B-SFT (Ours) \| ✓ \| ✓ \| 64.9 \| 32.3 \| 42.0 \| 49.7 \| 41.1 \| 34.8 \| 44.1 \|
	\| LongVT-7B-RL (Ours) \| ✓ \| ✓ \| <u>66.1</u> \| 37.7 \| 42.3 \| <u>56.3</u> \| <u>41.4</u> \| <u>35.9</u> \| <u>46.6</u> \|
	\| LongVT-7B-RFT (Ours) \| ✓ \| ✓ \| 67.0 \| 35.7 \| <u>43.7</u> \| 56.7 \| 41.3 \| 42.0 \| 47.7 \|

	> Performance Comparison with Existing Video-Centric LMMs across Various Long Video Understanding and Reasoning Benchmarks. The best and second-best result among open-source models in each column is marked in bold and <u>underlined</u>, respectively. The numbers with "≈" denote the average video duration of each benchmark. <sup>†</sup> indicates results sourced from official reports. Reasoning Prompt indicates whether a standard reasoning-style prompt (✓) or a direct question-answering prompt (✗) is applied; Tool Calling denotes whether native tool calling is enabled (✓) or disabled (✗) in the prompt.

	## Citation

	If you find LongVT useful for your research and applications, please cite using this BibTeX:

	```bibtex
	@misc{yang2025longvtincentivizingthinkinglong,
	title={LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling},
	author={Zuhao Yang and Sudong Wang and Kaichen Zhang and Keming Wu and Sicong Leng and Yifan Zhang and Chengwei Qin and Shijian Lu and Xingxuan Li and Lidong Bing},
	year={2025},
	eprint={2511.20785},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2511.20785},
	}
	```

	Check out this paper: https://arxiv.org/abs/2511.20785

	## Acknowledgements

	We gratefully acknowledge the following open-source projects that made this work possible:

	- [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) for providing the comprehensive evaluation framework for large multimodal models.
	- [lmms-engine](https://github.com/EvolvingLMMs-Lab/lmms-engine) for the SFT training infrastructure and tools.
	- [verl](https://github.com/volcengine/verl) for the reinforcement learning training framework.

	We thank the developers and contributors of these projects for their excellent work and for making their code publicly available.