Improve model card with metadata and usage information

3ec532d verified 2 days ago

1.78 kB

library_name: transformers
pipeline_tag: video-text-to-text

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

📖 Paper | 💻 Code

LongVideo-R1 is an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation. It addresses the challenge of understanding long videos with low computational budgets by avoiding exhaustive searches. At its core, it uses a reasoning module to infer which video clips are most informative for answering a query.

The model was fine-tuned on the Qwen-3-8B model using a two-stage paradigm:

Supervised Fine-Tuning (SFT): Using high-quality chain-of-thought-with-tool trajectories.
Reinforcement Learning (RL): Employing a specifically designed reward function to maximize selective and efficient clip navigation.

Usage

For online testing with tool use and multi-round reasoning, you can use the cli.py provided in the official repository. Note that the reasoning model, caption model, and video_qa model should be deployed in vLLM serve mode.

python cli.py \
  --video_path /path/to/video.mp4 \
  --question "What is the man doing in this video?" \
  --reasoning_base_url http://127.0.0.1:25600/v1 \
  --caption_base_url http://127.0.0.1:9081/v1 \
  --videoqa_base_url http://127.0.0.1:9081/v1

Citation

If you find LongVideo-R1 useful for your research, please cite:

@article{qiu2026longvideo,
  title={LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding},
  author={Qiu, Jihao and Xie, Lingxi and Huo, Xinyue and Tian, Qi and Ye, Qixiang},
  journal={arXiv preprint arXiv:2602.20913},
  year={2026}
}