LongVideo-R1-Qwen3 / README.md
nielsr's picture
nielsr HF Staff
Improve model card and add metadata
8c1fa46 verified
|
raw
history blame
2.7 kB
metadata
pipeline_tag: video-text-to-text
library_name: transformers
tags:
  - video-understanding
  - long-video
  - reasoning
  - r1
  - multimodal

LongVideo-R1-Qwen3

This repository contains the weights for LongVideo-R1-Qwen3, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient long video understanding.

This model was introduced in the paper LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding, accepted at CVPR 2026.

Model Description

LongVideo-R1 addresses the challenge of understanding long videos under low computational budgets. Instead of an exhaustive search across all frames, the agent uses a reasoning module to navigate video context, leveraging high-level visual cues to infer the most informative video clips.

  • Backbone: Fine-tuned from Qwen-3-8B.
  • Training Paradigm: Two-stage approach involving Supervised Fine-Tuning (SFT) on 33K high-quality chain-of-thought-with-tool trajectories followed by Reinforcement Learning (RL).
  • Architecture: The agent initiates traversal from top-level visual summaries and iteratively refines its focus, halting once it has sufficient knowledge to answer the query.

Links

Usage

LongVideo-R1 can be deployed using vLLM for online testing, supporting tool use and multi-round reasoning.

1. Deploy the reasoning model

# Deploy the reasoning model
MODEL_PATH="ChurchillQAQ/LongVideo-R1-Qwen3"
PORT=25600

vllm serve $MODEL_PATH \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --host 127.0.0.1 \
  --port $PORT \
  --served-model-name longvideor1 

2. Run Inference (CLI Demo)

Once the model is served (alongside the required caption and video-QA models as described in the GitHub README), you can use cli.py:

python cli.py \
  --video_path /path/to/video.mp4 \
  --question "What is the man doing in this video?" \
  --reasoning_base_url http://127.0.0.1:25600/v1 \
  --caption_base_url http://127.0.0.1:9081/v1 \
  --videoqa_base_url http://127.0.0.1:9081/v1

Citation

@article{qiu2026longvideo,
  title={LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding},
  author={Qiu, Jihao and Xie, Lingxi and Huo, Xinyue and Tian, Qi and Ye, Qixiang},
  journal={arXiv preprint arXiv:2602.20913},
  year={2026}
}