Improve model card and add metadata

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +71 -1
README.md CHANGED
@@ -1 +1,71 @@
1
- https://arxiv.org/abs/2602.20913
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: video-text-to-text
3
+ library_name: transformers
4
+ tags:
5
+ - video-understanding
6
+ - long-video
7
+ - reasoning
8
+ - r1
9
+ - multimodal
10
+ ---
11
+
12
+ # LongVideo-R1-Qwen3
13
+
14
+ This repository contains the weights for **LongVideo-R1-Qwen3**, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient long video understanding.
15
+
16
+ This model was introduced in the paper [LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding](https://arxiv.org/abs/2602.20913), accepted at CVPR 2026.
17
+
18
+ ## Model Description
19
+
20
+ LongVideo-R1 addresses the challenge of understanding long videos under low computational budgets. Instead of an exhaustive search across all frames, the agent uses a reasoning module to navigate video context, leveraging high-level visual cues to infer the most informative video clips.
21
+
22
+ - **Backbone**: Fine-tuned from **Qwen-3-8B**.
23
+ - **Training Paradigm**: Two-stage approach involving Supervised Fine-Tuning (SFT) on 33K high-quality chain-of-thought-with-tool trajectories followed by Reinforcement Learning (RL).
24
+ - **Architecture**: The agent initiates traversal from top-level visual summaries and iteratively refines its focus, halting once it has sufficient knowledge to answer the query.
25
+
26
+ ## Links
27
+ - **Paper**: [arXiv:2602.20913](https://arxiv.org/abs/2602.20913)
28
+ - **Code**: [GitHub - qiujihao19/LongVideo-R1](https://github.com/qiujihao19/LongVideo-R1)
29
+ - **Data**: [LongVideo-R1-Data](https://huggingface.co/datasets/ChurchillQAQ/LongVideo-R1-Data)
30
+
31
+ ## Usage
32
+
33
+ LongVideo-R1 can be deployed using `vLLM` for online testing, supporting tool use and multi-round reasoning.
34
+
35
+ ### 1. Deploy the reasoning model
36
+ ```bash
37
+ # Deploy the reasoning model
38
+ MODEL_PATH="ChurchillQAQ/LongVideo-R1-Qwen3"
39
+ PORT=25600
40
+
41
+ vllm serve $MODEL_PATH \
42
+ --tensor-parallel-size 1 \
43
+ --max-model-len 32768 \
44
+ --gpu-memory-utilization 0.85 \
45
+ --host 127.0.0.1 \
46
+ --port $PORT \
47
+ --served-model-name longvideor1
48
+ ```
49
+
50
+ ### 2. Run Inference (CLI Demo)
51
+ Once the model is served (alongside the required caption and video-QA models as described in the [GitHub README](https://github.com/qiujihao19/LongVideo-R1)), you can use `cli.py`:
52
+
53
+ ```bash
54
+ python cli.py \
55
+ --video_path /path/to/video.mp4 \
56
+ --question "What is the man doing in this video?" \
57
+ --reasoning_base_url http://127.0.0.1:25600/v1 \
58
+ --caption_base_url http://127.0.0.1:9081/v1 \
59
+ --videoqa_base_url http://127.0.0.1:9081/v1
60
+ ```
61
+
62
+ ## Citation
63
+
64
+ ```bibtex
65
+ @article{qiu2026longvideo,
66
+ title={LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding},
67
+ author={Qiu, Jihao and Xie, Lingxi and Huo, Xinyue and Tian, Qi and Ye, Qixiang},
68
+ journal={arXiv preprint arXiv:2602.20913},
69
+ year={2026}
70
+ }
71
+ ```