nielsr HF Staff commited on
Commit
8c1fa46
·
verified ·
1 Parent(s): 1f7abbe

Improve model card and add metadata

Browse files

Hi! I'm Niels from the community science team at Hugging Face.

I've opened this PR to improve the model card for LongVideo-R1-Qwen3. I've added YAML metadata including the `pipeline_tag` and `library_name` to help users discover the model, and structured the README to include information about the paper, the GitHub repository, and usage instructions derived from the official implementation.

Files changed (1) hide show
  1. README.md +71 -1
README.md CHANGED
@@ -1 +1,71 @@
1
- https://arxiv.org/abs/2602.20913
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: video-text-to-text
3
+ library_name: transformers
4
+ tags:
5
+ - video-understanding
6
+ - long-video
7
+ - reasoning
8
+ - r1
9
+ - multimodal
10
+ ---
11
+
12
+ # LongVideo-R1-Qwen3
13
+
14
+ This repository contains the weights for **LongVideo-R1-Qwen3**, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient long video understanding.
15
+
16
+ This model was introduced in the paper [LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding](https://arxiv.org/abs/2602.20913), accepted at CVPR 2026.
17
+
18
+ ## Model Description
19
+
20
+ LongVideo-R1 addresses the challenge of understanding long videos under low computational budgets. Instead of an exhaustive search across all frames, the agent uses a reasoning module to navigate video context, leveraging high-level visual cues to infer the most informative video clips.
21
+
22
+ - **Backbone**: Fine-tuned from **Qwen-3-8B**.
23
+ - **Training Paradigm**: Two-stage approach involving Supervised Fine-Tuning (SFT) on 33K high-quality chain-of-thought-with-tool trajectories followed by Reinforcement Learning (RL).
24
+ - **Architecture**: The agent initiates traversal from top-level visual summaries and iteratively refines its focus, halting once it has sufficient knowledge to answer the query.
25
+
26
+ ## Links
27
+ - **Paper**: [arXiv:2602.20913](https://arxiv.org/abs/2602.20913)
28
+ - **Code**: [GitHub - qiujihao19/LongVideo-R1](https://github.com/qiujihao19/LongVideo-R1)
29
+ - **Data**: [LongVideo-R1-Data](https://huggingface.co/datasets/ChurchillQAQ/LongVideo-R1-Data)
30
+
31
+ ## Usage
32
+
33
+ LongVideo-R1 can be deployed using `vLLM` for online testing, supporting tool use and multi-round reasoning.
34
+
35
+ ### 1. Deploy the reasoning model
36
+ ```bash
37
+ # Deploy the reasoning model
38
+ MODEL_PATH="ChurchillQAQ/LongVideo-R1-Qwen3"
39
+ PORT=25600
40
+
41
+ vllm serve $MODEL_PATH \
42
+ --tensor-parallel-size 1 \
43
+ --max-model-len 32768 \
44
+ --gpu-memory-utilization 0.85 \
45
+ --host 127.0.0.1 \
46
+ --port $PORT \
47
+ --served-model-name longvideor1
48
+ ```
49
+
50
+ ### 2. Run Inference (CLI Demo)
51
+ Once the model is served (alongside the required caption and video-QA models as described in the [GitHub README](https://github.com/qiujihao19/LongVideo-R1)), you can use `cli.py`:
52
+
53
+ ```bash
54
+ python cli.py \
55
+ --video_path /path/to/video.mp4 \
56
+ --question "What is the man doing in this video?" \
57
+ --reasoning_base_url http://127.0.0.1:25600/v1 \
58
+ --caption_base_url http://127.0.0.1:9081/v1 \
59
+ --videoqa_base_url http://127.0.0.1:9081/v1
60
+ ```
61
+
62
+ ## Citation
63
+
64
+ ```bibtex
65
+ @article{qiu2026longvideo,
66
+ title={LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding},
67
+ author={Qiu, Jihao and Xie, Lingxi and Huo, Xinyue and Tian, Qi and Ye, Qixiang},
68
+ journal={arXiv preprint arXiv:2602.20913},
69
+ year={2026}
70
+ }
71
+ ```