AaronHuangWei commited on
Commit
2e62237
·
verified ·
1 Parent(s): dec1ba9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +152 -10
README.md CHANGED
@@ -1,10 +1,152 @@
1
- ---
2
- title: README
3
- emoji: 😻
4
- colorFrom: gray
5
- colorTo: gray
6
- sdk: static
7
- pinned: false
8
- ---
9
-
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <p align="center" width="100%">
2
+ <img src="https://raw.githubusercontent.com/NVlabs/Long-RL/main/assets/long-rl-logo.png" alt="Stanford-Alpaca" style="width: 100%; min-width: 300px; display: block; margin: auto;">
3
+ </p>
4
+
5
+ # Long-RL: Scaling RL to Long Sequences (LongVideo-Reason Dataset)
6
+
7
+ [![Paper](https://img.shields.io/badge/Paper-Arvix%20Link-green)](https://arxiv.org/abs/2507.07966)
8
+ [![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-yellow.svg)](https://github.com/NVlabs/Long-RL/blob/main/LICENSE)
9
+
10
+ <div align="center">
11
+
12
+ [![Watch the video](https://raw.githubusercontent.com/NVlabs/Long-RL/main/assets/demo_video_first_frame.png)](https://www.youtube.com/watch?v=ykbblK2jiEg)
13
+
14
+ </div>
15
+
16
+ ## Data Distribution
17
+
18
+ <p align="center" width="100%">
19
+ <img src="https://raw.githubusercontent.com/NVlabs/Long-RL/main/assets/data_distribution.png" alt="Stanford-Alpaca" style="width: 100%; min-width: 300px; display: block; margin: auto;">
20
+ </p>
21
+
22
+ We strategically construct a high-quality dataset with CoT annotations for long video reasoning, named LongVideo-Reason. Leveraging a powerful VLM (NVILA-8B) and a leading open-source reasoning LLM, we develop a dataset comprising 52K high-quality Question-Reasoning-Answer pairs for long videos. We use 18K high-quality samples for Long-CoT-SFT to initialize the model's reasoning and instruction-following abilities, and 33K samples with an additional 110K video data for reinforcement learning. This two-stage training combines high-quality reasoning annotations with reinforcement learning, enabling LongVILA-R1 to achieve superior and generalized video reasoning. We also manually curate a balanced set of 1K long-video samples to build a new benchmark, LongVideo-Reason-eval, that evaluates performance from four perspectives: Temporal, Goal and Purpose, Spatial, and Plot and Narrative, for a comprehensive assessment.
23
+
24
+
25
+ **LongVideo-Reason (Train, 52k) [[Data Link](https://huggingface.co/datasets/LongVideo-Reason/longvideo-reason)]**
26
+
27
+ **LongVideo-Reason-eval (Test, 1k) [[Data Link](https://huggingface.co/datasets/LongVideo-Reason/longvideo_eval_videos)]**
28
+
29
+
30
+ ## Installation
31
+
32
+ ```bash
33
+ git clone https://github.com/NVlabs/Long-RL.git
34
+ cd Long-RL
35
+ pip install -e .
36
+ ```
37
+ If you want to train Qwen-Omni models, please
38
+ ```bash
39
+ bash vllm_replace.sh
40
+ ```
41
+
42
+ ## Training
43
+ ### Single node
44
+ For single node (within 8 GPUs), you can refer to the training scripts in the `examples` directory. For example,
45
+ ```bash
46
+ bash examples/new_supports/qwen2_5_vl_3b_video_grpo.sh $VIDEO_PATH
47
+ ```
48
+
49
+ ### Multi-nodes
50
+ For jobs that requires multi-nodes, you can refer to the ways mentioned in the EasyR1 repo, [here](https://github.com/hiyouga/EasyR1/tree/main?tab=readme-ov-file#how-to-run-70b-model-in-multi-node-environment).
51
+
52
+ We provide additional examples for `sbatch` scripts like, where `TRAIN_SCRIPT` is the script to train on single node, `NNODES` is the number of nodes required.
53
+ ```bash
54
+ bash scripts/srun_multi_nodes.sh $TRAIN_SCRIPT $NNODES
55
+ ```
56
+
57
+ For example,
58
+ ```bash
59
+ bash scripts/srun_multi_nodes.sh examples/new_supports/qwen2_5_vl_3b_video_grpo.sh 2
60
+ ```
61
+
62
+ ### Merge Checkpoint in Hugging Face Format
63
+ This follows the ways in the EasyR1 repo.
64
+ ```bash
65
+ python3 scripts/model_merger.py --local_dir checkpoints/easy_r1/exp_name/global_step_1/actor
66
+ ```
67
+
68
+ ## Evaluation
69
+ We provide the instruction on evaluating models on our `LongVideo-Reason` benchmark in the `eval` [directory](https://github.com/NVlabs/Long-RL/tree/main/eval).
70
+
71
+ ## Examples
72
+ <div align="center">
73
+ <a href="https://drive.google.com/file/d/1QJ-ZsDrmYS8v1XU4eWfYu5oHuXeyGSdK/view?usp=share_link">Football Video</a>
74
+ </div>
75
+ <p align="center" width="100%">
76
+ <img src="https://raw.githubusercontent.com/NVlabs/Long-RL/main/assets/example-football.png" alt="Stanford-Alpaca" style="width: 100%; min-width: 300px; display: block; margin: auto;">
77
+ </p>
78
+
79
+ <div align="center">
80
+ <a href="https://drive.google.com/file/d/1U0N563a2s24o_NDie1VfWauxFuSu31wC/view?usp=share_link">Texas Hold’em Video</a>
81
+ </div>
82
+ <p align="center" width="100%">
83
+ <img src="https://raw.githubusercontent.com/NVlabs/Long-RL/main/assets/example-TexasHold.png" alt="Stanford-Alpaca" style="width: 100%; min-width: 300px; display: block; margin: auto;">
84
+ </p>
85
+
86
+ <div align="center">
87
+ <a href="https://drive.google.com/file/d/1rnF4I6-EBpqhzA0SnwyajpxbAhMezDCn/view?usp=share_link">Starcraft II Video</a>
88
+ </div>
89
+ <p align="center" width="100%">
90
+ <img src="https://raw.githubusercontent.com/NVlabs/Long-RL/main/assets/example-starcraft2.png" alt="Stanford-Alpaca" style="width: 100%; min-width: 300px; display: block; margin: auto;">
91
+ </p>
92
+
93
+ <div align="center">
94
+ <a href="https://drive.google.com/file/d/1lo1E_bXXnMmWnFRudaSUgxMNxetEDHP9/view?usp=share_link">Moving Cup Video</a>
95
+ </div>
96
+ <p align="center" width="100%">
97
+ <img src="https://raw.githubusercontent.com/NVlabs/Long-RL/main/assets/example-movingcup.png" alt="Stanford-Alpaca" style="width: 100%; min-width: 300px; display: block; margin: auto;">
98
+ </p>
99
+
100
+
101
+ ## How to contribute
102
+ - Make sure to have git installed.
103
+ - Create your own [fork](https://github.com/NVlabs/Long-RL/fork) of the project.
104
+ - Clone the repository on your local machine, using git clone and pasting the url of this project.
105
+ - Read both the `Installation` sections above.
106
+ - Commit and push your changes.
107
+ - Make a pull request when finished modifying the project.
108
+
109
+
110
+ ## Core Contributors
111
+ [Yukang Chen](https://yukangchen.com/), [Wei Huang](https://aaron-weihuang.com/), [Shuai Yang](https://andysonys.github.io), [Qinghao Hu](https://tonyhao.xyz/), [Baifeng Shi](https://bfshi.github.io/), [Hanrong Ye](https://sites.google.com/site/yhrspace/home), [Ligeng Zhu](https://lzhu.me/).
112
+
113
+ We welcome all possible contributions and will acknowledge all contributors clearly.
114
+
115
+ ## Citation
116
+ Please consider to cite our paper and this framework, if they are helpful in your research.
117
+
118
+ ```bibtex
119
+ @misc{long-rl,
120
+ title = {Long-RL: Scaling RL to Long Sequences},
121
+ author = {Yukang Chen, Wei Huang, Shuai Yang, Qinghao Hu, Baifeng Shi, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu,Hongxu Yin, Yao Lu, Song Han},
122
+ year = {2025},
123
+ publisher = {GitHub},
124
+ journal = {GitHub repository},
125
+ howpublished = {\url{https://github.com/NVlabs/Long-RL}},
126
+ }
127
+ ```
128
+ ```bibtex
129
+ @article{chen2025longvila-r1,
130
+ title={Scaling RL to Long Videos},
131
+ author={Yukang Chen and Wei Huang and Baifeng Shi and Qinghao Hu and Hanrong Ye and Ligeng Zhu and Zhijian Liu and Pavlo Molchanov and Jan Kautz and Xiaojuan Qi and Sifei Liu and Hongxu Yin and Yao Lu and Song Han},
132
+ year={2025},
133
+ eprint={2507.07966},
134
+ archivePrefix={arXiv},
135
+ primaryClass={cs.CV}
136
+ }
137
+ ```
138
+ ```bibtex
139
+ @inproceedings{chen2024longvila,
140
+ title={LongVILA: Scaling Long-Context Visual Language Models for Long Videos},
141
+ author={Yukang Chen and Fuzhao Xue and Dacheng Li and Qinghao Hu and Ligeng Zhu and Xiuyu Li and Yunhao Fang and Haotian Tang and Shang Yang and Zhijian Liu and Ethan He and Hongxu Yin and Pavlo Molchanov and Jan Kautz and Linxi Fan and Yuke Zhu and Yao Lu and Song Han},
142
+ booktitle={The International Conference on Learning Representations (ICLR)},
143
+ year={2025},
144
+ }
145
+ ```
146
+
147
+ ## Acknowledgement
148
+ - [EasyR1](https://github.com/hiyouga/EasyR1): the codebase we built upon. Thanks for their wonderful work.
149
+ - [verl](https://github.com/volcengine/verl): the RL training framework we built upon.
150
+ - [vllm](https://github.com/vllm-project/vllm): we built upon vllm for the rollout engine.
151
+ - [Flow-GRPO](https://github.com/yifan123/flow_grpo): we refer to the Flow-GRPO for the image/video generation RL part.
152
+