File size: 8,831 Bytes
25b7a0c 5a9d521 25b7a0c 5a9d521 25b7a0c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 | ---
base_model:
- Qwen/Qwen2.5-Omni-7B
pipeline_tag: video-text-to-text
license: bsd-2-clause
library_name: transformers
---
<div align="center">
# Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration
Hao Zhong<sup>\*</sup>,
[Muzhi Zhu](https://scholar.google.com/citations?user=064gBH4AAAAJ&hl=zh-CN&oi=ao)<sup>*</sup>,
Zongze Du<sup>\*</sup>,
Zheng Huang<sup></sup>,
[Canyu Zhao](https://github.com/volcverse)<sup></sup>,
[Mingyu Liu](https://mingyulau.github.io/)<sup></sup>,
[Wen Wang](https://github.com/encounter1997)<sup></sup>,
[Hao Chen](https://scholar.google.com/citations?user=FaOqRpcAAAAJ)<sup></sup>,
[Chunhua Shen](https://cshen.github.io)<sup></sup>
[Zhejiang University](https://www.zju.edu.cn/english/)
*Equal contribution
[π **Paper**](https://arxiv.org/abs/2505.20256) | [π **Project Page**](https://aim-uofa.github.io/OmniR1/) | [π€ **Model Weights**](https://www.modelscope.cn/models/jxzh2020/Omni-R1) | [π€ **Model Weights**](https://huggingface.co/Haoz0206/Omni-R1)
</div>
## ποΈ Citation
If you find this work helpful for your research, please cite:
```BibTeX
@article{zhong2025omnir1reinforcementlearningomnimodal,
title={Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration},
author={Hao Zhong and Muzhi Zhu and Zongze Du and Zheng Huang and Canyu Zhao and Mingyu Liu and Wen Wang and Hao Chen and Chunhua Shen},
year={2025},
eprint={2505.20256},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.20256},
}
```
<!-- π€π€ -->
## π Overview
<div align="center">
<img width="800" alt="Omni-R1" src="assets/main_intro.png">
</div>
## π Description
Video-audio reasoning and fine-grained pixel understanding impose conflicting requirements on
multimodal models: dense temporal coverage demands many low-resolution frames, whereas precise grounding
calls for high-resolution inputs. We tackle this trade-off with a <strong>two-system architecture</strong>: a <strong>Global Reasoning System</strong> selects informative keyframes and rewrites the task at low spatial cost,
while a <strong>Detail Understanding System</strong> performs pixel-level grounding on the selected high-resolution snippets.
Because optimal keyframe selection and reformulation are ambiguous and hard to supervise, we formulate them
as a reinforcement-learning (RL) problem and present <strong>Omni-R1</strong>, an end-to-end RL framework
built on Group Relative Policy Optimization.
<strong>Omni-R1</strong> trains the Global Reasoning System through hierarchical rewards obtained via online
collaboration
with the Detail Understanding System, requiring only one epoch of RL on small task splits.
Experiments on two challenging benchmarks, Referring Audio-Visual Segmentation (RefAVS) and Reasoning
Video
Object Segmentation (REVOS), show that Omni-R1 not only surpasses strong supervised baselines but also
outperforms specialized state-of-the-art models, while substantially improving out-of-domain
generalization
and mitigating multimodal hallucination.
Our results demonstrate the first successful application of RL to large-scale omnimodal reasoning and
highlight a scalable path toward universally foundation models.
## π© Plan
<!-- - [ ] Release the weights. -->
- [X] Release model weights and demo.
- [ ] Release the segmentation and evaluation code.
- [ ] Release the training scripts.
<!-- --- -->
## π οΈ Getting Started
### π Set up Environment
```bash
git clone https://github.com/aim-uofa/Omni-R1
cd Omni-R1
# build environment
conda create -n omni python=3.10
conda activate omni
# install packages
pip install -r requirements.txt
pip install -e src/qwen-omni-utils[decord]
pip install flash-attn --no-build-isolation
pip install transformers/transformers_omni.zip
# replace transformers Qwen2.5-Omni .py file
bash replace_omni.sh
```
This project also supports `uv`, if preferred,
```bash
uv sync --no-build-isolation-package flash-attn
source .venv/bin/activate
# replace transformers Qwen2.5-Omni .py file
bash replace_omni.sh
```
### π Download Datasets
Download and extract the datasets you need and prepare a `src/r1-v/datasets.json` according to `src/r1-v/datasets_demo.json`.
- ReVOS and MeVIS datasets are directly selected from [Sa2VA](https://github.com/magic-research/Sa2VA) training dataset, which can be downloaded [here](https://huggingface.co/datasets/Dense-World/Sa2VA-Training). Please refer to Sa2VA for usage.
- refCOCOg_2k_840 from [SegZero](https://github.com/dvlab-research/Seg-Zero) can be downloaded [here](https://huggingface.co/datasets/Ricky06662/refCOCOg_2k_840).
- [RefAVS](https://gewu-lab.github.io/Ref-AVS/) can be downloaded [here](https://zenodo.org/records/12970978/files/RefAVSBench.tar.gz?download=1)
### ποΈ Training
```bash
# for uv, source .venv/bin/activate
conda activate omni
# start SAM server first. If not training VOS or alpha_g is set to 0.0, then SAM server is not necessary.
bash src/scripts/run_sam_server.sh
# start training, by default this script does not need a SAM server.
bash src/scripts/omni_r1_run_training.sh
```
To connect to an existing SAM server, you can set up `SAM_HOST` and `SAM_PORT` as environment variables in `src/scripts/omni_r1_run_training.sh`.
## π Inference
Inference code and evaluation code coming soon.
```python
import torch
from transformers import (
Qwen2_5OmniModel,
Qwen2_5OmniProcessor,
GenerationConfig,
Qwen2_5OmniThinkerForConditionalGeneration,
)
from transformers import AutoModelForCausalLM, AutoTokenizer
from qwen_omni_utils import process_mm_info, process_vision_info
omni_path = "/path/to/Omni-R1"
# Omni-R1 is Qwen2_5OmniThinker, not Qwen2_5OmniModel, so inference code is different from that of Qwen offical codes.
model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained(
omni_path,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
).eval()
processor = Qwen2_5OmniProcessor.from_pretrained(omni_path)
generation_config = GenerationConfig(
use_cache=True, max_new_tokens=1024, do_sample=False
)
def inference(video_path, prompt, sys_prompt):
messages = [
{"role": "system", "content": [{"type": "text", "text": sys_prompt}]},
{
"role": "user",
"content": [
{"type": "video", "video": video_path},
{"type": "text", "text": prompt},
],
},
]
text_input = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
audio_input, image_input, video_input, process_args = process_mm_info(
messages, use_audio_in_video=False
)
inputs = processor(
text=text_input,
images=image_input,
audios=audio_input,
videos=video_input,
return_tensors="pt",
do_resize=True,
)
# ηζθΎεΊ
with torch.inference_mode():
generated_ids = model.generate(**inputs, generation_config=generation_config)
prompt_length = inputs["input_ids"].size(1)
completion_ids = generated_ids[:, prompt_length:]
# Decode the generated completions
text = processor.batch_decode(completion_ids, skip_special_tokens=True)
return text
video_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/shopping.mp4"
prompt = "How many kind of drinks can you see in the video?"
## Use a local model to inference.
response = inference(
video_path, prompt=prompt, sys_prompt="You are a helpful assistant."
)
print(response[0])
```
## π« License
For academic usage, this project is licensed under [the 2-clause BSD License](LICENSE). For commercial inquiries, please contact [Chunhua Shen](mailto:chhshen@gmail.com).
## π Acknowledgements
We sincerely appreciate the contributions of the open-source community. The related projects are as follows: [Sa2VA](https://github.com/magic-research/Sa2VA), [Video-R1](https://github.com/tulerfeng/Video-R1), [R1-V](https://github.com/Deep-Agent/R1-V) , [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1)
## ποΈ Citation
If you find this work helpful for your research, please cite:
```BibTeX
@article{zhong2025omnir1reinforcementlearningomnimodal,
title={Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration},
author={Hao Zhong and Muzhi Zhu and Zongze Du and Zheng Huang and Canyu Zhao and Mingyu Liu and Wen Wang and Hao Chen and Chunhua Shen},
year={2025},
eprint={2505.20256},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.20256},
}
``` |