Any-to-Any
Transformers
Safetensors
chameleon
image-text-to-text
multimodal
reasoning
sft
rl
Omni-R1 / README.md
charlesdj's picture
Update README.md
13e998e verified
|
raw
history blame
2.74 kB
metadata
library_name: transformers
tags:
  - multimodal
  - reasoning
  - sft
  - rl
datasets:
  - multimodal-reasoning-lab/Zebra-CoT
  - ModalityDance/Omni-Bench
base_model:
  - GAIR/Anole-7b-v0.1

Omni-R1

Paper Code Omni-Bench

Overview

Omni-R1 is trained with multimodal interleaved supervision. It uses PeSFT for stable functional image generation, then PeRPO for RL refinement on unified tasks—enabling interleaved multimodal reasoning trajectories.

Usage

import torch
from PIL import Image
from transformers import ChameleonProcessor, ChameleonForConditionalGeneration

# 1) Import & load
model_id = "ModalityDance/Omni-R1"  # or a local checkpoint path
processor = ChameleonProcessor.from_pretrained(model_id)
model = ChameleonForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

# 2) Prepare a single input
prompt = "What is the smiling man in the image wearing? <image>"
image = Image.open("image.png").convert("RGB")

inputs = processor(
    prompt,
    images=[image],
    padding=False,
    return_for_text_completion=True,
    return_tensors="pt",
).to(model.device)

# 3) Call the model
outputs = model.generate(
    **inputs,
    max_length=4096,
    do_sample=True,
    temperature=0.5,
    top_p=0.9,
    pad_token_id=1,
    multimodal_generation_mode="unrestricted",
)

# 4) Get results
text = processor.batch_decode(outputs, skip_special_tokens=False)[0]
print(text)

For full scripts (batch JSONL inference, interleaved decoding, and vLLM-based evaluation), please refer to the official GitHub repository:
https://github.com/ModalityDance/Omni-R1

License

This project is licensed under the MIT License.
It also complies with the licenses of referenced third-party projects and dependencies, including the Chameleon Research License.

Citation

@misc{cheng2026omnir1unifiedgenerativeparadigm,
      title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning}, 
      author={Dongjie Cheng and Yongqi Li and Zhixin Ma and Hongru Cai and Yupeng Hu and Wenjie Wang and Liqiang Nie and Wenjie Li},
      year={2026},
      eprint={2601.09536},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.09536}, 
}