AR-Omni-Pretrain
Overview
AR-Omni-Pretrain is a single-decoder, single-token-stream autoregressive any-to-any model that supports text, images, and speech in a unified next-token prediction framework (without expert decoders).
This checkpoint serves as the base pretrained model for downstream instruction tuning and chat-style interleaved any-to-any conversation.
Usage
The recommended way to run AR-Omni v0.1 is via the official repository (vendored editable
transformersandaccelerate).
For full setup details, please follow: https://github.com/ModalityDance/AR-Omni
Inference
(1) TTS
python inference/inference_pretrain.py \
--ckpt_path /path/to/AR-Omni-Pretrain-v0.1 \
--tokenizer_path /path/to/AR-Omni-Pretrain-v0.1 \
--out_dir ./outputs/tts \
--device 0 \
tts \
--text "Good afternoon! How are you today?" \
--instruction "Convert this text into speech." \
--wavtokenizer_root /path/to/WavTokenizer \
--wavtokenizer_config /path/to/wavtokenizer.yaml \
--wavtokenizer_ckpt /path/to/wavtokenizer.ckpt \
--max_gen_len 1024 \
--out_name tts.wav
(2) ASR
python inference/inference_pretrain.py \
--ckpt_path /path/to/AR-Omni-Pretrain-v0.1 \
--tokenizer_path /path/to/AR-Omni-Pretrain-v0.1 \
--out_dir ./outputs/asr \
--device 0 \
asr \
--audio_path inference/ref.wav \
--wavtokenizer_root /path/to/WavTokenizer \
--wavtokenizer_config /path/to/wavtokenizer.yaml \
--wavtokenizer_ckpt /path/to/wavtokenizer.ckpt \
--instruction "Can you please convert this speech into written text?" \
--max_seq_len 512
(3) Image Captioning
python inference/inference_pretrain.py \
--ckpt_path /path/to/AR-Omni-Pretrain-v0.1 \
--tokenizer_path /path/to/AR-Omni-Pretrain-v0.1 \
--out_dir ./outputs/caption \
--device 0 \
caption \
--image_path inference/demo_test.jpg \
--instruction "Describe this image in detail." \
--max_gen_len 256
(4) Text-to-Image (T2I)
python inference/inference_pretrain.py \
--ckpt_path /path/to/AR-Omni-Pretrain-v0.1 \
--tokenizer_path /path/to/AR-Omni-Pretrain-v0.1 \
--out_dir ./outputs/t2i \
--device 0 \
t2i \
--text "a bunch of ripe strawberries on a plate" \
--temp 1.0 \
--guidance_scale_image 1.32 \
--out_name t2i_test.png
License
This project is licensed under the MIT License.
It also complies with the licenses of referenced third-party projects and dependencies, including the Chameleon Research License.
Citation
@misc{cheng2026aromniunifiedautoregressivemodel,
title={AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation},
author={Dongjie Cheng and Ruifeng Yuan and Yongqi Li and Runyang You and Wenjie Wang and Liqiang Nie and Lei Zhang and Wenjie Li},
year={2026},
eprint={2601.17761},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.17761},
}