AR-Omni-Pretrain

Overview

AR-Omni-Pretrain is a single-decoder, single-token-stream autoregressive any-to-any model that supports text, images, and speech in a unified next-token prediction framework (without expert decoders).
This checkpoint serves as the base pretrained model for downstream instruction tuning and chat-style interleaved any-to-any conversation.

Usage

The recommended way to run AR-Omni v0.1 is via the official repository (vendored editable transformers and accelerate).
For full setup details, please follow: https://github.com/ModalityDance/AR-Omni

Inference

(1) TTS

python inference/inference_pretrain.py \
  --ckpt_path /path/to/AR-Omni-Pretrain-v0.1 \
  --tokenizer_path /path/to/AR-Omni-Pretrain-v0.1 \
  --out_dir ./outputs/tts \
  --device 0 \
  tts \
  --text "Good afternoon! How are you today?" \
  --instruction "Convert this text into speech." \
  --wavtokenizer_root /path/to/WavTokenizer \
  --wavtokenizer_config /path/to/wavtokenizer.yaml \
  --wavtokenizer_ckpt /path/to/wavtokenizer.ckpt \
  --max_gen_len 1024 \
  --out_name tts.wav

(2) ASR

python inference/inference_pretrain.py \
  --ckpt_path /path/to/AR-Omni-Pretrain-v0.1 \
  --tokenizer_path /path/to/AR-Omni-Pretrain-v0.1 \
  --out_dir ./outputs/asr \
  --device 0 \
  asr \
  --audio_path inference/ref.wav \
  --wavtokenizer_root /path/to/WavTokenizer \
  --wavtokenizer_config /path/to/wavtokenizer.yaml \
  --wavtokenizer_ckpt /path/to/wavtokenizer.ckpt \
  --instruction "Can you please convert this speech into written text?" \
  --max_seq_len 512

(3) Image Captioning

python inference/inference_pretrain.py \
  --ckpt_path /path/to/AR-Omni-Pretrain-v0.1 \
  --tokenizer_path /path/to/AR-Omni-Pretrain-v0.1 \
  --out_dir ./outputs/caption \
  --device 0 \
  caption \
  --image_path inference/demo_test.jpg \
  --instruction "Describe this image in detail." \
  --max_gen_len 256

(4) Text-to-Image (T2I)

python inference/inference_pretrain.py \
  --ckpt_path /path/to/AR-Omni-Pretrain-v0.1 \
  --tokenizer_path /path/to/AR-Omni-Pretrain-v0.1 \
  --out_dir ./outputs/t2i \
  --device 0 \
  t2i \
  --text "a bunch of ripe strawberries on a plate" \
  --temp 1.0 \
  --guidance_scale_image 1.32 \
  --out_name t2i_test.png

License

This project is licensed under the MIT License.
It also complies with the licenses of referenced third-party projects and dependencies, including the Chameleon Research License.

Citation

@misc{cheng2026aromniunifiedautoregressivemodel,
      title={AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation}, 
      author={Dongjie Cheng and Ruifeng Yuan and Yongqi Li and Runyang You and Wenjie Wang and Liqiang Nie and Lei Zhang and Wenjie Li},
      year={2026},
      eprint={2601.17761},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.17761}, 
}