AR-Omni-Chat

Overview

AR-Omni-Chat is the chat variant of AR-Omni: a single-decoder, single-token-stream autoregressive any-to-any model that supports text, images, and speech in a unified next-token prediction framework.

This checkpoint is instruction-tuned on AR-Omni-Instruct-v0.1 and is intended for interleaved any-to-any conversation.

Usage

The recommended way to run AR-Omni v0.1 is via the official repository (vendored editable transformers and accelerate).
For full setup details, please follow: https://github.com/ModalityDance/AR-Omni

Run AR-Omni-Chat

Requires CosyVoice2 and WavTokenizer assets/config.
If CosyVoice is not installed as a package, set PYTHONPATH to include its repo.

PYTHONPATH=/path/to/CosyVoice/third_party/Matcha-TTS:/path/to/CosyVoice${PYTHONPATH:+:$PYTHONPATH} \
python3 inference/inference_chat.py \
  --input ./infer_test.json \
  --output_dir ./test_results \
  --model_root /path/to/converted_model_root \
  --hf_tokenizer /path/to/converted_model_root \
  --cosyvoice_model_dir /path/to/CosyVoice2-0.5B \
  --wavtokenizer_cfg_path /path/to/wavtokenizer.yaml \
  --wavtokenizer_ckpt_path /path/to/wavtokenizer.ckpt \
  --save_audio --save_images

Common optional flags:

--txt_temp, --txt_top_p : sampling settings for text
--img_temp : sampling temp for image tokens
--bandwidth_id : wavtokenizer bandwidth id

Input Example

{
  "dialog_id": "demo_0001",
  "speaker_wav": "./inference/ref.wav",
  "turns": [
    {
      "text": "Describe the image in detail.",
      "image_paths": ["inference/demo_test.jpg"],
      "user_append_text": "Please acknowledge the user's vocal input, create a textual response.",
      "reset": true
    }
  ]
}

{
  "dialog_id": "demo_0002",
  "speaker_wav": "./inference/ref.wav",
  "turns": [
    {
      "text": "Can you show me the sunset?",
      "user_append_text": "Please transcribe the user's vocal input, create a picture of it.",
      "reset": true
    }
  ]
}

License

This project is licensed under the MIT License.
It also complies with the licenses of referenced third-party projects and dependencies, including the Chameleon Research License.

Citation

@misc{cheng2026aromniunifiedautoregressivemodel,
      title={AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation}, 
      author={Dongjie Cheng and Ruifeng Yuan and Yongqi Li and Runyang You and Wenjie Wang and Liqiang Nie and Lei Zhang and Wenjie Li},
      year={2026},
      eprint={2601.17761},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.17761}, 
}