AR-Omni-Chat
Overview
AR-Omni-Chat is the chat variant of AR-Omni: a single-decoder, single-token-stream autoregressive any-to-any model that supports text, images, and speech in a unified next-token prediction framework.
This checkpoint is instruction-tuned on AR-Omni-Instruct-v0.1 and is intended for interleaved any-to-any conversation.
Usage
The recommended way to run AR-Omni v0.1 is via the official repository (vendored editable
transformersandaccelerate).
For full setup details, please follow: https://github.com/ModalityDance/AR-Omni
Run AR-Omni-Chat
Requires CosyVoice2 and WavTokenizer assets/config.
If CosyVoice is not installed as a package, setPYTHONPATHto include its repo.
PYTHONPATH=/path/to/CosyVoice/third_party/Matcha-TTS:/path/to/CosyVoice${PYTHONPATH:+:$PYTHONPATH} \
python3 inference/inference_chat.py \
--input ./infer_test.json \
--output_dir ./test_results \
--model_root /path/to/converted_model_root \
--hf_tokenizer /path/to/converted_model_root \
--cosyvoice_model_dir /path/to/CosyVoice2-0.5B \
--wavtokenizer_cfg_path /path/to/wavtokenizer.yaml \
--wavtokenizer_ckpt_path /path/to/wavtokenizer.ckpt \
--save_audio --save_images
Common optional flags:
--txt_temp,--txt_top_p: sampling settings for text--img_temp: sampling temp for image tokens--bandwidth_id: wavtokenizer bandwidth id
Input Example
{
"dialog_id": "demo_0001",
"speaker_wav": "./inference/ref.wav",
"turns": [
{
"text": "Describe the image in detail.",
"image_paths": ["inference/demo_test.jpg"],
"user_append_text": "Please acknowledge the user's vocal input, create a textual response.",
"reset": true
}
]
}
{
"dialog_id": "demo_0002",
"speaker_wav": "./inference/ref.wav",
"turns": [
{
"text": "Can you show me the sunset?",
"user_append_text": "Please transcribe the user's vocal input, create a picture of it.",
"reset": true
}
]
}
License
This project is licensed under the MIT License.
It also complies with the licenses of referenced third-party projects and dependencies, including the Chameleon Research License.
Citation
@misc{cheng2026aromniunifiedautoregressivemodel,
title={AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation},
author={Dongjie Cheng and Ruifeng Yuan and Yongqi Li and Runyang You and Wenjie Wang and Liqiang Nie and Lei Zhang and Wenjie Li},
year={2026},
eprint={2601.17761},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.17761},
}
Model tree for ModalityDance/AR-Omni-Chat-v0.1
Base model
GAIR/Anole-7b-v0.1