OmniVoice 🌍

OmniVoice is a massive multilingual zero-shot text-to-speech (TTS) model that scales to over 600 languages. At its core is a novel diffusion language model-style discrete non-autoregressive (NAR) architecture that directly maps text to multi-codebook acoustic tokens.

By leveraging a 581k-hour multilingual dataset and initialization from a pre-trained LLM, OmniVoice achieves the broadest language coverage to date and delivers state-of-the-art performance across Chinese, English, and diverse multilingual benchmarks.

Paper | GitHub | Demo | Project Page

Key Features

600+ Languages Supported: The broadest language coverage among zero-shot TTS models.
Voice Cloning: State-of-the-art voice cloning quality from short reference audio.
Voice Design: Control voices via assigned speaker attributes (gender, age, pitch, dialect/accent, whisper, etc.).
Fast Inference: RTF as low as 0.025 (40x faster than real-time).
Diffusion Language Model Architecture: A clean, streamlined, and scalable design that delivers both quality and speed.

Installation

pip

# Install PyTorch and Torchaudio first (refer to official site for CUDA versions)
pip install torch torchaudio

# Install OmniVoice
pip install omnivoice

Python API

Voice Cloning

Clone a voice from a short reference audio. Provide ref_audio and ref_text:

from omnivoice import OmniVoice
import torch
import torchaudio

model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="cuda:0",
    dtype=torch.float16
)

audio = model.generate(
    text="Hello, this is a test of zero-shot voice cloning.",
    ref_audio="ref.wav",
    ref_text="Transcription of the reference audio.",
) # audio is a list of `torch.Tensor` with shape (1, T) at 24 kHz.

torchaudio.save("out.wav", audio[0], 24000)

Voice Design

Describe the desired voice with speaker attributes — no reference audio needed. Supported attributes: gender (male/female), age (child to elderly), pitch (very low to very high), style (whisper), English accent (American, British, etc.), and Chinese dialect (四川话, 陕西话, etc.).

audio = model.generate(
    text="Hello, this is a test of zero-shot voice design.",
    instruct="female, low pitch, british accent",
)

Expressive Control

OmniVoice supports inline non-verbal symbols within the input text:

audio = model.generate(text="[laughter] You really got me. I didn't see that coming at all.")

Supported tags: [laughter], [confirmation-en], [question-en], [surprise-ah], [sniff], [sigh], and more.

Citation

@article{zhu2026omnivoice,
      title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
      author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
      journal={arXiv preprint arXiv:2604.00688},
      year={2026}
}