OmniVoice 🌍
OmniVoice is a state-of-the-art zero-shot multilingual TTS model supporting more than 600 languages. Built on a novel diffusion language model architecture, it generates high-quality speech with superior inference speed, supporting voice cloning and voice design.
Contents: Key Features | Installation | Quick Start | Python API | Command-Line Tools | Training & Evaluation | Discussion | Citation
Key Features
- 600+ Languages Supported: The broadest language coverage among zero-shot TTS models (full list)
- Voice Cloning: State-of-the-art voice cloning quality.
- Voice Design: Control voices via assigned speaker attributes (gender, age, pitch, dialect/accent, whisper, etc.).
- Fast Inference: RTF as low as 0.025 (40x faster than real-time).
- Diffusion Language Model Architecture: A clean, streamlined, and scalable design that delivers both quality and speed.
Installation
Choose one of the following methods: pip or uv.
pip
We recommend using a fresh virtual environment (e.g.,
conda,venv, etc.) to avoid conflicts.
Step 1: Install PyTorch
NVIDIA GPU
# Install pytorch with your CUDA version, e.g.
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128
See PyTorch official site for other versions installation.
Apple Silicon
pip install torch==2.8.0 torchaudio==2.8.0
Step 2: Install OmniVoice (choose one)
# From PyPI (stable release)
pip install omnivoice
# From the latest source on GitHub (no need to clone)
pip install git+https://github.com/k2-fsa/OmniVoice.git
# For development (clone first, editable install)
git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
pip install -e .
uv
Clone the repository and sync dependencies:
git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
uv sync
Tip: Can use mirror with
uv sync --default-index "https://mirrors.aliyun.com/pypi/simple"
Quick Start
Try OmniVoice without coding:
Launch the local web UI:
omnivoice-demo --ip 0.0.0.0 --port 8001Or try it directly on HuggingFace Space
If you have trouble connecting to HuggingFace when downloading the pre-trained models, set
export HF_ENDPOINT="https://hf-mirror.com"before running.
For full usage, see the Python API and Command-Line Tools sections below.
Python API
The OmniVoice model supports three generation modes. All features in this section are also available via command-line tools.
Voice Cloning
Clone a voice from a short reference audio. Provide ref_audio and ref_text:
from omnivoice import OmniVoice
import torch
import torchaudio
model = OmniVoice.from_pretrained(
"k2-fsa/OmniVoice",
device_map="cuda:0",
dtype=torch.float16
)
# Apple Silicon users: use device_map="mps" instead
audio = model.generate(
text="Hello, this is a test of zero-shot voice cloning.",
ref_audio="ref.wav",
ref_text="Transcription of the reference audio.",
) # audio is a list of `torch.Tensor` with shape (1, T) at 24 kHz.
# If you don't want to input `ref_text` manually, you can directly omit the `ref_text`.
# The model will use Whisper ASR to auto-transcribe it.
torchaudio.save("out.wav", audio[0], 24000)
Voice Design
Describe the desired voice with speaker attributes — no reference audio needed. Supported attributes: gender (male/female), age (child to elderly), pitch (very low to very high), style (whisper), English accent (American, British, etc.), and Chinese dialect (四川话, 陕西话, etc.). Attributes are comma-separated and freely combinable across categories.
audio = model.generate(
text="Hello, this is a test of zero-shot voice design.",
instruct="female, low pitch, british accent",
)
See docs/voice-design.md for the full attribute reference, Chinese equivalents, and usage tips.
Auto Voice
Let the model choose a voice automatically:
audio = model.generate(text="This is a sentence without any voice prompt.")
Generation Parameters
All above three modes share the same model.generate() API. You can further control the generation behavior via keyword arguments:
audio = model.generate(
text="...",
num_step=32, # diffusion steps (or 16 for faster inference)
speed=1.0, # speed factor (>1.0 faster, <1.0 slower)
duration=10.0, # fixed output duration in seconds (overrides speed)
# ... more options
)
See more detailed control in docs/generation-parameters.md.
Non-Verbal & Pronunciation Control
OmniVoice supports inline non-verbal symbols and pronunciation hints within the input text.
Non-verbal symbols: Insert tags like [laughter] directly in the text to add expressive non-verbal sounds.
audio = model.generate(text="[laughter] You really got me. I didn't see that coming at all.")
Supported tags: [laughter], [confirmation-en], [question-en], [question-ah], [question-oh], [question-ei], [question-yi], [surprise-ah], [surprise-oh], [surprise-wa], [surprise-yo], [dissatisfaction-hnn], [sniff], [sigh]
Pronunciation control (Chinese): Use pinyin with tone numbers to correct specific character pronunciations.
audio = model.generate(text="这批货物打ZHE2出售后他严重SHE2本了,再也经不起ZHE1腾了。")
Pronunciation control (English): Use CMU pronunciation dictionary (uppercase, in brackets) to override default English pronunciations.
audio = model.generate(text="You could probably still make [IH1 T] look good.")
Command-Line Tools
Three CLI entry points are provided. The CLI tools support all features available in the Python API (voice cloning, voice design, auto voice, generation parameters, etc.) — all controlled via command-line arguments.
| Command | Description | Source |
|---|---|---|
omnivoice-demo |
Interactive Gradio web demo | omnivoice/cli/demo.py |
omnivoice-infer |
Single-item inference | omnivoice/cli/infer.py |
omnivoice-infer-batch |
Batch inference across multiple GPUs | omnivoice/cli/infer_batch.py |
Demo
omnivoice-demo --ip 0.0.0.0 --port 8001
Provides a web UI for voice cloning and voice design. See omnivoice-demo --help for all options.
Single Inference
# Voice Cloning
# ref_text can be omitted (Whisper will auto-transcribe ref_audio to get it).
omnivoice-infer \
--model k2-fsa/OmniVoice \
--text "This is a test for text to speech." \
--ref_audio ref.wav \
--ref_text "Transcription of the reference audio." \
--output hello.wav
# Voice Design
omnivoice-infer --model k2-fsa/OmniVoice \
--text "This is a test for text to speech." \
--instruct "male, British accent" \
--output hello.wav
# Auto Voice
omnivoice-infer \
--model k2-fsa/OmniVoice \
--text "This is a test for text to speech."\
--output hello.wav
Batch Inference
omnivoice-infer-batch can distribute batch inference across multiple GPUs, designed for large-scale TTS tasks.
omnivoice-infer-batch \
--model k2-fsa/OmniVoice \
--test_list test.jsonl \
--res_dir results/
The test list is a JSONL file where each line is a JSON object:
{"id": "sample_001", "text": "Hello world", "ref_audio": "/path/to/ref.wav", "ref_text": "Reference transcript", "instruct": "female, british accent", "language_id": "en", "language_name": "English", "duration": 10.0, "speed": 1.0}
Only id and text are mandatory fields. ref_audio and ref_text are used in voice cloning mode. instruct is used in voice design mode. If no reference audio or instruct are provided, the model will generate text in a random voice.
language_id, language_name, duration, and speed are optional. duration (in seconds) fixes the output length; speed controls the speaking rate. If duration and speed are both provided, speed will be ignored.
Training & Evaluation
See examples/ for the complete pipeline — from data preparation to training, evaluation, and finetuning.
Discussion & Communication
You can directly discuss on GitHub Issues.
You can also scan the QR code to join our wechat group or follow our wechat official account.
Citation
@article{zhu2026omnivoice,
title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
journal={arXiv preprint arXiv:2604.00688},
year={2026}
}
- Downloads last month
- -

