license: apache-2.0
base_model:
- Qwen/Qwen3-0.6B
pipeline_tag: text-to-speech
language:
- aae
- aal
- aao
- ab
- abb
- abn
- abr
- abs
- abv
- acm
- acw
- acx
- adf
- adx
- ady
- aeb
- aec
- af
- afb
- afo
- ahl
- ahs
- ajg
- aju
- ala
- aln
- alo
- am
- amu
- an
- anc
- ank
- anp
- anw
- aom
- apc
- apd
- arb
- arq
- ars
- ary
- arz
- as
- ast
- avl
- awo
- ayl
- ayp
- az
- ba
- bag
- bas
- bax
- bba
- bbj
- bbl
- bbu
- bce
- bci
- bcs
- bcy
- bda
- bde
- bdm
- be
- beb
- bew
- bfd
- bft
- bg
- bgp
- bhb
- bhh
- bho
- bhp
- bhr
- bjj
- bjk
- bjn
- bjt
- bkh
- bkm
- bky
- bmm
- bmq
- bn
- bnm
- bnn
- bns
- bo
- bou
- bqg
- br
- bra
- brh
- bri
- brx
- bs
- bsh
- bsj
- bsk
- btm
- btv
- bug
- bum
- buo
- bux
- bwr
- bxf
- byc
- bys
- byv
- byx
- bzc
- bzw
- ca
- ccg
- ceb
- cen
- cfa
- cgg
- chq
- cjk
- ckb
- ckl
- ckr
- cky
- cnh
- cpy
- cs
- cte
- ctl
- cut
- cux
- cv
- cy
- da
- dag
- dar
- dav
- dbd
- dcc
- de
- deg
- dgh
- dgo
- dje
- dmk
- dml
- dru
- dty
- dua
- dv
- dyu
- dzg
- ebr
- ebu
- ego
- eiv
- eko
- ekr
- el
- elm
- en
- eo
- es
- esu
- et
- eto
- ets
- etu
- eu
- ewo
- ext
- eyo
- fa
- fan
- fat
- ff
- ffm
- fi
- fia
- fil
- fip
- fkk
- fmp
- fr
- fub
- fuc
- fue
- fuf
- fuh
- fui
- fuq
- fuv
- fy
- ga
- gbm
- gbr
- gby
- gcc
- gdf
- gej
- ges
- ggg
- gid
- gig
- giz
- gjk
- gju
- gl
- glw
- gn
- gol
- gom
- gsl
- gu
- gui
- gur
- guz
- gv
- gwc
- gwe
- gwt
- gya
- gyz
- ha
- hah
- hao
- haw
- haz
- hbb
- he
- hem
- hi
- hia
- hkk
- hla
- hno
- hoj
- hr
- hsb
- ht
- hu
- hue
- hul
- hux
- hwo
- hy
- hz
- ia
- ibb
- id
- ida
- idu
- ig
- ijc
- ijn
- ik
- ikw
- is
- ish
- iso
- it
- its
- itw
- itz
- ja
- jal
- jax
- jgo
- jmx
- jns
- jqr
- juk
- juo
- jv
- ka
- kab
- kai
- kaj
- kam
- kbd
- kbl
- kbt
- kcq
- kdh
- kea
- keu
- kfe
- kfk
- kfp
- khg
- khw
- kj
- kjc
- kjk
- kk
- kln
- kls
- km
- kmr
- kmy
- kn
- kna
- knn
- ko
- kol
- koo
- kpo
- kqo
- ks
- ksd
- ksf
- kto
- kuh
- kvx
- kw
- kwm
- kxp
- ky
- kyx
- lag
- lb
- lcm
- ldb
- lg
- lij
- lir
- lkb
- lla
- ln
- lnu
- lo
- loa
- lrk
- lss
- lt
- ltg
- lto
- lua
- luo
- lus
- lv
- lwg
- mab
- maf
- mai
- mau
- max
- mbo
- mcf
- mcn
- mcx
- mdd
- mde
- mdf
- mek
- mer
- meu
- mfm
- mfn
- mfo
- mfv
- mgg
- mgi
- mhk
- mhr
- mi
- mig
- miu
- mk
- mkf
- mki
- ml
- mlq
- mn
- mne
- mni
- mqy
- mr
- mrj
- mrr
- mrt
- ms
- mse
- msh
- msw
- mt
- mtr
- mtu
- mtx
- mua
- mug
- mui
- mve
- mvy
- mxs
- mxu
- mxy
- my
- myv
- mzl
- nal
- nan
- nap
- nb
- nbh
- ncf
- nco
- ncx
- ndi
- ng
- ngi
- nhg
- nhi
- nhn
- nhq
- nja
- nl
- nla
- nlv
- nmg
- nmz
- nn
- nnh
- 'no'
- noe
- npi
- nso
- ny
- nyu
- oc
- odk
- odu
- ogo
- om
- orc
- oru
- ory
- os
- pa
- pbs
- pbt
- pbu
- pcm
- pex
- phl
- phr
- pip
- piy
- pko
- pl
- plk
- plt
- pmq
- pms
- pmy
- pnb
- poc
- poe
- pow
- prq
- ps
- pst
- pt
- pua
- pwn
- qug
- qum
- qup
- qur
- qus
- quv
- qux
- quy
- qva
- qvi
- qvj
- qvl
- qwa
- qws
- qxa
- qxp
- qxt
- qxu
- qxw
- rag
- rm
- ro
- rob
- rof
- roo
- rth
- ru
- rup
- rw
- sa
- sah
- sat
- sau
- say
- sbn
- sc
- scl
- scn
- sd
- sei
- shu
- si
- sip
- siw
- sjr
- sk
- skg
- skr
- sl
- sn
- snc
- snk
- so
- sol
- sps
- sq
- sr
- src
- sro
- ssi
- ste
- sua
- sv
- sva
- sw
- szy
- ta
- tan
- tar
- tay
- tbf
- tcf
- tcy
- tdn
- tdx
- te
- tg
- tgc
- th
- the
- thq
- thr
- thv
- ti
- tig
- tio
- tk
- tkg
- tkt
- tli
- tlp
- tn
- tok
- tpl
- tpz
- tqp
- tr
- trp
- trq
- trv
- trw
- tt
- ttj
- ttr
- ttu
- tui
- tul
- tuq
- tuv
- tuy
- tvo
- tvu
- tw
- twu
- txs
- txy
- udl
- ug
- uk
- uki
- umb
- ur
- ush
- uz
- uzn
- vai
- var
- ver
- vi
- vmc
- vmj
- vmm
- vmp
- vmz
- vot
- vro
- wbl
- wci
- weo
- wes
- wja
- wji
- wo
- wof
- xh
- xhe
- xka
- xmf
- xmv
- xmw
- xpe
- xti
- xtu
- yaq
- yav
- yay
- ydd
- ydg
- yer
- 'yes'
- yi
- yo
- yue
- zga
- zgh
- zh
- zoc
- zoh
- zor
- zpv
- zpy
- ztg
- ztn
- ztp
- zts
- ztu
- zu
- zza
OmniVoice 🌍
OmniVoice is a state-of-the-art zero-shot multilingual TTS model supporting more than 600 languages. Built on a novel diffusion language model architecture, it generates high-quality speech with superior inference speed, supporting voice cloning and voice design.
Contents: Key Features | Installation | Quick Start | Python API | Command-Line Tools | Training & Evaluation | Discussion | Citation
Key Features
- 600+ Languages Supported: The broadest language coverage among zero-shot TTS models (full list)
- Voice Cloning: State-of-the-art voice cloning quality.
- Voice Design: Control voices via assigned speaker attributes (gender, age, pitch, dialect/accent, whisper, etc.).
- Fast Inference: RTF as low as 0.025 (40x faster than real-time).
- Diffusion Language Model Architecture: A clean, streamlined, and scalable design that delivers both quality and speed.
Installation
Choose one of the following methods: pip or uv.
pip
We recommend using a fresh virtual environment (e.g.,
conda,venv, etc.) to avoid conflicts.
Step 1: Install PyTorch
NVIDIA GPU
# Install pytorch with your CUDA version, e.g.
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128
See PyTorch official site for other versions installation.
Apple Silicon
pip install torch==2.8.0 torchaudio==2.8.0
Step 2: Install OmniVoice (choose one)
# From PyPI (stable release)
pip install omnivoice
# From the latest source on GitHub (no need to clone)
pip install git+https://github.com/k2-fsa/OmniVoice.git
# For development (clone first, editable install)
git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
pip install -e .
uv
Clone the repository and sync dependencies:
git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
uv sync
Tip: Can use mirror with
uv sync --default-index "https://mirrors.aliyun.com/pypi/simple"
Quick Start
Try OmniVoice without coding:
Launch the local web UI:
omnivoice-demo --ip 0.0.0.0 --port 8001Or try it directly on HuggingFace Space
If you have trouble connecting to HuggingFace when downloading the pre-trained models, set
export HF_ENDPOINT="https://hf-mirror.com"before running.
For full usage, see the Python API and Command-Line Tools sections below.
Python API
The OmniVoice model supports three generation modes. All features in this section are also available via command-line tools.
Voice Cloning
Clone a voice from a short reference audio. Provide ref_audio and ref_text:
from omnivoice import OmniVoice
import torch
import torchaudio
model = OmniVoice.from_pretrained(
"k2-fsa/OmniVoice",
device_map="cuda:0",
dtype=torch.float16
)
# Apple Silicon users: use device_map="mps" instead
audio = model.generate(
text="Hello, this is a test of zero-shot voice cloning.",
ref_audio="ref.wav",
ref_text="Transcription of the reference audio.",
) # audio is a list of `torch.Tensor` with shape (1, T) at 24 kHz.
# If you don't want to input `ref_text` manually, you can directly omit the `ref_text`.
# The model will use Whisper ASR to auto-transcribe it.
torchaudio.save("out.wav", audio[0], 24000)
Voice Design
Describe the desired voice with speaker attributes — no reference audio needed. Supported attributes: gender (male/female), age (child to elderly), pitch (very low to very high), style (whisper), English accent (American, British, etc.), and Chinese dialect (四川话, 陕西话, etc.). Attributes are comma-separated and freely combinable across categories.
audio = model.generate(
text="Hello, this is a test of zero-shot voice design.",
instruct="female, low pitch, british accent",
)
See docs/voice-design.md for the full attribute reference, Chinese equivalents, and usage tips.
Auto Voice
Let the model choose a voice automatically:
audio = model.generate(text="This is a sentence without any voice prompt.")
Generation Parameters
All above three modes share the same model.generate() API. You can further control the generation behavior via keyword arguments:
audio = model.generate(
text="...",
num_step=32, # diffusion steps (or 16 for faster inference)
speed=1.0, # speed factor (>1.0 faster, <1.0 slower)
duration=10.0, # fixed output duration in seconds (overrides speed)
# ... more options
)
See more detailed control in docs/generation-parameters.md.
Non-Verbal & Pronunciation Control
OmniVoice supports inline non-verbal symbols and pronunciation hints within the input text.
Non-verbal symbols: Insert tags like [laughter] directly in the text to add expressive non-verbal sounds.
audio = model.generate(text="[laughter] You really got me. I didn't see that coming at all.")
Supported tags: [laughter], [confirmation-en], [question-en], [question-ah], [question-oh], [question-ei], [question-yi], [surprise-ah], [surprise-oh], [surprise-wa], [surprise-yo], [dissatisfaction-hnn], [sniff], [sigh]
Pronunciation control (Chinese): Use pinyin with tone numbers to correct specific character pronunciations.
audio = model.generate(text="这批货物打ZHE2出售后他严重SHE2本了,再也经不起ZHE1腾了。")
Pronunciation control (English): Use CMU pronunciation dictionary (uppercase, in brackets) to override default English pronunciations.
audio = model.generate(text="You could probably still make [IH1 T] look good.")
Command-Line Tools
Three CLI entry points are provided. The CLI tools support all features available in the Python API (voice cloning, voice design, auto voice, generation parameters, etc.) — all controlled via command-line arguments.
| Command | Description | Source |
|---|---|---|
omnivoice-demo |
Interactive Gradio web demo | omnivoice/cli/demo.py |
omnivoice-infer |
Single-item inference | omnivoice/cli/infer.py |
omnivoice-infer-batch |
Batch inference across multiple GPUs | omnivoice/cli/infer_batch.py |
Demo
omnivoice-demo --ip 0.0.0.0 --port 8001
Provides a web UI for voice cloning and voice design. See omnivoice-demo --help for all options.
Single Inference
# Voice Cloning
# ref_text can be omitted (Whisper will auto-transcribe ref_audio to get it).
omnivoice-infer \
--model k2-fsa/OmniVoice \
--text "This is a test for text to speech." \
--ref_audio ref.wav \
--ref_text "Transcription of the reference audio." \
--output hello.wav
# Voice Design
omnivoice-infer --model k2-fsa/OmniVoice \
--text "This is a test for text to speech." \
--instruct "male, British accent" \
--output hello.wav
# Auto Voice
omnivoice-infer \
--model k2-fsa/OmniVoice \
--text "This is a test for text to speech."\
--output hello.wav
Batch Inference
omnivoice-infer-batch can distribute batch inference across multiple GPUs, designed for large-scale TTS tasks.
omnivoice-infer-batch \
--model k2-fsa/OmniVoice \
--test_list test.jsonl \
--res_dir results/
The test list is a JSONL file where each line is a JSON object:
{"id": "sample_001", "text": "Hello world", "ref_audio": "/path/to/ref.wav", "ref_text": "Reference transcript", "instruct": "female, british accent", "language_id": "en", "language_name": "English", "duration": 10.0, "speed": 1.0}
Only id and text are mandatory fields. ref_audio and ref_text are used in voice cloning mode. instruct is used in voice design mode. If no reference audio or instruct are provided, the model will generate text in a random voice.
language_id, language_name, duration, and speed are optional. duration (in seconds) fixes the output length; speed controls the speaking rate. If duration and speed are both provided, speed will be ignored.
Training & Evaluation
See examples/ for the complete pipeline — from data preparation to training, evaluation, and finetuning.
Discussion & Communication
You can directly discuss on GitHub Issues.
You can also scan the QR code to join our wechat group or follow our wechat official account.
Citation
@article{zhu2026omnivoice,
title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
journal={arXiv preprint arXiv:2604.00688},
year={2026}
}

