Duplicate from k2-fsa/OmniVoice

6ed421e 18 days ago

15.8 kB

license: apache-2.0
base_model:
  - Qwen/Qwen3-0.6B
pipeline_tag: text-to-speech
language:
  - aae
  - aal
  - aao
  - ab
  - abb
  - abn
  - abr
  - abs
  - abv
  - acm
  - acw
  - acx
  - adf
  - adx
  - ady
  - aeb
  - aec
  - af
  - afb
  - afo
  - ahl
  - ahs
  - ajg
  - aju
  - ala
  - aln
  - alo
  - am
  - amu
  - an
  - anc
  - ank
  - anp
  - anw
  - aom
  - apc
  - apd
  - arb
  - arq
  - ars
  - ary
  - arz
  - as
  - ast
  - avl
  - awo
  - ayl
  - ayp
  - az
  - ba
  - bag
  - bas
  - bax
  - bba
  - bbj
  - bbl
  - bbu
  - bce
  - bci
  - bcs
  - bcy
  - bda
  - bde
  - bdm
  - be
  - beb
  - bew
  - bfd
  - bft
  - bg
  - bgp
  - bhb
  - bhh
  - bho
  - bhp
  - bhr
  - bjj
  - bjk
  - bjn
  - bjt
  - bkh
  - bkm
  - bky
  - bmm
  - bmq
  - bn
  - bnm
  - bnn
  - bns
  - bo
  - bou
  - bqg
  - br
  - bra
  - brh
  - bri
  - brx
  - bs
  - bsh
  - bsj
  - bsk
  - btm
  - btv
  - bug
  - bum
  - buo
  - bux
  - bwr
  - bxf
  - byc
  - bys
  - byv
  - byx
  - bzc
  - bzw
  - ca
  - ccg
  - ceb
  - cen
  - cfa
  - cgg
  - chq
  - cjk
  - ckb
  - ckl
  - ckr
  - cky
  - cnh
  - cpy
  - cs
  - cte
  - ctl
  - cut
  - cux
  - cv
  - cy
  - da
  - dag
  - dar
  - dav
  - dbd
  - dcc
  - de
  - deg
  - dgh
  - dgo
  - dje
  - dmk
  - dml
  - dru
  - dty
  - dua
  - dv
  - dyu
  - dzg
  - ebr
  - ebu
  - ego
  - eiv
  - eko
  - ekr
  - el
  - elm
  - en
  - eo
  - es
  - esu
  - et
  - eto
  - ets
  - etu
  - eu
  - ewo
  - ext
  - eyo
  - fa
  - fan
  - fat
  - ff
  - ffm
  - fi
  - fia
  - fil
  - fip
  - fkk
  - fmp
  - fr
  - fub
  - fuc
  - fue
  - fuf
  - fuh
  - fui
  - fuq
  - fuv
  - fy
  - ga
  - gbm
  - gbr
  - gby
  - gcc
  - gdf
  - gej
  - ges
  - ggg
  - gid
  - gig
  - giz
  - gjk
  - gju
  - gl
  - glw
  - gn
  - gol
  - gom
  - gsl
  - gu
  - gui
  - gur
  - guz
  - gv
  - gwc
  - gwe
  - gwt
  - gya
  - gyz
  - ha
  - hah
  - hao
  - haw
  - haz
  - hbb
  - he
  - hem
  - hi
  - hia
  - hkk
  - hla
  - hno
  - hoj
  - hr
  - hsb
  - ht
  - hu
  - hue
  - hul
  - hux
  - hwo
  - hy
  - hz
  - ia
  - ibb
  - id
  - ida
  - idu
  - ig
  - ijc
  - ijn
  - ik
  - ikw
  - is
  - ish
  - iso
  - it
  - its
  - itw
  - itz
  - ja
  - jal
  - jax
  - jgo
  - jmx
  - jns
  - jqr
  - juk
  - juo
  - jv
  - ka
  - kab
  - kai
  - kaj
  - kam
  - kbd
  - kbl
  - kbt
  - kcq
  - kdh
  - kea
  - keu
  - kfe
  - kfk
  - kfp
  - khg
  - khw
  - kj
  - kjc
  - kjk
  - kk
  - kln
  - kls
  - km
  - kmr
  - kmy
  - kn
  - kna
  - knn
  - ko
  - kol
  - koo
  - kpo
  - kqo
  - ks
  - ksd
  - ksf
  - kto
  - kuh
  - kvx
  - kw
  - kwm
  - kxp
  - ky
  - kyx
  - lag
  - lb
  - lcm
  - ldb
  - lg
  - lij
  - lir
  - lkb
  - lla
  - ln
  - lnu
  - lo
  - loa
  - lrk
  - lss
  - lt
  - ltg
  - lto
  - lua
  - luo
  - lus
  - lv
  - lwg
  - mab
  - maf
  - mai
  - mau
  - max
  - mbo
  - mcf
  - mcn
  - mcx
  - mdd
  - mde
  - mdf
  - mek
  - mer
  - meu
  - mfm
  - mfn
  - mfo
  - mfv
  - mgg
  - mgi
  - mhk
  - mhr
  - mi
  - mig
  - miu
  - mk
  - mkf
  - mki
  - ml
  - mlq
  - mn
  - mne
  - mni
  - mqy
  - mr
  - mrj
  - mrr
  - mrt
  - ms
  - mse
  - msh
  - msw
  - mt
  - mtr
  - mtu
  - mtx
  - mua
  - mug
  - mui
  - mve
  - mvy
  - mxs
  - mxu
  - mxy
  - my
  - myv
  - mzl
  - nal
  - nan
  - nap
  - nb
  - nbh
  - ncf
  - nco
  - ncx
  - ndi
  - ng
  - ngi
  - nhg
  - nhi
  - nhn
  - nhq
  - nja
  - nl
  - nla
  - nlv
  - nmg
  - nmz
  - nn
  - nnh
  - 'no'
  - noe
  - npi
  - nso
  - ny
  - nyu
  - oc
  - odk
  - odu
  - ogo
  - om
  - orc
  - oru
  - ory
  - os
  - pa
  - pbs
  - pbt
  - pbu
  - pcm
  - pex
  - phl
  - phr
  - pip
  - piy
  - pko
  - pl
  - plk
  - plt
  - pmq
  - pms
  - pmy
  - pnb
  - poc
  - poe
  - pow
  - prq
  - ps
  - pst
  - pt
  - pua
  - pwn
  - qug
  - qum
  - qup
  - qur
  - qus
  - quv
  - qux
  - quy
  - qva
  - qvi
  - qvj
  - qvl
  - qwa
  - qws
  - qxa
  - qxp
  - qxt
  - qxu
  - qxw
  - rag
  - rm
  - ro
  - rob
  - rof
  - roo
  - rth
  - ru
  - rup
  - rw
  - sa
  - sah
  - sat
  - sau
  - say
  - sbn
  - sc
  - scl
  - scn
  - sd
  - sei
  - shu
  - si
  - sip
  - siw
  - sjr
  - sk
  - skg
  - skr
  - sl
  - sn
  - snc
  - snk
  - so
  - sol
  - sps
  - sq
  - sr
  - src
  - sro
  - ssi
  - ste
  - sua
  - sv
  - sva
  - sw
  - szy
  - ta
  - tan
  - tar
  - tay
  - tbf
  - tcf
  - tcy
  - tdn
  - tdx
  - te
  - tg
  - tgc
  - th
  - the
  - thq
  - thr
  - thv
  - ti
  - tig
  - tio
  - tk
  - tkg
  - tkt
  - tli
  - tlp
  - tn
  - tok
  - tpl
  - tpz
  - tqp
  - tr
  - trp
  - trq
  - trv
  - trw
  - tt
  - ttj
  - ttr
  - ttu
  - tui
  - tul
  - tuq
  - tuv
  - tuy
  - tvo
  - tvu
  - tw
  - twu
  - txs
  - txy
  - udl
  - ug
  - uk
  - uki
  - umb
  - ur
  - ush
  - uz
  - uzn
  - vai
  - var
  - ver
  - vi
  - vmc
  - vmj
  - vmm
  - vmp
  - vmz
  - vot
  - vro
  - wbl
  - wci
  - weo
  - wes
  - wja
  - wji
  - wo
  - wof
  - xh
  - xhe
  - xka
  - xmf
  - xmv
  - xmw
  - xpe
  - xti
  - xtu
  - yaq
  - yav
  - yay
  - ydd
  - ydg
  - yer
  - 'yes'
  - yi
  - yo
  - yue
  - zga
  - zgh
  - zh
  - zoc
  - zoh
  - zor
  - zpv
  - zpy
  - ztg
  - ztn
  - ztp
  - zts
  - ztu
  - zu
  - zza

OmniVoice 🌍

OmniVoice is a state-of-the-art zero-shot multilingual TTS model supporting more than 600 languages. Built on a novel diffusion language model architecture, it generates high-quality speech with superior inference speed, supporting voice cloning and voice design.

Key Features

600+ Languages Supported: The broadest language coverage among zero-shot TTS models (full list)
Voice Cloning: State-of-the-art voice cloning quality.
Voice Design: Control voices via assigned speaker attributes (gender, age, pitch, dialect/accent, whisper, etc.).
Fast Inference: RTF as low as 0.025 (40x faster than real-time).
Diffusion Language Model Architecture: A clean, streamlined, and scalable design that delivers both quality and speed.

Installation

Choose one of the following methods: pip or uv.

pip

We recommend using a fresh virtual environment (e.g., conda, venv, etc.) to avoid conflicts.

Step 1: Install PyTorch

NVIDIA GPU

# Install pytorch with your CUDA version, e.g.
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128

See PyTorch official site for other versions installation.

Apple Silicon

pip install torch==2.8.0 torchaudio==2.8.0

Step 2: Install OmniVoice (choose one)

# From PyPI (stable release)
pip install omnivoice

# From the latest source on GitHub (no need to clone)
pip install git+https://github.com/k2-fsa/OmniVoice.git

# For development (clone first, editable install)
git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
pip install -e .

uv

Clone the repository and sync dependencies:

git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
uv sync

Tip: Can use mirror with uv sync --default-index "https://mirrors.aliyun.com/pypi/simple"

Quick Start

Try OmniVoice without coding:

Launch the local web UI: omnivoice-demo --ip 0.0.0.0 --port 8001
Or try it directly on HuggingFace Space

If you have trouble connecting to HuggingFace when downloading the pre-trained models, set export HF_ENDPOINT="https://hf-mirror.com" before running.

For full usage, see the Python API and Command-Line Tools sections below.

Python API

The OmniVoice model supports three generation modes. All features in this section are also available via command-line tools.

Voice Cloning

Clone a voice from a short reference audio. Provide ref_audio and ref_text:

from omnivoice import OmniVoice
import torch
import torchaudio

model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="cuda:0",
    dtype=torch.float16
)
# Apple Silicon users: use device_map="mps" instead

audio = model.generate(
    text="Hello, this is a test of zero-shot voice cloning.",
    ref_audio="ref.wav",
    ref_text="Transcription of the reference audio.",
) # audio is a list of `torch.Tensor` with shape (1, T) at 24 kHz.

# If you don't want to input `ref_text` manually, you can directly omit the `ref_text`.
# The model will use Whisper ASR to auto-transcribe it.

torchaudio.save("out.wav", audio[0], 24000)

Voice Design

Describe the desired voice with speaker attributes — no reference audio needed. Supported attributes: gender (male/female), age (child to elderly), pitch (very low to very high), style (whisper), English accent (American, British, etc.), and Chinese dialect (四川话, 陕西话, etc.). Attributes are comma-separated and freely combinable across categories.

audio = model.generate(
    text="Hello, this is a test of zero-shot voice design.",
    instruct="female, low pitch, british accent",
)

See docs/voice-design.md for the full attribute reference, Chinese equivalents, and usage tips.

Auto Voice

Let the model choose a voice automatically:

audio = model.generate(text="This is a sentence without any voice prompt.")

Generation Parameters

All above three modes share the same model.generate() API. You can further control the generation behavior via keyword arguments:

audio = model.generate(
    text="...",
    num_step=32,  # diffusion steps (or 16 for faster inference)
    speed=1.0,     # speed factor (>1.0 faster, <1.0 slower)
    duration=10.0, # fixed output duration in seconds (overrides speed)
    # ... more options
)

See more detailed control in docs/generation-parameters.md.

Non-Verbal & Pronunciation Control

OmniVoice supports inline non-verbal symbols and pronunciation hints within the input text.

Non-verbal symbols: Insert tags like [laughter] directly in the text to add expressive non-verbal sounds.

audio = model.generate(text="[laughter] You really got me. I didn't see that coming at all.")

Supported tags: [laughter], [confirmation-en], [question-en], [question-ah], [question-oh], [question-ei], [question-yi], [surprise-ah], [surprise-oh], [surprise-wa], [surprise-yo], [dissatisfaction-hnn], [sniff], [sigh]

Pronunciation control (Chinese): Use pinyin with tone numbers to correct specific character pronunciations.

audio = model.generate(text="这批货物打ZHE2出售后他严重SHE2本了，再也经不起ZHE1腾了。")

Pronunciation control (English): Use CMU pronunciation dictionary (uppercase, in brackets) to override default English pronunciations.

audio = model.generate(text="You could probably still make [IH1 T] look good.")

Command-Line Tools

Three CLI entry points are provided. The CLI tools support all features available in the Python API (voice cloning, voice design, auto voice, generation parameters, etc.) — all controlled via command-line arguments.

Command	Description	Source
`omnivoice-demo`	Interactive Gradio web demo	omnivoice/cli/demo.py
`omnivoice-infer`	Single-item inference	omnivoice/cli/infer.py
`omnivoice-infer-batch`	Batch inference across multiple GPUs	omnivoice/cli/infer_batch.py

Demo

omnivoice-demo --ip 0.0.0.0 --port 8001

Provides a web UI for voice cloning and voice design. See omnivoice-demo --help for all options.

Single Inference

# Voice Cloning
# ref_text can be omitted (Whisper will auto-transcribe ref_audio to get it).
omnivoice-infer \
    --model k2-fsa/OmniVoice \
    --text "This is a test for text to speech." \
    --ref_audio ref.wav \
    --ref_text "Transcription of the reference audio." \
    --output hello.wav

# Voice Design
omnivoice-infer --model k2-fsa/OmniVoice \
    --text "This is a test for text to speech." \
    --instruct "male, British accent" \
    --output hello.wav

# Auto Voice
omnivoice-infer \
    --model k2-fsa/OmniVoice \
    --text "This is a test for text to speech."\
    --output hello.wav

Batch Inference

omnivoice-infer-batch can distribute batch inference across multiple GPUs, designed for large-scale TTS tasks.

omnivoice-infer-batch \
    --model k2-fsa/OmniVoice \
    --test_list test.jsonl \
    --res_dir results/

The test list is a JSONL file where each line is a JSON object:

{"id": "sample_001", "text": "Hello world", "ref_audio": "/path/to/ref.wav", "ref_text": "Reference transcript", "instruct": "female, british accent", "language_id": "en", "language_name": "English", "duration": 10.0, "speed": 1.0}

Only id and text are mandatory fields. ref_audio and ref_text are used in voice cloning mode. instruct is used in voice design mode. If no reference audio or instruct are provided, the model will generate text in a random voice.

language_id, language_name, duration, and speed are optional. duration (in seconds) fixes the output length; speed controls the speaking rate. If duration and speed are both provided, speed will be ignored.

Training & Evaluation

See examples/ for the complete pipeline — from data preparation to training, evaluation, and finetuning.

Discussion & Communication

You can directly discuss on GitHub Issues.

You can also scan the QR code to join our wechat group or follow our wechat official account.

Wechat Group	Wechat Official Account

Citation

@article{zhu2026omnivoice,
      title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
      author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
      journal={arXiv preprint arXiv:2604.00688},
      year={2026}
}