TTS & Intruct TTS Models

TTS and Instruct TTS Leaderboard

Bold and underlined values denote the best and second-best results.

TTS results on WenetSpeech-Wu-Bench.

Model CER (%)↓ SIM ↑ IMOS ↑ SMOS ↑ AMOS ↑ CER (%)↓ SIM ↑ IMOS ↑ SMOS ↑ AMOS ↑
Qwen3-TTS† 5.95 -- 4.35 -- 4.19 16.45 -- 4.03 -- 3.91
DiaMoE-TTS 57.05 0.702 3.11 3.43 3.52 82.52 0.587 2.83 3.14 3.22
CosyVoice2 10.33 0.713 3.83 3.71 3.84 82.49 0.618 3.24 3.42 3.37
CosyVoice2-Wu-CPT⭐ 6.35 0.727 4.01 3.84 3.92 32.97 0.620 3.72 3.55 3.63
CosyVoice2-Wu-SFT⭐ 6.19 0.726 4.32 3.78 4.11 25.00 0.601 3.96 3.48 3.76
CosyVoice2-Wu-SS⭐ 5.42 -- 4.37 -- 4.21 15.45 -- 4.04 -- 3.88

Performance of instruct TTS model.

Type Metric CosyVoice2-Wu-SFT⭐ CosyVoice2-Wu-instruct⭐
Emotion Happy ↑ 0.87 0.94
Angry ↑ 0.83 0.87
Sad ↑ 0.84 0.88
Surprised ↑ 0.67 0.73
EMOS ↑ 3.66 3.83
Prosody Pitch ↑ 0.24 0.74
Speech Rate ↑ 0.26 0.82
PMOS ↑ 2.13 3.68

TTS Inference

Install

Clone and install

  • Clone the repo
git clone https://github.com/ASLP-lab/WenetSpeech-Wu-Repo.git
cd WenetSpeech-Wu-Repo/Generation
  • Create Conda env:
conda create -n cosyvoice python=3.10
conda activate cosyvoice
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

Model download

from huggingface_hub import snapshot_download
snapshot_download('ASLP-lab/WenetSpeech-Wu-Speech-Generation', local_dir='pretrained_models')

Usage

CosyVoice2-Wu-SFT

ln -s ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2/* ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-SFT/
mv ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-SFT/SFT.pt ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-SFT/llm.pt
import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudio

cosyvoice_base = CosyVoice2(
    'ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2',
    load_jit=False, load_trt=False, load_vllm=False, fp16=False
)

cosyvoice_sft = CosyVoice2(
    'ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-SFT',
    load_jit=False, load_trt=False, load_vllm=False, fp16=False
)


prompt_speech_16k = load_wav('figs/A0002_S0003_0_G0003_G0004_33.wav', 16000)
prompt_text = "最少辰光阿拉是做撒呃喃,有钞票就是到银行里保本保息。"
text = "<|wuyu|>"+"阿拉屋里向养了一只小猫,伊老欢喜晒太阳的,每日下半天总归蹲辣窗口。"

for i, j in enumerate(cosyvoice_base.inference_instruct2(text, '用上海话说这句话', prompt_speech_16k, stream=False)):
    torchaudio.save('A0002_S0003_0_G0003_G0004_33_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

for i, j in enumerate(cosyvoice_sft.inference_zero_shot(text, prompt_text, prompt_speech_16k , stream=False)):
    torchaudio.save('A0002_S0003_0_G0003_G0004_33_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

CosyVoice2-Wu-instruct

ln -s ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2/* ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-emotion/
mv ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-emotion/instruct_Emo.pt ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-emotion/llm.pt


ln -s ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2/* ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-prosody/
mv ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-prosody/instruct_Pro.pt ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-prosody/llm.pt
import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudio

cosyvoice_emo = CosyVoice2(
    'ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-emotion',
    load_jit=False, load_trt=False, load_vllm=False, fp16=False
)

cosyvoice_pro = CosyVoice2(
    'ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-prosody',
    load_jit=False, load_trt=False, load_vllm=False, fp16=False
)


prompt_speech_16k = load_wav('figs/A0002_S0003_0_G0003_G0004_33.wav', 16000)
prompt_text = "最少辰光阿拉是做撒呃喃,有钞票就是到银行里保本保息。"
text = "阿拉屋里向养了一只小猫,伊老欢喜晒太阳的,每日下半天总归蹲辣窗口。"

emo_text = "<|开心|><|wuyu|>"+text
for i, j in enumerate(cosyvoice_emo.inference_instruct2(emo_text, '用开心的情感说', prompt_speech_16k, stream=False)):
    torchaudio.save('A0002_S0003_0_G0003_G0004_33_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

pro_text = "<|男性|><|语速快|><|基频高|><|wuyu|>"+text
for i, j in enumerate(cosyvoice_pro.inference_instruct2(pro_text, '这是一位男性,音调很高语速很快地说',prompt_speech_16k, stream=False)):
    torchaudio.save('A0002_S0003_0_G0003_G0004_33_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ASLP-lab/WenetSpeech-Wu-Speech-Generation

Quantized
(6)
this model

Collection including ASLP-lab/WenetSpeech-Wu-Speech-Generation