TTS & Intruct TTS Models

TTS and Instruct TTS Leaderboard

Bold and underlined values denote the best and second-best results.

TTS results on WenetSpeech-Wu-Bench.

Model	CER (%)↓	SIM ↑	IMOS ↑	SMOS ↑	AMOS ↑	CER (%)↓	SIM ↑	IMOS ↑	SMOS ↑	AMOS ↑
Qwen3-TTS†	5.95	--	4.35	--	4.19	16.45	--	4.03	--	3.91
DiaMoE-TTS	57.05	0.702	3.11	3.43	3.52	82.52	0.587	2.83	3.14	3.22
CosyVoice2	10.33	0.713	3.83	3.71	3.84	82.49	0.618	3.24	3.42	3.37
CosyVoice2-Wu-CPT⭐	6.35	0.727	4.01	3.84	3.92	32.97	0.620	3.72	3.55	3.63
CosyVoice2-Wu-SFT⭐	6.19	0.726	4.32	3.78	4.11	25.00	0.601	3.96	3.48	3.76
CosyVoice2-Wu-SS⭐	5.42	--	4.37	--	4.21	15.45	--	4.04	--	3.88

Performance of instruct TTS model.

Type	Metric	CosyVoice2-Wu-SFT⭐	CosyVoice2-Wu-instruct⭐
Emotion	Happy ↑	0.87	0.94
	Angry ↑	0.83	0.87
	Sad ↑	0.84	0.88
	Surprised ↑	0.67	0.73
	EMOS ↑	3.66	3.83
Prosody	Pitch ↑	0.24	0.74
	Speech Rate ↑	0.26	0.82
	PMOS ↑	2.13	3.68

TTS Inference

Install

Clone and install

Clone the repo

git clone https://github.com/ASLP-lab/WenetSpeech-Wu-Repo.git
cd WenetSpeech-Wu-Repo/Generation

Create Conda env:

conda create -n cosyvoice python=3.10
conda activate cosyvoice
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

Model download

from huggingface_hub import snapshot_download
snapshot_download('ASLP-lab/WenetSpeech-Wu-Speech-Generation', local_dir='pretrained_models')

Usage

CosyVoice2-Wu-SFT

ln -s ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2/* ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-SFT/
mv ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-SFT/SFT.pt ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-SFT/llm.pt

import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudio

cosyvoice_base = CosyVoice2(
    'ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2',
    load_jit=False, load_trt=False, load_vllm=False, fp16=False
)

cosyvoice_sft = CosyVoice2(
    'ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-SFT',
    load_jit=False, load_trt=False, load_vllm=False, fp16=False
)


prompt_speech_16k = load_wav('figs/A0002_S0003_0_G0003_G0004_33.wav', 16000)
prompt_text = "最少辰光阿拉是做撒呃喃，有钞票就是到银行里保本保息。"
text = "<|wuyu|>"+"阿拉屋里向养了一只小猫，伊老欢喜晒太阳的，每日下半天总归蹲辣窗口。"

for i, j in enumerate(cosyvoice_base.inference_instruct2(text, '用上海话说这句话', prompt_speech_16k, stream=False)):
    torchaudio.save('A0002_S0003_0_G0003_G0004_33_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

for i, j in enumerate(cosyvoice_sft.inference_zero_shot(text, prompt_text, prompt_speech_16k , stream=False)):
    torchaudio.save('A0002_S0003_0_G0003_G0004_33_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

CosyVoice2-Wu-instruct

ln -s ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2/* ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-emotion/
mv ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-emotion/instruct_Emo.pt ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-emotion/llm.pt


ln -s ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2/* ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-prosody/
mv ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-prosody/instruct_Pro.pt ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-prosody/llm.pt

import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudio

cosyvoice_emo = CosyVoice2(
    'ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-emotion',
    load_jit=False, load_trt=False, load_vllm=False, fp16=False
)

cosyvoice_pro = CosyVoice2(
    'ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-prosody',
    load_jit=False, load_trt=False, load_vllm=False, fp16=False
)


prompt_speech_16k = load_wav('figs/A0002_S0003_0_G0003_G0004_33.wav', 16000)
prompt_text = "最少辰光阿拉是做撒呃喃，有钞票就是到银行里保本保息。"
text = "阿拉屋里向养了一只小猫，伊老欢喜晒太阳的，每日下半天总归蹲辣窗口。"

emo_text = "<|开心|><|wuyu|>"+text
for i, j in enumerate(cosyvoice_emo.inference_instruct2(emo_text, '用开心的情感说', prompt_speech_16k, stream=False)):
    torchaudio.save('A0002_S0003_0_G0003_G0004_33_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

pro_text = "<|男性|><|语速快|><|基频高|><|wuyu|>"+text
for i, j in enumerate(cosyvoice_pro.inference_instruct2(pro_text, '这是一位男性，音调很高语速很快地说',prompt_speech_16k, stream=False)):
    torchaudio.save('A0002_S0003_0_G0003_G0004_33_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for ASLP-lab/WenetSpeech-Wu-Speech-Generation

Base model

FunAudioLLM/CosyVoice2-0.5B

Quantized

(7)

this model

Collection including ASLP-lab/WenetSpeech-Wu-Speech-Generation

WenetSpeech-Wu

Collection

4 items • Updated Jan 31 • 4