WenetSpeech-Wu
Collection
4 items
•
Updated
•
3
Bold and underlined values denote the best and second-best results.
TTS results on WenetSpeech-Wu-Bench.
| Model | CER (%)↓ | SIM ↑ | IMOS ↑ | SMOS ↑ | AMOS ↑ | CER (%)↓ | SIM ↑ | IMOS ↑ | SMOS ↑ | AMOS ↑ |
|---|---|---|---|---|---|---|---|---|---|---|
| Qwen3-TTS† | 5.95 | -- | 4.35 | -- | 4.19 | 16.45 | -- | 4.03 | -- | 3.91 |
| DiaMoE-TTS | 57.05 | 0.702 | 3.11 | 3.43 | 3.52 | 82.52 | 0.587 | 2.83 | 3.14 | 3.22 |
| CosyVoice2 | 10.33 | 0.713 | 3.83 | 3.71 | 3.84 | 82.49 | 0.618 | 3.24 | 3.42 | 3.37 |
| CosyVoice2-Wu-CPT⭐ | 6.35 | 0.727 | 4.01 | 3.84 | 3.92 | 32.97 | 0.620 | 3.72 | 3.55 | 3.63 |
| CosyVoice2-Wu-SFT⭐ | 6.19 | 0.726 | 4.32 | 3.78 | 4.11 | 25.00 | 0.601 | 3.96 | 3.48 | 3.76 |
| CosyVoice2-Wu-SS⭐ | 5.42 | -- | 4.37 | -- | 4.21 | 15.45 | -- | 4.04 | -- | 3.88 |
Performance of instruct TTS model.
| Type | Metric | CosyVoice2-Wu-SFT⭐ | CosyVoice2-Wu-instruct⭐ |
|---|---|---|---|
| Emotion | Happy ↑ | 0.87 | 0.94 |
| Angry ↑ | 0.83 | 0.87 | |
| Sad ↑ | 0.84 | 0.88 | |
| Surprised ↑ | 0.67 | 0.73 | |
| EMOS ↑ | 3.66 | 3.83 | |
| Prosody | Pitch ↑ | 0.24 | 0.74 |
| Speech Rate ↑ | 0.26 | 0.82 | |
| PMOS ↑ | 2.13 | 3.68 |
Clone and install
git clone https://github.com/ASLP-lab/WenetSpeech-Wu-Repo.git
cd WenetSpeech-Wu-Repo/Generation
conda create -n cosyvoice python=3.10
conda activate cosyvoice
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
from huggingface_hub import snapshot_download
snapshot_download('ASLP-lab/WenetSpeech-Wu-Speech-Generation', local_dir='pretrained_models')
ln -s ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2/* ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-SFT/
mv ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-SFT/SFT.pt ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-SFT/llm.pt
import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudio
cosyvoice_base = CosyVoice2(
'ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2',
load_jit=False, load_trt=False, load_vllm=False, fp16=False
)
cosyvoice_sft = CosyVoice2(
'ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-SFT',
load_jit=False, load_trt=False, load_vllm=False, fp16=False
)
prompt_speech_16k = load_wav('figs/A0002_S0003_0_G0003_G0004_33.wav', 16000)
prompt_text = "最少辰光阿拉是做撒呃喃,有钞票就是到银行里保本保息。"
text = "<|wuyu|>"+"阿拉屋里向养了一只小猫,伊老欢喜晒太阳的,每日下半天总归蹲辣窗口。"
for i, j in enumerate(cosyvoice_base.inference_instruct2(text, '用上海话说这句话', prompt_speech_16k, stream=False)):
torchaudio.save('A0002_S0003_0_G0003_G0004_33_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
for i, j in enumerate(cosyvoice_sft.inference_zero_shot(text, prompt_text, prompt_speech_16k , stream=False)):
torchaudio.save('A0002_S0003_0_G0003_G0004_33_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
ln -s ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2/* ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-emotion/
mv ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-emotion/instruct_Emo.pt ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-emotion/llm.pt
ln -s ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2/* ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-prosody/
mv ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-prosody/instruct_Pro.pt ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-prosody/llm.pt
import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudio
cosyvoice_emo = CosyVoice2(
'ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-emotion',
load_jit=False, load_trt=False, load_vllm=False, fp16=False
)
cosyvoice_pro = CosyVoice2(
'ASLP-lab/WenetSpeech-Wu-Speech-Generation/CosyVoice2-Wu-instruct-prosody',
load_jit=False, load_trt=False, load_vllm=False, fp16=False
)
prompt_speech_16k = load_wav('figs/A0002_S0003_0_G0003_G0004_33.wav', 16000)
prompt_text = "最少辰光阿拉是做撒呃喃,有钞票就是到银行里保本保息。"
text = "阿拉屋里向养了一只小猫,伊老欢喜晒太阳的,每日下半天总归蹲辣窗口。"
emo_text = "<|开心|><|wuyu|>"+text
for i, j in enumerate(cosyvoice_emo.inference_instruct2(emo_text, '用开心的情感说', prompt_speech_16k, stream=False)):
torchaudio.save('A0002_S0003_0_G0003_G0004_33_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
pro_text = "<|男性|><|语速快|><|基频高|><|wuyu|>"+text
for i, j in enumerate(cosyvoice_pro.inference_instruct2(pro_text, '这是一位男性,音调很高语速很快地说',prompt_speech_16k, stream=False)):
torchaudio.save('A0002_S0003_0_G0003_G0004_33_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
Base model
FunAudioLLM/CosyVoice2-0.5B