Added performance test values

5a34fb1 verified 23 days ago

4.85 kB

	---
	language:
	- de
	license: apache-2.0
	tags:
	- text-to-speech
	- german
	- cosyvoice3
	- thorsten-voice
	pipeline_tag: text-to-speech
	datasets:
	- Thorsten-Voice/thorsten-voice-2022-10
	---

	# Thorsten-Voice · CosyVoice3

	German Text-to-Speech fine-tune of [FunAudioLLM/Fun-CosyVoice3-0.5B-2512](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512) on the [Thorsten-Voice 2022.10](https://www.thorsten-voice.de) dataset (12,283 German utterances).

	This repository contains two fine-tuned model components:

	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `llm.pt` \| 1.9 GB \| Fine-tuned LLM (speech rhythm, prosody, speaker style) \|
	\| `flow.pt` \| 1.3 GB \| Fine-tuned Flow Decoder (voice timbre, spectral characteristics) \|

	The HiFi-GAN vocoder (`hift.pt`) is used unchanged from the base model.

	---

	## Quickstart with Docker

	The easiest way to use this model is via the official Docker container:

	```bash
	docker run -p 8000:8000 \
	-v cosyvoice_models:/app/CosyVoice/pretrained_models \
	thorstenvoice/cosyvoice-tts

	# Then generate audio:
	curl -X POST http://localhost:8000/tts \
	-F "text=Hallo, ich bin Thorsten. Schön, dass du da bist." \
	--output thorsten.wav
	```

	→ [Docker Hub: thorstenvoice/cosyvoice-tts](https://hub.docker.com/r/thorstenvoice/cosyvoice-tts)

	---

	## Manual Installation

	### 1. Clone CosyVoice at the correct commit

	```bash
	git clone https://github.com/FunAudioLLM/CosyVoice.git
	cd CosyVoice
	git checkout ace7c47
	git submodule update --init --recursive
	```

	### 2. Install dependencies

	Python 3.10 or 3.11 recommended.

	```bash
	sudo apt-get install -y sox libsox-fmt-all ffmpeg
	pip install setuptools --upgrade
	pip install openai-whisper
	grep -v "openai-whisper" requirements.txt > requirements_fixed.txt
	pip install -r requirements_fixed.txt
	```

	### 3. Set PYTHONPATH

	```bash
	export PYTHONPATH=/path/to/CosyVoice:/path/to/CosyVoice/third_party/Matcha-TTS:$PYTHONPATH
	```

	### 4. Download models

	```bash
	pip install huggingface_hub

	# Base model
	hf download FunAudioLLM/Fun-CosyVoice3-0.5B-2512 \
	--local-dir pretrained_models/CosyVoice3-0.5B

	# Thorsten fine-tuned weights
	hf download Thorsten-Voice/CosyVoice3 \
	--local-dir pretrained_models/CosyVoice3-0.5B \
	--include "llm.pt" "flow.pt" "spk2info.pt" "infer_thorsten.py"
	```

	### 5. Generate audio

	```bash
	python3 infer_thorsten.py \
	--text "Hallo, ich bin Thorsten. Schön, dass du da bist." \
	--output thorsten.wav
	```

	---

	## Performance

	Benchmarked with these two test texts:

	Short (~8 words):
	> "Hallo, hier ist Thorsten. Schön, dass Du da bist."

	Long (~80 words):
	> "Für mich sind alle Menschen gleich, unabhängig von Geschlecht, sexueller Orientierung, Religion, Hautfarbe oder Geokoordinaten der Geburt. Ich glaube an eine globale Welt, wo jeder überall willkommen ist und freies Wissen und Bildung kostenfrei für jeden zur Verfügung steht. Ich habe meine Stimme der Allgemeinheit gespendet, in der Hoffnung darauf, dass sie in diesem Sinne genutzt wird."

	\| Hardware \| Short text \| Long text \|
	\|----------\|-----------\|-----------\|
	\| MacBook Air M1 (CPU) \| 47s \| 4:30 min \|
	\| QNAP NAS Intel (CPU) \| 50s \| — \|
	\| RunPod RTX 4090 (GPU) \| 2.9s \| 12.9s \|

	---

	## Python 3.12 patches

	1. `cosyvoice/flow/flow.py` — add after `conds = conds.transpose(1, 2)` in `CausalMaskedDiffWithDiT.forward()`:

	```python
	min_len = min(h.shape[1], feat.shape[1])
	h = h[:, :min_len, :]
	feat = feat[:, :min_len, :]
	conds = conds[:, :, :min_len]
	mask = mask[:, :min_len]
	```

	2. `third_party/Matcha-TTS/matcha/utils/__init__.py`:

	```bash
	echo "" > third_party/Matcha-TTS/matcha/utils/__init__.py
	```

	---

	## Training details

	\| Component \| Base model \| Epochs \| Dataset \|
	\|-----------\|-----------\|--------\|---------\|
	\| LLM \| Fun-CosyVoice3-0.5B-2512 \| 1 \| Thorsten-Voice 2022.10 (12,283 utterances) \|
	\| Flow Decoder \| Fun-CosyVoice3-0.5B-2512 \| 9 \| Thorsten-Voice 2022.10 (12,283 utterances) \|
	\| HiFi-GAN \| Fun-CosyVoice3-0.5B-2512 \| — \| not fine-tuned \|

	Hardware: NVIDIA A40 (48 GB VRAM)

	---

	## License

	Apache 2.0 — same as the base model.
	The Thorsten-Voice dataset is licensed under [CC0](https://creativecommons.org/publicdomain/zero/1.0/).

	---

	## Citation

	```bibtex
	@article{du2025cosyvoice,
	title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
	author={Du, Zhihao and others},
	journal={arXiv preprint arXiv:2505.17589},
	year={2025}
	}
	```

	---

	## Links

	- [Thorsten-Voice Website](https://www.thorsten-voice.de)
	- [Docker Container](https://hub.docker.com/r/thorstenvoice/cosyvoice-tts)
	- [CosyVoice GitHub](https://github.com/FunAudioLLM/CosyVoice)
	- [Base Model](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512)
	- [Source on GitHub](https://github.com/thorstenMueller/Thorsten-Voice/tree/main/docker/cosyvoice)