Added performance test values
Browse files
README.md
CHANGED
|
@@ -27,12 +27,28 @@ The HiFi-GAN vocoder (`hift.pt`) is used unchanged from the base model.
|
|
| 27 |
|
| 28 |
---
|
| 29 |
|
| 30 |
-
##
|
| 31 |
|
| 32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
-
|
| 35 |
-
|
|
|
|
| 36 |
|
| 37 |
```bash
|
| 38 |
git clone https://github.com/FunAudioLLM/CosyVoice.git
|
|
@@ -43,17 +59,12 @@ git submodule update --init --recursive
|
|
| 43 |
|
| 44 |
### 2. Install dependencies
|
| 45 |
|
| 46 |
-
Python 3.10 or 3.11
|
| 47 |
|
| 48 |
```bash
|
| 49 |
-
# System dependencies
|
| 50 |
sudo apt-get install -y sox libsox-fmt-all ffmpeg
|
| 51 |
-
|
| 52 |
-
# Install openai-whisper first (before requirements.txt)
|
| 53 |
pip install setuptools --upgrade
|
| 54 |
pip install openai-whisper
|
| 55 |
-
|
| 56 |
-
# Install remaining dependencies
|
| 57 |
grep -v "openai-whisper" requirements.txt > requirements_fixed.txt
|
| 58 |
pip install -r requirements_fixed.txt
|
| 59 |
```
|
|
@@ -64,31 +75,22 @@ pip install -r requirements_fixed.txt
|
|
| 64 |
export PYTHONPATH=/path/to/CosyVoice:/path/to/CosyVoice/third_party/Matcha-TTS:$PYTHONPATH
|
| 65 |
```
|
| 66 |
|
| 67 |
-
### 4. Download
|
| 68 |
|
| 69 |
```bash
|
| 70 |
pip install huggingface_hub
|
| 71 |
|
|
|
|
| 72 |
hf download FunAudioLLM/Fun-CosyVoice3-0.5B-2512 \
|
| 73 |
--local-dir pretrained_models/CosyVoice3-0.5B
|
| 74 |
-
```
|
| 75 |
-
|
| 76 |
-
### 5. Download this model
|
| 77 |
|
| 78 |
-
|
| 79 |
hf download Thorsten-Voice/CosyVoice3 \
|
| 80 |
--local-dir pretrained_models/CosyVoice3-0.5B \
|
| 81 |
--include "llm.pt" "flow.pt" "spk2info.pt" "infer_thorsten.py"
|
| 82 |
```
|
| 83 |
|
| 84 |
-
|
| 85 |
-
All other files (`hift.pt`, `campplus.onnx`, etc.) remain from the base model.
|
| 86 |
-
|
| 87 |
-
---
|
| 88 |
-
|
| 89 |
-
## Inference
|
| 90 |
-
|
| 91 |
-
### Quick start
|
| 92 |
|
| 93 |
```bash
|
| 94 |
python3 infer_thorsten.py \
|
|
@@ -96,27 +98,31 @@ python3 infer_thorsten.py \
|
|
| 96 |
--output thorsten.wav
|
| 97 |
```
|
| 98 |
|
| 99 |
-
|
| 100 |
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
---
|
| 110 |
|
| 111 |
## Python 3.12 patches
|
| 112 |
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
**1. `cosyvoice/flow/flow.py`** — add these lines inside `CausalMaskedDiffWithDiT.forward()`,
|
| 116 |
-
directly after `conds = conds.transpose(1, 2)` and before `loss, _ = self.decoder.compute_loss(...)`:
|
| 117 |
|
| 118 |
```python
|
| 119 |
-
# Alignment fix
|
| 120 |
min_len = min(h.shape[1], feat.shape[1])
|
| 121 |
h = h[:, :min_len, :]
|
| 122 |
feat = feat[:, :min_len, :]
|
|
@@ -124,7 +130,7 @@ conds = conds[:, :, :min_len]
|
|
| 124 |
mask = mask[:, :min_len]
|
| 125 |
```
|
| 126 |
|
| 127 |
-
**2. `third_party/Matcha-TTS/matcha/utils/__init__.py`**
|
| 128 |
|
| 129 |
```bash
|
| 130 |
echo "" > third_party/Matcha-TTS/matcha/utils/__init__.py
|
|
@@ -146,9 +152,8 @@ echo "" > third_party/Matcha-TTS/matcha/utils/__init__.py
|
|
| 146 |
|
| 147 |
## License
|
| 148 |
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
The Thorsten-Voice dataset is licensed under [Creative Commons Zero (CC0)](https://creativecommons.org/publicdomain/zero/1.0/).
|
| 152 |
|
| 153 |
---
|
| 154 |
|
|
@@ -168,5 +173,7 @@ The Thorsten-Voice dataset is licensed under [Creative Commons Zero (CC0)](https
|
|
| 168 |
## Links
|
| 169 |
|
| 170 |
- [Thorsten-Voice Website](https://www.thorsten-voice.de)
|
|
|
|
| 171 |
- [CosyVoice GitHub](https://github.com/FunAudioLLM/CosyVoice)
|
| 172 |
-
- [Base Model
|
|
|
|
|
|
| 27 |
|
| 28 |
---
|
| 29 |
|
| 30 |
+
## Quickstart with Docker
|
| 31 |
|
| 32 |
+
The easiest way to use this model is via the official Docker container:
|
| 33 |
+
|
| 34 |
+
```bash
|
| 35 |
+
docker run -p 8000:8000 \
|
| 36 |
+
-v cosyvoice_models:/app/CosyVoice/pretrained_models \
|
| 37 |
+
thorstenvoice/cosyvoice-tts
|
| 38 |
+
|
| 39 |
+
# Then generate audio:
|
| 40 |
+
curl -X POST http://localhost:8000/tts \
|
| 41 |
+
-F "text=Hallo, ich bin Thorsten. Schön, dass du da bist." \
|
| 42 |
+
--output thorsten.wav
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
→ [Docker Hub: thorstenvoice/cosyvoice-tts](https://hub.docker.com/r/thorstenvoice/cosyvoice-tts)
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
|
| 49 |
+
## Manual Installation
|
| 50 |
+
|
| 51 |
+
### 1. Clone CosyVoice at the correct commit
|
| 52 |
|
| 53 |
```bash
|
| 54 |
git clone https://github.com/FunAudioLLM/CosyVoice.git
|
|
|
|
| 59 |
|
| 60 |
### 2. Install dependencies
|
| 61 |
|
| 62 |
+
Python 3.10 or 3.11 recommended.
|
| 63 |
|
| 64 |
```bash
|
|
|
|
| 65 |
sudo apt-get install -y sox libsox-fmt-all ffmpeg
|
|
|
|
|
|
|
| 66 |
pip install setuptools --upgrade
|
| 67 |
pip install openai-whisper
|
|
|
|
|
|
|
| 68 |
grep -v "openai-whisper" requirements.txt > requirements_fixed.txt
|
| 69 |
pip install -r requirements_fixed.txt
|
| 70 |
```
|
|
|
|
| 75 |
export PYTHONPATH=/path/to/CosyVoice:/path/to/CosyVoice/third_party/Matcha-TTS:$PYTHONPATH
|
| 76 |
```
|
| 77 |
|
| 78 |
+
### 4. Download models
|
| 79 |
|
| 80 |
```bash
|
| 81 |
pip install huggingface_hub
|
| 82 |
|
| 83 |
+
# Base model
|
| 84 |
hf download FunAudioLLM/Fun-CosyVoice3-0.5B-2512 \
|
| 85 |
--local-dir pretrained_models/CosyVoice3-0.5B
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
+
# Thorsten fine-tuned weights
|
| 88 |
hf download Thorsten-Voice/CosyVoice3 \
|
| 89 |
--local-dir pretrained_models/CosyVoice3-0.5B \
|
| 90 |
--include "llm.pt" "flow.pt" "spk2info.pt" "infer_thorsten.py"
|
| 91 |
```
|
| 92 |
|
| 93 |
+
### 5. Generate audio
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
```bash
|
| 96 |
python3 infer_thorsten.py \
|
|
|
|
| 98 |
--output thorsten.wav
|
| 99 |
```
|
| 100 |
|
| 101 |
+
---
|
| 102 |
|
| 103 |
+
## Performance
|
| 104 |
+
|
| 105 |
+
Benchmarked with these two test texts:
|
| 106 |
+
|
| 107 |
+
**Short** (~8 words):
|
| 108 |
+
> "Hallo, hier ist Thorsten. Schön, dass Du da bist."
|
| 109 |
+
|
| 110 |
+
**Long** (~80 words):
|
| 111 |
+
> "Für mich sind alle Menschen gleich, unabhängig von Geschlecht, sexueller Orientierung, Religion, Hautfarbe oder Geokoordinaten der Geburt. Ich glaube an eine globale Welt, wo jeder überall willkommen ist und freies Wissen und Bildung kostenfrei für jeden zur Verfügung steht. Ich habe meine Stimme der Allgemeinheit gespendet, in der Hoffnung darauf, dass sie in diesem Sinne genutzt wird."
|
| 112 |
+
|
| 113 |
+
| Hardware | Short text | Long text |
|
| 114 |
+
|----------|-----------|-----------|
|
| 115 |
+
| MacBook Air M1 (CPU) | 47s | 4:30 min |
|
| 116 |
+
| QNAP NAS Intel (CPU) | 50s | — |
|
| 117 |
+
| RunPod RTX 4090 (GPU) | **2.9s** | **12.9s** |
|
| 118 |
|
| 119 |
---
|
| 120 |
|
| 121 |
## Python 3.12 patches
|
| 122 |
|
| 123 |
+
**1. `cosyvoice/flow/flow.py`** — add after `conds = conds.transpose(1, 2)` in `CausalMaskedDiffWithDiT.forward()`:
|
|
|
|
|
|
|
|
|
|
| 124 |
|
| 125 |
```python
|
|
|
|
| 126 |
min_len = min(h.shape[1], feat.shape[1])
|
| 127 |
h = h[:, :min_len, :]
|
| 128 |
feat = feat[:, :min_len, :]
|
|
|
|
| 130 |
mask = mask[:, :min_len]
|
| 131 |
```
|
| 132 |
|
| 133 |
+
**2. `third_party/Matcha-TTS/matcha/utils/__init__.py`:**
|
| 134 |
|
| 135 |
```bash
|
| 136 |
echo "" > third_party/Matcha-TTS/matcha/utils/__init__.py
|
|
|
|
| 152 |
|
| 153 |
## License
|
| 154 |
|
| 155 |
+
Apache 2.0 — same as the base model.
|
| 156 |
+
The Thorsten-Voice dataset is licensed under [CC0](https://creativecommons.org/publicdomain/zero/1.0/).
|
|
|
|
| 157 |
|
| 158 |
---
|
| 159 |
|
|
|
|
| 173 |
## Links
|
| 174 |
|
| 175 |
- [Thorsten-Voice Website](https://www.thorsten-voice.de)
|
| 176 |
+
- [Docker Container](https://hub.docker.com/r/thorstenvoice/cosyvoice-tts)
|
| 177 |
- [CosyVoice GitHub](https://github.com/FunAudioLLM/CosyVoice)
|
| 178 |
+
- [Base Model](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512)
|
| 179 |
+
- [Source on GitHub](https://github.com/thorstenMueller/Thorsten-Voice/tree/main/docker/cosyvoice)
|