Text-to-Speech
German
german
cosyvoice3
thorsten-voice
Thorsten-Voice commited on
Commit
5a34fb1
·
verified ·
1 Parent(s): 1e89a21

Added performance test values

Browse files
Files changed (1) hide show
  1. README.md +48 -41
README.md CHANGED
@@ -27,12 +27,28 @@ The HiFi-GAN vocoder (`hift.pt`) is used unchanged from the base model.
27
 
28
  ---
29
 
30
- ## Requirements
31
 
32
- ### 1. Clone CosyVoice at the correct commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
- This model was trained and tested against a specific version of CosyVoice.
35
- Using a different commit may cause compatibility issues.
 
36
 
37
  ```bash
38
  git clone https://github.com/FunAudioLLM/CosyVoice.git
@@ -43,17 +59,12 @@ git submodule update --init --recursive
43
 
44
  ### 2. Install dependencies
45
 
46
- Python 3.10 or 3.11 is recommended. With Python 3.12 additional patches are required (see below).
47
 
48
  ```bash
49
- # System dependencies
50
  sudo apt-get install -y sox libsox-fmt-all ffmpeg
51
-
52
- # Install openai-whisper first (before requirements.txt)
53
  pip install setuptools --upgrade
54
  pip install openai-whisper
55
-
56
- # Install remaining dependencies
57
  grep -v "openai-whisper" requirements.txt > requirements_fixed.txt
58
  pip install -r requirements_fixed.txt
59
  ```
@@ -64,31 +75,22 @@ pip install -r requirements_fixed.txt
64
  export PYTHONPATH=/path/to/CosyVoice:/path/to/CosyVoice/third_party/Matcha-TTS:$PYTHONPATH
65
  ```
66
 
67
- ### 4. Download the base model
68
 
69
  ```bash
70
  pip install huggingface_hub
71
 
 
72
  hf download FunAudioLLM/Fun-CosyVoice3-0.5B-2512 \
73
  --local-dir pretrained_models/CosyVoice3-0.5B
74
- ```
75
-
76
- ### 5. Download this model
77
 
78
- ```bash
79
  hf download Thorsten-Voice/CosyVoice3 \
80
  --local-dir pretrained_models/CosyVoice3-0.5B \
81
  --include "llm.pt" "flow.pt" "spk2info.pt" "infer_thorsten.py"
82
  ```
83
 
84
- This overwrites only `llm.pt` and `flow.pt` in the base model directory.
85
- All other files (`hift.pt`, `campplus.onnx`, etc.) remain from the base model.
86
-
87
- ---
88
-
89
- ## Inference
90
-
91
- ### Quick start
92
 
93
  ```bash
94
  python3 infer_thorsten.py \
@@ -96,27 +98,31 @@ python3 infer_thorsten.py \
96
  --output thorsten.wav
97
  ```
98
 
99
- Optional parameters:
100
 
101
- ```bash
102
- python3 infer_thorsten.py \
103
- --text "Ein längerer Text hier." \
104
- --model_dir pretrained_models/CosyVoice3-0.5B \
105
- --output output.wav \
106
- --speed 0.9
107
- ```
 
 
 
 
 
 
 
 
108
 
109
  ---
110
 
111
  ## Python 3.12 patches
112
 
113
- When using Python 3.12, two additional source patches are required.
114
-
115
- **1. `cosyvoice/flow/flow.py`** — add these lines inside `CausalMaskedDiffWithDiT.forward()`,
116
- directly after `conds = conds.transpose(1, 2)` and before `loss, _ = self.decoder.compute_loss(...)`:
117
 
118
  ```python
119
- # Alignment fix
120
  min_len = min(h.shape[1], feat.shape[1])
121
  h = h[:, :min_len, :]
122
  feat = feat[:, :min_len, :]
@@ -124,7 +130,7 @@ conds = conds[:, :, :min_len]
124
  mask = mask[:, :min_len]
125
  ```
126
 
127
- **2. `third_party/Matcha-TTS/matcha/utils/__init__.py`** — clear this file:
128
 
129
  ```bash
130
  echo "" > third_party/Matcha-TTS/matcha/utils/__init__.py
@@ -146,9 +152,8 @@ echo "" > third_party/Matcha-TTS/matcha/utils/__init__.py
146
 
147
  ## License
148
 
149
- The fine-tuned weights follow the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) license of the base model.
150
-
151
- The Thorsten-Voice dataset is licensed under [Creative Commons Zero (CC0)](https://creativecommons.org/publicdomain/zero/1.0/).
152
 
153
  ---
154
 
@@ -168,5 +173,7 @@ The Thorsten-Voice dataset is licensed under [Creative Commons Zero (CC0)](https
168
  ## Links
169
 
170
  - [Thorsten-Voice Website](https://www.thorsten-voice.de)
 
171
  - [CosyVoice GitHub](https://github.com/FunAudioLLM/CosyVoice)
172
- - [Base Model on HuggingFace](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512)
 
 
27
 
28
  ---
29
 
30
+ ## Quickstart with Docker
31
 
32
+ The easiest way to use this model is via the official Docker container:
33
+
34
+ ```bash
35
+ docker run -p 8000:8000 \
36
+ -v cosyvoice_models:/app/CosyVoice/pretrained_models \
37
+ thorstenvoice/cosyvoice-tts
38
+
39
+ # Then generate audio:
40
+ curl -X POST http://localhost:8000/tts \
41
+ -F "text=Hallo, ich bin Thorsten. Schön, dass du da bist." \
42
+ --output thorsten.wav
43
+ ```
44
+
45
+ → [Docker Hub: thorstenvoice/cosyvoice-tts](https://hub.docker.com/r/thorstenvoice/cosyvoice-tts)
46
+
47
+ ---
48
 
49
+ ## Manual Installation
50
+
51
+ ### 1. Clone CosyVoice at the correct commit
52
 
53
  ```bash
54
  git clone https://github.com/FunAudioLLM/CosyVoice.git
 
59
 
60
  ### 2. Install dependencies
61
 
62
+ Python 3.10 or 3.11 recommended.
63
 
64
  ```bash
 
65
  sudo apt-get install -y sox libsox-fmt-all ffmpeg
 
 
66
  pip install setuptools --upgrade
67
  pip install openai-whisper
 
 
68
  grep -v "openai-whisper" requirements.txt > requirements_fixed.txt
69
  pip install -r requirements_fixed.txt
70
  ```
 
75
  export PYTHONPATH=/path/to/CosyVoice:/path/to/CosyVoice/third_party/Matcha-TTS:$PYTHONPATH
76
  ```
77
 
78
+ ### 4. Download models
79
 
80
  ```bash
81
  pip install huggingface_hub
82
 
83
+ # Base model
84
  hf download FunAudioLLM/Fun-CosyVoice3-0.5B-2512 \
85
  --local-dir pretrained_models/CosyVoice3-0.5B
 
 
 
86
 
87
+ # Thorsten fine-tuned weights
88
  hf download Thorsten-Voice/CosyVoice3 \
89
  --local-dir pretrained_models/CosyVoice3-0.5B \
90
  --include "llm.pt" "flow.pt" "spk2info.pt" "infer_thorsten.py"
91
  ```
92
 
93
+ ### 5. Generate audio
 
 
 
 
 
 
 
94
 
95
  ```bash
96
  python3 infer_thorsten.py \
 
98
  --output thorsten.wav
99
  ```
100
 
101
+ ---
102
 
103
+ ## Performance
104
+
105
+ Benchmarked with these two test texts:
106
+
107
+ **Short** (~8 words):
108
+ > "Hallo, hier ist Thorsten. Schön, dass Du da bist."
109
+
110
+ **Long** (~80 words):
111
+ > "Für mich sind alle Menschen gleich, unabhängig von Geschlecht, sexueller Orientierung, Religion, Hautfarbe oder Geokoordinaten der Geburt. Ich glaube an eine globale Welt, wo jeder überall willkommen ist und freies Wissen und Bildung kostenfrei für jeden zur Verfügung steht. Ich habe meine Stimme der Allgemeinheit gespendet, in der Hoffnung darauf, dass sie in diesem Sinne genutzt wird."
112
+
113
+ | Hardware | Short text | Long text |
114
+ |----------|-----------|-----------|
115
+ | MacBook Air M1 (CPU) | 47s | 4:30 min |
116
+ | QNAP NAS Intel (CPU) | 50s | — |
117
+ | RunPod RTX 4090 (GPU) | **2.9s** | **12.9s** |
118
 
119
  ---
120
 
121
  ## Python 3.12 patches
122
 
123
+ **1. `cosyvoice/flow/flow.py`** add after `conds = conds.transpose(1, 2)` in `CausalMaskedDiffWithDiT.forward()`:
 
 
 
124
 
125
  ```python
 
126
  min_len = min(h.shape[1], feat.shape[1])
127
  h = h[:, :min_len, :]
128
  feat = feat[:, :min_len, :]
 
130
  mask = mask[:, :min_len]
131
  ```
132
 
133
+ **2. `third_party/Matcha-TTS/matcha/utils/__init__.py`:**
134
 
135
  ```bash
136
  echo "" > third_party/Matcha-TTS/matcha/utils/__init__.py
 
152
 
153
  ## License
154
 
155
+ Apache 2.0 same as the base model.
156
+ The Thorsten-Voice dataset is licensed under [CC0](https://creativecommons.org/publicdomain/zero/1.0/).
 
157
 
158
  ---
159
 
 
173
  ## Links
174
 
175
  - [Thorsten-Voice Website](https://www.thorsten-voice.de)
176
+ - [Docker Container](https://hub.docker.com/r/thorstenvoice/cosyvoice-tts)
177
  - [CosyVoice GitHub](https://github.com/FunAudioLLM/CosyVoice)
178
+ - [Base Model](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512)
179
+ - [Source on GitHub](https://github.com/thorstenMueller/Thorsten-Voice/tree/main/docker/cosyvoice)