update readme
Browse files
README.md
CHANGED
|
@@ -35,9 +35,9 @@ When a single piece of audio needs to **sound like a real person**, **pronounce
|
|
| 35 |
|
| 36 |
|
| 37 |
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
MOSS-TTS is the **flagship base model** in our open-source **TTS Family**. It is designed as a production-ready synthesis backbone that can serve as the primary high-quality engine for scalable voice applications, and as a strong research baseline for controllable TTS and discrete audio token modeling.
|
| 42 |
|
| 43 |
**Design goals**
|
|
@@ -48,7 +48,7 @@ MOSS-TTS is the **flagship base model** in our open-source **TTS Family**. It is
|
|
| 48 |
|
| 49 |
|
| 50 |
|
| 51 |
-
|
| 52 |
|
| 53 |
MOSS-TTS delivers state-of-the-art quality while providing the fine-grained controllability and long-form stability required for production-grade voice applications, from zero-shot cloning and hour-long narration to token- and phoneme-level control across multilingual and code-switched speech.
|
| 54 |
|
|
@@ -66,7 +66,7 @@ MOSS-TTS delivers state-of-the-art quality while providing the fine-grained cont
|
|
| 66 |
|
| 67 |
|
| 68 |
|
| 69 |
-
|
| 70 |
|
| 71 |
MOSS-TTS includes **two complementary architectures**, both trained and released to explore different performance/latency tradeoffs and to support downstream research.
|
| 72 |
|
|
@@ -91,7 +91,7 @@ For full details, see:
|
|
| 91 |
|
| 92 |
|
| 93 |
|
| 94 |
-
|
| 95 |
|
| 96 |
| Model | Description |
|
| 97 |
|---|---|
|
|
@@ -109,7 +109,7 @@ For full details, see:
|
|
| 109 |
|
| 110 |
|
| 111 |
|
| 112 |
-
|
| 113 |
|
| 114 |
> Tip: For production usage, prioritize **MOSS-TTSDelay-8B**. The examples below use this model; **MOSS-TTSLocal-1.7B** supports the same API, and a practical walkthrough is available in [moss_tts_local/README.md](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer).
|
| 115 |
|
|
@@ -205,7 +205,7 @@ with torch.no_grad():
|
|
| 205 |
|
| 206 |
```
|
| 207 |
|
| 208 |
-
|
| 209 |
|
| 210 |
MOSS-TTS supports continuation-based cloning: provide a prefix audio clip in the assistant message, and make sure the **prefix transcript** is included in the text. The model continues in the same speaker identity and style.
|
| 211 |
|
|
@@ -291,7 +291,7 @@ with torch.no_grad():
|
|
| 291 |
|
| 292 |
|
| 293 |
|
| 294 |
-
|
| 295 |
|
| 296 |
**UserMessage**
|
| 297 |
|
|
@@ -309,7 +309,7 @@ with torch.no_grad():
|
|
| 309 |
|
| 310 |
|
| 311 |
|
| 312 |
-
|
| 313 |
|
| 314 |
| Parameter | Type | Default | Description |
|
| 315 |
|---|---|---:|---|
|
|
@@ -323,7 +323,7 @@ with torch.no_grad():
|
|
| 323 |
|
| 324 |
|
| 325 |
|
| 326 |
-
|
| 327 |
|
| 328 |
Use tone-numbered Pinyin such as `ni3 hao3 wo3 men1`. You can convert Chinese text with [pypinyin](https://github.com/mozillazg/python-pinyin), then adjust tones for pronunciation control.
|
| 329 |
|
|
@@ -361,7 +361,7 @@ print(text)
|
|
| 361 |
|
| 362 |
|
| 363 |
|
| 364 |
-
|
| 365 |
|
| 366 |
Use `/.../` to wrap IPA sequences so they are distinct from normal text. You can use [DeepPhonemizer](https://github.com/spring-media/DeepPhonemizer) to convert English paragraphs or words into IPA sequences.
|
| 367 |
|
|
@@ -386,7 +386,7 @@ print(model_input_text)
|
|
| 386 |
|
| 387 |
|
| 388 |
|
| 389 |
-
|
| 390 |
MOSS-TTS achieved state-of-the-art results on the open-source zero-shot TTS benchmark Seed-TTS-eval, not only surpassing all open-source models but also rivaling the most powerful closed-source models.
|
| 391 |
|
| 392 |
| Model | Params | Open-source | EN WER (%) ↓ | EN SIM (%) ↑ | ZH CER (%) ↓ | ZH SIM (%) ↑ |
|
|
|
|
| 35 |
|
| 36 |
|
| 37 |
|
| 38 |
+
# MOSS-TTS
|
| 39 |
+
## 1. Overview
|
| 40 |
+
### 1.1 TTS Family Positioning
|
| 41 |
MOSS-TTS is the **flagship base model** in our open-source **TTS Family**. It is designed as a production-ready synthesis backbone that can serve as the primary high-quality engine for scalable voice applications, and as a strong research baseline for controllable TTS and discrete audio token modeling.
|
| 42 |
|
| 43 |
**Design goals**
|
|
|
|
| 48 |
|
| 49 |
|
| 50 |
|
| 51 |
+
### 1.2 Key Capabilities
|
| 52 |
|
| 53 |
MOSS-TTS delivers state-of-the-art quality while providing the fine-grained controllability and long-form stability required for production-grade voice applications, from zero-shot cloning and hour-long narration to token- and phoneme-level control across multilingual and code-switched speech.
|
| 54 |
|
|
|
|
| 66 |
|
| 67 |
|
| 68 |
|
| 69 |
+
### 1.3 Model Architecture
|
| 70 |
|
| 71 |
MOSS-TTS includes **two complementary architectures**, both trained and released to explore different performance/latency tradeoffs and to support downstream research.
|
| 72 |
|
|
|
|
| 91 |
|
| 92 |
|
| 93 |
|
| 94 |
+
### 1.4 Released Models
|
| 95 |
|
| 96 |
| Model | Description |
|
| 97 |
|---|---|
|
|
|
|
| 109 |
|
| 110 |
|
| 111 |
|
| 112 |
+
## 2. Quick Start
|
| 113 |
|
| 114 |
> Tip: For production usage, prioritize **MOSS-TTSDelay-8B**. The examples below use this model; **MOSS-TTSLocal-1.7B** supports the same API, and a practical walkthrough is available in [moss_tts_local/README.md](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer).
|
| 115 |
|
|
|
|
| 205 |
|
| 206 |
```
|
| 207 |
|
| 208 |
+
### Continuation + Voice Cloning (Prefix Audio + Text)
|
| 209 |
|
| 210 |
MOSS-TTS supports continuation-based cloning: provide a prefix audio clip in the assistant message, and make sure the **prefix transcript** is included in the text. The model continues in the same speaker identity and style.
|
| 211 |
|
|
|
|
| 291 |
|
| 292 |
|
| 293 |
|
| 294 |
+
### Input Types
|
| 295 |
|
| 296 |
**UserMessage**
|
| 297 |
|
|
|
|
| 309 |
|
| 310 |
|
| 311 |
|
| 312 |
+
### Generation Hyperparameters
|
| 313 |
|
| 314 |
| Parameter | Type | Default | Description |
|
| 315 |
|---|---|---:|---|
|
|
|
|
| 323 |
|
| 324 |
|
| 325 |
|
| 326 |
+
### Pinyin Input
|
| 327 |
|
| 328 |
Use tone-numbered Pinyin such as `ni3 hao3 wo3 men1`. You can convert Chinese text with [pypinyin](https://github.com/mozillazg/python-pinyin), then adjust tones for pronunciation control.
|
| 329 |
|
|
|
|
| 361 |
|
| 362 |
|
| 363 |
|
| 364 |
+
### IPA Input
|
| 365 |
|
| 366 |
Use `/.../` to wrap IPA sequences so they are distinct from normal text. You can use [DeepPhonemizer](https://github.com/spring-media/DeepPhonemizer) to convert English paragraphs or words into IPA sequences.
|
| 367 |
|
|
|
|
| 386 |
|
| 387 |
|
| 388 |
|
| 389 |
+
## 3. Evaluation
|
| 390 |
MOSS-TTS achieved state-of-the-art results on the open-source zero-shot TTS benchmark Seed-TTS-eval, not only surpassing all open-source models but also rivaling the most powerful closed-source models.
|
| 391 |
|
| 392 |
| Model | Params | Open-source | EN WER (%) ↓ | EN SIM (%) ↑ | ZH CER (%) ↓ | ZH SIM (%) ↑ |
|