YWMditto commited on
Commit
04fb082
·
1 Parent(s): 5370a09

update readme

Browse files
Files changed (1) hide show
  1. README.md +13 -13
README.md CHANGED
@@ -35,9 +35,9 @@ When a single piece of audio needs to **sound like a real person**, **pronounce
35
 
36
 
37
 
38
- ## MOSS-TTS
39
- ### 1. Overview
40
- #### 1.1 TTS Family Positioning
41
  MOSS-TTS is the **flagship base model** in our open-source **TTS Family**. It is designed as a production-ready synthesis backbone that can serve as the primary high-quality engine for scalable voice applications, and as a strong research baseline for controllable TTS and discrete audio token modeling.
42
 
43
  **Design goals**
@@ -48,7 +48,7 @@ MOSS-TTS is the **flagship base model** in our open-source **TTS Family**. It is
48
 
49
 
50
 
51
- #### 1.2 Key Capabilities
52
 
53
  MOSS-TTS delivers state-of-the-art quality while providing the fine-grained controllability and long-form stability required for production-grade voice applications, from zero-shot cloning and hour-long narration to token- and phoneme-level control across multilingual and code-switched speech.
54
 
@@ -66,7 +66,7 @@ MOSS-TTS delivers state-of-the-art quality while providing the fine-grained cont
66
 
67
 
68
 
69
- #### 1.3 Model Architecture
70
 
71
  MOSS-TTS includes **two complementary architectures**, both trained and released to explore different performance/latency tradeoffs and to support downstream research.
72
 
@@ -91,7 +91,7 @@ For full details, see:
91
 
92
 
93
 
94
- #### 1.4 Released Models
95
 
96
  | Model | Description |
97
  |---|---|
@@ -109,7 +109,7 @@ For full details, see:
109
 
110
 
111
 
112
- ### 2. Quick Start
113
 
114
  > Tip: For production usage, prioritize **MOSS-TTSDelay-8B**. The examples below use this model; **MOSS-TTSLocal-1.7B** supports the same API, and a practical walkthrough is available in [moss_tts_local/README.md](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer).
115
 
@@ -205,7 +205,7 @@ with torch.no_grad():
205
 
206
  ```
207
 
208
- #### Continuation + Voice Cloning (Prefix Audio + Text)
209
 
210
  MOSS-TTS supports continuation-based cloning: provide a prefix audio clip in the assistant message, and make sure the **prefix transcript** is included in the text. The model continues in the same speaker identity and style.
211
 
@@ -291,7 +291,7 @@ with torch.no_grad():
291
 
292
 
293
 
294
- #### Input Types
295
 
296
  **UserMessage**
297
 
@@ -309,7 +309,7 @@ with torch.no_grad():
309
 
310
 
311
 
312
- #### Generation Hyperparameters
313
 
314
  | Parameter | Type | Default | Description |
315
  |---|---|---:|---|
@@ -323,7 +323,7 @@ with torch.no_grad():
323
 
324
 
325
 
326
- #### Pinyin Input
327
 
328
  Use tone-numbered Pinyin such as `ni3 hao3 wo3 men1`. You can convert Chinese text with [pypinyin](https://github.com/mozillazg/python-pinyin), then adjust tones for pronunciation control.
329
 
@@ -361,7 +361,7 @@ print(text)
361
 
362
 
363
 
364
- #### IPA Input
365
 
366
  Use `/.../` to wrap IPA sequences so they are distinct from normal text. You can use [DeepPhonemizer](https://github.com/spring-media/DeepPhonemizer) to convert English paragraphs or words into IPA sequences.
367
 
@@ -386,7 +386,7 @@ print(model_input_text)
386
 
387
 
388
 
389
- ### 3. Evaluation
390
  MOSS-TTS achieved state-of-the-art results on the open-source zero-shot TTS benchmark Seed-TTS-eval, not only surpassing all open-source models but also rivaling the most powerful closed-source models.
391
 
392
  | Model | Params | Open-source | EN WER (%) ↓ | EN SIM (%) ↑ | ZH CER (%) ↓ | ZH SIM (%) ↑ |
 
35
 
36
 
37
 
38
+ # MOSS-TTS
39
+ ## 1. Overview
40
+ ### 1.1 TTS Family Positioning
41
  MOSS-TTS is the **flagship base model** in our open-source **TTS Family**. It is designed as a production-ready synthesis backbone that can serve as the primary high-quality engine for scalable voice applications, and as a strong research baseline for controllable TTS and discrete audio token modeling.
42
 
43
  **Design goals**
 
48
 
49
 
50
 
51
+ ### 1.2 Key Capabilities
52
 
53
  MOSS-TTS delivers state-of-the-art quality while providing the fine-grained controllability and long-form stability required for production-grade voice applications, from zero-shot cloning and hour-long narration to token- and phoneme-level control across multilingual and code-switched speech.
54
 
 
66
 
67
 
68
 
69
+ ### 1.3 Model Architecture
70
 
71
  MOSS-TTS includes **two complementary architectures**, both trained and released to explore different performance/latency tradeoffs and to support downstream research.
72
 
 
91
 
92
 
93
 
94
+ ### 1.4 Released Models
95
 
96
  | Model | Description |
97
  |---|---|
 
109
 
110
 
111
 
112
+ ## 2. Quick Start
113
 
114
  > Tip: For production usage, prioritize **MOSS-TTSDelay-8B**. The examples below use this model; **MOSS-TTSLocal-1.7B** supports the same API, and a practical walkthrough is available in [moss_tts_local/README.md](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer).
115
 
 
205
 
206
  ```
207
 
208
+ ### Continuation + Voice Cloning (Prefix Audio + Text)
209
 
210
  MOSS-TTS supports continuation-based cloning: provide a prefix audio clip in the assistant message, and make sure the **prefix transcript** is included in the text. The model continues in the same speaker identity and style.
211
 
 
291
 
292
 
293
 
294
+ ### Input Types
295
 
296
  **UserMessage**
297
 
 
309
 
310
 
311
 
312
+ ### Generation Hyperparameters
313
 
314
  | Parameter | Type | Default | Description |
315
  |---|---|---:|---|
 
323
 
324
 
325
 
326
+ ### Pinyin Input
327
 
328
  Use tone-numbered Pinyin such as `ni3 hao3 wo3 men1`. You can convert Chinese text with [pypinyin](https://github.com/mozillazg/python-pinyin), then adjust tones for pronunciation control.
329
 
 
361
 
362
 
363
 
364
+ ### IPA Input
365
 
366
  Use `/.../` to wrap IPA sequences so they are distinct from normal text. You can use [DeepPhonemizer](https://github.com/spring-media/DeepPhonemizer) to convert English paragraphs or words into IPA sequences.
367
 
 
386
 
387
 
388
 
389
+ ## 3. Evaluation
390
  MOSS-TTS achieved state-of-the-art results on the open-source zero-shot TTS benchmark Seed-TTS-eval, not only surpassing all open-source models but also rivaling the most powerful closed-source models.
391
 
392
  | Model | Params | Open-source | EN WER (%) ↓ | EN SIM (%) ↑ | ZH CER (%) ↓ | ZH SIM (%) ↑ |