OpenMOSS-Team
/

MOSS-TTS

Text-to-Speech

Safetensors

moss_tts_delay

custom_code

Model card Files Files and versions

xet

Community

YWMditto commited on 4 days ago

Commit

04fb082

1 Parent(s): 5370a09

update readme

Browse files

Files changed (1) hide show

README.md +13 -13

README.md CHANGED Viewed

@@ -35,9 +35,9 @@ When a single piece of audio needs to **sound like a real person**, **pronounce
-## MOSS-TTS
-### 1. Overview
-#### 1.1 TTS Family Positioning
 MOSS-TTS is the **flagship base model** in our open-source **TTS Family**. It is designed as a production-ready synthesis backbone that can serve as the primary high-quality engine for scalable voice applications, and as a strong research baseline for controllable TTS and discrete audio token modeling.
 **Design goals**
@@ -48,7 +48,7 @@ MOSS-TTS is the **flagship base model** in our open-source **TTS Family**. It is
-#### 1.2 Key Capabilities
 MOSS-TTS delivers state-of-the-art quality while providing the fine-grained controllability and long-form stability required for production-grade voice applications, from zero-shot cloning and hour-long narration to token- and phoneme-level control across multilingual and code-switched speech.
@@ -66,7 +66,7 @@ MOSS-TTS delivers state-of-the-art quality while providing the fine-grained cont
-#### 1.3 Model Architecture
 MOSS-TTS includes **two complementary architectures**, both trained and released to explore different performance/latency tradeoffs and to support downstream research.
@@ -91,7 +91,7 @@ For full details, see:
-#### 1.4 Released Models
 | Model | Description |
 |---|---|
@@ -109,7 +109,7 @@ For full details, see:
-### 2. Quick Start
 > Tip: For production usage, prioritize **MOSS-TTSDelay-8B**. The examples below use this model; **MOSS-TTSLocal-1.7B** supports the same API, and a practical walkthrough is available in [moss_tts_local/README.md](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer).
@@ -205,7 +205,7 @@ with torch.no_grad():
 ```
-#### Continuation + Voice Cloning (Prefix Audio + Text)
 MOSS-TTS supports continuation-based cloning: provide a prefix audio clip in the assistant message, and make sure the **prefix transcript** is included in the text. The model continues in the same speaker identity and style.
@@ -291,7 +291,7 @@ with torch.no_grad():
-#### Input Types
 **UserMessage**
@@ -309,7 +309,7 @@ with torch.no_grad():
-#### Generation Hyperparameters
 | Parameter | Type | Default | Description |
 |---|---|---:|---|
@@ -323,7 +323,7 @@ with torch.no_grad():
-#### Pinyin Input
 Use tone-numbered Pinyin such as `ni3 hao3 wo3 men1`. You can convert Chinese text with [pypinyin](https://github.com/mozillazg/python-pinyin), then adjust tones for pronunciation control.
@@ -361,7 +361,7 @@ print(text)
-#### IPA Input
 Use `/.../` to wrap IPA sequences so they are distinct from normal text. You can use [DeepPhonemizer](https://github.com/spring-media/DeepPhonemizer) to convert English paragraphs or words into IPA sequences.
@@ -386,7 +386,7 @@ print(model_input_text)
-### 3. Evaluation
 MOSS-TTS achieved state-of-the-art results on the open-source zero-shot TTS benchmark Seed-TTS-eval, not only surpassing all open-source models but also rivaling the most powerful closed-source models.
 | Model | Params | Open-source | EN WER (%) ↓ | EN SIM (%) ↑ | ZH CER (%) ↓ | ZH SIM (%) ↑ |

+# MOSS-TTS
+## 1. Overview
+### 1.1 TTS Family Positioning
 MOSS-TTS is the **flagship base model** in our open-source **TTS Family**. It is designed as a production-ready synthesis backbone that can serve as the primary high-quality engine for scalable voice applications, and as a strong research baseline for controllable TTS and discrete audio token modeling.
 **Design goals**
+### 1.2 Key Capabilities
 MOSS-TTS delivers state-of-the-art quality while providing the fine-grained controllability and long-form stability required for production-grade voice applications, from zero-shot cloning and hour-long narration to token- and phoneme-level control across multilingual and code-switched speech.
+### 1.3 Model Architecture
 MOSS-TTS includes **two complementary architectures**, both trained and released to explore different performance/latency tradeoffs and to support downstream research.
+### 1.4 Released Models
 | Model | Description |
 |---|---|
+## 2. Quick Start
 > Tip: For production usage, prioritize **MOSS-TTSDelay-8B**. The examples below use this model; **MOSS-TTSLocal-1.7B** supports the same API, and a practical walkthrough is available in [moss_tts_local/README.md](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer).
 ```
+### Continuation + Voice Cloning (Prefix Audio + Text)
 MOSS-TTS supports continuation-based cloning: provide a prefix audio clip in the assistant message, and make sure the **prefix transcript** is included in the text. The model continues in the same speaker identity and style.
+### Input Types
 **UserMessage**
+### Generation Hyperparameters
 | Parameter | Type | Default | Description |
 |---|---|---:|---|
+### Pinyin Input
 Use tone-numbered Pinyin such as `ni3 hao3 wo3 men1`. You can convert Chinese text with [pypinyin](https://github.com/mozillazg/python-pinyin), then adjust tones for pronunciation control.
+### IPA Input
 Use `/.../` to wrap IPA sequences so they are distinct from normal text. You can use [DeepPhonemizer](https://github.com/spring-media/DeepPhonemizer) to convert English paragraphs or words into IPA sequences.
+## 3. Evaluation
 MOSS-TTS achieved state-of-the-art results on the open-source zero-shot TTS benchmark Seed-TTS-eval, not only surpassing all open-source models but also rivaling the most powerful closed-source models.
 | Model | Params | Open-source | EN WER (%) ↓ | EN SIM (%) ↑ | ZH CER (%) ↓ | ZH SIM (%) ↑ |