update readme
Browse files
README.md
CHANGED
|
@@ -13,13 +13,14 @@ MOSS‑TTS Family is an open‑source **speech and sound generation model family
|
|
| 13 |
<img src="https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_imgaes_demo/moss_tts_family_arch.jpeg" width="85%" />
|
| 14 |
</p>
|
| 15 |
|
|
|
|
| 16 |
When a single piece of audio needs to **sound like a real person**, **pronounce every word accurately**, **switch speaking styles across content**, **remain stable over tens of minutes**, and **support dialogue, role‑play, and real‑time interaction**, a single TTS model is often not enough. The **MOSS‑TTS Family** breaks the workflow into five production‑ready models that can be used independently or composed into a complete pipeline.
|
| 17 |
|
| 18 |
-
- **MOSS‑TTS**: MOSS-TTS is the flagship
|
| 19 |
-
- **MOSS‑TTSD**: MOSS-TTSD is a production
|
| 20 |
-
- **MOSS‑VoiceGenerator**: MOSS-VoiceGenerator is an open-source voice design
|
| 21 |
-
- **MOSS‑SoundEffect**: MOSS-SoundEffect is a high-fidelity sound
|
| 22 |
-
- **MOSS‑TTS‑Realtime**: MOSS-TTS-Realtime is a context-aware, multi-turn streaming TTS
|
| 23 |
|
| 24 |
|
| 25 |
## Released Models
|
|
|
|
| 13 |
<img src="https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_imgaes_demo/moss_tts_family_arch.jpeg" width="85%" />
|
| 14 |
</p>
|
| 15 |
|
| 16 |
+
|
| 17 |
When a single piece of audio needs to **sound like a real person**, **pronounce every word accurately**, **switch speaking styles across content**, **remain stable over tens of minutes**, and **support dialogue, role‑play, and real‑time interaction**, a single TTS model is often not enough. The **MOSS‑TTS Family** breaks the workflow into five production‑ready models that can be used independently or composed into a complete pipeline.
|
| 18 |
|
| 19 |
+
- **MOSS‑TTS**: MOSS-TTS is the flagship production TTS foundation model, centered on high-fidelity zero-shot voice cloning with controllable long-form synthesis, pronunciation, and multilingual/code-switched speech. It serves as the core engine for scalable narration, dubbing, and voice-driven products.
|
| 20 |
+
- **MOSS‑TTSD**: MOSS-TTSD is a production long-form dialogue model for expressive multi-speaker conversational audio at scale. It supports long-duration continuity, turn-taking control, and zero-shot voice cloning from short references for podcasts, audiobooks, commentary, dubbing, and entertainment dialogue.
|
| 21 |
+
- **MOSS‑VoiceGenerator**: MOSS-VoiceGenerator is an open-source voice design model that creates speaker timbres directly from free-form text, without reference audio. It unifies timbre design, style control, and content synthesis, and can be used standalone or as a voice-design layer for downstream TTS.
|
| 22 |
+
- **MOSS‑SoundEffect**: MOSS-SoundEffect is a high-fidelity text-to-sound model with broad category coverage and controllable duration for real content production. It generates stable audio from prompts across ambience, urban scenes, creatures, human actions, and music-like clips for film, games, interactive media, and data synthesis.
|
| 23 |
+
- **MOSS‑TTS‑Realtime**: MOSS-TTS-Realtime is a context-aware, multi-turn streaming TTS model for real-time voice agents. By conditioning on dialogue history across both text and prior user acoustics, it delivers low-latency synthesis with coherent, consistent voice responses across turns.
|
| 24 |
|
| 25 |
|
| 26 |
## Released Models
|