|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- text-to-speech |
|
|
language: |
|
|
- zh |
|
|
- en |
|
|
- de |
|
|
- es |
|
|
- fr |
|
|
- ja |
|
|
- it |
|
|
- he |
|
|
- ko |
|
|
- ru |
|
|
- fa |
|
|
- ar |
|
|
- pl |
|
|
- pt |
|
|
- cs |
|
|
- da |
|
|
- sv |
|
|
- hu |
|
|
- el |
|
|
- tr |
|
|
--- |
|
|
# MOSS-TTS Family |
|
|
|
|
|
<br> |
|
|
|
|
|
<p align="center"> |
|
|
|
|
|
<img src="https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_imgaes_demo/openmoss_x_mosi" height="50" align="middle" /> |
|
|
</p> |
|
|
|
|
|
|
|
|
|
|
|
<div align="center"> |
|
|
<a href="https://github.com/OpenMOSS/MOSS-TTS/tree/main"><img src="https://img.shields.io/badge/Project%20Page-GitHub-blue"></a> |
|
|
<a href="https://modelscope.cn/collections/OpenMOSS-Team/MOSS-TTS"><img src="https://img.shields.io/badge/ModelScope-Models-lightgrey?logo=modelscope&"></a> |
|
|
<a href="https://mosi.cn/#models"><img src="https://img.shields.io/badge/Blog-View-blue?logo=internet-explorer&"></a> |
|
|
<a href="https://github.com/OpenMOSS/MOSS-TTS"><img src="https://img.shields.io/badge/Arxiv-Coming%20soon-red?logo=arxiv&"></a> |
|
|
|
|
|
<a href="https://studio.mosi.cn"><img src="https://img.shields.io/badge/AIStudio-Try-green?logo=internet-explorer&"></a> |
|
|
<a href="https://studio.mosi.cn/docs/moss-tts"><img src="https://img.shields.io/badge/API-Docs-00A3FF?logo=fastapi&"></a> |
|
|
<a href="https://x.com/Open_MOSS"><img src="https://img.shields.io/badge/Twitter-Follow-black?logo=x&"></a> |
|
|
<a href="https://discord.gg/fvm5TaWjU3"><img src="https://img.shields.io/badge/Discord-Join-5865F2?logo=discord&"></a> |
|
|
</div> |
|
|
|
|
|
## Overview |
|
|
MOSS‑TTS Family is an open‑source **speech and sound generation model family** from [MOSI.AI](https://mosi.cn/#hero) and the [OpenMOSS team](https://www.open-moss.com/). It is designed for **high‑fidelity**, **high‑expressiveness**, and **complex real‑world scenarios**, covering stable long‑form speech, multi‑speaker dialogue, voice/character design, environmental sound effects, and real‑time streaming TTS. |
|
|
|
|
|
|
|
|
## Introduction |
|
|
|
|
|
<p align="center"> |
|
|
<img src="https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_imgaes_demo/moss_tts_family_arch.jpeg" width="85%" /> |
|
|
</p> |
|
|
|
|
|
|
|
|
When a single piece of audio needs to **sound like a real person**, **pronounce every word accurately**, **switch speaking styles across content**, **remain stable over tens of minutes**, and **support dialogue, role‑play, and real‑time interaction**, a single TTS model is often not enough. The **MOSS‑TTS Family** breaks the workflow into five production‑ready models that can be used independently or composed into a complete pipeline. |
|
|
|
|
|
- **MOSS‑TTS**: MOSS-TTS is the flagship production TTS foundation model, centered on high-fidelity zero-shot voice cloning with controllable long-form synthesis, pronunciation, and multilingual/code-switched speech. It serves as the core engine for scalable narration, dubbing, and voice-driven products. |
|
|
- **MOSS‑TTSD**: MOSS-TTSD is a production long-form dialogue model for expressive multi-speaker conversational audio at scale. It supports long-duration continuity, turn-taking control, and zero-shot voice cloning from short references for podcasts, audiobooks, commentary, dubbing, and entertainment dialogue. |
|
|
- **MOSS‑VoiceGenerator**: MOSS-VoiceGenerator is an open-source voice design model that creates speaker timbres directly from free-form text, without reference audio. It unifies timbre design, style control, and content synthesis, and can be used standalone or as a voice-design layer for downstream TTS. |
|
|
- **MOSS‑SoundEffect**: MOSS-SoundEffect is a high-fidelity text-to-sound model with broad category coverage and controllable duration for real content production. It generates stable audio from prompts across ambience, urban scenes, creatures, human actions, and music-like clips for film, games, interactive media, and data synthesis. |
|
|
- **MOSS‑TTS‑Realtime**: MOSS-TTS-Realtime is a context-aware, multi-turn streaming TTS model for real-time voice agents. By conditioning on dialogue history across both text and prior user acoustics, it delivers low-latency synthesis with coherent, consistent voice responses across turns. |
|
|
|
|
|
|
|
|
## Released Models |
|
|
|
|
|
| Model | Architecture | Size | Model Card | Hugging Face | |
|
|
|---|---|---:|---|---| |
|
|
| **MOSS-TTS** | MossTTSDelay | 8B | [moss_tts_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTS) | |
|
|
| | MossTTSLocal | 1.7B | [moss_tts_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer) | |
|
|
| **MOSS‑TTSD‑V1.0** | MossTTSDelay | 8B | [moss_ttsd_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_ttsd_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTSD-v1.0) | |
|
|
| **MOSS‑VoiceGenerator** | MossTTSDelay | 1.7B | [moss_voice_generator_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_voice_generator_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-Voice-Generator) | |
|
|
| **MOSS‑SoundEffect** | MossTTSDelay | 8B | [moss_sound_effect_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_sound_effect_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-SoundEffect) | |
|
|
| **MOSS‑TTS‑Realtime** | MossTTSRealtime | 1.7B | [moss_tts_realtime_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_realtime_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Realtime) | |
|
|
|
|
|
|
|
|
|
|
|
# MOSS-TTS-Realtime |
|
|
|
|
|
## 1. Overview |
|
|
|
|
|
### 1.1 TTS Family Positioning |
|
|
|
|
|
**MOSS-TTS-Realtime** is a high-performance, real-time speech synthesis model within the broader MOSS TTS Family. It is designed for interactive voice agents that require low-latency, continuous speech generation across multi-turn conversations. Unlike conventional streaming TTS systems that synthesize each response in isolation, MOSS-TTS-Realtime natively models dialogue context by conditioning speech generation on both textual and acoustic information from previous turns. By tightly integrating multi-turn context awareness with incremental streaming synthesis, it produces natural, coherent, and voice-consistent audio responses, enabling fluid and human-like spoken interactions for real-time applications. |
|
|
|
|
|
**Key Capabilities** |
|
|
* **Context-Aware & Expressive Speech Generation**: Generates expressive and coherent speech by modeling both textual and acoustic context across multiple dialogue turns. |
|
|
|
|
|
* **High-Fidelity Voice Cloning with Multi-Turn Consistency**: Achieves exceptionally high voice similarity while maintaining strong speaker identity consistency across multiple dialogue turns. |
|
|
|
|
|
* **Long-Context**: Supports long-range context with a maximum context length of 32K (about 40 minutes), enabling stable and consistent speech generation in extended conversations. |
|
|
|
|
|
* **Highly Human-Like Speech with Natural Prosody**: Trained on over 2.5 million hours of single-speaker speech and more than 1 million hours of two-speaker and multi-speaker conversational data, resulting in highly natural prosody and strong human-like expressiveness. |
|
|
|
|
|
* **Multilingual Speech Support**: Supports over 10 languages beyond Chinese and English, including Korean, Japanese, German, and French, enabling consistent and expressive speech across languages. |
|
|
|
|
|
### 1.2 Model Architecture |
|
|
|
|
|
<p align="center"> |
|
|
<img src="https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_imgaes_demo/moss_tts_realtime" width="60%"/> |
|
|
</p> |
|
|
|
|
|
## 2. Quick Start |
|
|
### 2.1 Environment Setup |
|
|
|
|
|
We recommend a clean, isolated Python environment with **Transformers 5.0.0** to avoid dependency conflicts. |
|
|
|
|
|
```bash |
|
|
conda create -n moss-tts python=3.12 -y |
|
|
conda activate moss-tts |
|
|
``` |
|
|
|
|
|
Install all required dependencies: |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/OpenMOSS/MOSS-TTS.git |
|
|
cd MOSS-TTS |
|
|
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e . |
|
|
``` |
|
|
|
|
|
### 2.2 Usage |
|
|
Please refer to the following GitHub repository for detailed usage instructions and examples: |
|
|
|
|
|
👉 **Usage Guide**: |
|
|
https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_realtime_model_card.md |
|
|
|
|
|
--- |