MOSS-TTS-Realtime / README.md

Cqy2019

Update README.md

d6944f1 verified about 2 hours ago

8.25 kB

	---
	license: apache-2.0
	tags:
	- text-to-speech
	language:
	- zh
	- en
	- de
	- es
	- fr
	- ja
	- it
	- he
	- ko
	- ru
	- fa
	- ar
	- pl
	- pt
	- cs
	- da
	- sv
	- hu
	- el
	- tr
	---
	# MOSS-TTS Family

	<br>

	<p align="center">

	<img src="https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_imgaes_demo/openmoss_x_mosi" height="50" align="middle" />
	</p>



	<div align="center">
	<a href="https://github.com/OpenMOSS/MOSS-TTS/tree/main"><img src="https://img.shields.io/badge/Project%20Page-GitHub-blue"></a>
	<a href="https://modelscope.cn/collections/OpenMOSS-Team/MOSS-TTS"><img src="https://img.shields.io/badge/ModelScope-Models-lightgrey?logo=modelscope&amp"></a>
	<a href="https://mosi.cn/#models"><img src="https://img.shields.io/badge/Blog-View-blue?logo=internet-explorer&amp"></a>
	<a href="https://github.com/OpenMOSS/MOSS-TTS"><img src="https://img.shields.io/badge/Arxiv-Coming%20soon-red?logo=arxiv&amp"></a>

	<a href="https://studio.mosi.cn"><img src="https://img.shields.io/badge/AIStudio-Try-green?logo=internet-explorer&amp"></a>
	<a href="https://studio.mosi.cn/docs/moss-tts"><img src="https://img.shields.io/badge/API-Docs-00A3FF?logo=fastapi&amp"></a>
	<a href="https://x.com/Open_MOSS"><img src="https://img.shields.io/badge/Twitter-Follow-black?logo=x&amp"></a>
	<a href="https://discord.gg/fvm5TaWjU3"><img src="https://img.shields.io/badge/Discord-Join-5865F2?logo=discord&amp"></a>
	</div>

	## Overview
	MOSS‑TTS Family is an open‑source speech and sound generation model family from [MOSI.AI](https://mosi.cn/#hero) and the [OpenMOSS team](https://www.open-moss.com/). It is designed for high‑fidelity, high‑expressiveness, and complex real‑world scenarios, covering stable long‑form speech, multi‑speaker dialogue, voice/character design, environmental sound effects, and real‑time streaming TTS.


	## Introduction

	<p align="center">
	<img src="https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_imgaes_demo/moss_tts_family_arch.jpeg" width="85%" />
	</p>


	When a single piece of audio needs to sound like a real person, pronounce every word accurately, switch speaking styles across content, remain stable over tens of minutes, and support dialogue, role‑play, and real‑time interaction, a single TTS model is often not enough. The MOSS‑TTS Family breaks the workflow into five production‑ready models that can be used independently or composed into a complete pipeline.

	- MOSS‑TTS: MOSS-TTS is the flagship production TTS foundation model, centered on high-fidelity zero-shot voice cloning with controllable long-form synthesis, pronunciation, and multilingual/code-switched speech. It serves as the core engine for scalable narration, dubbing, and voice-driven products.
	- MOSS‑TTSD: MOSS-TTSD is a production long-form dialogue model for expressive multi-speaker conversational audio at scale. It supports long-duration continuity, turn-taking control, and zero-shot voice cloning from short references for podcasts, audiobooks, commentary, dubbing, and entertainment dialogue.
	- MOSS‑VoiceGenerator: MOSS-VoiceGenerator is an open-source voice design model that creates speaker timbres directly from free-form text, without reference audio. It unifies timbre design, style control, and content synthesis, and can be used standalone or as a voice-design layer for downstream TTS.
	- MOSS‑SoundEffect: MOSS-SoundEffect is a high-fidelity text-to-sound model with broad category coverage and controllable duration for real content production. It generates stable audio from prompts across ambience, urban scenes, creatures, human actions, and music-like clips for film, games, interactive media, and data synthesis.
	- MOSS‑TTS‑Realtime: MOSS-TTS-Realtime is a context-aware, multi-turn streaming TTS model for real-time voice agents. By conditioning on dialogue history across both text and prior user acoustics, it delivers low-latency synthesis with coherent, consistent voice responses across turns.


	## Released Models

	\| Model \| Architecture \| Size \| Model Card \| Hugging Face \|
	\|---\|---\|---:\|---\|---\|
	\| MOSS-TTS \| MossTTSDelay \| 8B \| [moss_tts_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_model_card.md) \| 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTS) \|
	\| \| MossTTSLocal \| 1.7B \| [moss_tts_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_model_card.md) \| 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer) \|
	\| MOSS‑TTSD‑V1.0 \| MossTTSDelay \| 8B \| [moss_ttsd_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_ttsd_model_card.md) \| 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTSD-v1.0) \|
	\| MOSS‑VoiceGenerator \| MossTTSDelay \| 1.7B \| [moss_voice_generator_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_voice_generator_model_card.md) \| 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-Voice-Generator) \|
	\| MOSS‑SoundEffect \| MossTTSDelay \| 8B \| [moss_sound_effect_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_sound_effect_model_card.md) \| 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-SoundEffect) \|
	\| MOSS‑TTS‑Realtime \| MossTTSRealtime \| 1.7B \| [moss_tts_realtime_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_realtime_model_card.md) \| 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Realtime) \|



	# MOSS-TTS-Realtime

	## 1. Overview

	### 1.1 TTS Family Positioning

	MOSS-TTS-Realtime is a high-performance, real-time speech synthesis model within the broader MOSS TTS Family. It is designed for interactive voice agents that require low-latency, continuous speech generation across multi-turn conversations. Unlike conventional streaming TTS systems that synthesize each response in isolation, MOSS-TTS-Realtime natively models dialogue context by conditioning speech generation on both textual and acoustic information from previous turns. By tightly integrating multi-turn context awareness with incremental streaming synthesis, it produces natural, coherent, and voice-consistent audio responses, enabling fluid and human-like spoken interactions for real-time applications.

	Key Capabilities
	* Context-Aware & Expressive Speech Generation: Generates expressive and coherent speech by modeling both textual and acoustic context across multiple dialogue turns.

	* High-Fidelity Voice Cloning with Multi-Turn Consistency: Achieves exceptionally high voice similarity while maintaining strong speaker identity consistency across multiple dialogue turns.

	* Long-Context: Supports long-range context with a maximum context length of 32K (about 40 minutes), enabling stable and consistent speech generation in extended conversations.

	* Highly Human-Like Speech with Natural Prosody: Trained on over 2.5 million hours of single-speaker speech and more than 1 million hours of two-speaker and multi-speaker conversational data, resulting in highly natural prosody and strong human-like expressiveness.

	* Multilingual Speech Support: Supports over 10 languages beyond Chinese and English, including Korean, Japanese, German, and French, enabling consistent and expressive speech across languages.

	### 1.2 Model Architecture

	<p align="center">
	<img src="https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_imgaes_demo/moss_tts_realtime" width="60%"/>
	</p>

	## 2. Quick Start
	### 2.1 Environment Setup

	We recommend a clean, isolated Python environment with Transformers 5.0.0 to avoid dependency conflicts.

	```bash
	conda create -n moss-tts python=3.12 -y
	conda activate moss-tts
	```

	Install all required dependencies:

	```bash
	git clone https://github.com/OpenMOSS/MOSS-TTS.git
	cd MOSS-TTS
	pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e .
	```

	### 2.2 Usage
	Please refer to the following GitHub repository for detailed usage instructions and examples:

	👉 Usage Guide:
	https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_realtime_model_card.md

	---