YWMditto commited on
Commit
a03f4bb
·
1 Parent(s): 0c8df99

update arxiv link

Browse files
Files changed (1) hide show
  1. README.md +47 -31
README.md CHANGED
@@ -40,7 +40,7 @@ language:
40
  <a href="https://github.com/OpenMOSS/MOSS-TTS/tree/main"><img src="https://img.shields.io/badge/Project%20Page-GitHub-blue"></a>
41
  <a href="https://modelscope.cn/collections/OpenMOSS-Team/MOSS-TTS"><img src="https://img.shields.io/badge/ModelScope-Models-lightgrey?logo=modelscope&amp"></a>
42
  <a href="https://mosi.cn/#models"><img src="https://img.shields.io/badge/Blog-View-blue?logo=internet-explorer&amp"></a>
43
- <a href="https://github.com/OpenMOSS/MOSS-TTS"><img src="https://img.shields.io/badge/Arxiv-Coming%20soon-red?logo=arxiv&amp"></a>
44
 
45
  <a href="https://studio.mosi.cn"><img src="https://img.shields.io/badge/AIStudio-Try-green?logo=internet-explorer&amp"></a>
46
  <a href="https://studio.mosi.cn/docs/moss-tts"><img src="https://img.shields.io/badge/API-Docs-00A3FF?logo=fastapi&amp"></a>
@@ -62,23 +62,37 @@ MOSS‑TTS Family is an open‑source **speech and sound generation model family
62
 
63
  When a single piece of audio needs to **sound like a real person**, **pronounce every word accurately**, **switch speaking styles across content**, **remain stable over tens of minutes**, and **support dialogue, role‑play, and real‑time interaction**, a single TTS model is often not enough. The **MOSS‑TTS Family** breaks the workflow into five production‑ready models that can be used independently or composed into a complete pipeline.
64
 
65
- - **MOSS‑TTS**: MOSS-TTS is the flagship production TTS foundation model, centered on high-fidelity zero-shot voice cloning with controllable long-form synthesis, pronunciation, and multilingual/code-switched speech. It serves as the core engine for scalable narration, dubbing, and voice-driven products.
66
- - **MOSS‑TTSD**: MOSS-TTSD is a production long-form dialogue model for expressive multi-speaker conversational audio at scale. It supports long-duration continuity, turn-taking control, and zero-shot voice cloning from short references for podcasts, audiobooks, commentary, dubbing, and entertainment dialogue.
67
- - **MOSS‑VoiceGenerator**: MOSS-VoiceGenerator is an open-source voice design model that creates speaker timbres directly from free-form text, without reference audio. It unifies timbre design, style control, and content synthesis, and can be used standalone or as a voice-design layer for downstream TTS.
68
- - **MOSS‑SoundEffect**: MOSS-SoundEffect is a high-fidelity text-to-sound model with broad category coverage and controllable duration for real content production. It generates stable audio from prompts across ambience, urban scenes, creatures, human actions, and music-like clips for film, games, interactive media, and data synthesis.
69
- - **MOSS‑TTS‑Realtime**: MOSS-TTS-Realtime is a context-aware, multi-turn streaming TTS model for real-time voice agents. By conditioning on dialogue history across both text and prior user acoustics, it delivers low-latency synthesis with coherent, consistent voice responses across turns.
70
 
71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  ## Released Models
73
 
74
- | Model | Architecture | Size | Model Card | Hugging Face |
75
- |---|---|---:|---|---|
76
- | **MOSS-TTS** | MossTTSDelay | 8B | [moss_tts_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTS) |
77
- | | MossTTSLocal | 1.7B | [moss_tts_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer) |
78
- | **MOSS‑TTSD‑V1.0** | MossTTSDelay | 8B | [moss_ttsd_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_ttsd_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTSD-v1.0) |
79
- | **MOSS‑VoiceGenerator** | MossTTSDelay | 1.7B | [moss_voice_generator_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_voice_generator_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-Voice-Generator) |
80
- | **MOSS‑SoundEffect** | MossTTSDelay | 8B | [moss_sound_effect_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_sound_effect_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-SoundEffect) |
81
- | **MOSS‑TTS‑Realtime** | MossTTSRealtime | 1.7B | [moss_tts_realtime_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_realtime_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Realtime) |
 
82
 
83
  ## Supported Languages
84
 
@@ -88,10 +102,10 @@ MOSS-TTS, MOSS-TTSD and MOSS-TTS-Realtime currently supports **20 languages**:
88
  |---|---|---|---|---|---|---|---|---|
89
  | Chinese | zh | 🇨🇳 | English | en | 🇺🇸 | German | de | 🇩🇪 |
90
  | Spanish | es | 🇪🇸 | French | fr | 🇫🇷 | Japanese | ja | 🇯🇵 |
91
- | Italian | it | 🇮🇹 | Hebrew | he | 🇮🇱 | Korean | ko | 🇰🇷 |
92
  | Russian | ru | 🇷🇺 | Persian (Farsi) | fa | 🇮🇷 | Arabic | ar | 🇸🇦 |
93
  | Polish | pl | 🇵🇱 | Portuguese | pt | 🇵🇹 | Czech | cs | 🇨🇿 |
94
- | Danish | da | 🇩🇰 | Swedish | sv | 🇸🇪 | Hungarian | hu | 🇭🇺 |
95
  | Greek | el | 🇬🇷 | Turkish | tr | 🇹🇷 | | | |
96
 
97
  # MOSS-TTS
@@ -535,29 +549,31 @@ print(model_input_text)
535
  ## 3. Evaluation
536
  MOSS-TTS achieved state-of-the-art results on the open-source zero-shot TTS benchmark Seed-TTS-eval, not only surpassing all open-source models but also rivaling the most powerful closed-source models.
537
 
538
- | Model | Params | Open-source | EN WER (%) ↓ | EN SIM (%) ↑ | ZH CER (%) ↓ | ZH SIM (%) ↑ |
539
  |---|---:|:---:|---:|---:|---:|---:|
540
  | DiTAR | 0.6B | ❌ | 1.69 | 73.5 | 1.02 | 75.3 |
541
- | FishAudio-S1 | 4B | ❌ | 1.72 | 62.57 | 1.22 | 72.1 |
542
- | Seed-TTS | | ❌ | 2.25 | 76.2 | 1.12 | 79.6 |
543
- | MiniMax-Speech | | ❌ | 1.65 | 69.2 | 0.83 | 78.3 |
 
544
  | | | | | | | |
545
  | CosyVoice | 0.3B | ✅ | 4.29 | 60.9 | 3.63 | 72.3 |
546
  | CosyVoice2 | 0.5B | ✅ | 3.09 | 65.9 | 1.38 | 75.7 |
547
  | CosyVoice3 | 0.5B | ✅ | 2.02 | 71.8 | 1.16 | 78 |
548
- | CosyVoice3 | 1.5B | ✅ | 2.22 | 72 | 1.12 | 78.1 |
549
- | F5-TTS | 0.3B | ✅ | 2 | 67 | 1.53 | 76 |
550
  | SparkTTS | 0.5B | ✅ | 3.14 | 57.3 | 1.54 | 66 |
551
  | FireRedTTS | 0.5B | ✅ | 3.82 | 46 | 1.51 | 63.5 |
552
- | FireRedTTS-2 | 1.5B | ✅ | 1.95 | 66.5 | 1.14 | 73.6 |
553
- | Qwen2.5-Omni | 7B | ✅ | 2.72 | 63.2 | 1.7 | 75.2 |
554
- | FishAudio-S1-mini | 0.5B | ✅ | 1.94 | 55 | 1.18 | 68.5 |
555
  | IndexTTS2 | 1.5B | ✅ | 2.23 | 70.6 | 1.03 | 76.5 |
556
  | VibeVoice | 1.5B | ✅ | 3.04 | 68.9 | 1.16 | 74.4 |
557
- | HiggsAudio-v2 | 3B | ✅ | 2.44 | 67.7 | 1.5 | 74 |
558
- | VoxCPM | 0.5B | ✅ | 1.85 | 72.9 | **0.93** | 77.2 |
559
- | Qwen3-TTS | 0.6B | ✅ | 1.68 | 70.39 | 1.23 | 76.4 |
560
- | Qwen3-TTS | 1.7B | ✅ | **1.5** | 71.45 | 1.33 | 76.72 |
 
 
561
  | | | | | | | |
562
- | MossTTSDelay | 8B | ✅ | 1.79 | 71.46 | 1.32 | 77.05 |
563
- | MossTTSLocal | 1.7B | ✅ | 1.85 | **73.42** | 1.2 | **78.82** |
 
40
  <a href="https://github.com/OpenMOSS/MOSS-TTS/tree/main"><img src="https://img.shields.io/badge/Project%20Page-GitHub-blue"></a>
41
  <a href="https://modelscope.cn/collections/OpenMOSS-Team/MOSS-TTS"><img src="https://img.shields.io/badge/ModelScope-Models-lightgrey?logo=modelscope&amp"></a>
42
  <a href="https://mosi.cn/#models"><img src="https://img.shields.io/badge/Blog-View-blue?logo=internet-explorer&amp"></a>
43
+ <a href="https://arxiv.org/pdf/2603.18090"><img src="https://img.shields.io/badge/Arxiv-2603.18090-red?logo=Arxiv&amp"></a>
44
 
45
  <a href="https://studio.mosi.cn"><img src="https://img.shields.io/badge/AIStudio-Try-green?logo=internet-explorer&amp"></a>
46
  <a href="https://studio.mosi.cn/docs/moss-tts"><img src="https://img.shields.io/badge/API-Docs-00A3FF?logo=fastapi&amp"></a>
 
62
 
63
  When a single piece of audio needs to **sound like a real person**, **pronounce every word accurately**, **switch speaking styles across content**, **remain stable over tens of minutes**, and **support dialogue, role‑play, and real‑time interaction**, a single TTS model is often not enough. The **MOSS‑TTS Family** breaks the workflow into five production‑ready models that can be used independently or composed into a complete pipeline.
64
 
65
+ - **MOSS‑TTS**: The flagship production model featuring high fidelity and optimal zero-shot voice cloning. It supports **long-speech generation**, **fine-grained control over Pinyin, phonemes, and duration**, as well as **multilingual/code-switched synthesis**.
66
+ - **MOSS‑TTSD**: A spoken dialogue generation model for expressive, multi-speaker, and ultra-long dialogues. The new **v1.0 version** achieves **industry-leading performance on objective metrics** and **outperformed top closed-source models like Doubao and Gemini 2.5-pro** in subjective evaluations. You can visit the [MOSS-TTSD repository](https://github.com/OpenMOSS/MOSS-TTSD) for details.
67
+ - **MOSS‑VoiceGenerator**: An open-source voice design model capable of generating diverse voices and styles directly from text prompts, **without any reference speech**. It unifies voice design, style control, and synthesis, functioning independently or as a design layer for downstream TTS. Its performance **surpasses other top-tier voice design models in arena ratings**.
68
+ - **MOSS‑TTS‑Realtime**: A multi-turn context-aware model for real-time voice agents. It uses incremental synthesis to ensure natural and coherent replies, making it **ideal for building low-latency voice agents when paired with text models**. The TTFB (Time To First Byte) of MOSS-TTS-Realtime reaches 180 ms, and the $T_{\text{LLM-first-sentence}} + T_{\text{MOSS-TTS-Realtime-TTFB}}$ is 377 ms.
69
+ - **MOSS‑SoundEffect**: A content creation model specialized in **sound effect generation** with wide category coverage and controllable duration. It generates audio for natural environments, urban scenes, biological sounds, human actions, and musical fragments, suitable for film, games, and interactive experiences.
70
 
71
 
72
+ ## Model Architecture
73
+
74
+ We train **MossTTSDelay** and **MossTTSLocal** as complementary baselines under one training/evaluation setup: **Delay** emphasizes long-context stability, inference speed, and production readiness, while **Local** emphasizes lightweight flexibility and strong objective performance for streaming-oriented systems. Together they provide reproducible references for deployment and research.
75
+
76
+ **MossTTSRealtime** is not a third comparison baseline; it is a capability-driven design for voice agents. By modeling multi-turn context from both prior text and user acoustics, it delivers low-latency streaming speech that stays coherent and voice-consistent across turns.
77
+
78
+
79
+ | Architecture | Core Mechanism | Arch Details |
80
+ |---|---|---|
81
+ | `MossTTSDelay` | Multi‑head parallel RVQ prediction with delay‑pattern scheduling | [![Arch Details](https://img.shields.io/badge/Model%20Card-View-blue?logo=markdown)](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_tts_delay/README.md) |
82
+ | `MossTTSLocal` | Time‑synchronous RVQ blocks with a depth transformer | [![Arch Details](https://img.shields.io/badge/Model%20Card-View-blue?logo=markdown)](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_tts_local/README.md) |
83
+ | `MossTTSRealtime` | Hierarchical text–audio inputs for realtime synthesis | [![Arch Details](https://img.shields.io/badge/Model%20Card-View-blue?logo=markdown)](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_tts_realtime/README.md) |
84
+
85
  ## Released Models
86
 
87
+
88
+ | Model | Architecture | Size | Model Card | Hugging Face | ModelScope |
89
+ |---|---|---:|---|---|---|
90
+ | **MOSS-TTS** | `MossTTSDelay` | 8B | [![Model Card](https://img.shields.io/badge/Model%20Card-View-blue?logo=markdown)](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_model_card.md) | [![Hugging Face](https://img.shields.io/badge/Huggingface-Model-orange?logo=huggingface)](https://huggingface.co/OpenMOSS-Team/MOSS-TTS) | [![ModelScope](https://img.shields.io/badge/ModelScope-Model-lightgrey?logo=modelscope)](https://modelscope.cn/models/openmoss/MOSS-TTS) |
91
+ | | `MossTTSLocal` | 1.7B | [![Model Card](https://img.shields.io/badge/Model%20Card-View-blue?logo=markdown)](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_model_card.md) | [![Hugging Face](https://img.shields.io/badge/Huggingface-Model-orange?logo=huggingface)](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer) | [![ModelScope](https://img.shields.io/badge/ModelScope-Model-lightgrey?logo=modelscope)](https://modelscope.cn/models/openmoss/MOSS-TTS-Local-Transformer) |
92
+ | **MOSS‑TTSD‑V1.0** | `MossTTSDelay` | 8B | [![Model Card](https://img.shields.io/badge/Model%20Card-View-blue?logo=markdown)](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_ttsd_model_card.md) | [![Hugging Face](https://img.shields.io/badge/Huggingface-Model-orange?logo=huggingface)](https://huggingface.co/OpenMOSS-Team/MOSS-TTSD-v1.0) | [![ModelScope](https://img.shields.io/badge/ModelScope-Model-lightgrey?logo=modelscope)](https://modelscope.cn/models/openmoss/MOSS-TTSD-v1.0) |
93
+ | **MOSS‑VoiceGenerator** | `MossTTSDelay` | 1.7B | [![Model Card](https://img.shields.io/badge/Model%20Card-View-blue?logo=markdown)](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_voice_generator_model_card.md) | [![Hugging Face](https://img.shields.io/badge/Huggingface-Model-orange?logo=huggingface)](https://huggingface.co/OpenMOSS-Team/MOSS-VoiceGenerator) | [![ModelScope](https://img.shields.io/badge/ModelScope-Model-lightgrey?logo=modelscope)](https://modelscope.cn/models/openmoss/MOSS-VoiceGenerator) |
94
+ | **MOSS‑SoundEffect** | `MossTTSDelay` | 8B | [![Model Card](https://img.shields.io/badge/Model%20Card-View-blue?logo=markdown)](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_sound_effect_model_card.md) | [![Hugging Face](https://img.shields.io/badge/Huggingface-Model-orange?logo=huggingface)](https://huggingface.co/OpenMOSS-Team/MOSS-SoundEffect) | [![ModelScope](https://img.shields.io/badge/ModelScope-Model-lightgrey?logo=modelscope)](https://modelscope.cn/models/openmoss/MOSS-SoundEffect) |
95
+ | **MOSS‑TTS‑Realtime** | `MossTTSRealtime` | 1.7B | [![Model Card](https://img.shields.io/badge/Model%20Card-View-blue?logo=markdown)](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_realtime_model_card.md) | [![Hugging Face](https://img.shields.io/badge/Huggingface-Model-orange?logo=huggingface)](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Realtime) | [![ModelScope](https://img.shields.io/badge/ModelScope-Model-lightgrey?logo=modelscope)](https://modelscope.cn/models/openmoss/MOSS-TTS-Realtime) |
96
 
97
  ## Supported Languages
98
 
 
102
  |---|---|---|---|---|---|---|---|---|
103
  | Chinese | zh | 🇨🇳 | English | en | 🇺🇸 | German | de | 🇩🇪 |
104
  | Spanish | es | 🇪🇸 | French | fr | 🇫🇷 | Japanese | ja | 🇯🇵 |
105
+ | Italian | it | 🇮🇹 | Hungarian | hu | 🇭🇺 | Korean | ko | 🇰🇷 |
106
  | Russian | ru | 🇷🇺 | Persian (Farsi) | fa | 🇮🇷 | Arabic | ar | 🇸🇦 |
107
  | Polish | pl | 🇵🇱 | Portuguese | pt | 🇵🇹 | Czech | cs | 🇨🇿 |
108
+ | Danish | da | 🇩🇰 | Swedish | sv | 🇸🇪 | | | |
109
  | Greek | el | 🇬🇷 | Turkish | tr | 🇹🇷 | | | |
110
 
111
  # MOSS-TTS
 
549
  ## 3. Evaluation
550
  MOSS-TTS achieved state-of-the-art results on the open-source zero-shot TTS benchmark Seed-TTS-eval, not only surpassing all open-source models but also rivaling the most powerful closed-source models.
551
 
552
+ | Model | Params | Opensource | EN WER (%) ↓ | EN SIM (%) ↑ | ZH CER (%) ↓ | ZH SIM (%) ↑ |
553
  |---|---:|:---:|---:|---:|---:|---:|
554
  | DiTAR | 0.6B | ❌ | 1.69 | 73.5 | 1.02 | 75.3 |
555
+ | FishAudioS1 | 4B | ❌ | 1.72 | 62.57 | 1.22 | 72.1 |
556
+ | CosyVoice3 | 1.5B | ❌ | 2.22 | 72 | 1.12 | 78.1 |
557
+ | Seed‑TTS | | ❌ | 2.25 | 76.2 | 1.12 | 79.6 |
558
+ | MiniMax‑Speech | | ❌ | 1.65 | 69.2 | 0.83 | 78.3 |
559
  | | | | | | | |
560
  | CosyVoice | 0.3B | ✅ | 4.29 | 60.9 | 3.63 | 72.3 |
561
  | CosyVoice2 | 0.5B | ✅ | 3.09 | 65.9 | 1.38 | 75.7 |
562
  | CosyVoice3 | 0.5B | ✅ | 2.02 | 71.8 | 1.16 | 78 |
563
+ | F5‑TTS | 0.3B | ✅ | 2 | 67 | 1.53 | 76 |
 
564
  | SparkTTS | 0.5B | ✅ | 3.14 | 57.3 | 1.54 | 66 |
565
  | FireRedTTS | 0.5B | ✅ | 3.82 | 46 | 1.51 | 63.5 |
566
+ | FireRedTTS2 | 1.5B | ✅ | 1.95 | 66.5 | 1.14 | 73.6 |
567
+ | Qwen2.5Omni | 7B | ✅ | 2.72 | 63.2 | 1.7 | 75.2 |
568
+ | FishAudioS1mini | 0.5B | ✅ | 1.94 | 55 | 1.18 | 68.5 |
569
  | IndexTTS2 | 1.5B | ✅ | 2.23 | 70.6 | 1.03 | 76.5 |
570
  | VibeVoice | 1.5B | ✅ | 3.04 | 68.9 | 1.16 | 74.4 |
571
+ | HiggsAudiov2 | 3B | ✅ | 2.44 | 67.7 | 1.5 | 74 |
572
+ | GLM-TTS | 1.5B | ✅ | 2.23 | 67.2 | 1.03 | 76.1 |
573
+ | GLM-TTS-RL | 1.5B | ✅ | 1.91 | 68.1 | **0.89** | 76.4 |
574
+ | VoxCPM | 0.5B | ✅ | 1.85 | 72.9 | 0.93 | 77.2 |
575
+ | Qwen3‑TTS | 0.6B | ✅ | 1.68 | 70.39 | 1.23 | 76.4 |
576
+ | Qwen3‑TTS | 1.7B | ✅ | **1.5** | 71.45 | 1.33 | 76.72 |
577
  | | | | | | | |
578
+ | **MossTTSDelay** | **8B** | ✅ | 1.84 | 70.86 | 1.37 | 76.98 |
579
+ | **MossTTSLocal** | **1.7B** | ✅ | 1.93 | **73.28** | 1.44 | **79.62** |