Text-to-Speech
PyTorch
moss_tts_nano
custom_code
Kuangwei Chen commited on
Commit
37f06d9
·
1 Parent(s): c3158a0

Update readme

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.png filter=lfs diff=lfs merge=lfs -text
37
+ *.jpg filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,226 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MOSS-TTS-Nano
2
+
3
+ <br>
4
+
5
+ <p align="center">
6
+ <img src="./assets/images/OpenMOSS_Logo.png" height="70" align="middle" />
7
+ &nbsp;&nbsp;&nbsp;&nbsp;
8
+ <img src="./assets/images/mosi-logo.png" height="50" align="middle" />
9
+ </p>
10
+
11
+ <div align="center">
12
+ <a href="https://clawhub.ai/luogao2333/moss-tts-voice"><img src="https://img.shields.io/badge/🦞_OpenClaw-Skills-8A2BE2" alt="OpenClaw"></a>
13
+ <a href="https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Nano"><img src="https://img.shields.io/badge/Huggingface-Models-orange?logo=huggingface&amp"></a>
14
+ <a href="https://modelscope.cn/collections/OpenMOSS-Team/MOSS-TTS-Nano"><img src="https://img.shields.io/badge/ModelScope-Models-lightgrey?logo=modelscope&amp"></a>
15
+ <a href="https://mosi.cn/#models"><img src="https://img.shields.io/badge/Blog-View-blue?logo=internet-explorer&amp"></a>
16
+ <a href="https://arxiv.org/abs/2603.18090"><img src="https://img.shields.io/badge/Arxiv-2603.18090-red?logo=arxiv&amp"></a>
17
+
18
+ <a href="https://studio.mosi.cn/experiments/moss-tts-nano"><img src="https://img.shields.io/badge/AIStudio-Try-green?logo=internet-explorer&amp"></a>
19
+ <a href="https://studio.mosi.cn/docs/moss-tts-nano"><img src="https://img.shields.io/badge/API-Docs-00A3FF?logo=fastapi&amp"></a>
20
+ <a href="https://x.com/Open_MOSS"><img src="https://img.shields.io/badge/Twitter-Follow-black?logo=x&amp"></a>
21
+ <a href="https://discord.gg/Xf3aXddCjc"><img src="https://img.shields.io/badge/Discord-Join-5865F2?logo=discord&amp"></a>
22
+ <a href="./assets/images/wechat.jpg"><img src="https://img.shields.io/badge/WeChat-Join-07C160?logo=wechat&amp;logoColor=white" alt="WeChat"></a>
23
+ </div>
24
+
25
+ MOSS-TTS-Nano is an open-source **multilingual tiny speech generation model** from [MOSI.AI](https://mosi.cn/#hero) and the [OpenMOSS team](https://www.open-moss.com/). With only **0.1B parameters**, it is designed for **realtime speech generation**, can run directly on **CPU without a GPU**, and keeps the deployment stack simple enough for local demos, web serving, and lightweight product integration.
26
+
27
+ ## News
28
+
29
+ * 2026.4.10: We release **MOSS-TTS-Nano**. A demo Space is available at [OpenMOSS-Team/MOSS-TTS-Nano](https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTS-Nano). You can also view the demo and more details at [openmoss.github.io/MOSS-TTS-Nano-Demo/](https://openmoss.github.io/MOSS-TTS-Nano-Demo/).
30
+
31
+ ## Demo
32
+
33
+ - Online Demo: [https://openmoss.github.io/MOSS-TTS-Nano-Demo/](https://openmoss.github.io/MOSS-TTS-Nano-Demo/)
34
+ - Hugging Face Space: [OpenMOSS-Team/MOSS-TTS-Nano](https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTS-Nano)
35
+
36
+ ## Contents
37
+
38
+ - [News](#news)
39
+ - [Demo](#demo)
40
+ - [Introduction](#introduction)
41
+ - [Main Features](#main-features)
42
+ - [Supported Languages](#supported-languages)
43
+ - [Quickstart](#quickstart)
44
+ - [Environment Setup](#environment-setup)
45
+ - [Voice Clone with `infer.py`](#voice-clone-with-inferpy)
46
+ - [Local Web Demo with `app.py`](#local-web-demo-with-apppy)
47
+ - [CLI Command: `moss-tts-nano generate`](#cli-command-moss-tts-nano-generate)
48
+ - [CLI Command: `moss-tts-nano serve`](#cli-command-moss-tts-nano-serve)
49
+ - [MOSS-Audio-Tokenizer-Nano](#moss-audio-tokenizer-nano)
50
+ - [License](#license)
51
+ - [Citation](#citation)
52
+ - [Star History](#star-history)
53
+
54
+ ## Introduction
55
+
56
+ <p align="center">
57
+ <img src="./assets/images/concept.png" alt="MOSS-TTS-Nano concept" width="85%" />
58
+ </p>
59
+
60
+ MOSS-TTS-Nano focuses on the part of TTS deployment that matters most in practice: **small footprint**, **low latency**, **good enough quality for realtime products**, and **simple local setup**. It uses a pure autoregressive **Audio Tokenizer + LLM** pipeline and keeps the inference workflow friendly for both terminal users and web-demo users.
61
+
62
+ ### Main Features
63
+
64
+ - **Tiny model size**: only **0.1B parameters**
65
+ - **Native audio format**: **48 kHz**, **2-channel** output
66
+ - **Multilingual**: supports **Chinese, English, and more**
67
+ - **Pure autoregressive architecture**: built on **Audio Tokenizer + LLM**
68
+ - **Streaming inference**: low realtime latency and fast first audio
69
+ - **CPU friendly**: streaming generation can run on a **4-core CPU**
70
+ - **Long-text capable**: supports long input with automatic chunked voice cloning
71
+ - **Open-source deployment**: direct `python infer.py`, `python app.py`, and packaged CLI support
72
+
73
+ ## Supported Languages
74
+
75
+ MOSS-TTS-Nano currently supports **20 languages**:
76
+
77
+ | Language | Code | Flag | Language | Code | Flag | Language | Code | Flag |
78
+ |---|---|---|---|---|---|---|---|---|
79
+ | Chinese | zh | 🇨🇳 | English | en | 🇺🇸 | German | de | 🇩🇪 |
80
+ | Spanish | es | 🇪🇸 | French | fr | 🇫🇷 | Japanese | ja | 🇯🇵 |
81
+ | Italian | it | 🇮🇹 | Hungarian | hu | 🇭🇺 | Korean | ko | 🇰🇷 |
82
+ | Russian | ru | 🇷🇺 | Persian (Farsi) | fa | 🇮🇷 | Arabic | ar | 🇸🇦 |
83
+ | Polish | pl | 🇵🇱 | Portuguese | pt | 🇵🇹 | Czech | cs | 🇨🇿 |
84
+ | Danish | da | 🇩🇰 | Swedish | sv | 🇸🇪 | Greek | el | 🇬🇷 |
85
+ | Turkish | tr | 🇹🇷 | | | | | | |
86
+
87
+ ## Quickstart
88
+
89
+ ### Environment Setup
90
+
91
+ We recommend a clean Python environment first, then installing the project in editable mode so the `moss-tts-nano` command becomes available locally.
92
+ The examples below intentionally keep arguments minimal and rely on the repository defaults.
93
+ By default, the code loads `OpenMOSS-Team/MOSS-TTS-Nano` and `OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano`.
94
+
95
+ #### Using Conda
96
+
97
+ ```bash
98
+ conda create -n moss-tts-nano python=3.12 -y
99
+ conda activate moss-tts-nano
100
+
101
+ git clone https://github.com/OpenMOSS/MOSS-TTS-Nano.git
102
+ cd MOSS-TTS-Nano
103
+
104
+ pip install -r requirements.txt
105
+ pip install -e .
106
+ ```
107
+
108
+ If `WeTextProcessing` fails to install from `requirements.txt`, try installing it manually in the same environment:
109
+
110
+ ```bash
111
+ conda install -c conda-forge pynini=2.1.6.post1 -y
112
+ pip install git+https://github.com/WhizZest/WeTextProcessing.git
113
+ ```
114
+
115
+ ### Voice Clone with `infer.py`
116
+
117
+ This repository keeps the direct Python entrypoint for local inference. The example below uses **voice clone mode**, which is the main recommended workflow for MOSS-TTS-Nano.
118
+
119
+ ```bash
120
+ python infer.py \
121
+ --prompt-audio-path assets/audio/zh_1.wav \
122
+ --text "欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。"
123
+ ```
124
+
125
+ This writes audio to `generated_audio/infer_output.wav` by default.
126
+
127
+ ### Local Web Demo with `app.py`
128
+
129
+ You can launch the local FastAPI demo for browser-based testing:
130
+
131
+ ```bash
132
+ python app.py
133
+ ```
134
+
135
+ Then open `http://127.0.0.1:18083` in your browser.
136
+
137
+ ### CLI Command: `moss-tts-nano generate`
138
+
139
+ After `pip install -e .`, you can call the packaged CLI directly:
140
+
141
+ ```bash
142
+ moss-tts-nano generate \
143
+ --prompt-speech assets/audio/zh_1.wav \
144
+ --text "欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。"
145
+ ```
146
+
147
+ Useful notes:
148
+
149
+ - `moss-tts-nano generate` writes to `generated_audio/moss_tts_nano_output.wav` by default.
150
+ - `--prompt-speech` is the friendly alias for the reference audio path used by voice cloning.
151
+ - `--text-file` is supported for long-form synthesis.
152
+
153
+ ### CLI Command: `moss-tts-nano serve`
154
+
155
+ You can also launch the web demo through the packaged CLI:
156
+
157
+ ```bash
158
+ moss-tts-nano serve
159
+ ```
160
+
161
+ This command forwards to `app.py`, keeps the model loaded in memory, and serves the local browser demo plus HTTP generation endpoints.
162
+
163
+ ## MOSS-Audio-Tokenizer-Nano
164
+
165
+ <a id="mat-intro"></a>
166
+ ### Introduction
167
+ **MOSS-Audio-Tokenizer** is the unified discrete audio interface for the entire MOSS-TTS family. It is built on the **Cat** (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture, a CNN-free audio tokenizer composed entirely of causal Transformer blocks. It serves as the shared audio backbone for MOSS-TTS, MOSS-TTS-Nano, MOSS-TTSD, MOSS-VoiceGenerator, MOSS-SoundEffect, and MOSS-TTS-Realtime, providing a consistent audio representation across the full product family.
168
+
169
+ To further improve perceptual quality while reducing inference cost, we trained **MOSS-Audio-Tokenizer-Nano**, a lightweight tokenizer with approximately **20 million parameters** designed for high-fidelity audio compression. It supports **48 kHz** input and output as well as **stereo audio**, which helps reduce compression loss and improve listening quality. It can compress **48 kHz stereo audio** into a **12.5 Hz** token stream and uses **RVQ with 16 codebooks**, enabling high-fidelity reconstruction across variable bitrates from **0.125 kbps to 4 kbps**.
170
+
171
+
172
+ To learn more about setup, advanced usage, and evaluation metrics, please visit the [MOSS-Audio-Tokenizer Repository](https://github.com/OpenMOSS/MOSS-Audio-Tokenizer)
173
+
174
+ <p align="center">
175
+ <img src="./assets/images/arch_moss_audio_tokenizer_nano.png" alt="MOSS-Audio-Tokenizer-Nano architecture" width="100%" />
176
+ Architecture of MOSS-Audio-Tokenizer-Nano
177
+ </p>
178
+
179
+ ### Model Weights
180
+
181
+ | Model | Hugging Face | ModelScope |
182
+ |:-----:|:------------:|:----------:|
183
+ | **MOSS-Audio-Tokenizer-Nano** | [![Hugging Face](https://img.shields.io/badge/Huggingface-Model-orange?logo=huggingface)](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano) | [![ModelScope](https://img.shields.io/badge/ModelScope-Model-lightgrey?logo=modelscope)](https://modelscope.cn/models/openmoss/MOSS-Audio-Tokenizer-Nano) |
184
+
185
+
186
+ ## License
187
+
188
+ This repository will follow the license specified in the root `LICENSE` file. If you are reading this before that file is published, please treat the repository as **not yet licensed for redistribution**.
189
+
190
+ ## Citation
191
+
192
+ If you use the MOSS-TTS work in your research or product, please cite:
193
+
194
+ ```bibtex
195
+ @misc{openmoss2026mossttsnano,
196
+ title={MOSS-TTS-Nano},
197
+ author={OpenMOSS Team},
198
+ year={2026},
199
+ howpublished={GitHub repository},
200
+ url={https://github.com/OpenMOSS/MOSS-TTS-Nano}
201
+ }
202
+ ```
203
+
204
+ ```bibtex
205
+ @misc{gong2026mossttstechnicalreport,
206
+ title={MOSS-TTS Technical Report},
207
+ author={Yitian Gong and Botian Jiang and Yiwei Zhao and Yucheng Yuan and Kuangwei Chen and Yaozhou Jiang and Cheng Chang and Dong Hong and Mingshu Chen and Ruixiao Li and Yiyang Zhang and Yang Gao and Hanfu Chen and Ke Chen and Songlin Wang and Xiaogui Yang and Yuqian Zhang and Kexin Huang and ZhengYuan Lin and Kang Yu and Ziqi Chen and Jin Wang and Zhaoye Fei and Qinyuan Cheng and Shimin Li and Xipeng Qiu},
208
+ year={2026},
209
+ eprint={2603.18090},
210
+ archivePrefix={arXiv},
211
+ primaryClass={cs.SD},
212
+ url={https://arxiv.org/abs/2603.18090}
213
+ }
214
+ ```
215
+
216
+ ```bibtex
217
+ @misc{gong2026mossaudiotokenizerscalingaudiotokenizers,
218
+ title={MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models},
219
+ author={Yitian Gong and Kuangwei Chen and Zhaoye Fei and Xiaogui Yang and Ke Chen and Yang Wang and Kexin Huang and Mingshu Chen and Ruixiao Li and Qingyuan Cheng and Shimin Li and Xipeng Qiu},
220
+ year={2026},
221
+ eprint={2602.10934},
222
+ archivePrefix={arXiv},
223
+ primaryClass={cs.SD},
224
+ url={https://arxiv.org/abs/2602.10934},
225
+ }
226
+ ```
assets/images/OpenMOSS_Logo.png ADDED

Git LFS Details

  • SHA256: 1693063e8714371ed4b04046d49173d120ce8c42d29168dd148afc555cb919d3
  • Pointer size: 130 Bytes
  • Size of remote file: 31.9 kB
assets/images/arch_moss_audio_tokenizer_nano.png ADDED

Git LFS Details

  • SHA256: 2975096ead35b386724868a86a79de46a044eea2cbb815fb75b16f8ac9511db4
  • Pointer size: 131 Bytes
  • Size of remote file: 174 kB
assets/images/concept.png ADDED

Git LFS Details

  • SHA256: 18c079211d63da4e3bc622d49c72ecb96d0f5f078fc912fb5f29065cd4ad3a5f
  • Pointer size: 132 Bytes
  • Size of remote file: 2.23 MB
assets/images/mosi-logo.png ADDED

Git LFS Details

  • SHA256: d83a75af3f1822ea51f8c5ee07df5de580d7edc734924a4ca706b1856878fcfe
  • Pointer size: 130 Bytes
  • Size of remote file: 25.2 kB
assets/images/wechat.jpg ADDED

Git LFS Details

  • SHA256: d14f8415797dbf747bf2236eb6a5f7d3ece7d5b6ba26f090b05f2e59e4a8b8bd
  • Pointer size: 130 Bytes
  • Size of remote file: 11.1 kB