Instructions to use niobures/MOSS-TTS with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use niobures/MOSS-TTS with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="niobures/MOSS-TTS",
	filename="models/MOSS-TTS-GGUF/MOSS_TTS_F16.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use niobures/MOSS-TTS with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf niobures/MOSS-TTS:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf niobures/MOSS-TTS:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf niobures/MOSS-TTS:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf niobures/MOSS-TTS:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf niobures/MOSS-TTS:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf niobures/MOSS-TTS:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf niobures/MOSS-TTS:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf niobures/MOSS-TTS:Q4_K_M

Use Docker

docker model run hf.co/niobures/MOSS-TTS:Q4_K_M

LM Studio
Jan
Ollama
How to use niobures/MOSS-TTS with Ollama:
```
ollama run hf.co/niobures/MOSS-TTS:Q4_K_M
```

Unsloth Studio new

How to use niobures/MOSS-TTS with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for niobures/MOSS-TTS to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for niobures/MOSS-TTS to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for niobures/MOSS-TTS to start chatting

Docker Model Runner
How to use niobures/MOSS-TTS with Docker Model Runner:
```
docker model run hf.co/niobures/MOSS-TTS:Q4_K_M
```

Lemonade

How to use niobures/MOSS-TTS with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull niobures/MOSS-TTS:Q4_K_M

Run and chat with the model

lemonade run user.MOSS-TTS-Q4_K_M

List all available models

lemonade list

niobures commited on Apr 21

Commit

052c0a4

verified ·

1 Parent(s): 5d5fac5

MOSS-TTS-Nano-100M

Browse files

Files changed (22) hide show

.gitattributes +2 -0
models/MOSS-TTS-Nano-100M/.gitattributes +37 -0
models/MOSS-TTS-Nano-100M/.gitignore +1 -0
models/MOSS-TTS-Nano-100M/README.md +253 -0
models/MOSS-TTS-Nano-100M/__init__.py +31 -0
models/MOSS-TTS-Nano-100M/assets/images/OpenMOSS_Logo.png +0 -0
models/MOSS-TTS-Nano-100M/assets/images/arch_moss_audio_tokenizer_nano.png +3 -0
models/MOSS-TTS-Nano-100M/assets/images/concept.png +3 -0
models/MOSS-TTS-Nano-100M/assets/images/mosi-logo.png +0 -0
models/MOSS-TTS-Nano-100M/assets/images/wechat.jpg +0 -0
models/MOSS-TTS-Nano-100M/config.json +197 -0
models/MOSS-TTS-Nano-100M/configuration_moss_tts_nano.py +108 -0
models/MOSS-TTS-Nano-100M/gpt2_decoder.py +618 -0
models/MOSS-TTS-Nano-100M/languages.txt +19 -0
models/MOSS-TTS-Nano-100M/modeling_moss_tts_nano.py +0 -0
models/MOSS-TTS-Nano-100M/prompting.py +92 -0
models/MOSS-TTS-Nano-100M/pytorch_model.bin +3 -0
models/MOSS-TTS-Nano-100M/source.txt +1 -0
models/MOSS-TTS-Nano-100M/special_tokens_map.json +30 -0
models/MOSS-TTS-Nano-100M/tokenization_moss_tts_nano.py +106 -0
models/MOSS-TTS-Nano-100M/tokenizer.model +3 -0
models/MOSS-TTS-Nano-100M/tokenizer_config.json +52 -0

.gitattributes CHANGED Viewed

@@ -72,3 +72,5 @@ MOSS-TTS[[:space:]]Technical[[:space:]]Report.pdf filter=lfs diff=lfs merge=lfs
 models/MOSS-TTS-Nano-100M-ONNX/moss_tts_global_shared.data filter=lfs diff=lfs merge=lfs -text
 models/MOSS-TTS-Nano-100M-ONNX/moss_tts_local_shared.data filter=lfs diff=lfs merge=lfs -text
 models/MOSS-TTS-Local-Transformer/tokenizer.json filter=lfs diff=lfs merge=lfs -text

 models/MOSS-TTS-Nano-100M-ONNX/moss_tts_global_shared.data filter=lfs diff=lfs merge=lfs -text
 models/MOSS-TTS-Nano-100M-ONNX/moss_tts_local_shared.data filter=lfs diff=lfs merge=lfs -text
 models/MOSS-TTS-Local-Transformer/tokenizer.json filter=lfs diff=lfs merge=lfs -text
+models/MOSS-TTS-Nano-100M/assets/images/arch_moss_audio_tokenizer_nano.png filter=lfs diff=lfs merge=lfs -text
+models/MOSS-TTS-Nano-100M/assets/images/concept.png filter=lfs diff=lfs merge=lfs -text

models/MOSS-TTS-Nano-100M/.gitattributes ADDED Viewed

	@@ -0,0 +1,37 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text
+*.jpg filter=lfs diff=lfs merge=lfs -text

models/MOSS-TTS-Nano-100M/.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ __pycache__

models/MOSS-TTS-Nano-100M/README.md ADDED Viewed

	@@ -0,0 +1,253 @@

+---
+license: apache-2.0
+tags:
+- text-to-speech
+language:
+- zh
+- en
+- de
+- es
+- fr
+- ja
+- it
+- he
+- ko
+- ru
+- fa
+- ar
+- pl
+- pt
+- cs
+- da
+- sv
+- hu
+- el
+- tr
+---
+# MOSS-TTS-Nano
+<br>
+<p align="center">
+  <img src="./assets/images/OpenMOSS_Logo.png" height="70" align="middle" />
+  &nbsp;&nbsp;&nbsp;&nbsp;
+  <img src="./assets/images/mosi-logo.png" height="50" align="middle" />
+</p>
+<div align="center">
+  <a href="https://clawhub.ai/luogao2333/moss-tts-voice"><img src="https://img.shields.io/badge/🦞_OpenClaw-Skills-8A2BE2" alt="OpenClaw"></a>
+  <a href="https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Nano"><img src="https://img.shields.io/badge/Huggingface-Models-orange?logo=huggingface&amp"></a>
+  <a href="https://modelscope.cn/models/openmoss/MOSS-TTS-Nano"><img src="https://img.shields.io/badge/ModelScope-Models-7B61FF?logo=modelscope&amp;logoColor=white"></a>
+  <a href="https://mosi.cn/#models"><img src="https://img.shields.io/badge/Blog-View-blue?logo=internet-explorer&amp"></a>
+  <a href="https://arxiv.org/abs/2603.18090"><img src="https://img.shields.io/badge/Arxiv-2603.18090-red?logo=arxiv&amp"></a>
+  <a href="https://studio.mosi.cn/experiments/moss-tts-nano"><img src="https://img.shields.io/badge/AIStudio-Try-green?logo=internet-explorer&amp"></a>
+  <a href="https://studio.mosi.cn/docs/moss-tts-nano"><img src="https://img.shields.io/badge/API-Docs-00A3FF?logo=fastapi&amp"></a>
+  <a href="https://x.com/Open_MOSS"><img src="https://img.shields.io/badge/Twitter-Follow-black?logo=x&amp"></a>
+  <a href="https://discord.gg/Xf3aXddCjc"><img src="https://img.shields.io/badge/Discord-Join-5865F2?logo=discord&amp"></a>
+  <a href="./assets/images/wechat.jpg"><img src="https://img.shields.io/badge/WeChat-Join-07C160?logo=wechat&amp;logoColor=white" alt="WeChat"></a>
+</div>
+MOSS-TTS-Nano is an open-source **multilingual tiny speech generation model** from [MOSI.AI](https://mosi.cn/#hero) and the [OpenMOSS team](https://www.open-moss.com/). With only **0.1B parameters**, it is designed for **realtime speech generation**, can run directly on **CPU without a GPU**, and keeps the deployment stack simple enough for local demos, web serving, and lightweight product integration.
+## News
+* 2026.4.10: We release **MOSS-TTS-Nano**. A demo Space is available at [OpenMOSS-Team/MOSS-TTS-Nano](https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTS-Nano). You can also view the demo and more details at [openmoss.github.io/MOSS-TTS-Nano-Demo/](https://openmoss.github.io/MOSS-TTS-Nano-Demo/).
+## Demo
+- Online Demo: [https://openmoss.github.io/MOSS-TTS-Nano-Demo/](https://openmoss.github.io/MOSS-TTS-Nano-Demo/)
+- Hugging Face Space: [OpenMOSS-Team/MOSS-TTS-Nano](https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTS-Nano)
+## Contents
+- [News](#news)
+- [Demo](#demo)
+- [Introduction](#introduction)
+  - [Main Features](#main-features)
+- [Supported Languages](#supported-languages)
+- [Quickstart](#quickstart)
+  - [Environment Setup](#environment-setup)
+  - [Voice Clone with `infer.py`](#voice-clone-with-inferpy)
+  - [Local Web Demo with `app.py`](#local-web-demo-with-apppy)
+  - [CLI Command: `moss-tts-nano generate`](#cli-command-moss-tts-nano-generate)
+  - [CLI Command: `moss-tts-nano serve`](#cli-command-moss-tts-nano-serve)
+- [MOSS-Audio-Tokenizer-Nano](#moss-audio-tokenizer-nano)
+- [License](#license)
+- [Citation](#citation)
+- [Star History](#star-history)
+## Introduction
+<p align="center">
+  <img src="./assets/images/concept.png" alt="MOSS-TTS-Nano concept" width="85%" />
+</p>
+MOSS-TTS-Nano focuses on the part of TTS deployment that matters most in practice: **small footprint**, **low latency**, **good enough quality for realtime products**, and **simple local setup**. It uses a pure autoregressive **Audio Tokenizer + LLM** pipeline and keeps the inference workflow friendly for both terminal users and web-demo users.
+### Main Features
+- **Tiny model size**: only **0.1B parameters**
+- **Native audio format**: **48 kHz**, **2-channel** output
+- **Multilingual**: supports **Chinese, English, and more**
+- **Pure autoregressive architecture**: built on **Audio Tokenizer + LLM**
+- **Streaming inference**: low realtime latency and fast first audio
+- **CPU friendly**: streaming generation can run on a **4-core CPU**
+- **Long-text capable**: supports long input with automatic chunked voice cloning
+- **Open-source deployment**: direct `python infer.py`, `python app.py`, and packaged CLI support
+## Supported Languages
+MOSS-TTS-Nano currently supports **20 languages**:
+| Language | Code | Flag | Language | Code | Flag | Language | Code | Flag |
+|---|---|---|---|---|---|---|---|---|
+| Chinese | zh | 🇨🇳 | English | en | 🇺🇸 | German | de | 🇩🇪 |
+| Spanish | es | 🇪🇸 | French | fr | 🇫🇷 | Japanese | ja | 🇯🇵 |
+| Italian | it | 🇮🇹 | Hungarian | hu | 🇭🇺 | Korean | ko | 🇰🇷 |
+| Russian | ru | 🇷🇺 | Persian (Farsi) | fa | 🇮🇷 | Arabic | ar | 🇸🇦 |
+| Polish | pl | ����🇱 | Portuguese | pt | 🇵🇹 | Czech | cs | 🇨🇿 |
+| Danish | da | 🇩🇰 | Swedish | sv | 🇸🇪 | Greek | el | 🇬🇷 |
+| Turkish | tr | 🇹🇷 |  |  |  |  |  |  |
+## Quickstart
+### Environment Setup
+We recommend a clean Python environment first, then installing the project in editable mode so the `moss-tts-nano` command becomes available locally.
+The examples below intentionally keep arguments minimal and rely on the repository defaults.
+By default, the code loads `OpenMOSS-Team/MOSS-TTS-Nano` and `OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano`.
+#### Using Conda
+```bash
+conda create -n moss-tts-nano python=3.12 -y
+conda activate moss-tts-nano
+git clone https://github.com/OpenMOSS/MOSS-TTS-Nano.git
+cd MOSS-TTS-Nano
+pip install -r requirements.txt
+pip install -e .
+```
+If `WeTextProcessing` fails to install from `requirements.txt`, try installing it manually in the same environment:
+```bash
+conda install -c conda-forge pynini=2.1.6.post1 -y
+pip install git+https://github.com/WhizZest/WeTextProcessing.git
+```
+### Voice Clone with `infer.py`
+This repository keeps the direct Python entrypoint for local inference. The example below uses **voice clone mode**, which is the main recommended workflow for MOSS-TTS-Nano.
+```bash
+python infer.py \
+  --prompt-audio-path assets/audio/zh_1.wav \
+  --text "欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。"
+```
+This writes audio to `generated_audio/infer_output.wav` by default.
+### Local Web Demo with `app.py`
+You can launch the local FastAPI demo for browser-based testing:
+```bash
+python app.py
+```
+Then open `http://127.0.0.1:18083` in your browser.
+### CLI Command: `moss-tts-nano generate`
+After `pip install -e .`, you can call the packaged CLI directly:
+```bash
+moss-tts-nano generate \
+  --prompt-speech assets/audio/zh_1.wav \
+  --text "欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。"
+```
+Useful notes:
+- `moss-tts-nano generate` writes to `generated_audio/moss_tts_nano_output.wav` by default.
+- `--prompt-speech` is the friendly alias for the reference audio path used by voice cloning.
+- `--text-file` is supported for long-form synthesis.
+### CLI Command: `moss-tts-nano serve`
+You can also launch the web demo through the packaged CLI:
+```bash
+moss-tts-nano serve
+```
+This command forwards to `app.py`, keeps the model loaded in memory, and serves the local browser demo plus HTTP generation endpoints.
+## MOSS-Audio-Tokenizer-Nano
+<a id="mat-intro"></a>
+### Introduction
+**MOSS-Audio-Tokenizer** is the unified discrete audio interface for the entire MOSS-TTS family. It is built on the **Cat** (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture, a CNN-free audio tokenizer composed entirely of causal Transformer blocks. It serves as the shared audio backbone for MOSS-TTS, MOSS-TTS-Nano, MOSS-TTSD, MOSS-VoiceGenerator, MOSS-SoundEffect, and MOSS-TTS-Realtime, providing a consistent audio representation across the full product family.
+To further improve perceptual quality while reducing inference cost, we trained **MOSS-Audio-Tokenizer-Nano**, a lightweight tokenizer with approximately **20 million parameters** designed for high-fidelity audio compression. It supports **48 kHz** input and output as well as **stereo audio**, which helps reduce compression loss and improve listening quality. It can compress **48 kHz stereo audio** into a **12.5 Hz** token stream and uses **RVQ with 16 codebooks**, enabling high-fidelity reconstruction across variable bitrates from **0.125 kbps to 4 kbps**.
+To learn more about setup, advanced usage, and evaluation metrics, please visit the [MOSS-Audio-Tokenizer Repository](https://github.com/OpenMOSS/MOSS-Audio-Tokenizer)
+<p align="center">
+  <img src="./assets/images/arch_moss_audio_tokenizer_nano.png" alt="MOSS-Audio-Tokenizer-Nano architecture" width="100%" />
+  Architecture of MOSS-Audio-Tokenizer-Nano
+</p>
+### Model Weights
+| Model | Hugging Face | ModelScope |
+|:-----:|:------------:|:----------:|
+| **MOSS-Audio-Tokenizer-Nano** | [![Hugging Face](https://img.shields.io/badge/Huggingface-Model-orange?logo=huggingface)](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano) | [![ModelScope](https://img.shields.io/badge/ModelScope-Models-7B61FF?logo=modelscope&amp;logoColor=white)](https://modelscope.cn/models/openmoss/MOSS-Audio-Tokenizer-Nano) |
+## License
+This repository will follow the license specified in the root `LICENSE` file. If you are reading this before that file is published, please treat the repository as **not yet licensed for redistribution**.
+## Citation
+If you use the MOSS-TTS work in your research or product, please cite:
+```bibtex
+@misc{openmoss2026mossttsnano,
+  title={MOSS-TTS-Nano},
+  author={OpenMOSS Team},
+  year={2026},
+  howpublished={GitHub repository},
+  url={https://github.com/OpenMOSS/MOSS-TTS-Nano}
+}
+```
+```bibtex
+@misc{gong2026mossttstechnicalreport,
+  title={MOSS-TTS Technical Report},
+  author={Yitian Gong and Botian Jiang and Yiwei Zhao and Yucheng Yuan and Kuangwei Chen and Yaozhou Jiang and Cheng Chang and Dong Hong and Mingshu Chen and Ruixiao Li and Yiyang Zhang and Yang Gao and Hanfu Chen and Ke Chen and Songlin Wang and Xiaogui Yang and Yuqian Zhang and Kexin Huang and ZhengYuan Lin and Kang Yu and Ziqi Chen and Jin Wang and Zhaoye Fei and Qinyuan Cheng and Shimin Li and Xipeng Qiu},
+  year={2026},
+  eprint={2603.18090},
+  archivePrefix={arXiv},
+  primaryClass={cs.SD},
+  url={https://arxiv.org/abs/2603.18090}
+}
+```
+```bibtex
+@misc{gong2026mossaudiotokenizerscalingaudiotokenizers,
+  title={MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models},
+  author={Yitian Gong and Kuangwei Chen and Zhaoye Fei and Xiaogui Yang and Ke Chen and Yang Wang and Kexin Huang and Mingshu Chen and Ruixiao Li and Qingyuan Cheng and Shimin Li and Xipeng Qiu},
+  year={2026},
+  eprint={2602.10934},
+  archivePrefix={arXiv},
+  primaryClass={cs.SD},
+  url={https://arxiv.org/abs/2602.10934},
+}
+```

models/MOSS-TTS-Nano-100M/__init__.py ADDED Viewed

	@@ -0,0 +1,31 @@

+from .configuration_moss_tts_nano import MossTTSNanoConfig
+from .modeling_moss_tts_nano import (
+    MossTTSNanoForCausalLM,
+    MossTTSNanoGenerationOutput,
+    MossTTSNanoOutput,
+)
+from .tokenization_moss_tts_nano import MossTTSNanoSentencePieceTokenizer
+try:
+    MossTTSNanoConfig.register_for_auto_class()
+except Exception:
+    pass
+for auto_class_name in ("AutoModel", "AutoModelForCausalLM"):
+    try:
+        MossTTSNanoForCausalLM.register_for_auto_class(auto_class_name)
+    except Exception:
+        pass
+try:
+    MossTTSNanoSentencePieceTokenizer.register_for_auto_class("AutoTokenizer")
+except Exception:
+    pass
+__all__ = [
+    "MossTTSNanoConfig",
+    "MossTTSNanoForCausalLM",
+    "MossTTSNanoSentencePieceTokenizer",
+    "MossTTSNanoGenerationOutput",
+    "MossTTSNanoOutput",
+]

models/MOSS-TTS-Nano-100M/assets/images/OpenMOSS_Logo.png ADDED Viewed

models/MOSS-TTS-Nano-100M/assets/images/arch_moss_audio_tokenizer_nano.png ADDED Viewed

Git LFS Details

SHA256: 2975096ead35b386724868a86a79de46a044eea2cbb815fb75b16f8ac9511db4
Pointer size: 131 Bytes
Size of remote file: 174 kB

models/MOSS-TTS-Nano-100M/assets/images/concept.png ADDED Viewed

Git LFS Details

SHA256: 18c079211d63da4e3bc622d49c72ecb96d0f5f078fc912fb5f29065cd4ad3a5f
Pointer size: 132 Bytes
Size of remote file: 2.23 MB

models/MOSS-TTS-Nano-100M/assets/images/mosi-logo.png ADDED Viewed

models/MOSS-TTS-Nano-100M/assets/images/wechat.jpg ADDED Viewed

models/MOSS-TTS-Nano-100M/config.json ADDED Viewed

	@@ -0,0 +1,197 @@

+{
+  "add_cross_attention": false,
+  "architectures": [
+    "MossTTSNanoForCausalLM"
+  ],
+  "attn_implementation": "flash_attention_2",
+  "audio_assistant_slot_token_id": 9,
+  "audio_codebook_sizes": [
+    1024,
+    1024,
+    1024,
+    1024,
+    1024,
+    1024,
+    1024,
+    1024,
+    1024,
+    1024,
+    1024,
+    1024,
+    1024,
+    1024,
+    1024,
+    1024
+  ],
+  "audio_end_token_id": 7,
+  "audio_pad_token_id": 1024,
+  "audio_start_token_id": 6,
+  "audio_tokenizer_pretrained_name_or_path": "OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano",
+  "audio_tokenizer_sample_rate": 48000,
+  "audio_tokenizer_type": "moss-audio-tokenizer-nano",
+  "audio_user_slot_token_id": 8,
+  "audio_vocab_size": 1024,
+  "bad_words_ids": null,
+  "begin_suppress_tokens": null,
+  "bos_token_id": null,
+  "chunk_size_feed_forward": 0,
+  "cross_attention_hidden_size": null,
+  "decoder_start_token_id": null,
+  "diversity_penalty": 0.0,
+  "do_sample": false,
+  "dtype": "float32",
+  "early_stopping": false,
+  "encoder_no_repeat_ngram_size": 0,
+  "eos_token_id": null,
+  "exponential_decay_length_penalty": null,
+  "finetuning_task": null,
+  "forced_bos_token_id": null,
+  "forced_eos_token_id": null,
+  "gpt2_config": {
+    "_name_or_path": "",
+    "activation_function": "gelu_new",
+    "add_cross_attention": false,
+    "architectures": null,
+    "attn_pdrop": 0.0,
+    "bad_words_ids": null,
+    "begin_suppress_tokens": null,
+    "bos_token_id": 1,
+    "chunk_size_feed_forward": 0,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "dtype": null,
+    "early_stopping": false,
+    "embd_pdrop": 0.0,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": 2,
+    "exponential_decay_length_penalty": null,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "initializer_range": 0.02,
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "layer_norm_epsilon": 1e-05,
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "min_length": 0,
+    "model_type": "gpt2",
+    "n_ctx": 32768,
+    "n_embd": 768,
+    "n_head": 12,
+    "n_inner": 3072,
+    "n_layer": 12,
+    "n_positions": 32768,
+    "no_repeat_ngram_size": 0,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_return_sequences": 1,
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "pad_token_id": 3,
+    "position_embedding_type": "rope",
+    "prefix": null,
+    "problem_type": null,
+    "pruned_heads": {},
+    "remove_invalid_values": false,
+    "reorder_and_upcast_attn": false,
+    "repetition_penalty": 1.0,
+    "resid_pdrop": 0.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "rope_base": 10000.0,
+    "scale_attn_by_inverse_layer_idx": false,
+    "scale_attn_weights": true,
+    "sep_token_id": null,
+    "summary_activation": null,
+    "summary_first_dropout": 0.1,
+    "summary_proj_to_labels": true,
+    "summary_type": "cls_index",
+    "summary_use_proj": true,
+    "suppress_tokens": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "tf_legacy_loss": false,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": true,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torchscript": false,
+    "transformers_version": "4.57.1",
+    "typical_p": 1.0,
+    "use_bfloat16": false,
+    "use_cache": true,
+    "vocab_size": 16384
+  },
+  "hidden_size": 768,
+  "id2label": {
+    "0": "LABEL_0",
+    "1": "LABEL_1"
+  },
+  "im_end_token_id": 5,
+  "im_start_token_id": 4,
+  "initializer_range": 0.02,
+  "is_decoder": false,
+  "is_encoder_decoder": false,
+  "label2id": {
+    "LABEL_0": 0,
+    "LABEL_1": 1
+  },
+  "length_penalty": 1.0,
+  "local_transformer_attn_implementation": "flash_attention_2",
+  "local_transformer_layers": 1,
+  "max_length": 20,
+  "max_position_embeddings": 32768,
+  "min_length": 0,
+  "model_architecture": "global_local_transformer",
+  "model_type": "moss_tts_nano",
+  "n_vq": 16,
+  "no_repeat_ngram_size": 0,
+  "num_beam_groups": 1,
+  "num_beams": 1,
+  "num_return_sequences": 1,
+  "output_attentions": false,
+  "output_hidden_states": false,
+  "output_scores": false,
+  "pad_token_id": 3,
+  "prefix": null,
+  "problem_type": null,
+  "pruned_heads": {},
+  "remove_invalid_values": false,
+  "repetition_penalty": 1.0,
+  "return_dict": true,
+  "return_dict_in_generate": false,
+  "sep_token_id": null,
+  "suppress_tokens": null,
+  "task_specific_params": null,
+  "temperature": 1.0,
+  "tf_legacy_loss": false,
+  "tie_encoder_decoder": false,
+  "tie_word_embeddings": true,
+  "tokenizer_class": "MossTTSNanoSentencePieceTokenizer",
+  "tokenizer_use_fast": false,
+  "top_k": 50,
+  "top_p": 1.0,
+  "torchscript": false,
+  "transformers_version": "4.57.1",
+  "typical_p": 1.0,
+  "use_bfloat16": false,
+  "vocab_size": 16384,
+  "auto_map": {
+    "AutoConfig": "configuration_moss_tts_nano.MossTTSNanoConfig",
+    "AutoModel": "modeling_moss_tts_nano.MossTTSNanoForCausalLM",
+    "AutoModelForCausalLM": "modeling_moss_tts_nano.MossTTSNanoForCausalLM"
+  }
+}

models/MOSS-TTS-Nano-100M/configuration_moss_tts_nano.py ADDED Viewed

	@@ -0,0 +1,108 @@

+# coding=utf-8
+from typing import Any, Dict, Optional, Union
+from transformers.configuration_utils import PretrainedConfig
+from transformers.models.gpt2.configuration_gpt2 import GPT2Config
+class MossTTSNanoConfig(PretrainedConfig):
+    model_type = "moss_tts_nano"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    def __init__(
+        self,
+        gpt2_config: Optional[Union[GPT2Config, Dict[str, Any]]] = None,
+        n_vq: int = 8,
+        audio_vocab_size: Optional[int] = 1024,
+        audio_codebook_sizes: Optional[list[int]] = None,
+        audio_pad_token_id: int = 1024,
+        pad_token_id: int = 151643,
+        im_start_token_id: int = 151644,
+        im_end_token_id: int = 151645,
+        audio_start_token_id: int = 151652,
+        audio_end_token_id: int = 151653,
+        audio_user_slot_token_id: int = 151654,
+        audio_assistant_slot_token_id: int = 151656,
+        tokenizer_use_fast: bool = False,
+        audio_tokenizer_type: str = "moss-audio-tokenizer-nano",
+        audio_tokenizer_pretrained_name_or_path: Optional[str] = "OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano",
+        audio_tokenizer_sample_rate: int = 48000,
+        attn_implementation: str = "flash_attention_2",
+        initializer_range: float = 0.02,
+        model_architecture: str = "global_local_transformer",
+        local_transformer_layers: int = 4,
+        local_transformer_attn_implementation: Optional[str] = None,
+        **kwargs: Any,
+    ) -> None:
+        if isinstance(gpt2_config, dict):
+            self.gpt2_config = GPT2Config(**gpt2_config)
+        elif gpt2_config is None:
+            self.gpt2_config = GPT2Config()
+        else:
+            self.gpt2_config = gpt2_config
+        self.n_vq = int(n_vq)
+        if audio_codebook_sizes is None:
+            if audio_vocab_size is None:
+                raise ValueError("audio_vocab_size must be set when audio_codebook_sizes is not provided.")
+            resolved_audio_codebook_sizes = [int(audio_vocab_size)] * self.n_vq
+        else:
+            resolved_audio_codebook_sizes = [int(codebook_size) for codebook_size in audio_codebook_sizes]
+        if len(resolved_audio_codebook_sizes) != self.n_vq:
+            raise ValueError(
+                "audio_codebook_sizes must have length n_vq "
+                f"(expected {self.n_vq}, got {len(resolved_audio_codebook_sizes)})."
+            )
+        if any(codebook_size <= 0 for codebook_size in resolved_audio_codebook_sizes):
+            raise ValueError("audio_codebook_sizes must contain positive integers.")
+        max_audio_codebook_size = max(resolved_audio_codebook_sizes)
+        if audio_vocab_size is not None and int(audio_vocab_size) < max_audio_codebook_size:
+            raise ValueError(
+                "audio_vocab_size must be >= max(audio_codebook_sizes) "
+                f"(got {audio_vocab_size}, expected at least {max_audio_codebook_size})."
+            )
+        self.audio_codebook_sizes = resolved_audio_codebook_sizes
+        self.audio_vocab_size = (
+            max_audio_codebook_size if audio_vocab_size is None else int(audio_vocab_size)
+        )
+        self.audio_pad_token_id = int(audio_pad_token_id)
+        if self.audio_pad_token_id < max_audio_codebook_size:
+            raise ValueError(
+                "audio_pad_token_id must be >= max(audio_codebook_sizes) so pad stays outside every codebook "
+                f"(got {self.audio_pad_token_id}, max codebook size {max_audio_codebook_size})."
+            )
+        self.pad_token_id = pad_token_id
+        self.im_start_token_id = im_start_token_id
+        self.im_end_token_id = im_end_token_id
+        self.audio_start_token_id = audio_start_token_id
+        self.audio_end_token_id = audio_end_token_id
+        self.audio_user_slot_token_id = audio_user_slot_token_id
+        self.audio_assistant_slot_token_id = audio_assistant_slot_token_id
+        self.tokenizer_use_fast = tokenizer_use_fast
+        self.audio_tokenizer_type = audio_tokenizer_type
+        self.audio_tokenizer_pretrained_name_or_path = audio_tokenizer_pretrained_name_or_path
+        self.audio_tokenizer_sample_rate = audio_tokenizer_sample_rate
+        self.attn_implementation = attn_implementation
+        self.initializer_range = initializer_range
+        self.model_architecture = model_architecture
+        self.local_transformer_layers = local_transformer_layers
+        self.local_transformer_attn_implementation = (
+            attn_implementation
+            if local_transformer_attn_implementation is None
+            else local_transformer_attn_implementation
+        )
+        self.vocab_size = self.gpt2_config.vocab_size
+        self.hidden_size = self.gpt2_config.hidden_size
+        self.max_position_embeddings = self.gpt2_config.n_positions
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+    def to_dict(self) -> Dict[str, Any]:
+        output = super().to_dict()
+        output["gpt2_config"] = self.gpt2_config.to_dict()
+        return output
+__all__ = ["MossTTSNanoConfig"]

models/MOSS-TTS-Nano-100M/gpt2_decoder.py ADDED Viewed

	@@ -0,0 +1,618 @@

+# coding=utf-8
+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Optional
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint
+from transformers.activations import ACT2FN
+from transformers.modeling_outputs import BaseModelOutputWithPast
+from transformers.models.gpt2.configuration_gpt2 import GPT2Config
+try:
+    from flash_attn import flash_attn_func, flash_attn_varlen_func
+    from flash_attn.bert_padding import pad_input, unpad_input
+    _FLASH_ATTN_AVAILABLE = True
+except Exception:
+    flash_attn_func = None
+    flash_attn_varlen_func = None
+    pad_input = None
+    unpad_input = None
+    _FLASH_ATTN_AVAILABLE = False
+@dataclass
+class PackedSequenceMetadata:
+    cu_seqlens: torch.Tensor
+    max_seqlen: int
+    indices: Optional[torch.Tensor] = None
+    batch_size: Optional[int] = None
+    seq_len: Optional[int] = None
+class MossTTSNanoGPT2RotaryEmbedding(nn.Module):
+    def __init__(self, dim: int, base: float = 10000.0) -> None:
+        super().__init__()
+        if dim % 2 != 0:
+            raise ValueError(f"RoPE head_dim must be even, got {dim}")
+        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+    def forward(
+        self,
+        position_ids: torch.LongTensor,
+        *,
+        device: torch.device,
+        dtype: torch.dtype,
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        if position_ids.ndim == 1:
+            position_ids = position_ids.unsqueeze(0)
+        freqs = torch.einsum("bs,d->bsd", position_ids.to(device=device, dtype=self.inv_freq.dtype), self.inv_freq)
+        cos = freqs.cos().repeat_interleave(2, dim=-1).unsqueeze(2).to(dtype=dtype)
+        sin = freqs.sin().repeat_interleave(2, dim=-1).unsqueeze(2).to(dtype=dtype)
+        return cos, sin
+def rotate_half(hidden_states: torch.Tensor) -> torch.Tensor:
+    even = hidden_states[..., ::2]
+    odd = hidden_states[..., 1::2]
+    return torch.stack((-odd, even), dim=-1).reshape_as(hidden_states)
+def apply_rotary_pos_emb(
+    hidden_states: torch.Tensor,
+    cos: torch.Tensor,
+    sin: torch.Tensor,
+) -> torch.Tensor:
+    return (hidden_states * cos) + (rotate_half(hidden_states) * sin)
+class MossTTSNanoGPT2MLP(nn.Module):
+    def __init__(self, config: GPT2Config) -> None:
+        super().__init__()
+        hidden_size = int(config.hidden_size)
+        inner_size = int(config.n_inner or 4 * hidden_size)
+        self.fc_in = nn.Linear(hidden_size, inner_size)
+        self.fc_out = nn.Linear(inner_size, hidden_size)
+        self.act = ACT2FN[config.activation_function]
+        self.dropout = nn.Dropout(config.resid_pdrop)
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.fc_in(hidden_states)
+        hidden_states = self.act(hidden_states)
+        hidden_states = self.fc_out(hidden_states)
+        return self.dropout(hidden_states)
+class MossTTSNanoGPT2Attention(nn.Module):
+    def __init__(self, config: GPT2Config, layer_idx: int, attn_implementation: str) -> None:
+        super().__init__()
+        hidden_size = int(config.hidden_size)
+        num_heads = int(config.num_attention_heads)
+        if hidden_size % num_heads != 0:
+            raise ValueError(f"hidden_size={hidden_size} must be divisible by num_attention_heads={num_heads}")
+        self.num_heads = num_heads
+        self.head_dim = hidden_size // num_heads
+        self.embed_dim = hidden_size
+        self.layer_idx = layer_idx
+        self.attn_implementation = attn_implementation
+        self.attn_dropout = float(config.attn_pdrop)
+        self.resid_dropout = nn.Dropout(config.resid_pdrop)
+        self.scale_attn_weights = bool(getattr(config, "scale_attn_weights", True))
+        self.scale_attn_by_inverse_layer_idx = bool(getattr(config, "scale_attn_by_inverse_layer_idx", False))
+        self.position_embedding_type = str(getattr(config, "position_embedding_type", "absolute")).lower()
+        if self.position_embedding_type not in {"absolute", "rope"}:
+            raise ValueError(f"Unsupported position_embedding_type={self.position_embedding_type!r}")
+        self.c_attn = nn.Linear(hidden_size, 3 * hidden_size)
+        self.c_proj = nn.Linear(hidden_size, hidden_size)
+        self.rotary_emb = None
+        if self.position_embedding_type == "rope":
+            self.rotary_emb = MossTTSNanoGPT2RotaryEmbedding(
+                self.head_dim,
+                base=float(getattr(config, "rope_base", 10000.0)),
+            )
+    def _split_heads(self, tensor: torch.Tensor) -> torch.Tensor:
+        if tensor.ndim == 3:
+            batch_size, seq_len, _ = tensor.shape
+            return tensor.view(batch_size, seq_len, self.num_heads, self.head_dim)
+        if tensor.ndim == 2:
+            total_tokens, _ = tensor.shape
+            return tensor.view(total_tokens, self.num_heads, self.head_dim)
+        raise ValueError(f"Unsupported tensor rank for attention split: {tensor.ndim}")
+    def _merge_heads(self, tensor: torch.Tensor) -> torch.Tensor:
+        if tensor.ndim == 4:
+            batch_size, seq_len, _, _ = tensor.shape
+            return tensor.reshape(batch_size, seq_len, self.embed_dim)
+        if tensor.ndim == 3:
+            total_tokens, _, _ = tensor.shape
+            return tensor.reshape(total_tokens, self.embed_dim)
+        raise ValueError(f"Unsupported tensor rank for attention merge: {tensor.ndim}")
+    def _causal_attention_mask(
+        self,
+        attention_mask: Optional[torch.Tensor],
+        query_length: int,
+        key_length: int,
+        device: torch.device,
+    ) -> torch.Tensor:
+        query_positions = torch.arange(query_length, device=device, dtype=torch.long)
+        query_positions = query_positions + max(key_length - query_length, 0)
+        key_positions = torch.arange(key_length, device=device, dtype=torch.long)
+        causal = key_positions.unsqueeze(0) <= query_positions.unsqueeze(1)
+        causal = causal.unsqueeze(0).unsqueeze(0)
+        if attention_mask is None:
+            return causal
+        key_mask = attention_mask[:, None, None, :].to(dtype=torch.bool)
+        return causal & key_mask
+    def _eager_attention(
+        self,
+        query: torch.Tensor,
+        key: torch.Tensor,
+        value: torch.Tensor,
+        attention_mask: Optional[torch.Tensor],
+    ) -> torch.Tensor:
+        query = query.transpose(1, 2)
+        key = key.transpose(1, 2)
+        value = value.transpose(1, 2)
+        scale = 1.0
+        if self.scale_attn_weights:
+            scale /= self.head_dim ** 0.5
+        if self.scale_attn_by_inverse_layer_idx:
+            scale /= float(self.layer_idx + 1)
+        scores = torch.matmul(query, key.transpose(-1, -2)) * scale
+        causal_mask = self._causal_attention_mask(
+            attention_mask=attention_mask,
+            query_length=query.shape[-2],
+            key_length=key.shape[-2],
+            device=query.device,
+        )
+        scores = scores.masked_fill(~causal_mask, torch.finfo(scores.dtype).min)
+        probs = torch.softmax(scores, dim=-1)
+        if self.training and self.attn_dropout > 0:
+            probs = torch.dropout(probs, self.attn_dropout, train=True)
+        output = torch.matmul(probs, value)
+        return output.transpose(1, 2).contiguous()
+    def _sdpa_attention(
+        self,
+        query: torch.Tensor,
+        key: torch.Tensor,
+        value: torch.Tensor,
+        attention_mask: Optional[torch.Tensor],
+    ) -> torch.Tensor:
+        query = query.transpose(1, 2)
+        key = key.transpose(1, 2)
+        value = value.transpose(1, 2)
+        mask = None
+        query_attention_mask = None
+        if attention_mask is not None:
+            query_length = query.shape[-2]
+            key_length = key.shape[-2]
+            mask = self._causal_attention_mask(
+                attention_mask=attention_mask,
+                query_length=query_length,
+                key_length=key_length,
+                device=query.device,
+            )
+            query_attention_mask = attention_mask[:, -query_length:].to(dtype=torch.bool, device=query.device)
+            if not bool(query_attention_mask.all()):
+                # SDPA can produce NaNs when a query row is fully masked. For padded query positions,
+                # keep a single aligned key visible, then zero the query output after attention.
+                mask = mask.expand(query.shape[0], -1, -1, -1).clone()
+                invalid_batch, invalid_query = torch.nonzero(~query_attention_mask, as_tuple=True)
+                aligned_key = invalid_query + max(key_length - query_length, 0)
+                mask[invalid_batch, :, invalid_query, aligned_key] = True
+        output = torch.nn.functional.scaled_dot_product_attention(
+            query,
+            key,
+            value,
+            attn_mask=mask,
+            dropout_p=self.attn_dropout if self.training else 0.0,
+            is_causal=mask is None,
+        )
+        if query_attention_mask is not None and not bool(query_attention_mask.all()):
+            output = output.masked_fill(~query_attention_mask[:, None, :, None], 0.0)
+        return output.transpose(1, 2).contiguous()
+    def _flash_attention(
+        self,
+        query: torch.Tensor,
+        key: torch.Tensor,
+        value: torch.Tensor,
+        attention_mask: Optional[torch.Tensor],
+        packed_metadata: Optional[PackedSequenceMetadata],
+    ) -> torch.Tensor:
+        if not _FLASH_ATTN_AVAILABLE:
+            raise ImportError("flash_attn is not installed, but attn_implementation='flash_attention_2' was requested.")
+        if query.device.type != "cuda":
+            raise ValueError("flash_attention_2 requires CUDA tensors.")
+        if query.dtype not in (torch.float16, torch.bfloat16):
+            raise ValueError(
+                f"flash_attention_2 requires fp16/bf16 tensors, but received dtype={query.dtype}."
+            )
+        dropout_p = self.attn_dropout if self.training else 0.0
+        if packed_metadata is not None:
+            if packed_metadata.indices is not None:
+                query = query.reshape(-1, self.num_heads, self.head_dim).index_select(0, packed_metadata.indices)
+                key = key.reshape(-1, self.num_heads, self.head_dim).index_select(0, packed_metadata.indices)
+                value = value.reshape(-1, self.num_heads, self.head_dim).index_select(0, packed_metadata.indices)
+            output = flash_attn_varlen_func(
+                query,
+                key,
+                value,
+                packed_metadata.cu_seqlens,
+                packed_metadata.cu_seqlens,
+                packed_metadata.max_seqlen,
+                packed_metadata.max_seqlen,
+                dropout_p=dropout_p,
+                causal=True,
+            )
+            if packed_metadata.indices is None:
+                return output
+            return pad_input(
+                output,
+                packed_metadata.indices,
+                packed_metadata.batch_size,
+                packed_metadata.seq_len,
+            )
+        if attention_mask is None or bool(attention_mask.all()):
+            return flash_attn_func(
+                query,
+                key,
+                value,
+                dropout_p=dropout_p,
+                causal=True,
+            )
+        unpadded_query, indices, cu_seqlens, max_seqlen, _ = unpad_input(query, attention_mask)
+        unpadded_key, _, _, _, _ = unpad_input(key, attention_mask)
+        unpadded_value, _, _, _, _ = unpad_input(value, attention_mask)
+        output = flash_attn_varlen_func(
+            unpadded_query,
+            unpadded_key,
+            unpadded_value,
+            cu_seqlens,
+            cu_seqlens,
+            max_seqlen,
+            max_seqlen,
+            dropout_p=dropout_p,
+            causal=True,
+        )
+        return pad_input(output, indices, query.shape[0], query.shape[1])
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        packed_metadata: Optional[PackedSequenceMetadata] = None,
+        layer_past: Optional[tuple[torch.Tensor, torch.Tensor]] = None,
+        use_cache: bool = False,
+    ) -> tuple[torch.Tensor, Optional[tuple[torch.Tensor, torch.Tensor]]]:
+        qkv = self.c_attn(hidden_states)
+        query, key, value = qkv.split(self.embed_dim, dim=-1)
+        query = self._split_heads(query)
+        key = self._split_heads(key)
+        value = self._split_heads(value)
+        if self.rotary_emb is not None:
+            if position_ids is None:
+                raise ValueError("position_ids must be provided when position_embedding_type='rope'.")
+            cos, sin = self.rotary_emb(
+                position_ids.to(device=query.device),
+                device=query.device,
+                dtype=query.dtype,
+            )
+            query = apply_rotary_pos_emb(query, cos, sin)
+            key = apply_rotary_pos_emb(key, cos, sin)
+        if layer_past is not None:
+            past_key, past_value = layer_past
+            key = torch.cat([past_key.to(device=key.device, dtype=key.dtype), key], dim=1)
+            value = torch.cat([past_value.to(device=value.device, dtype=value.dtype), value], dim=1)
+        present = (key, value) if use_cache else None
+        if self.attn_implementation == "flash_attention_2" and layer_past is None:
+            attn_output = self._flash_attention(
+                query=query,
+                key=key,
+                value=value,
+                attention_mask=attention_mask,
+                packed_metadata=packed_metadata,
+            )
+        elif self.attn_implementation == "sdpa":
+            attn_output = self._sdpa_attention(
+                query=query,
+                key=key,
+                value=value,
+                attention_mask=attention_mask,
+            )
+        else:
+            attn_output = self._eager_attention(
+                query=query,
+                key=key,
+                value=value,
+                attention_mask=attention_mask,
+            )
+        attn_output = self._merge_heads(attn_output)
+        attn_output = self.c_proj(attn_output)
+        return self.resid_dropout(attn_output), present
+class MossTTSNanoGPT2Block(nn.Module):
+    def __init__(self, config: GPT2Config, layer_idx: int, attn_implementation: str) -> None:
+        super().__init__()
+        hidden_size = int(config.hidden_size)
+        self.ln_1 = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
+        self.attn = MossTTSNanoGPT2Attention(config, layer_idx=layer_idx, attn_implementation=attn_implementation)
+        self.ln_2 = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
+        self.mlp = MossTTSNanoGPT2MLP(config)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        packed_metadata: Optional[PackedSequenceMetadata] = None,
+        layer_past: Optional[tuple[torch.Tensor, torch.Tensor]] = None,
+        use_cache: bool = False,
+    ) -> tuple[torch.Tensor, Optional[tuple[torch.Tensor, torch.Tensor]]]:
+        attn_output, present = self.attn(
+            self.ln_1(hidden_states),
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            packed_metadata=packed_metadata,
+            layer_past=layer_past,
+            use_cache=use_cache,
+        )
+        hidden_states = hidden_states + attn_output
+        hidden_states = hidden_states + self.mlp(self.ln_2(hidden_states))
+        return hidden_states, present
+class MossTTSNanoGPT2Model(nn.Module):
+    def __init__(self, config: GPT2Config, attn_implementation: str = "eager") -> None:
+        super().__init__()
+        self.config = config
+        self.attn_implementation = attn_implementation
+        self.position_embedding_type = str(getattr(config, "position_embedding_type", "absolute")).lower()
+        if self.position_embedding_type not in {"absolute", "rope"}:
+            raise ValueError(f"Unsupported position_embedding_type={self.position_embedding_type!r}")
+        hidden_size = int(config.hidden_size)
+        self.wte = nn.Embedding(config.vocab_size, hidden_size)
+        self.wpe = nn.Embedding(config.n_positions, hidden_size) if self.position_embedding_type == "absolute" else nn.Identity()
+        self.drop = nn.Dropout(config.embd_pdrop)
+        self.h = nn.ModuleList(
+            [MossTTSNanoGPT2Block(config, layer_idx=index, attn_implementation=attn_implementation) for index in range(config.n_layer)]
+        )
+        self.ln_f = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
+        self.gradient_checkpointing = False
+        self._reset_parameters()
+    def _reset_parameters(self) -> None:
+        init_std = float(self.config.initializer_range)
+        for module in self.modules():
+            if isinstance(module, nn.Linear):
+                nn.init.normal_(module.weight, mean=0.0, std=init_std)
+                if module.bias is not None:
+                    nn.init.zeros_(module.bias)
+            elif isinstance(module, nn.Embedding):
+                nn.init.normal_(module.weight, mean=0.0, std=init_std)
+            elif isinstance(module, nn.LayerNorm):
+                nn.init.ones_(module.weight)
+                nn.init.zeros_(module.bias)
+    @staticmethod
+    def _normalize_num_sequences(
+        cu_seqlens: torch.Tensor,
+        num_sequences: Optional[torch.Tensor],
+        device: torch.device,
+    ) -> torch.Tensor:
+        if cu_seqlens.ndim == 1:
+            cu_seqlens = cu_seqlens.unsqueeze(0)
+        if num_sequences is None:
+            counts = []
+            for boundary in cu_seqlens:
+                diffs = boundary[1:] - boundary[:-1]
+                counts.append(int((diffs > 0).sum().item()))
+            return torch.tensor(counts, dtype=torch.int32, device=device)
+        if num_sequences.ndim == 0:
+            return num_sequences.unsqueeze(0)
+        return num_sequences
+    @staticmethod
+    def build_packed_position_ids(
+        attention_mask: Optional[torch.Tensor],
+        cu_seqlens: torch.Tensor,
+        num_sequences: Optional[torch.Tensor],
+    ) -> torch.Tensor:
+        if cu_seqlens.ndim == 1:
+            cu_seqlens = cu_seqlens.unsqueeze(0)
+        batch_size, seq_len = cu_seqlens.shape[0], cu_seqlens.shape[1] - 1
+        device = cu_seqlens.device
+        position_ids = torch.zeros((batch_size, seq_len), dtype=torch.long, device=device)
+        counts = MossTTSNanoGPT2Model._normalize_num_sequences(cu_seqlens, num_sequences, device=device)
+        for batch_index in range(batch_size):
+            sequence_count = int(counts[batch_index].item())
+            boundaries = cu_seqlens[batch_index, : sequence_count + 1].tolist()
+            for start, end in zip(boundaries[:-1], boundaries[1:]):
+                start = int(start)
+                end = int(end)
+                if end > start:
+                    position_ids[batch_index, start:end] = torch.arange(end - start, device=device)
+        if attention_mask is not None:
+            position_ids = position_ids * attention_mask.to(dtype=position_ids.dtype)
+        return position_ids
+    @staticmethod
+    def build_packed_metadata(
+        hidden_states: torch.Tensor,
+        cu_seqlens: torch.Tensor,
+        num_sequences: Optional[torch.Tensor],
+    ) -> PackedSequenceMetadata:
+        if cu_seqlens.ndim == 1:
+            cu_seqlens = cu_seqlens.unsqueeze(0)
+        device = hidden_states.device
+        counts = MossTTSNanoGPT2Model._normalize_num_sequences(cu_seqlens, num_sequences, device=device)
+        flat_indices = []
+        cumulative = [0]
+        max_seqlen = 0
+        seq_len = hidden_states.shape[1]
+        for batch_index in range(hidden_states.shape[0]):
+            sequence_count = int(counts[batch_index].item())
+            boundaries = cu_seqlens[batch_index, : sequence_count + 1].tolist()
+            for start, end in zip(boundaries[:-1], boundaries[1:]):
+                start = int(start)
+                end = int(end)
+                if end <= start:
+                    continue
+                segment_indices = batch_index * seq_len + torch.arange(start, end, device=device)
+                flat_indices.append(segment_indices)
+                cumulative.append(cumulative[-1] + (end - start))
+                max_seqlen = max(max_seqlen, end - start)
+        if not flat_indices:
+            raise ValueError("cu_seqlens did not describe any non-empty packed sequences.")
+        indices = torch.cat(flat_indices, dim=0)
+        return PackedSequenceMetadata(
+            cu_seqlens=torch.tensor(cumulative, dtype=torch.int32, device=device),
+            max_seqlen=max_seqlen,
+            indices=indices,
+            batch_size=hidden_states.shape[0],
+            seq_len=hidden_states.shape[1],
+        )
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[tuple[tuple[torch.Tensor, torch.Tensor], ...]] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: bool = True,
+        cu_seqlens: Optional[torch.Tensor] = None,
+        num_sequences: Optional[torch.Tensor] = None,
+    ) -> BaseModelOutputWithPast:
+        del input_ids, output_attentions
+        if inputs_embeds is None:
+            raise ValueError("inputs_embeds must be provided.")
+        use_cache = bool(use_cache)
+        if use_cache and cu_seqlens is not None:
+            raise ValueError("use_cache=True is not supported together with cu_seqlens packing.")
+        hidden_states = inputs_embeds
+        if attention_mask is None:
+            attention_mask = torch.ones(hidden_states.shape[:2], dtype=torch.bool, device=hidden_states.device)
+        else:
+            attention_mask = attention_mask.to(dtype=torch.bool, device=hidden_states.device)
+        query_attention_mask = attention_mask[:, -hidden_states.shape[1] :]
+        packed_metadata = None
+        if position_ids is None:
+            if cu_seqlens is not None:
+                position_ids = self.build_packed_position_ids(
+                    attention_mask=attention_mask,
+                    cu_seqlens=cu_seqlens.to(device=hidden_states.device),
+                    num_sequences=num_sequences.to(device=hidden_states.device) if num_sequences is not None else None,
+                )
+            elif attention_mask is not None:
+                position_ids = attention_mask.long().cumsum(dim=-1) - 1
+                position_ids = position_ids.masked_fill(~attention_mask, 0)
+                position_ids = position_ids[:, -hidden_states.shape[1] :]
+            else:
+                past_length = 0
+                if past_key_values is not None and len(past_key_values) > 0:
+                    past_length = past_key_values[0][0].shape[1]
+                position_ids = torch.arange(hidden_states.shape[1], device=hidden_states.device, dtype=torch.long)
+                position_ids = position_ids + past_length
+                position_ids = position_ids.unsqueeze(0).expand(hidden_states.shape[0], -1)
+        if cu_seqlens is not None and self.attn_implementation == "flash_attention_2":
+            packed_metadata = self.build_packed_metadata(
+                hidden_states=hidden_states,
+                cu_seqlens=cu_seqlens.to(device=hidden_states.device),
+                num_sequences=num_sequences.to(device=hidden_states.device) if num_sequences is not None else None,
+            )
+        if self.position_embedding_type == "absolute":
+            hidden_states = hidden_states + self.wpe(position_ids)
+        hidden_states = self.drop(hidden_states)
+        hidden_states = hidden_states * query_attention_mask.unsqueeze(-1).to(dtype=hidden_states.dtype)
+        all_hidden_states = () if output_hidden_states else None
+        presents = [] if use_cache else None
+        for layer_index, block in enumerate(self.h):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+                if use_cache:
+                    raise ValueError("use_cache=True is not supported when gradient checkpointing is enabled during training.")
+                def custom_forward(*inputs):
+                    output, _ = block(
+                        inputs[0],
+                        attention_mask=inputs[1],
+                        position_ids=inputs[2],
+                        packed_metadata=packed_metadata,
+                        layer_past=None,
+                        use_cache=False,
+                    )
+                    return output
+                hidden_states = torch.utils.checkpoint.checkpoint(
+                    custom_forward,
+                    hidden_states,
+                    attention_mask,
+                    position_ids,
+                    use_reentrant=False,
+                )
+                present = None
+            else:
+                hidden_states, present = block(
+                    hidden_states,
+                    attention_mask=attention_mask,
+                    position_ids=position_ids,
+                    packed_metadata=packed_metadata,
+                    layer_past=None if past_key_values is None else past_key_values[layer_index],
+                    use_cache=use_cache,
+                )
+            hidden_states = hidden_states * query_attention_mask.unsqueeze(-1).to(dtype=hidden_states.dtype)
+            if presents is not None:
+                presents.append(present)
+        hidden_states = self.ln_f(hidden_states)
+        hidden_states = hidden_states * query_attention_mask.unsqueeze(-1).to(dtype=hidden_states.dtype)
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+        if not return_dict:
+            return (hidden_states, tuple(presents) if presents is not None else None, all_hidden_states, None)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=tuple(presents) if presents is not None else None,
+            hidden_states=all_hidden_states,
+            attentions=None,
+        )

models/MOSS-TTS-Nano-100M/languages.txt ADDED Viewed

	@@ -0,0 +1,19 @@

+Chinese
+English
+German
+Spanish
+French
+Japanese
+Italian
+Hebrew
+Korean
+Russian
+Persian
+Arabic
+Polish
+Portuguese
+Czech
+Danish
+Swedish
+Hungarian
+Greek

models/MOSS-TTS-Nano-100M/modeling_moss_tts_nano.py ADDED Viewed

The diff for this file is too large to render. See raw diff

models/MOSS-TTS-Nano-100M/prompting.py ADDED Viewed

	@@ -0,0 +1,92 @@

+from __future__ import annotations
+from typing import List, Sequence
+from .configuration_moss_tts_nano import MossTTSNanoConfig
+USER_ROLE_PREFIX = "user\n"
+USER_TEMPLATE_REFERENCE_PREFIX = (
+    "<user_inst>\n"
+    "- Reference(s):\n"
+)
+USER_TEMPLATE_AFTER_REFERENCE = (
+    "\n- Instruction:\nNone\n"
+    "- Tokens:\nNone\n"
+    "- Quality:\nNone\n"
+    "- Sound Event:\nNone\n"
+    "- Ambient Sound:\nNone\n"
+    "- Language:\nNone\n"
+    "- Text:\n"
+)
+USER_TEMPLATE_PREFIX = USER_TEMPLATE_REFERENCE_PREFIX + "None" + USER_TEMPLATE_AFTER_REFERENCE
+USER_TEMPLATE_SUFFIX = "\n</user_inst>"
+ASSISTANT_TURN_PREFIX = "\n"
+ASSISTANT_ROLE_PREFIX = "assistant\n"
+def encode_text(tokenizer, text: str) -> List[int]:
+    try:
+        return list(tokenizer.encode(text, add_special_tokens=False))
+    except TypeError:
+        return list(tokenizer.encode(text))
+def decode_text(tokenizer, token_ids: Sequence[int]) -> str:
+    try:
+        return str(
+            tokenizer.decode(
+                list(token_ids),
+                skip_special_tokens=False,
+                clean_up_tokenization_spaces=False,
+            )
+        )
+    except TypeError:
+        try:
+            return str(tokenizer.decode(list(token_ids), skip_special_tokens=False))
+        except TypeError:
+            return str(tokenizer.decode(list(token_ids)))
+def build_user_prompt_prefix(tokenizer, config: MossTTSNanoConfig) -> List[int]:
+    return [config.im_start_token_id] + encode_text(tokenizer, USER_ROLE_PREFIX) + encode_text(
+        tokenizer,
+        USER_TEMPLATE_REFERENCE_PREFIX,
+    )
+def build_user_prompt_after_reference(tokenizer) -> List[int]:
+    return encode_text(tokenizer, USER_TEMPLATE_AFTER_REFERENCE)
+def build_assistant_prompt_prefix(tokenizer, config: MossTTSNanoConfig) -> List[int]:
+    return encode_text(tokenizer, USER_TEMPLATE_SUFFIX) + [config.im_end_token_id] + encode_text(
+        tokenizer,
+        ASSISTANT_TURN_PREFIX,
+    ) + [config.im_start_token_id] + encode_text(
+        tokenizer,
+        ASSISTANT_ROLE_PREFIX,
+    )
+def build_prompt_prefix(tokenizer, config: MossTTSNanoConfig) -> List[int]:
+    return (
+        build_user_prompt_prefix(tokenizer, config)
+        + encode_text(tokenizer, "None")
+        + build_user_prompt_after_reference(tokenizer)
+    )
+def build_prompt_suffix(tokenizer, config: MossTTSNanoConfig) -> List[int]:
+    return build_assistant_prompt_prefix(tokenizer, config)
+def build_prompt_token_ids(
+    tokenizer,
+    config: MossTTSNanoConfig,
+    text_token_ids: Sequence[int],
+) -> List[int]:
+    return build_prompt_prefix(tokenizer, config) + [int(token_id) for token_id in text_token_ids] + build_prompt_suffix(
+        tokenizer,
+        config,
+    )

models/MOSS-TTS-Nano-100M/pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:24003f2f11ac8a2cbf70514db2d8f1c02fb451aa6b3c0bffc9da09f31cd7caa5
+size 234693095

models/MOSS-TTS-Nano-100M/source.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ https://huggingface.co/wittin/MOSS-TTS-Nano-100M

models/MOSS-TTS-Nano-100M/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

models/MOSS-TTS-Nano-100M/tokenization_moss_tts_nano.py ADDED Viewed

	@@ -0,0 +1,106 @@

+from __future__ import annotations
+import shutil
+from pathlib import Path
+from typing import Any
+import sentencepiece as spm
+from transformers import PreTrainedTokenizer
+VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.model"}
+class MossTTSNanoSentencePieceTokenizer(PreTrainedTokenizer):
+    vocab_files_names = VOCAB_FILES_NAMES
+    model_input_names = ["input_ids", "attention_mask"]
+    def __init__(
+        self,
+        vocab_file: str,
+        unk_token: str = "<unk>",
+        bos_token: str = "<s>",
+        eos_token: str = "</s>",
+        pad_token: str = "<pad>",
+        sp_model_kwargs: dict[str, Any] | None = None,
+        **kwargs,
+    ) -> None:
+        self.vocab_file = str(vocab_file)
+        self.sp_model_kwargs = {} if sp_model_kwargs is None else dict(sp_model_kwargs)
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.Load(self.vocab_file)
+        super().__init__(
+            unk_token=unk_token,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            pad_token=pad_token,
+            **kwargs,
+        )
+    @property
+    def vocab_size(self) -> int:
+        return int(self.sp_model.get_piece_size())
+    def get_vocab(self) -> dict[str, int]:
+        vocab = {self.sp_model.id_to_piece(i): i for i in range(self.vocab_size)}
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+    def _tokenize(self, text: str) -> list[str]:
+        return list(self.sp_model.encode(text, out_type=str))
+    def _convert_token_to_id(self, token: str) -> int:
+        token_id = int(self.sp_model.piece_to_id(token))
+        return token_id
+    def _convert_id_to_token(self, index: int) -> str:
+        return str(self.sp_model.id_to_piece(int(index)))
+    def convert_tokens_to_string(self, tokens: list[str]) -> str:
+        return str(self.sp_model.decode(tokens))
+    def save_vocabulary(self, save_directory: str, filename_prefix: str | None = None) -> tuple[str]:
+        save_dir = Path(save_directory)
+        save_dir.mkdir(parents=True, exist_ok=True)
+        out_name = "tokenizer.model" if filename_prefix is None else f"{filename_prefix}-tokenizer.model"
+        out_path = save_dir / out_name
+        if Path(self.vocab_file).resolve() != out_path.resolve():
+            shutil.copyfile(self.vocab_file, out_path)
+        return (str(out_path),)
+    def build_inputs_with_special_tokens(
+        self,
+        token_ids_0: list[int],
+        token_ids_1: list[int] | None = None,
+    ) -> list[int]:
+        if token_ids_1 is None:
+            return list(token_ids_0)
+        return list(token_ids_0) + list(token_ids_1)
+    def get_special_tokens_mask(
+        self,
+        token_ids_0: list[int],
+        token_ids_1: list[int] | None = None,
+        already_has_special_tokens: bool = False,
+    ) -> list[int]:
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0,
+                token_ids_1=token_ids_1,
+                already_has_special_tokens=True,
+            )
+        if token_ids_1 is None:
+            return [0] * len(token_ids_0)
+        return [0] * (len(token_ids_0) + len(token_ids_1))
+    def create_token_type_ids_from_sequences(
+        self,
+        token_ids_0: list[int],
+        token_ids_1: list[int] | None = None,
+    ) -> list[int]:
+        if token_ids_1 is None:
+            return [0] * len(token_ids_0)
+        return [0] * (len(token_ids_0) + len(token_ids_1))
+__all__ = ["MossTTSNanoSentencePieceTokenizer"]

models/MOSS-TTS-Nano-100M/tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c353ee1479b536bf414c1b247f5542b6607fb8ae91320e5af1781fee200fddff
+size 470897

models/MOSS-TTS-Nano-100M/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,52 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [],
+  "auto_map": {
+    "AutoTokenizer": [
+      "tokenization_moss_tts_nano.MossTTSNanoSentencePieceTokenizer",
+      null
+    ]
+  },
+  "backend": "custom",
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "model_max_length": 16384,
+  "pad_token": "<pad>",
+  "tokenizer_class": "MossTTSNanoSentencePieceTokenizer",
+  "unk_token": "<unk>"
+}