Cqy2019 commited on
Commit
1204df2
·
verified ·
1 Parent(s): 49ca09f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +219 -217
README.md CHANGED
@@ -1,217 +1,219 @@
1
- ---
2
- license: apache-2.0
3
- ---
4
- # MOSS-TTS Family
5
-
6
- <br>
7
-
8
- <p align="center">
9
- &nbsp;&nbsp;&nbsp;&nbsp;
10
- <img src="https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_imgaes_demo/openmoss_x_mosi" height="50" align="middle" />
11
- </p>
12
-
13
-
14
-
15
- <div align="center">
16
- <a href="https://github.com/OpenMOSS/MOSS-TTS/tree/main"><img src="https://img.shields.io/badge/Project%20Page-GitHub-blue"></a>
17
- <a href="https://modelscope.cn/collections/OpenMOSS-Team/MOSS-TTS"><img src="https://img.shields.io/badge/ModelScope-Models-lightgrey?logo=modelscope&amp"></a>
18
- <a href="https://mosi.cn/#models"><img src="https://img.shields.io/badge/Blog-View-blue?logo=internet-explorer&amp"></a>
19
- <a href="https://github.com/OpenMOSS/MOSS-TTS"><img src="https://img.shields.io/badge/Arxiv-Coming%20soon-red?logo=arxiv&amp"></a>
20
-
21
- <a href="https://studio.mosi.cn"><img src="https://img.shields.io/badge/AIStudio-Try-green?logo=internet-explorer&amp"></a>
22
- <a href="https://studio.mosi.cn/docs/moss-tts"><img src="https://img.shields.io/badge/API-Docs-00A3FF?logo=fastapi&amp"></a>
23
- <a href="https://x.com/Open_MOSS"><img src="https://img.shields.io/badge/Twitter-Follow-black?logo=x&amp"></a>
24
- <a href="https://discord.gg/fvm5TaWjU3"><img src="https://img.shields.io/badge/Discord-Join-5865F2?logo=discord&amp"></a>
25
- </div>
26
-
27
- ## Overview
28
- MOSS‑TTS Family is an open‑source **speech and sound generation model family** from [MOSI.AI](https://mosi.cn/#hero) and the [OpenMOSS team](https://www.open-moss.com/). It is designed for **high‑fidelity**, **high‑expressiveness**, and **complex real‑world scenarios**, covering stable long‑form speech, multi‑speaker dialogue, voice/character design, environmental sound effects, and real‑time streaming TTS.
29
-
30
-
31
- ## Introduction
32
-
33
- <p align="center">
34
- <img src="https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_imgaes_demo/moss_tts_family_arch.jpeg" width="85%" />
35
- </p>
36
-
37
- When a single piece of audio needs to **sound like a real person**, **pronounce every word accurately**, **switch speaking styles across content**, **remain stable over tens of minutes**, and **support dialogue, role‑play, and real‑time interaction**, a single TTS model is often not enough. The **MOSS‑TTS Family** breaks the workflow into five production‑ready models that can be used independently or composed into a complete pipeline.
38
-
39
- - **MOSS‑TTS**: MOSS-TTS is the flagship production TTS foundation model, centered on high-fidelity zero-shot voice cloning with controllable long-form synthesis, pronunciation, and multilingual/code-switched speech. It serves as the core engine for scalable narration, dubbing, and voice-driven products.
40
- - **MOSS‑TTSD**: MOSS-TTSD is a production long-form dialogue model for expressive multi-speaker conversational audio at scale. It supports long-duration continuity, turn-taking control, and zero-shot voice cloning from short references for podcasts, audiobooks, commentary, dubbing, and entertainment dialogue.
41
- - **MOSS‑VoiceGenerator**: MOSS-VoiceGenerator is an open-source voice design model that creates speaker timbres directly from free-form text, without reference audio. It unifies timbre design, style control, and content synthesis, and can be used standalone or as a voice-design layer for downstream TTS.
42
- - **MOSS‑SoundEffect**: MOSS-SoundEffect is a high-fidelity text-to-sound model with broad category coverage and controllable duration for real content production. It generates stable audio from prompts across ambience, urban scenes, creatures, human actions, and music-like clips for film, games, interactive media, and data synthesis.
43
- - **MOSS‑TTS‑Realtime**: MOSS-TTS-Realtime is a context-aware, multi-turn streaming TTS model for real-time voice agents. By conditioning on dialogue history across both text and prior user acoustics, it delivers low-latency synthesis with coherent, consistent voice responses across turns.
44
-
45
-
46
- ## Released Models
47
-
48
- | Model | Architecture | Size | Model Card | Hugging Face |
49
- |---|---|---:|---|---|
50
- | **MOSS-TTS** | MossTTSDelay | 8B | [moss_tts_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTS) |
51
- | | MossTTSLocal | 1.7B | [moss_tts_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer) |
52
- | **MOSS‑TTSD‑V1.0** | MossTTSDelay | 8B | [moss_ttsd_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_ttsd_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTSD-v1.0) |
53
- | **MOSS‑VoiceGenerator** | MossTTSDelay | 1.7B | [moss_voice_generator_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_voice_generator_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-Voice-Generator) |
54
- | **MOSS‑SoundEffect** | MossTTSDelay | 8B | [moss_sound_effect_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_sound_effect_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-SoundEffect) |
55
- | **MOSS‑TTS‑Realtime** | MossTTSRealtime | 1.7B | [moss_tts_realtime_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_realtime_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Realtime) |
56
-
57
-
58
- # MOSS-SoundEffect
59
-
60
- **MOSS-SoundEffect** is the **environment sound & sound effect generation model** in the **MOSS‑TTS Family**. It generates ambient soundscapes and concrete sound effects directly from text descriptions, and is designed to complement speech content with immersive context in production workflows.
61
-
62
-
63
- ## 1. Overview
64
-
65
- ### 1.1 TTS Family Positioning
66
-
67
- MOSS-SoundEffect is designed as an audio generation backbone for creating high-fidelity environmental and action sounds from text, serving both scalable content pipelines and a strong research baseline for controllable audio generation.
68
-
69
- **Design goals**
70
- * **Coverage & richness**: broad sound taxonomy with layered ambience and realistic texture
71
- * **Composability**: easy integration into creative pipelines (games/film/tools) and synthetic data generation setups
72
-
73
-
74
- ### 1.2 Key Capabilities
75
- MOSS‑SoundEffect focuses on **contextual audio completion** beyond speech, enabling creators and systems to enrich scenes with believable acoustic environments and action‑level cues.
76
-
77
- **What it can generate**
78
- - **Natural environments**: e.g., “fresh snow crunching under footsteps.”
79
- - **Urban environments**: e.g., “a sports car roaring past on the highway.”
80
- - **Animals & creatures**: e.g., “early morning park with birds chirping in a quiet atmosphere.”
81
- - **Human actions**: e.g., “clear footsteps echoing on concrete at a steady rhythm.”
82
-
83
- **Why it matters**
84
- - Completes **scene immersion** for narrative content, film/TV, documentaries, games, and podcasts.
85
- - Supports **voice agents** and interactive systems that need ambient context, not just speech.
86
- - Acts as the **sound‑design layer** of the MOSS‑TTS Family’s end‑to‑end workflow.
87
-
88
-
89
-
90
- ### 1.3 Model Architecture
91
- **MOSS-SoundEffect** employs the **MossTTSDelay** architecture (see [moss_tts_delay/README.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_tts_delay/README.md)), reusing the same discrete token generation backbone for audio synthesis. A text prompt (optionally with simple control tags such as **duration**) is tokenized and fed into the Delay-pattern autoregressive model to predict **RVQ audio tokens** over time. The generated tokens are then decoded by the audio tokenizer/vocoder to produce high-fidelity sound effects, enabling consistent quality and controllable length across diverse SFX categories.
92
-
93
-
94
-
95
- ### 1.4 Released Models
96
- **Recommended decoding hyperparameters**
97
- | Model | audio_temperature | audio_top_p | audio_top_k | audio_repetition_penalty |
98
- |---|---:|---:|---:|---:|
99
- | **MOSS-SoundEffect** | 1.5 | 0.6 | 50 | 1.2 |
100
-
101
-
102
- ## 2. Quick Start
103
-
104
-
105
- ### Environment Setup
106
-
107
- We recommend a clean, isolated Python environment with **Transformers 5.0.0** to avoid dependency conflicts.
108
-
109
- ```bash
110
- conda create -n moss-tts python=3.12 -y
111
- conda activate moss-tts
112
- ```
113
-
114
- Install all required dependencies:
115
-
116
- ```bash
117
- git clone https://github.com/OpenMOSS/MOSS-TTS.git
118
- cd MOSS-TTS
119
- pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e .
120
- ```
121
-
122
- #### (Optional) Install FlashAttention 2
123
-
124
- For better speed and lower GPU memory usage, you can install FlashAttention 2 if your hardware supports it.
125
-
126
- ```bash
127
- pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[flash-attn]"
128
- ```
129
-
130
- If your machine has limited RAM and many CPU cores, you can cap build parallelism:
131
-
132
- ```bash
133
- MAX_JOBS=4 pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[flash-attn]"
134
- ```
135
-
136
- Notes:
137
- - Dependencies are managed in `pyproject.toml`, which currently pins `torch==2.9.1+cu128` and `torchaudio==2.9.1+cu128`.
138
- - If FlashAttention 2 fails to build on your machine, you can skip it and use the default attention backend.
139
- - FlashAttention 2 is only available on supported GPUs and is typically used with `torch.float16` or `torch.bfloat16`.
140
-
141
-
142
- ### Basic Usage
143
-
144
- ```python
145
- import os
146
- from pathlib import Path
147
- import torch
148
- import torchaudio
149
- from transformers import AutoModel, AutoProcessor
150
- # Disable the broken cuDNN SDPA backend
151
- torch.backends.cuda.enable_cudnn_sdp(False)
152
- # Keep these enabled as fallbacks
153
- torch.backends.cuda.enable_flash_sdp(True)
154
- torch.backends.cuda.enable_mem_efficient_sdp(True)
155
- torch.backends.cuda.enable_math_sdp(True)
156
-
157
-
158
- pretrained_model_name_or_path = "OpenMOSS-Team/MOSS-SoundEffect"
159
- device = "cuda" if torch.cuda.is_available() else "cpu"
160
- dtype = torch.bfloat16 if device == "cuda" else torch.float32
161
-
162
- processor = AutoProcessor.from_pretrained(
163
- pretrained_model_name_or_path,
164
- trust_remote_code=True,
165
- )
166
- processor.audio_tokenizer = processor.audio_tokenizer.to(device)
167
-
168
- text_1 = "雷声隆隆,雨声淅沥。"
169
- text_2 = "清晰脚步声在水泥地面回响,节奏稳定。"
170
-
171
- conversations = [
172
- [processor.build_user_message(ambient_sound=text_1)],
173
- [processor.build_user_message(ambient_sound=text_2)]
174
- ]
175
-
176
- model = AutoModel.from_pretrained(
177
- pretrained_model_name_or_path,
178
- trust_remote_code=True,
179
- attn_implementation="sdpa",
180
- torch_dtype=dtype,
181
- ).to(device)
182
- model.eval()
183
-
184
- batch_size = 1
185
-
186
- messages = []
187
- save_dir = Path("inference_root")
188
- save_dir.mkdir(exist_ok=True, parents=True)
189
- sample_idx = 0
190
- with torch.no_grad():
191
- for start in range(0, len(conversations), batch_size):
192
- batch_conversations = conversations[start : start + batch_size]
193
- batch = processor(batch_conversations, mode="generation")
194
- input_ids = batch["input_ids"].to(device)
195
- attention_mask = batch["attention_mask"].to(device)
196
-
197
- outputs = model.generate(
198
- input_ids=input_ids,
199
- attention_mask=attention_mask,
200
- max_new_tokens=4096,
201
- )
202
-
203
- for message in processor.decode(outputs):
204
- audio = message.audio_codes_list[0]
205
- out_path = save_dir / f"sample{sample_idx}.wav"
206
- sample_idx += 1
207
- torchaudio.save(out_path, audio.unsqueeze(0), processor.model_config.sampling_rate)
208
- ```
209
-
210
- ### Input Types
211
-
212
- **UserMessage**
213
- | Field | Type | Required | Description |
214
- |---|---|---:|---|
215
- | `ambient_sound` | `str` | Yes | Description of environment sound & sound effect |
216
- | `tokens` | `int` | No | Expected number of audio tokens. **1s ≈ 12.5 tokens**. |
217
-
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - text-to-audio
5
+ ---
6
+ # MOSS-TTS Family
7
+
8
+ <br>
9
+
10
+ <p align="center">
11
+ &nbsp;&nbsp;&nbsp;&nbsp;
12
+ <img src="https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_imgaes_demo/openmoss_x_mosi" height="50" align="middle" />
13
+ </p>
14
+
15
+
16
+
17
+ <div align="center">
18
+ <a href="https://github.com/OpenMOSS/MOSS-TTS/tree/main"><img src="https://img.shields.io/badge/Project%20Page-GitHub-blue"></a>
19
+ <a href="https://modelscope.cn/collections/OpenMOSS-Team/MOSS-TTS"><img src="https://img.shields.io/badge/ModelScope-Models-lightgrey?logo=modelscope&amp"></a>
20
+ <a href="https://mosi.cn/#models"><img src="https://img.shields.io/badge/Blog-View-blue?logo=internet-explorer&amp"></a>
21
+ <a href="https://github.com/OpenMOSS/MOSS-TTS"><img src="https://img.shields.io/badge/Arxiv-Coming%20soon-red?logo=arxiv&amp"></a>
22
+
23
+ <a href="https://studio.mosi.cn"><img src="https://img.shields.io/badge/AIStudio-Try-green?logo=internet-explorer&amp"></a>
24
+ <a href="https://studio.mosi.cn/docs/moss-tts"><img src="https://img.shields.io/badge/API-Docs-00A3FF?logo=fastapi&amp"></a>
25
+ <a href="https://x.com/Open_MOSS"><img src="https://img.shields.io/badge/Twitter-Follow-black?logo=x&amp"></a>
26
+ <a href="https://discord.gg/fvm5TaWjU3"><img src="https://img.shields.io/badge/Discord-Join-5865F2?logo=discord&amp"></a>
27
+ </div>
28
+
29
+ ## Overview
30
+ MOSS‑TTS Family is an open‑source **speech and sound generation model family** from [MOSI.AI](https://mosi.cn/#hero) and the [OpenMOSS team](https://www.open-moss.com/). It is designed for **high‑fidelity**, **high‑expressiveness**, and **complex real‑world scenarios**, covering stable long‑form speech, multi‑speaker dialogue, voice/character design, environmental sound effects, and real‑time streaming TTS.
31
+
32
+
33
+ ## Introduction
34
+
35
+ <p align="center">
36
+ <img src="https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_imgaes_demo/moss_tts_family_arch.jpeg" width="85%" />
37
+ </p>
38
+
39
+ When a single piece of audio needs to **sound like a real person**, **pronounce every word accurately**, **switch speaking styles across content**, **remain stable over tens of minutes**, and **support dialogue, role‑play, and real‑time interaction**, a single TTS model is often not enough. The **MOSS‑TTS Family** breaks the workflow into five production‑ready models that can be used independently or composed into a complete pipeline.
40
+
41
+ - **MOSS‑TTS**: MOSS-TTS is the flagship production TTS foundation model, centered on high-fidelity zero-shot voice cloning with controllable long-form synthesis, pronunciation, and multilingual/code-switched speech. It serves as the core engine for scalable narration, dubbing, and voice-driven products.
42
+ - **MOSS‑TTSD**: MOSS-TTSD is a production long-form dialogue model for expressive multi-speaker conversational audio at scale. It supports long-duration continuity, turn-taking control, and zero-shot voice cloning from short references for podcasts, audiobooks, commentary, dubbing, and entertainment dialogue.
43
+ - **MOSS‑VoiceGenerator**: MOSS-VoiceGenerator is an open-source voice design model that creates speaker timbres directly from free-form text, without reference audio. It unifies timbre design, style control, and content synthesis, and can be used standalone or as a voice-design layer for downstream TTS.
44
+ - **MOSS‑SoundEffect**: MOSS-SoundEffect is a high-fidelity text-to-sound model with broad category coverage and controllable duration for real content production. It generates stable audio from prompts across ambience, urban scenes, creatures, human actions, and music-like clips for film, games, interactive media, and data synthesis.
45
+ - **MOSS‑TTS‑Realtime**: MOSS-TTS-Realtime is a context-aware, multi-turn streaming TTS model for real-time voice agents. By conditioning on dialogue history across both text and prior user acoustics, it delivers low-latency synthesis with coherent, consistent voice responses across turns.
46
+
47
+
48
+ ## Released Models
49
+
50
+ | Model | Architecture | Size | Model Card | Hugging Face |
51
+ |---|---|---:|---|---|
52
+ | **MOSS-TTS** | MossTTSDelay | 8B | [moss_tts_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTS) |
53
+ | | MossTTSLocal | 1.7B | [moss_tts_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer) |
54
+ | **MOSS‑TTSD‑V1.0** | MossTTSDelay | 8B | [moss_ttsd_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_ttsd_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTSD-v1.0) |
55
+ | **MOSS‑VoiceGenerator** | MossTTSDelay | 1.7B | [moss_voice_generator_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_voice_generator_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-Voice-Generator) |
56
+ | **MOSS‑SoundEffect** | MossTTSDelay | 8B | [moss_sound_effect_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_sound_effect_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-SoundEffect) |
57
+ | **MOSS‑TTS‑Realtime** | MossTTSRealtime | 1.7B | [moss_tts_realtime_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_realtime_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Realtime) |
58
+
59
+
60
+ # MOSS-SoundEffect
61
+
62
+ **MOSS-SoundEffect** is the **environment sound & sound effect generation model** in the **MOSS‑TTS Family**. It generates ambient soundscapes and concrete sound effects directly from text descriptions, and is designed to complement speech content with immersive context in production workflows.
63
+
64
+
65
+ ## 1. Overview
66
+
67
+ ### 1.1 TTS Family Positioning
68
+
69
+ MOSS-SoundEffect is designed as an audio generation backbone for creating high-fidelity environmental and action sounds from text, serving both scalable content pipelines and a strong research baseline for controllable audio generation.
70
+
71
+ **Design goals**
72
+ * **Coverage & richness**: broad sound taxonomy with layered ambience and realistic texture
73
+ * **Composability**: easy integration into creative pipelines (games/film/tools) and synthetic data generation setups
74
+
75
+
76
+ ### 1.2 Key Capabilities
77
+ MOSS‑SoundEffect focuses on **contextual audio completion** beyond speech, enabling creators and systems to enrich scenes with believable acoustic environments and action‑level cues.
78
+
79
+ **What it can generate**
80
+ - **Natural environments**: e.g., “fresh snow crunching under footsteps.”
81
+ - **Urban environments**: e.g., “a sports car roaring past on the highway.”
82
+ - **Animals & creatures**: e.g., “early morning park with birds chirping in a quiet atmosphere.”
83
+ - **Human actions**: e.g., “clear footsteps echoing on concrete at a steady rhythm.”
84
+
85
+ **Why it matters**
86
+ - Completes **scene immersion** for narrative content, film/TV, documentaries, games, and podcasts.
87
+ - Supports **voice agents** and interactive systems that need ambient context, not just speech.
88
+ - Acts as the **sound‑design layer** of the MOSS‑TTS Family’s end‑to‑end workflow.
89
+
90
+
91
+
92
+ ### 1.3 Model Architecture
93
+ **MOSS-SoundEffect** employs the **MossTTSDelay** architecture (see [moss_tts_delay/README.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_tts_delay/README.md)), reusing the same discrete token generation backbone for audio synthesis. A text prompt (optionally with simple control tags such as **duration**) is tokenized and fed into the Delay-pattern autoregressive model to predict **RVQ audio tokens** over time. The generated tokens are then decoded by the audio tokenizer/vocoder to produce high-fidelity sound effects, enabling consistent quality and controllable length across diverse SFX categories.
94
+
95
+
96
+
97
+ ### 1.4 Released Models
98
+ **Recommended decoding hyperparameters**
99
+ | Model | audio_temperature | audio_top_p | audio_top_k | audio_repetition_penalty |
100
+ |---|---:|---:|---:|---:|
101
+ | **MOSS-SoundEffect** | 1.5 | 0.6 | 50 | 1.2 |
102
+
103
+
104
+ ## 2. Quick Start
105
+
106
+
107
+ ### Environment Setup
108
+
109
+ We recommend a clean, isolated Python environment with **Transformers 5.0.0** to avoid dependency conflicts.
110
+
111
+ ```bash
112
+ conda create -n moss-tts python=3.12 -y
113
+ conda activate moss-tts
114
+ ```
115
+
116
+ Install all required dependencies:
117
+
118
+ ```bash
119
+ git clone https://github.com/OpenMOSS/MOSS-TTS.git
120
+ cd MOSS-TTS
121
+ pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e .
122
+ ```
123
+
124
+ #### (Optional) Install FlashAttention 2
125
+
126
+ For better speed and lower GPU memory usage, you can install FlashAttention 2 if your hardware supports it.
127
+
128
+ ```bash
129
+ pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[flash-attn]"
130
+ ```
131
+
132
+ If your machine has limited RAM and many CPU cores, you can cap build parallelism:
133
+
134
+ ```bash
135
+ MAX_JOBS=4 pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[flash-attn]"
136
+ ```
137
+
138
+ Notes:
139
+ - Dependencies are managed in `pyproject.toml`, which currently pins `torch==2.9.1+cu128` and `torchaudio==2.9.1+cu128`.
140
+ - If FlashAttention 2 fails to build on your machine, you can skip it and use the default attention backend.
141
+ - FlashAttention 2 is only available on supported GPUs and is typically used with `torch.float16` or `torch.bfloat16`.
142
+
143
+
144
+ ### Basic Usage
145
+
146
+ ```python
147
+ import os
148
+ from pathlib import Path
149
+ import torch
150
+ import torchaudio
151
+ from transformers import AutoModel, AutoProcessor
152
+ # Disable the broken cuDNN SDPA backend
153
+ torch.backends.cuda.enable_cudnn_sdp(False)
154
+ # Keep these enabled as fallbacks
155
+ torch.backends.cuda.enable_flash_sdp(True)
156
+ torch.backends.cuda.enable_mem_efficient_sdp(True)
157
+ torch.backends.cuda.enable_math_sdp(True)
158
+
159
+
160
+ pretrained_model_name_or_path = "OpenMOSS-Team/MOSS-SoundEffect"
161
+ device = "cuda" if torch.cuda.is_available() else "cpu"
162
+ dtype = torch.bfloat16 if device == "cuda" else torch.float32
163
+
164
+ processor = AutoProcessor.from_pretrained(
165
+ pretrained_model_name_or_path,
166
+ trust_remote_code=True,
167
+ )
168
+ processor.audio_tokenizer = processor.audio_tokenizer.to(device)
169
+
170
+ text_1 = "雷声隆隆,雨声淅沥。"
171
+ text_2 = "清晰脚步声在水泥地面回响,节奏稳定。"
172
+
173
+ conversations = [
174
+ [processor.build_user_message(ambient_sound=text_1)],
175
+ [processor.build_user_message(ambient_sound=text_2)]
176
+ ]
177
+
178
+ model = AutoModel.from_pretrained(
179
+ pretrained_model_name_or_path,
180
+ trust_remote_code=True,
181
+ attn_implementation="sdpa",
182
+ torch_dtype=dtype,
183
+ ).to(device)
184
+ model.eval()
185
+
186
+ batch_size = 1
187
+
188
+ messages = []
189
+ save_dir = Path("inference_root")
190
+ save_dir.mkdir(exist_ok=True, parents=True)
191
+ sample_idx = 0
192
+ with torch.no_grad():
193
+ for start in range(0, len(conversations), batch_size):
194
+ batch_conversations = conversations[start : start + batch_size]
195
+ batch = processor(batch_conversations, mode="generation")
196
+ input_ids = batch["input_ids"].to(device)
197
+ attention_mask = batch["attention_mask"].to(device)
198
+
199
+ outputs = model.generate(
200
+ input_ids=input_ids,
201
+ attention_mask=attention_mask,
202
+ max_new_tokens=4096,
203
+ )
204
+
205
+ for message in processor.decode(outputs):
206
+ audio = message.audio_codes_list[0]
207
+ out_path = save_dir / f"sample{sample_idx}.wav"
208
+ sample_idx += 1
209
+ torchaudio.save(out_path, audio.unsqueeze(0), processor.model_config.sampling_rate)
210
+ ```
211
+
212
+ ### Input Types
213
+
214
+ **UserMessage**
215
+ | Field | Type | Required | Description |
216
+ |---|---|---:|---|
217
+ | `ambient_sound` | `str` | Yes | Description of environment sound & sound effect |
218
+ | `tokens` | `int` | No | Expected number of audio tokens. **1s ≈ 12.5 tokens**. |
219
+