File size: 3,825 Bytes
9ada2e6
4338367
 
 
9ada2e6
4338367
 
 
 
 
 
 
 
 
 
9ada2e6
4338367
e35df4d
4338367
 
 
 
 
 
e35df4d
4338367
 
0f8256e
4338367
e35df4d
4338367
e35df4d
4338367
 
 
e35df4d
4338367
 
 
 
 
e35df4d
4338367
 
 
 
 
 
 
 
 
 
 
 
 
 
e35df4d
4338367
e35df4d
4338367
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e35df4d
4338367
 
 
 
 
 
0f8256e
4338367
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
language:
- en
- zh
license: apache-2.0
library_name: diffusers
pipeline_tag: text-to-audio
base_model:
- Qwen/Qwen3-1.7B
tags:
- text-to-audio
- diffusion
- flow-matching
- sound-effects
- audio-generation
---

# MOSS-SoundEffect-V2.0

<p align="center">
  <img src="https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_imgaes_demo/openmoss_x_mosi" height="50" align="middle" />
</p>

<div align="center">
  <a href="https://github.com/OpenMOSS/MOSS-TTS/tree/main/moss_soundeffect_v2"><img src="https://img.shields.io/badge/Project%20Page-GitHub-blue"></a>
</div>

**MOSS-SoundEffect v2.0** is a text-to-audio model with a Diffusion Transformer (DiT) backbone trained with the Flow Matching objective, paired with a DAC VAE and a Qwen3 text encoder. It generates high-fidelity environmental, urban, creature, and human-action sound effects from natural-language prompts, with controllable duration up to 30 seconds at 48 kHz.

## 1. Overview

### 1.1 TTS Family Positioning

Within the MOSS-TTS Family, MOSS-SoundEffect is the dedicated **text-to-sound** model — the family member that turns natural-language captions into non-speech audio (ambience, urban scenes, creatures, human actions, short music-like clips). v2.0 supersedes the v1 discrete-token autoregressive backbone (`MossTTSDelay`) with a continuous-latent **Diffusion Transformer + Flow Matching** design.

### 1.2 Key Capabilities

- **Broad SFX coverage**: natural environments, urban environments, animals & creatures, human actions, and short musical/percussive clips.
- **Long-form generation**: stable audio up to **30 seconds** per call with the duration tag prepended to the prompt at training time.
- **Bilingual prompts**: trained with both **English and Chinese** captions.

### 1.3 Released Models

| Model | Architecture | DiT Variant | Parameters |
|---|---|---|---:|
| **MOSS-SoundEffect-V2.0** | DiT + Flow Matching | `1.3B` | 1.3B |

**Recommended inference hyperparameters**

| Parameter | Default | Description |
|---|---:|---|
| `num_inference_steps` | 100 | Number of flow-match solver steps. |
| `cfg_scale` | 4.0 | Classifier-free guidance weight. |
| `sigma_shift` | 5.0 | Flow-match scheduler shift applied per call. |
| `seconds` | 10.0 | Output duration. Up to 30. |

## 2. Quick Start

### Environment Setup

We recommend a clean, isolated Python 3.12 environment to avoid dependency conflicts with the top-level MOSS-TTS environment.

```bash
conda create -n moss-soundeffect-v2 python=3.12 -y
conda activate moss-soundeffect-v2

git clone https://github.com/OpenMOSS/MOSS-TTS.git
cd MOSS-TTS/moss_soundeffect_v2
pip install --extra-index-url https://download.pytorch.org/whl/cu128 \
    -e ".[torch-cu128,finetune]"
```

For a minimal **inference-only** install (still ships the Gradio demo; skips the fine-tuning extras):

```bash
pip install --extra-index-url https://download.pytorch.org/whl/cu128 \
    -e ".[torch-cu128]"
```

### Basic Usage

```python
import torch
from moss_soundeffect_v2 import MossSoundEffectPipeline

pipe = MossSoundEffectPipeline.from_pretrained(
    "OpenMOSS-Team/MOSS-SoundEffect-v2.0",   # this repo, or a local dir
    torch_dtype=torch.bfloat16,
    device="cuda",
)

audio = pipe(
    prompt="A dog barking loudly in a park.",
    seconds=10,
    num_inference_steps=100,
    cfg_scale=4.0,
)                                            # (B, C, T) waveform tensor
pipe.save_audio(audio, "out.wav")
```

> The underlying DiT is wrapped with `torch.compile` + Triton CUDA Graph. The first call may take a few minutes to compile. If you hit `TorchDynamo` / Triton errors, set `TORCHDYNAMO_DISABLE=1` before launching Python.

For a Gradio demo and fine-tuning recipes, see the [GitHub README](https://github.com/OpenMOSS/MOSS-TTS/tree/main/moss_soundeffect_v2#readme).