MOSS-SoundEffect Model Card
MOSS-SoundEffect is the environment sound & sound effect generation model in the MOSS‑TTS Family. It generates ambient soundscapes and concrete sound effects directly from text descriptions, and is designed to complement speech content with immersive context in production workflows.
1. Overview
1.1 TTS Family Positioning
MOSS-SoundEffect is designed as an audio generation backbone for creating high-fidelity environmental and action sounds from text, serving both scalable content pipelines and a strong research baseline for controllable audio generation.
Design goals
- Coverage & richness: broad sound taxonomy with layered ambience and realistic texture
- Composability: easy integration into creative pipelines (games/film/tools) and synthetic data generation setups
1.2 Key Capabilities
MOSS‑SoundEffect focuses on contextual audio completion beyond speech, enabling creators and systems to enrich scenes with believable acoustic environments and action‑level cues.
What it can generate
- Natural environments: e.g., “fresh snow crunching under footsteps.”
- Urban environments: e.g., “a sports car roaring past on the highway.”
- Animals & creatures: e.g., “early morning park with birds chirping in a quiet atmosphere.”
- Human actions: e.g., “clear footsteps echoing on concrete at a steady rhythm.”
Why it matters
- Completes scene immersion for narrative content, film/TV, documentaries, games, and podcasts.
- Supports voice agents and interactive systems that need ambient context, not just speech.
- Acts as the sound‑design layer of the MOSS‑TTS Family’s end‑to‑end workflow.
1.3 Model Architecture
MOSS-SoundEffect employs the MossTTSDelay architecture (see moss_tts_delay/README.md), reusing the same discrete token generation backbone for audio synthesis. A text prompt (optionally with simple control tags such as duration) is tokenized and fed into the Delay-pattern autoregressive model to predict RVQ audio tokens over time. The generated tokens are then decoded by the audio tokenizer/vocoder to produce high-fidelity sound effects, enabling consistent quality and controllable length across diverse SFX categories.
1.4 Released Models
Recommended decoding hyperparameters
| Model | audio_temperature | audio_top_p | audio_top_k | audio_repetition_penalty |
|---|---|---|---|---|
| MOSS-SoundEffect | 1.5 | 0.6 | 50 | 1.2 |
2. Quick Start
from pathlib import Path
import importlib.util
import torch
import torchaudio
from transformers import AutoModel, AutoProcessor
# Disable the broken cuDNN SDPA backend
torch.backends.cuda.enable_cudnn_sdp(False)
# Keep these enabled as fallbacks
torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(True)
torch.backends.cuda.enable_math_sdp(True)
pretrained_model_name_or_path = "OpenMOSS-Team/MOSS-SoundEffect"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" else torch.float32
def resolve_attn_implementation() -> str:
# Prefer FlashAttention 2 when package + device conditions are met.
if (
device == "cuda"
and importlib.util.find_spec("flash_attn") is not None
and dtype in {torch.float16, torch.bfloat16}
):
major, _ = torch.cuda.get_device_capability()
if major >= 8:
return "flash_attention_2"
# CUDA fallback: use PyTorch SDPA kernels.
if device == "cuda":
return "sdpa"
# CPU fallback.
return "eager"
attn_implementation = resolve_attn_implementation()
print(f"[INFO] Using attn_implementation={attn_implementation}")
processor = AutoProcessor.from_pretrained(
pretrained_model_name_or_path,
trust_remote_code=True,
)
processor.audio_tokenizer = processor.audio_tokenizer.to(device)
text_1 = "雷声隆隆,雨声淅沥。"
text_2 = "清晰脚步声在水泥地面回响,节奏稳定。"
conversations = [
[processor.build_user_message(ambient_sound=text_1)],
[processor.build_user_message(ambient_sound=text_2)]
]
model = AutoModel.from_pretrained(
pretrained_model_name_or_path,
trust_remote_code=True,
attn_implementation=attn_implementation,
torch_dtype=dtype,
).to(device)
model.eval()
batch_size = 1
save_dir = Path("inference_root")
save_dir.mkdir(exist_ok=True, parents=True)
sample_idx = 0
with torch.no_grad():
for start in range(0, len(conversations), batch_size):
batch_conversations = conversations[start : start + batch_size]
batch = processor(batch_conversations, mode="generation")
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
outputs = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=4096,
)
for message in processor.decode(outputs):
audio = message.audio_codes_list[0]
out_path = save_dir / f"sample{sample_idx}.wav"
sample_idx += 1
torchaudio.save(out_path, audio.unsqueeze(0), processor.model_config.sampling_rate)
Input Types
UserMessage
| Field | Type | Required | Description |
|---|---|---|---|
ambient_sound |
str |
Yes | Description of environment sound & sound effect |
tokens |
int |
No | Expected number of audio tokens. 1s ≈ 12.5 tokens. |