nielsr HF Staff

Improve model card: add metadata and paper links

09d002c verified about 1 month ago

5.52 kB

	---
	license: apache-2.0
	pipeline_tag: text-to-audio
	library_name: transformers
	tags:
	- text-to-audio
	- audio-generation
	- moss-tts
	---

	# MOSS-SoundEffect

	MOSS-SoundEffect is a high-fidelity text-to-sound model from the MOSS-TTS Family, developed by the [OpenMOSS team](https://www.open-moss.com/) and [MOSI.AI](https://mosi.cn/#hero). It generates ambient soundscapes and concrete sound effects directly from text descriptions.

	The model architecture and underlying tokenization are presented in the paper: [MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models](https://huggingface.co/papers/2602.10934).

	<div align="center\">
	<a href="https://github.com/OpenMOSS/MOSS-TTS/tree/main"><img src="https://img.shields.io/badge/Project%20Page-GitHub-blue"></a>
	<a href="https://modelscope.cn/collections/OpenMOSS-Team/MOSS-TTS"><img src="https://img.shields.io/badge/ModelScope-Models-lightgrey?logo=modelscope&amp"></a>
	<a href="https://mosi.cn/#models"><img src="https://img.shields.io/badge/Blog-View-blue?logo=internet-explorer&amp"></a>
	<a href="https://huggingface.co/papers/2602.10934"><img src="https://img.shields.io/badge/Arxiv-2602.10934-red?logo=arxiv&amp"></a>
	<a href="https://studio.mosi.cn"><img src="https://img.shields.io/badge/AIStudio-Try-green?logo=internet-explorer&amp"></a>
	<a href="https://x.com/Open_MOSS"><img src="https://img.shields.io/badge/Twitter-Follow-black?logo=x&amp"></a>
	<a href="https://discord.gg/fvm5TaWjU3"><img src="https://img.shields.io/badge/Discord-Join-5865F2?logo=discord&amp"></a>
	</div>

	## Overview
	MOSS‑TTS Family is an open‑source speech and sound generation model family. It is designed for high‑fidelity, high‑expressiveness, and complex real‑world scenarios, covering stable long‑form speech, multi‑speaker dialogue, voice/character design, environmental sound effects, and real‑time streaming TTS.

	MOSS-SoundEffect specifically focuses on contextual audio completion beyond speech, enabling creators and systems to enrich scenes with believable acoustic environments and action‑level cues.

	### Key Capabilities
	- Natural environments: e.g., “fresh snow crunching under footsteps.”
	- Urban environments: e.g., “a sports car roaring past on the highway.”
	- Animals & creatures: e.g., “early morning park with birds chirping in a quiet atmosphere.”
	- Human actions: e.g., “clear footsteps echoing on concrete at a steady rhythm.”

	## Model Architecture
	MOSS-SoundEffect employs the MossTTSDelay architecture, reusing the same discrete token generation backbone for audio synthesis. A text prompt (optionally with simple control tags such as duration) is tokenized and fed into the Delay-pattern autoregressive model to predict RVQ audio tokens over time. The generated tokens are then decoded by the CAT (Causal Audio Tokenizer) decoder to produce high-fidelity sound effects.

	## Quick Start

	### Environment Setup

	We recommend a clean, isolated Python environment with Transformers 5.0.0 to avoid dependency conflicts.

	```bash
	conda create -n moss-tts python=3.12 -y
	conda activate moss-tts

	git clone https://github.com/OpenMOSS/MOSS-TTS.git
	cd MOSS-TTS
	pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e .
	```

	### Basic Usage

	```python
	from pathlib import Path
	import importlib.util
	import torch
	import torchaudio
	from transformers import AutoModel, AutoProcessor

	# Disable the broken cuDNN SDPA backend
	torch.backends.cuda.enable_cudnn_sdp(False)
	# Keep these enabled as fallbacks
	torch.backends.cuda.enable_flash_sdp(True)
	torch.backends.cuda.enable_mem_efficient_sdp(True)
	torch.backends.cuda.enable_math_sdp(True)

	pretrained_model_name_or_path = "OpenMOSS-Team/MOSS-SoundEffect"
	device = "cuda" if torch.cuda.is_available() else "cpu"
	dtype = torch.bfloat16 if device == "cuda" else torch.float32

	processor = AutoProcessor.from_pretrained(
	pretrained_model_name_or_path,
	trust_remote_code=True,
	)
	processor.audio_tokenizer = processor.audio_tokenizer.to(device)

	text = "雷声隆隆，雨声淅沥。" # Thunder rumbling, rain pattering.

	conversations = [
	[processor.build_user_message(ambient_sound=text)]
	]

	model = AutoModel.from_pretrained(
	pretrained_model_name_or_path,
	trust_remote_code=True,
	torch_dtype=dtype,
	).to(device)
	model.eval()

	with torch.no_grad():
	batch = processor(conversations, mode="generation")
	input_ids = batch["input_ids"].to(device)
	attention_mask = batch["attention_mask"].to(device)

	outputs = model.generate(
	input_ids=input_ids,
	attention_mask=attention_mask,
	max_new_tokens=4096,
	)

	for message in processor.decode(outputs):
	audio = message.audio_codes_list[0]
	torchaudio.save("sample.wav", audio.unsqueeze(0), processor.model_config.sampling_rate)
	```

	## Citation
	If you use this model or the CAT architecture in your work, please cite:
	```bibtex
	@misc{gong2026mossaudiotokenizerscalingaudiotokenizers,
	title={MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models},
	author={Yitian Gong and Kuangwei Chen and Zhaoye Fei and Xiaogui Yang and Ke Chen and Yang Wang and Kexin Huang and Mingshu Chen and Ruixiao Li and Qingyuan Cheng and Shimin Li and Xipeng Qiu},
	year={2026},
	eprint={2602.10934},
	archivePrefix={arXiv},
	primaryClass={cs.SD},
	url={https://arxiv.org/abs/2602.10934},
	}
	```