Instructions to use mlx-community/MOSS-SoundEffect-v2.0-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/MOSS-SoundEffect-v2.0-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir MOSS-SoundEffect-v2.0-bf16 mlx-community/MOSS-SoundEffect-v2.0-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
mlx-community/MOSS-SoundEffect-v2.0-bf16
This model mlx-community/MOSS-SoundEffect-v2.0-bf16 was converted to MLX format from OpenMOSS-Team/MOSS-SoundEffect-v2.0 โ a text-to-sound-effect diffusion pipeline (foley / ambience / creature / action audio, 48 kHz, up to 30 s) with a 1.3B Wan-style flow-matching DiT, a continuous 128-d DAC VAE (50 Hz latents), and a frozen Qwen3-1.7B text encoder.
Precision: DiT bf16, DAC-VAE fp32 (the reference decodes under fp32 autocast), Qwen3 text encoder bf16.
Use with mlx
pip install moss-sfx-mlx # https://github.com/xocialize/moss-soundeffect-mlx
from moss_sfx_mlx.pipeline_mlx import MossSoundEffectPipeline
pipe = MossSoundEffectPipeline.from_pretrained("mlx-community/MOSS-SoundEffect-v2.0-bf16")
audio = pipe(prompt="a heavy wooden door creaks open slowly",
seconds=5, num_inference_steps=100, cfg_scale=4.0, seed=0)
# audio: (1, 1, samples) mx.array at 48 kHz
Parity
Validated against the upstream PyTorch reference (fp32, CPU stream, per-module and end-to-end golden tensors; full suite in the GitHub repo):
End-to-end waveform vs PyTorch golden (10-step CFG denoise): max_abs < 1e-2 fp32
Full-DiT velocity field at production scale (T=1500): max_abs < 1e-2 fp32
DAC-VAE decode vs reference: max_abs < 1e-2 fp32 (no scale constant โ the learned post_quant_conv is faithful)
Qwen3 hidden states: cosine 1.0, max_abs 4.4e-4 (fp32 accumulation floor)
10-prompt perceptual A/B at 100 steps: passed human review (correct content, duration, no tonal artifacts)
Performance (Apple M5 Max)
100 steps, cfg 4.0, full 30 s latent: 60 s wall clock, 14.2 GB peak memory.
License
Apache-2.0, matching the upstream model, code, and all components.
Quantized
Model tree for mlx-community/MOSS-SoundEffect-v2.0-bf16
Base model
Qwen/Qwen3-1.7B-Base