mlx-community/MOSS-SoundEffect-v2.0-bf16

This model mlx-community/MOSS-SoundEffect-v2.0-bf16 was converted to MLX format from OpenMOSS-Team/MOSS-SoundEffect-v2.0 — a text-to-sound-effect diffusion pipeline (foley / ambience / creature / action audio, 48 kHz, up to 30 s) with a 1.3B Wan-style flow-matching DiT, a continuous 128-d DAC VAE (50 Hz latents), and a frozen Qwen3-1.7B text encoder.

Precision: DiT bf16, DAC-VAE fp32 (the reference decodes under fp32 autocast), Qwen3 text encoder bf16.

Use with mlx

pip install moss-sfx-mlx  # https://github.com/xocialize/moss-soundeffect-mlx

from moss_sfx_mlx.pipeline_mlx import MossSoundEffectPipeline

pipe = MossSoundEffectPipeline.from_pretrained("mlx-community/MOSS-SoundEffect-v2.0-bf16")
audio = pipe(prompt="a heavy wooden door creaks open slowly",
             seconds=5, num_inference_steps=100, cfg_scale=4.0, seed=0)
# audio: (1, 1, samples) mx.array at 48 kHz

Parity

Validated against the upstream PyTorch reference (fp32, CPU stream, per-module and end-to-end golden tensors; full suite in the GitHub repo):

End-to-end waveform vs PyTorch golden (10-step CFG denoise): max_abs < 1e-2 fp32
Full-DiT velocity field at production scale (T=1500): max_abs < 1e-2 fp32
DAC-VAE decode vs reference: max_abs < 1e-2 fp32 (no scale constant — the learned post_quant_conv is faithful)
Qwen3 hidden states: cosine 1.0, max_abs 4.4e-4 (fp32 accumulation floor)
10-prompt perceptual A/B at 100 steps: passed human review (correct content, duration, no tonal artifacts)

Performance (Apple M5 Max)

100 steps, cfg 4.0, full 30 s latent: 60 s wall clock, 14.2 GB peak memory.

License

Apache-2.0, matching the upstream model, code, and all components.

Downloads last month: -; Downloads are not tracked for this model. How to track

MLX

Hardware compatibility

Quantized

Model tree for mlx-community/MOSS-SoundEffect-v2.0-bf16

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Finetuned

OpenMOSS-Team/MOSS-SoundEffect-v2.0

Finetuned

(2)

this model