mlx-community/MOSS-SoundEffect-v2.0-4bit

This model mlx-community/MOSS-SoundEffect-v2.0-4bit was converted to MLX format from OpenMOSS-Team/MOSS-SoundEffect-v2.0 โ€” a text-to-sound-effect diffusion pipeline (foley / ambience / creature / action audio, 48 kHz, up to 30 s) with a 1.3B Wan-style flow-matching DiT, a continuous 128-d DAC VAE (50 Hz latents), and a frozen Qwen3-1.7B text encoder.

Precision: DiT int4 (group_size 64, transformer-block Linears only โ€” embeddings, time/text projections, head, and norms stay bf16), DAC-VAE fp32, Qwen3 text encoder bf16.

Use with mlx

pip install moss-sfx-mlx  # https://github.com/xocialize/moss-soundeffect-mlx
from moss_sfx_mlx.pipeline_mlx import MossSoundEffectPipeline

pipe = MossSoundEffectPipeline.from_pretrained("mlx-community/MOSS-SoundEffect-v2.0-4bit")
audio = pipe(prompt="a heavy wooden door creaks open slowly",
             seconds=5, num_inference_steps=100, cfg_scale=4.0, seed=0)
# audio: (1, 1, samples) mx.array at 48 kHz

Parity

Validated against the upstream PyTorch reference (fp32, CPU stream, per-module and end-to-end golden tensors; full suite in the GitHub repo):

  • End-to-end waveform vs PyTorch golden (10-step CFG denoise): max_abs < 1e-2 fp32

  • Full-DiT velocity field at production scale (T=1500): max_abs < 1e-2 fp32

  • DAC-VAE decode vs reference: max_abs < 1e-2 fp32 (no scale constant โ€” the learned post_quant_conv is faithful)

  • Qwen3 hidden states: cosine 1.0, max_abs 4.4e-4 (fp32 accumulation floor)

  • int4 DiT per-pass cosine vs bf16 on identical injected inputs: 0.999425 (gate 0.99)

  • 10-prompt perceptual A/B at 100 steps: passed human review (correct content, duration, no tonal artifacts)

Performance (Apple M5 Max)

100 steps, cfg 4.0, full 30 s latent: 45 s wall clock, 12.2 GB peak memory; DiT shrinks 2.83 GB -> 0.83 GB.

License

Apache-2.0, matching the upstream model, code, and all components.

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mlx-community/MOSS-SoundEffect-v2.0-4bit

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(2)
this model