MOSS-Audio-4B-Thinking-MLX-4bit

Release Status

  • Current community release: Yes
  • Variant: Pure-MLX end-to-end (LLM + audio path), fully INT4
  • Canonical pair: Released alongside MOSS-Audio-8B-Thinking-MLX-hybrid

This model was converted to MLX format from OpenMOSS-Team/MOSS-Audio-4B-Thinking and released as the pure-MLX 4-bit variant.

  • LLM: Qwen3-4B INT4 (group size 64)
  • Audio path: encoder + adapter + DeepStack, all INT4 on MLX
  • Runtime: no PyTorch at inference time

Layout

mlx_llm/         Qwen3-4B INT4 weights + tokenizer      (2.3 GB)
mlx_audio/       Audio encoder + adapter + DeepStack    (442 MB)
                 mergers, all INT4
scripts/         Pure-MLX bridge source
inference.py     Standalone example

Measured on Apple M3 Ultra

This bundle Full BF16
Decode speed 163 t/s 33 t/s
Decode steady-state peak 3.0 GB 26 GB
Transient prefill peak 3.8 GB 26 GB
Disk footprint 2.74 GB 10.4 GB

Audio quality varies by domain (speech vs non-speech); validate on your target data.

Mobile viability

This bundle targets iPhone Pro-class devices (8 GB RAM, ~4-5 GB app budget). 3.8 GB transient prefill peak + ~150ms spike duration has not yet been validated on-device.

Usage

pip install mlx mlx-lm librosa numpy transformers safetensors
python inference.py --audio your_clip.wav

The inference script auto-detects bundle size from the adapter weights. No PyTorch required at runtime โ€” mel spectrogram and input_ids expansion are pure-MLX via scripts/moss_audio_mel_mlx.py. For Swift/iOS integration, port scripts/moss_audio_mlx_bridge_v3.py + scripts/moss_audio_mel_mlx.py (~450 lines combined, no PyTorch deps).

Defaults and knobs

  • --repetition-penalty 1.02 โ€” kills decode-loops on non-speech clips without starving genre descriptions. Raising to 1.05 hurts clip_07 quality; lowering to 1.00 regresses 0.10 on the composite.
  • --max-tokens 2048 โ€” sufficient for the full <think>...</think> + answer output on all measured clips (max observed: 640 tokens).

Evaluation (BFCL v3, community-run)

Text-only tool-calling evaluation (not an audio-caption quality metric), run on a 600-sample subset (simple/multiple/parallel = 200 each), greedy decoding (repetition_penalty=0), measured 2026-05.

Category This bundle 4B BF16 Qwen3-4B base
simple (200) 92.0% 93.5% 92.5%
multiple (200) 89.5% 93.0% 88.0%
parallel (200) 43.5% 71.5% 87.5%
3-cat avg 75.0% 86.0% 89.3%

Single-call categories (simple/multiple) within 1โ€“4 pp of BF16. Parallel degrades ~28 pp under INT4 โ€” genuine decode regression, not a parser artifact. If your use case doesn't require parallel multi-function calls in one turn, this bundle is production-viable for tool calling.

Limitations

  • Fine-grained music genre (clip_07 holiday music): weakest clip. Hit rate on "sleigh bells/brass" is ~33% even with RP=1.02 dialed in.
  • CJK proper nouns: quality varies. Korean esports commentary preserves Hangul structure; Chinese scripted dialog is reliable; Chinese mixed-register (clip_04) sometimes paraphrases to English.
  • Thinking-block truncation: 1/21 runs hits max_tokens=2048 inside <think> on decode loops. RP=1.02 brings this down from the 2-4/21 seen in earlier configs.

Contributors

  • Rumilabs Inc - We are building the richest content knowledge base in the world to empower interactive media: Quantization, MLX conversion, pure-MLX audio runtime, benchmarking, and release packaging.

License

Apache-2.0 (inherited from base model).

Citation

Base model: MOSS-Audio.

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for RumiLabs/MOSS-Audio-4B-Thinking-MLX-4bit

Finetuned
(1)
this model