MOSS-Audio-4B-Thinking-MLX-4bit
Release Status
- Current community release: Yes
- Variant: Pure-MLX end-to-end (LLM + audio path), fully INT4
- Canonical pair: Released alongside
MOSS-Audio-8B-Thinking-MLX-hybrid
This model was converted to MLX format from OpenMOSS-Team/MOSS-Audio-4B-Thinking and released as the pure-MLX 4-bit variant.
- LLM: Qwen3-4B INT4 (group size 64)
- Audio path: encoder + adapter + DeepStack, all INT4 on MLX
- Runtime: no PyTorch at inference time
Layout
mlx_llm/ Qwen3-4B INT4 weights + tokenizer (2.3 GB)
mlx_audio/ Audio encoder + adapter + DeepStack (442 MB)
mergers, all INT4
scripts/ Pure-MLX bridge source
inference.py Standalone example
Measured on Apple M3 Ultra
| This bundle | Full BF16 | |
|---|---|---|
| Decode speed | 163 t/s | 33 t/s |
| Decode steady-state peak | 3.0 GB | 26 GB |
| Transient prefill peak | 3.8 GB | 26 GB |
| Disk footprint | 2.74 GB | 10.4 GB |
Audio quality varies by domain (speech vs non-speech); validate on your target data.
Mobile viability
This bundle targets iPhone Pro-class devices (8 GB RAM, ~4-5 GB app budget). 3.8 GB transient prefill peak + ~150ms spike duration has not yet been validated on-device.
Usage
pip install mlx mlx-lm librosa numpy transformers safetensors
python inference.py --audio your_clip.wav
The inference script auto-detects bundle size from the adapter weights.
No PyTorch required at runtime โ mel spectrogram and input_ids expansion
are pure-MLX via scripts/moss_audio_mel_mlx.py. For Swift/iOS integration,
port scripts/moss_audio_mlx_bridge_v3.py + scripts/moss_audio_mel_mlx.py
(~450 lines combined, no PyTorch deps).
Defaults and knobs
--repetition-penalty 1.02โ kills decode-loops on non-speech clips without starving genre descriptions. Raising to 1.05 hurts clip_07 quality; lowering to 1.00 regresses 0.10 on the composite.--max-tokens 2048โ sufficient for the full<think>...</think>+ answer output on all measured clips (max observed: 640 tokens).
Evaluation (BFCL v3, community-run)
Text-only tool-calling evaluation (not an audio-caption quality metric), run on a
600-sample subset (simple/multiple/parallel = 200 each), greedy decoding
(repetition_penalty=0), measured 2026-05.
| Category | This bundle | 4B BF16 | Qwen3-4B base |
|---|---|---|---|
| simple (200) | 92.0% | 93.5% | 92.5% |
| multiple (200) | 89.5% | 93.0% | 88.0% |
| parallel (200) | 43.5% | 71.5% | 87.5% |
| 3-cat avg | 75.0% | 86.0% | 89.3% |
Single-call categories (simple/multiple) within 1โ4 pp of BF16. Parallel degrades ~28 pp under INT4 โ genuine decode regression, not a parser artifact. If your use case doesn't require parallel multi-function calls in one turn, this bundle is production-viable for tool calling.
Limitations
- Fine-grained music genre (clip_07 holiday music): weakest clip. Hit rate on "sleigh bells/brass" is ~33% even with RP=1.02 dialed in.
- CJK proper nouns: quality varies. Korean esports commentary preserves Hangul structure; Chinese scripted dialog is reliable; Chinese mixed-register (clip_04) sometimes paraphrases to English.
- Thinking-block truncation: 1/21 runs hits
max_tokens=2048inside<think>on decode loops. RP=1.02 brings this down from the 2-4/21 seen in earlier configs.
Contributors
- Rumilabs Inc - We are building the richest content knowledge base in the world to empower interactive media: Quantization, MLX conversion, pure-MLX audio runtime, benchmarking, and release packaging.
License
Apache-2.0 (inherited from base model).
Citation
Base model: MOSS-Audio.
Quantized
Model tree for RumiLabs/MOSS-Audio-4B-Thinking-MLX-4bit
Base model
OpenMOSS-Team/MOSS-Audio-4B-Thinking