MOSS-Audio-4B-Thinking-MLX-4bit

Release Status

Current community release: Yes
Variant: Pure-MLX end-to-end (LLM + audio path), fully INT4
Canonical pair: Released alongside MOSS-Audio-8B-Thinking-MLX-hybrid

This model was converted to MLX format from OpenMOSS-Team/MOSS-Audio-4B-Thinking and released as the pure-MLX 4-bit variant.

LLM: Qwen3-4B INT4 (group size 64)
Audio path: encoder + adapter + DeepStack, all INT4 on MLX
Runtime: no PyTorch at inference time

Layout

mlx_llm/         Qwen3-4B INT4 weights + tokenizer      (2.3 GB)
mlx_audio/       Audio encoder + adapter + DeepStack    (442 MB)
                 mergers, all INT4
scripts/         Pure-MLX bridge source
inference.py     Standalone example

Measured on Apple M3 Ultra

	This bundle	Full BF16
Decode speed	163 t/s	33 t/s
Decode steady-state peak	3.0 GB	26 GB
Transient prefill peak	3.8 GB	26 GB
Disk footprint	2.74 GB	10.4 GB

Audio quality varies by domain (speech vs non-speech); validate on your target data.

Mobile viability

This bundle targets iPhone Pro-class devices (8 GB RAM, ~4-5 GB app budget). 3.8 GB transient prefill peak + ~150ms spike duration has not yet been validated on-device.

Usage

pip install mlx mlx-lm librosa numpy transformers safetensors
python inference.py --audio your_clip.wav

The inference script auto-detects bundle size from the adapter weights. No PyTorch required at runtime — mel spectrogram and input_ids expansion are pure-MLX via scripts/moss_audio_mel_mlx.py. For Swift/iOS integration, port scripts/moss_audio_mlx_bridge_v3.py + scripts/moss_audio_mel_mlx.py (~450 lines combined, no PyTorch deps).

Defaults and knobs

--repetition-penalty 1.02 — kills decode-loops on non-speech clips without starving genre descriptions. Raising to 1.05 hurts clip_07 quality; lowering to 1.00 regresses 0.10 on the composite.
--max-tokens 2048 — sufficient for the full <think>...</think> + answer output on all measured clips (max observed: 640 tokens).

Evaluation (BFCL v3, community-run)

Text-only tool-calling evaluation (not an audio-caption quality metric), run on a 600-sample subset (simple/multiple/parallel = 200 each), greedy decoding (repetition_penalty=0), measured 2026-05.

Category	This bundle	4B BF16	Qwen3-4B base
simple (200)	92.0%	93.5%	92.5%
multiple (200)	89.5%	93.0%	88.0%
parallel (200)	43.5%	71.5%	87.5%
3-cat avg	75.0%	86.0%	89.3%

Single-call categories (simple/multiple) within 1–4 pp of BF16. Parallel degrades ~28 pp under INT4 — genuine decode regression, not a parser artifact. If your use case doesn't require parallel multi-function calls in one turn, this bundle is production-viable for tool calling.

Limitations

Fine-grained music genre (clip_07 holiday music): weakest clip. Hit rate on "sleigh bells/brass" is ~33% even with RP=1.02 dialed in.
CJK proper nouns: quality varies. Korean esports commentary preserves Hangul structure; Chinese scripted dialog is reliable; Chinese mixed-register (clip_04) sometimes paraphrases to English.
Thinking-block truncation: 1/21 runs hits max_tokens=2048 inside <think> on decode loops. RP=1.02 brings this down from the 2-4/21 seen in earlier configs.

Contributors

Rumilabs Inc - We are building the richest content knowledge base in the world to empower interactive media: Quantization, MLX conversion, pure-MLX audio runtime, benchmarking, and release packaging.

License

Apache-2.0 (inherited from base model).

Citation

Base model: MOSS-Audio.

Downloads last month: -; Downloads are not tracked for this model. How to track

MLX

Hardware compatibility

Quantized

Inference Providers NEW

Audio-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RumiLabs/MOSS-Audio-4B-Thinking-MLX-4bit

Base model

OpenMOSS-Team/MOSS-Audio-4B-Thinking

Finetuned

(1)

this model