MOSS-Music-8B-Thinking · MLX 4-bit

A 4-bit MLX quantization of OpenMOSS-Team/MOSS-Music-8B-Thinking for music understanding on Apple Silicon. The smallest build (~6 GB), a good fit for 16 GB Macs.

Community conversion, not an official release. All model credit goes to the OpenMOSS Team.

Other sizes: 8-bit · 6-bit

Usage

MOSS-Music is a custom multimodal (audio + text) model, so it does not load with mlx_lm / mlx_vlm directly. Use the moss_music_mlx backend (code, PR):

from huggingface_hub import snapshot_download
from moss_music_mlx import load_pretrained, generate
from src.processing_moss_music import MossMusicProcessor

path = snapshot_download("mlx-community/MOSS-Music-8B-Thinking-4bit")
model = load_pretrained(path)
proc = MossMusicProcessor.from_pretrained(path, trust_remote_code=True, enable_time_marker=True)
print(generate(model, proc, "Analyze this track: genre, key, BPM, structure.", audio_path="song.mp3"))

Conversion

4-bit, group size 64. The audio encoder is kept at bf16 to preserve audio fidelity; quantization is applied to the Qwen3 layers, token embeddings and lm_head.
Converted with mlx==0.31.2, mlx-lm==0.29.1.

Accuracy

Versus the fp32 PyTorch reference, the 4-bit model's prefill next-token argmax is identical and the logits match to cosine 0.99889 (8-bit is 0.99999, 6-bit 0.99989). 4-bit is the most aggressive recipe; for the highest fidelity prefer 6-bit or 8-bit.

License & credit

Apache-2.0, inherited from the base model. This repository provides only the MLX-quantized weights; all credit goes to the OpenMOSS Team.

Downloads last month: -

Safetensors

Model size

2B params

Tensor type

BF16

U32

MLX

Hardware compatibility

Quantized

Inference Providers NEW

Audio-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/MOSS-Music-8B-Thinking-4bit

Base model

OpenMOSS-Team/MOSS-Music-8B-Thinking

Finetuned

(3)

this model