mlx-community/VibeThinker-3B-OptiQ-4bit

A 4-bit mixed-precision MLX quant produced by mlx-optiq — the sensitivity-aware quantization toolkit for Apple Silicon. It edges stock uniform 4-bit on the six-metric Capability Score and stays markedly closer to the original bf16 weights (KL 0.58 vs 1.44).

A 4-bit mixed-precision MLX quant of mlx-community/VibeThinker-3B (a reasoning model fine-tuned from Qwen2.5-Coder-3B). Per-layer bit-widths come from a KL-divergence sensitivity pass on a six-domain calibration mix (prose · reasoning · code · agent · tool-call · constraint-bearing instructions). Sensitive layers go to 8-bit; robust ones stay at 4-bit.

Quantization details

Property Value
Predominant precision 4-bit
Layers at 8-bit (sensitive) 141
Layers at 4-bit (robust) 111
Total quantized layers 252
Achieved bits per weight 5.12
Group size 64
Calibration mix six-domain mix (40 samples × 6 domains)
Reference for sensitivity bf16

We follow the same naming convention llama.cpp uses for Q4_K_M and similar mixed-precision quants: the "4-bit" label is for the predominant precision, not the weighted average. The few sensitive layers held at 8-bit make this build +0.5 GB larger than a stock uniform-4-bit quant (2.1 GB vs 1.6 GB) while recovering quality the uniform quant loses.

Usage

Load it with mlx-lm and use it as usual:

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/VibeThinker-3B-OptiQ-4bit")
response = generate(
    model, tokenizer,
    prompt="Explain quantum computing in simple terms.",
    max_tokens=512,
)

This is a reasoning model — it emits a <think>…</think> block before its final answer. Give it room with max_tokens.

For more (mixed-precision KV-cache serving, sensitivity-aware LoRA fine-tuning, OpenAI + Anthropic-compatible inference server with structured/JSON output, hot-swap mounted adapters), install mlx-optiq:

pip install mlx-optiq

Benchmarks

Scored with mlx-optiq's reasoning-aware evaluation (optiq eval --reasoning): the model emits its full <think> trace (budget 3072 tokens), the trace is stripped before answer extraction, and MMLU is scored generatively (parse the answer letter from the response) instead of by first-letter logit argmax — which collapses to chance for a model trained to reason before answering. The OptiQ quant and the stock uniform-4-bit baseline are scored identically, so the comparison is apples-to-apples.

Metric OptiQ Uniform 4-bit Δ
MMLU (generative, 250 samples) 74.1% 69.7% +4.4
GSM8K (250 samples, CoT) 90.0% 90.8% -0.8
IFEval (200 prompts, strict) 55.0% 54.0% +1.0
BFCL-V3 simple (200 calls) 35.0% 42.5% -7.5
HumanEval (164 problems, pass@1) 89.6% 84.8% +4.9
HashHop (60 instances, long-context) 0.0% 0.0% +0.0
Capability Score (mean of 6) 57.29 56.97 +0.33
KL vs bf16 (mean / p95) 0.5841 / 2.0397 1.4408 / 4.4788
On-disk size 2.1 GB 1.6 GB +0.5

VibeThinker is a math/code reasoning specialist — strong on GSM8K and HumanEval, solid on MMLU. Its tool-calling is inconsistent (BFCL 35–42%: it often reasons in prose instead of emitting a clean call), and it has no long-context-retrieval ability — HashHop is 0% for both quants (it echoes the context instead of resolving the chain; a genuine model-capability gap verified by inspecting outputs, not a quantization artifact). The two quants trade the smaller tasks within noise (BFCL's ±7pp CI at n=200 spans the gap there); OptiQ's clear wins are MMLU (+4.4) and HumanEval (+4.9), and above all fidelity to the original weights: KL 0.58 vs 1.44 (~2.5× closer to bf16).

Sample sizes are reduced from the 1000-sample default because each reasoning trace is ~1000+ tokens (≈50× slower per question); they're sized for the per-task deltas shown. Every metric gets one equal vote; disk size is an honest second axis, not folded into the score. See the eval-framework writeup.

Links

License

MIT (inherits from the base model).

Downloads last month
163
Safetensors
Model size
0.6B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/VibeThinker-3B-OptiQ-4bit

Base model

Qwen/Qwen2.5-3B
Quantized
(1)
this model