Qwen3.5-2B-OptiQ-4bit

Mixed-precision quantized with OptiQ — sensitivity-driven quantization for Apple Silicon

This is a mixed-precision quantized version of Qwen/Qwen3.5-2B in MLX format. Unlike uniform quantization (all layers at the same bit-width), OptiQ measures each layer's sensitivity and assigns optimal per-layer bit-widths, preserving model quality where it matters most.

How OptiQ Works

OptiQ is an optimizing compiler that converts PyTorch models into hardware-optimized MLX versions using data-driven mixed-precision quantization:

  1. Sensitivity Analysis — For each layer, OptiQ simulates quantization at each candidate bit-width and measures the KL divergence between the original and quantized output distributions. Layers that distort the distribution more are "sensitive."
  2. Greedy Knapsack Optimization — Starting with all layers at the minimum bit-width, OptiQ greedily upgrades the most sensitive layers to higher precision until the target bits-per-weight budget is exhausted.
  3. Per-Layer Bit Allocation — The result is a custom quantization config where each layer gets the bit-width that maximizes quality within the size budget. Protected layers (embeddings, final layers) are always assigned high precision.

Quantization Details

Property Value
Target BPW 4.5
Achieved BPW 4.50
Candidate bits 4, 8
Layers at 4-bit 116
Layers at 8-bit 71
Total quantized layers 187
Group size 64
Model size 1365 MB
Uniform 4-bit size 1010 MB
Calibration data WikiText-2 (2 samples, 128 tokens)

Benchmark Results

GSM8K (200 samples, 3-shot chain-of-thought):

Model Candidates GSM8K Accuracy Size
OptiQ mixed (4.5 BPW) 4, 8 48.0% 1365 MB
Uniform 4-bit — 48.5% 1010 MB
OptiQ 3.5 BPW 3, 4, 8 14.0% 1141 MB
Uniform 3-bit — 6.0% 786 MB
OptiQ 3.0 BPW 3, 4, 8 6.0% 786 MB
OptiQ 3.0 BPW 2, 4 2.0% 786 MB
Uniform 2-bit — 0.5% 562 MB

Key Findings

At the 2B scale, there is a sharp quality cliff between 4-bit and 3-bit — the model drops from ~48% to single digits. This differs from the 0.8B model where the degradation is more gradual and OptiQ's mixed-precision has room to recover quality.

  • At 4-bit (4.5 BPW): The model is robust. Uniform and mixed-precision perform equivalently — OptiQ provides a safety margin for sensitive layers without measurable cost.
  • At 3.5 BPW [3,4,8]: OptiQ achieves 14.0% — 2.3x better than uniform 3-bit (6.0%) — by keeping sensitive layers at 4 or 8-bit while only pushing the least sensitive to 3-bit.
  • At 3.0 BPW: The floor bit-width matters enormously. With a 3-bit floor [3,4,8], OptiQ matches uniform 3-bit (6.0%). With a 2-bit floor [2,4], quality collapses to 2.0% — even mixed-precision can't save layers quantized to 2-bit at this scale.
  • Uniform 2-bit is essentially random (0.5%).

Recommendation: Use this 4.5 BPW model for the best quality-size tradeoff at 2B scale. For smaller models where mixed-precision shows dramatic benefits, see Qwen3.5-0.8B-OptiQ-4bit where OptiQ more than doubles uniform 4-bit accuracy (27% vs 11.5%).

Usage

This model works with standard mlx-lm — no special code needed:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen3.5-2B-OptiQ-4bit")

prompt = "Explain quantum computing in simple terms:"
response = generate(model, tokenizer, prompt=prompt, max_tokens=200)
print(response)

Requirements: mlx-lm >= 0.30.7 (for Qwen3.5 architecture support)

pip install mlx-lm>=0.30.7

Architecture

Qwen3.5 uses a hybrid attention architecture with alternating linear_attn and self_attn layers. OptiQ's sensitivity analysis identifies which layers are most sensitive to quantization error and assigns them higher precision, while less sensitive layers get 4-bit quantization to minimize model size.

Article

For more details on the methodology and results, see: Not All Layers Are Equal: Mixed-Precision Quantization for Weights and KV Cache on Apple Silicon

Credits

  • Quantization method: OptiQ — optimizing compiler for mixed-precision quantization on Apple Silicon
  • Base model: Qwen/Qwen3.5-2B by Qwen Team
  • Runtime: MLX by Apple
Downloads last month
-
Safetensors
Model size
0.4B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/Qwen3.5-2B-OptiQ-4bit

Finetuned
Qwen/Qwen3.5-2B
Quantized
(38)
this model