Qwen3.5-2B-OptiQ-4bit
Mixed-precision quantized with OptiQ — sensitivity-driven quantization for Apple Silicon
This is a mixed-precision quantized version of Qwen/Qwen3.5-2B in MLX format. Unlike uniform quantization (all layers at the same bit-width), OptiQ measures each layer's sensitivity and assigns optimal per-layer bit-widths, preserving model quality where it matters most.
How OptiQ Works
OptiQ is an optimizing compiler that converts PyTorch models into hardware-optimized MLX versions using data-driven mixed-precision quantization:
- Sensitivity Analysis — For each layer, OptiQ simulates quantization at each candidate bit-width and measures the KL divergence between the original and quantized output distributions. Layers that distort the distribution more are "sensitive."
- Greedy Knapsack Optimization — Starting with all layers at the minimum bit-width, OptiQ greedily upgrades the most sensitive layers to higher precision until the target bits-per-weight budget is exhausted.
- Per-Layer Bit Allocation — The result is a custom quantization config where each layer gets the bit-width that maximizes quality within the size budget. Protected layers (embeddings, final layers) are always assigned high precision.
Quantization Details
| Property | Value |
|---|---|
| Target BPW | 4.5 |
| Achieved BPW | 4.50 |
| Candidate bits | 4, 8 |
| Layers at 4-bit | 116 |
| Layers at 8-bit | 71 |
| Total quantized layers | 187 |
| Group size | 64 |
| Model size | 1365 MB |
| Uniform 4-bit size | 1010 MB |
| Calibration data | WikiText-2 (2 samples, 128 tokens) |
Benchmark Results
GSM8K (200 samples, 3-shot chain-of-thought):
| Model | Candidates | GSM8K Accuracy | Size |
|---|---|---|---|
| OptiQ mixed (4.5 BPW) | 4, 8 | 48.0% | 1365 MB |
| Uniform 4-bit | — | 48.5% | 1010 MB |
| OptiQ 3.5 BPW | 3, 4, 8 | 14.0% | 1141 MB |
| Uniform 3-bit | — | 6.0% | 786 MB |
| OptiQ 3.0 BPW | 3, 4, 8 | 6.0% | 786 MB |
| OptiQ 3.0 BPW | 2, 4 | 2.0% | 786 MB |
| Uniform 2-bit | — | 0.5% | 562 MB |
Key Findings
At the 2B scale, there is a sharp quality cliff between 4-bit and 3-bit — the model drops from ~48% to single digits. This differs from the 0.8B model where the degradation is more gradual and OptiQ's mixed-precision has room to recover quality.
- At 4-bit (4.5 BPW): The model is robust. Uniform and mixed-precision perform equivalently — OptiQ provides a safety margin for sensitive layers without measurable cost.
- At 3.5 BPW [3,4,8]: OptiQ achieves 14.0% — 2.3x better than uniform 3-bit (6.0%) — by keeping sensitive layers at 4 or 8-bit while only pushing the least sensitive to 3-bit.
- At 3.0 BPW: The floor bit-width matters enormously. With a 3-bit floor [3,4,8], OptiQ matches uniform 3-bit (6.0%). With a 2-bit floor [2,4], quality collapses to 2.0% — even mixed-precision can't save layers quantized to 2-bit at this scale.
- Uniform 2-bit is essentially random (0.5%).
Recommendation: Use this 4.5 BPW model for the best quality-size tradeoff at 2B scale. For smaller models where mixed-precision shows dramatic benefits, see Qwen3.5-0.8B-OptiQ-4bit where OptiQ more than doubles uniform 4-bit accuracy (27% vs 11.5%).
Usage
This model works with standard mlx-lm — no special code needed:
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Qwen3.5-2B-OptiQ-4bit")
prompt = "Explain quantum computing in simple terms:"
response = generate(model, tokenizer, prompt=prompt, max_tokens=200)
print(response)
Requirements: mlx-lm >= 0.30.7 (for Qwen3.5 architecture support)
pip install mlx-lm>=0.30.7
Architecture
Qwen3.5 uses a hybrid attention architecture with alternating linear_attn and self_attn layers. OptiQ's sensitivity analysis identifies which layers are most sensitive to quantization error and assigns them higher precision, while less sensitive layers get 4-bit quantization to minimize model size.
Article
For more details on the methodology and results, see: Not All Layers Are Equal: Mixed-Precision Quantization for Weights and KV Cache on Apple Silicon
Credits
- Quantization method: OptiQ — optimizing compiler for mixed-precision quantization on Apple Silicon
- Base model: Qwen/Qwen3.5-2B by Qwen Team
- Runtime: MLX by Apple
- Downloads last month
- -
4-bit