Qwen3.5-0.8B-OptiQ-4bit
Mixed-precision quantized with OptiQ — sensitivity-driven quantization for Apple Silicon
This is a mixed-precision quantized version of Qwen/Qwen3.5-0.8B in MLX format. Unlike uniform quantization (all layers at the same bit-width), OptiQ measures each layer's sensitivity and assigns optimal per-layer bit-widths, preserving model quality where it matters most.
How OptiQ Works
OptiQ is an optimizing compiler that converts PyTorch models into hardware-optimized MLX versions using data-driven mixed-precision quantization:
- Sensitivity Analysis — For each layer, OptiQ simulates quantization at each candidate bit-width and measures the KL divergence between the original and quantized output distributions. Layers that distort the distribution more are "sensitive."
- Greedy Knapsack Optimization — Starting with all layers at the minimum bit-width, OptiQ greedily upgrades the most sensitive layers to higher precision until the target bits-per-weight budget is exhausted.
- Per-Layer Bit Allocation — The result is a custom quantization config where each layer gets the bit-width that maximizes quality within the size budget. Protected layers (embeddings, final layers) are always assigned high precision.
Quantization Details
| Property | Value |
|---|---|
| Target BPW | 4.5 |
| Achieved BPW | 4.50 |
| Candidate bits | 4, 8 |
| Layers at 4-bit | 111 |
| Layers at 8-bit | 76 |
| Total quantized layers | 187 |
| Group size | 64 |
| Model size | 570 MB |
| Uniform 4-bit size | 404 MB |
| Calibration data | WikiText-2 (2 samples, 128 tokens) |
Benchmark Results
GSM8K (200 samples, 3-shot chain-of-thought):
| Model | GSM8K Accuracy |
|---|---|
| OptiQ mixed (4.5 BPW) | 27.0% |
| Uniform 4-bit | 11.5% |
OptiQ more than doubles the accuracy of uniform 4-bit quantization (+15.5 percentage points, 2.3x improvement) at a modest size increase.
Usage
This model works with standard mlx-lm — no special code needed:
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Qwen3.5-0.8B-OptiQ-4bit")
prompt = "Explain quantum computing in simple terms:"
response = generate(model, tokenizer, prompt=prompt, max_tokens=200)
print(response)
Requirements: mlx-lm >= 0.30.7 (for Qwen3.5 architecture support)
pip install mlx-lm>=0.30.7
Architecture
Qwen3.5-0.8B uses a hybrid attention architecture with alternating linear_attn and self_attn layers across 24 transformer blocks. OptiQ's sensitivity analysis found that early layers (blocks 0-2), self-attention K/V projections, and the final block are the most sensitive — these receive 8-bit precision while less sensitive MLP and attention layers are quantized to 4-bit.
Article
For more details on the methodology and results, see: Not All Layers Are Equal: Mixed-Precision Quantization for Weights and KV Cache on Apple Silicon
Credits
- Quantization method: OptiQ — optimizing compiler for mixed-precision quantization on Apple Silicon
- Base model: Qwen/Qwen3.5-0.8B by Qwen Team
- Runtime: MLX by Apple
- Downloads last month
- -
4-bit