Qwen3.5-0.8B-OptiQ-4bit

Mixed-precision quantized with OptiQ — sensitivity-driven quantization for Apple Silicon

This is a mixed-precision quantized version of Qwen/Qwen3.5-0.8B in MLX format. Unlike uniform quantization (all layers at the same bit-width), OptiQ measures each layer's sensitivity and assigns optimal per-layer bit-widths, preserving model quality where it matters most.

How OptiQ Works

OptiQ is an optimizing compiler that converts PyTorch models into hardware-optimized MLX versions using data-driven mixed-precision quantization:

  1. Sensitivity Analysis — For each layer, OptiQ simulates quantization at each candidate bit-width and measures the KL divergence between the original and quantized output distributions. Layers that distort the distribution more are "sensitive."
  2. Greedy Knapsack Optimization — Starting with all layers at the minimum bit-width, OptiQ greedily upgrades the most sensitive layers to higher precision until the target bits-per-weight budget is exhausted.
  3. Per-Layer Bit Allocation — The result is a custom quantization config where each layer gets the bit-width that maximizes quality within the size budget. Protected layers (embeddings, final layers) are always assigned high precision.

Quantization Details

Property Value
Target BPW 4.5
Achieved BPW 4.50
Candidate bits 4, 8
Layers at 4-bit 111
Layers at 8-bit 76
Total quantized layers 187
Group size 64
Model size 570 MB
Uniform 4-bit size 404 MB
Calibration data WikiText-2 (2 samples, 128 tokens)

Benchmark Results

GSM8K (200 samples, 3-shot chain-of-thought):

Model GSM8K Accuracy
OptiQ mixed (4.5 BPW) 27.0%
Uniform 4-bit 11.5%

OptiQ more than doubles the accuracy of uniform 4-bit quantization (+15.5 percentage points, 2.3x improvement) at a modest size increase.

Usage

This model works with standard mlx-lm — no special code needed:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen3.5-0.8B-OptiQ-4bit")

prompt = "Explain quantum computing in simple terms:"
response = generate(model, tokenizer, prompt=prompt, max_tokens=200)
print(response)

Requirements: mlx-lm >= 0.30.7 (for Qwen3.5 architecture support)

pip install mlx-lm>=0.30.7

Architecture

Qwen3.5-0.8B uses a hybrid attention architecture with alternating linear_attn and self_attn layers across 24 transformer blocks. OptiQ's sensitivity analysis found that early layers (blocks 0-2), self-attention K/V projections, and the final block are the most sensitive — these receive 8-bit precision while less sensitive MLP and attention layers are quantized to 4-bit.

Article

For more details on the methodology and results, see: Not All Layers Are Equal: Mixed-Precision Quantization for Weights and KV Cache on Apple Silicon

Credits

  • Quantization method: OptiQ — optimizing compiler for mixed-precision quantization on Apple Silicon
  • Base model: Qwen/Qwen3.5-0.8B by Qwen Team
  • Runtime: MLX by Apple
Downloads last month
-
Safetensors
Model size
0.2B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/Qwen3.5-0.8B-OptiQ-4bit

Finetuned
Qwen/Qwen3.5-0.8B
Quantized
(40)
this model