Qwen3.5-9B-OptiQ-4bit

Mixed-precision quantized with OptiQ — sensitivity-driven quantization for Apple Silicon

This is a mixed-precision quantized version of Qwen/Qwen3.5-9B in MLX format. Unlike uniform quantization (all layers at the same bit-width), OptiQ measures each layer's sensitivity and assigns optimal per-layer bit-widths, preserving model quality where it matters most.

How OptiQ Works

OptiQ is an optimizing compiler that converts PyTorch models into hardware-optimized MLX versions using data-driven mixed-precision quantization:

  1. Sensitivity Analysis — For each layer, OptiQ simulates quantization at each candidate bit-width and measures the KL divergence between the original and quantized output distributions. Layers that distort the distribution more are "sensitive."
  2. Greedy Knapsack Optimization — Starting with all layers at the minimum bit-width, OptiQ greedily upgrades the most sensitive layers to higher precision until the target bits-per-weight budget is exhausted.
  3. Per-Layer Bit Allocation — The result is a custom quantization config where each layer gets the bit-width that maximizes quality within the size budget. Protected layers (embeddings, final layers) are always assigned high precision.

Quantization Details

Property Value
Target BPW 4.5
Achieved BPW 4.50
Candidate bits 4, 8
Layers at 4-bit 157
Layers at 8-bit 92
Total quantized layers 249
Group size 64
Model size 5763 MB
Uniform 4-bit size 4805 MB
Calibration data WikiText-2 (2 samples, 128 tokens)

Benchmark Results

GSM8K (200 samples, 3-shot chain-of-thought):

Model GSM8K Accuracy
OptiQ mixed (4.5 BPW) 90.0%
Uniform 4-bit 90.0%

At this scale, OptiQ matches uniform 4-bit performance while providing a safety margin for sensitive layers. For smaller models where mixed-precision shows dramatic benefits, see Qwen3.5-0.8B-OptiQ-4bit where OptiQ more than doubles uniform accuracy (27% vs 11.5%).

Usage

This model works with standard mlx-lm — no special code needed:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")

prompt = "Explain quantum computing in simple terms:"
response = generate(model, tokenizer, prompt=prompt, max_tokens=200)
print(response)

Requirements: mlx-lm >= 0.30.7 (for Qwen3.5 architecture support)

pip install mlx-lm>=0.30.7

Architecture

Qwen3.5 uses a hybrid attention architecture with alternating linear_attn and self_attn layers. OptiQ's sensitivity analysis identifies which layers are most sensitive to quantization error and assigns them higher precision, while less sensitive layers get 4-bit quantization to minimize model size.

Article

For more details on the methodology and results, see: Not All Layers Are Equal: Mixed-Precision Quantization for Weights and KV Cache on Apple Silicon

Credits

  • Quantization method: OptiQ — optimizing compiler for mixed-precision quantization on Apple Silicon
  • Base model: Qwen/Qwen3.5-9B by Qwen Team
  • Runtime: MLX by Apple
Downloads last month
-
Safetensors
Model size
9B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/Qwen3.5-9B-OptiQ-4bit

Finetuned
Qwen/Qwen3.5-9B
Quantized
(56)
this model