Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit

The Model Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit was converted to Dynamic MLX format from Qwen/Qwen3-4B-Instruct-2507 using mlx-lm version 0.30.0.

Estimating sensitivities:  80%|████████████████████▉     | 45/56 [2:29:16<31:17, 170.70s/it]
Estimating sensitivities:  96%|█████████████████████████ | 54/56 [2:55:25<05:54, 177.01s/it]
Estimating sensitivities: 57it [3:03:25, 193.08s/it]
Original PPL: 8.204
[INFO] Quantized model with 5.005 bits per weight.
Quantized PPL: 8.506
Peak memory used: 47.648GB

What are MLX Dynamic Quants

MLX dynamic quantization is a compression technique that intelligently allocates different bit-widths to different layers based on their sensitivity to quantization errors. Unlike uniform quantization that applies the same precision to all layers, dynamic quantization analyzes each layer's impact on model output quality and assigns higher precision to more sensitive layers while using lower precision for robust layers.

You may notice in the files that there is also a sensitivites.json available. This file contains-by-layer sensitivity scores that were calculated during a comprehensive analysis phase. Each score how sensitive a particular layer is to quantization - higher scores indicate layers that suffer greater quality loss if quantized aggressively.

The sensitivity analysis process is the time-consuming part of creating dynamic quants (as shown by the multiple hours) because the system must test each layer individually with sample data to measure accuracy impact.

You can download just this file and create your own dynamic quants without redoing the lengthy sensitivity analysis.

Perplexity Results Explained

Perplexity (PPL) measures how well a language model predicts text, with lower scores indicating better performance. The evaluation results demonstrate the effectiveness of this approach:

Original Model PPL: 8.204

Quantized Model PPL: 8.506

Despite shrinking the model by ~2 GB, the dynamic quantized version achieves nearly identical perplexity to the original. The tiny increase (+0.302 PPL) means virtually no loss in language understanding, especially impressive given the drastic memory savings.

Why Accuracy Remains High

The secret lies in the sensitivity-guided precision allocation. Rather than uniformly degrading all parameters, the system protects the most critical neural pathway (eg., model.layers12.self_attn.v_proj) with higher precision while more aggressively compressing less sensitive components (eg., model.layers.0.self_attn.o_proj: 9.8e-05). This selective preservation strategy enables the model to maintain near-original accuracy, making high-performance inference accessible on hardware with limited memory resources.

Downloads last month
32
Safetensors
Model size
0.7B params
Tensor type
BF16
·
U32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit

Quantized
(177)
this model

Collection including Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit