Otilde/Ministral-3-3B-Instruct-2512-Q7-MLX-Dynamic

The model Otilde/Ministral-3-3B-Instruct-2512-Q7-MLX-Dynamic was converted to Dynamic MLX format from mistralai/Ministral-3-3B-Instruct-2512 using mlx-lm version 0.30.0.

./mlx_lm.dynamic_quant --model mistralai/Ministral-3-3B-Instruct-2512 --sensitivities mistralai_Ministral-3-3B-Instruct-2512_sensitivities.json --low-bits 6 --high-bits 8 --low-group-size 32 --high-group-size 32 --report-ppl
Original PPL: 7.733
[INFO] Quantized model with 7.017 bits per weight.
Quantized PPL: 7.752

What are MLX Dynamic Quants

MLX dynamic quantization is a compression technique that intelligently allocates different bit-widths to different layers based on their sensitivity to quantization errors. Unlike uniform quantization that applies the same precision to all layers, dynamic quantization analyzes each layer's impact on model output quality and assigns higher precision to more sensitive layers while using lower precision for robust layers.

You may notice in the files that there is also a sensitivites.json available. This file contains-by-layer sensitivity scores that were calculated during a comprehensive analysis phase. Each score how sensitive a particular layer is to quantization - higher scores indicate layers that suffer greater quality loss if quantized aggressively.

The sensitivity analysis process is the time-consuming part of creating dynamic quants because the system must test each layer individually with sample data to measure accuracy impact.

You can download just this file and create your own dynamic quants without redoing the lengthy sensitivity analysis.

Perplexity Results Explained

Perplexity (PPL) measures how well a language model predicts text, with lower scores indicating better performance. The evaluation results demonstrate the effectiveness of this approach:

Original Model PPL: 7.733

Quantized Model PPL: 7.752

The Q7 dynamic‑quantized model occupies about 3.02 GB, compared with the Q8 version’s ≈ 3.4 GB. Despite shrinking the model by ~400 MB, the dynamic quantized version achieves nearly identical perplexity to the original. The tiny increase (+0.019 PPL) means virtually no loss in language understanding, especially impressive given the drastic memory savings.

Why Accuracy Remains High

The secret lies in the sensitivity-guided precision allocation. Rather than uniformly degrading all parameters, the system protects the most critical neural pathway with higher precision while more aggressively compressing less sensitive components. This selective preservation strategy enables the model to maintain near-original accuracy, making high-performance inference accessible on hardware with limited memory resources.