Otilde commited on
Commit
2da2292
·
verified ·
1 Parent(s): 23ef1e5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -0
README.md CHANGED
@@ -7,3 +7,40 @@ tags:
7
  - mlx
8
  base_model: Qwen/Qwen3-4B-Instruct-2507
9
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  - mlx
8
  base_model: Qwen/Qwen3-4B-Instruct-2507
9
  ---
10
+ # Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit
11
+
12
+ The Model [Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit](https://huggingface.co/Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit/) was converted to Dynamic MLX format from [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) using mlx-lm version 0.30.0.
13
+
14
+ ```
15
+ Estimating sensitivities: 80%|████████████████████▉ | 45/56 [2:29:16<31:17, 170.70s/it]
16
+ Estimating sensitivities: 96%|█████████████████████████ | 54/56 [2:55:25<05:54, 177.01s/it]
17
+ Estimating sensitivities: 57it [3:03:25, 193.08s/it]
18
+ Original PPL: 8.204
19
+ [INFO] Quantized model with 5.005 bits per weight.
20
+ Quantized PPL: 8.506
21
+ Peak memory used: 47.648GB
22
+ ```
23
+
24
+ # What are MLX Dynamic Quants
25
+
26
+ MLX dynamic quantization is a compression technique that intelligently allocates different bit-widths to different layers based on their sensitivity to quantization errors. Unlike uniform quantization that applies the same precision to all layers, dynamic quantization analyzes each layer's impact on model output quality and assigns higher precision to more sensitive layers while using lower precision for robust layers.
27
+
28
+ You may notice in the files that there is also a [sensitivites.json](https://huggingface.co/Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit/blob/main/Qwen_Qwen3-4B-Instruct-2507_sensitivities.json) available. This file contains-by-layer sensitivity scores that were calculated during a comprehensive analysis phase. Each score how sensitive a particular layer is to quantization - higher scores indicate layers that suffer greater quality loss if quantized aggressively.
29
+
30
+ The sensitivity analysis process is the time-consuming part of creating dynamic quants (as shown by the multiple hours) because the system must test each layer individually with sample data to measure accuracy impact.
31
+
32
+ You can download just this file and create your own dynamic quants without redoing the lengthy sensitivity analysis.
33
+
34
+ # Perplexity Results Explained
35
+
36
+ Perplexity (PPL) measures how well a language model predicts text, with lower scores indicating better performance. The evaluation results demonstrate the effectiveness of this approach:
37
+
38
+ Original Model PPL: **8.204**
39
+
40
+ Quantized Model PPL: **8.506**
41
+
42
+ Despite shrinking the model by ~2 GB, the dynamic quantized version achieves nearly identical perplexity to the original. The tiny increase (+0.302 PPL) means virtually no loss in language understanding, especially impressive given the drastic memory savings.
43
+
44
+ ## Why Accuracy Remains High
45
+
46
+ The secret lies in the sensitivity-guided precision allocation. Rather than uniformly degrading all parameters, the system protects the most critical neural pathway (eg., model.layers12.self_attn.v_proj) with higher precision while more aggressively compressing less sensitive components (eg., model.layers.0.self_attn.o_proj: 9.8e-05). This selective preservation strategy enables the model to maintain near-original accuracy, making high-performance inference accessible on hardware with limited memory resources.