Nekochu commited on
Commit
289dd45
·
verified ·
1 Parent(s): 8ee84b3

Append optimization guide to README

Browse files
Files changed (1) hide show
  1. README.md +93 -0
README.md CHANGED
@@ -61,3 +61,96 @@ Note: Gemma-3 is dense (no MoE), so `*.shared_experts.*` is not applicable.
61
  - Base model: [VIDraft/Gemma-3-R1984-27B](https://huggingface.co/VIDraft/Gemma-3-R1984-27B)
62
  - Quantization: [exllamav3](https://github.com/turboderp-org/exllamav3) v0.0.34
63
  - Optimization method: [ArtusDev](https://huggingface.co/ArtusDev)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  - Base model: [VIDraft/Gemma-3-R1984-27B](https://huggingface.co/VIDraft/Gemma-3-R1984-27B)
62
  - Quantization: [exllamav3](https://github.com/turboderp-org/exllamav3) v0.0.34
63
  - Optimization method: [ArtusDev](https://huggingface.co/ArtusDev)
64
+
65
+ ---
66
+
67
+ # EXL3 Optimization Guide
68
+
69
+ ## Targets
70
+ `2.5bpw_H6 3.0bpw_H6 3.5bpw_H6 4.0bpw_H6 4.5bpw_H6 5.0bpw_H6 6.0bpw_H6`
71
+
72
+ | Target | Action |
73
+ |---|---|
74
+ | 2.5bpw_H6 | optimized |
75
+ | 3.0bpw_H6 | direct convert |
76
+ | 3.5bpw_H6 | optimized |
77
+ | 4.0bpw_H6 | optimized |
78
+ | 4.5bpw_H6 | optimized |
79
+ | 5.0bpw_H6 | direct convert |
80
+ | 6.0bpw_H6 | direct convert |
81
+
82
+ ## Overview
83
+ Dynamic EXL3 quants mix tensor precision, similar to mixed-precision GGUFs. There are two frameworks:
84
+
85
+ - **Optimization**
86
+ - **Recompilation**
87
+
88
+ Usually, optimization and recompilation are used together: create a mixed quant through optimization, then run recompilation on top of it.
89
+
90
+ ## Optimization
91
+ 1. Start with two quants at different bpw, for example 2bpw and 3bpw.
92
+ 2. `measure.py` measures KLD differences by replacing layer groups in the lower-bpw quant with groups from the higher-bpw quant; standard EXL3 calibration data is used.
93
+ 3. The resulting `measurement.json` can be reused. You only have to create it once, no matter how many mixed quants you make.
94
+ 4. `optimize.py` uses that `measurement.json` to create a third quant from two source quants, replacing the tensors that matter most with higher-bpw tensors.
95
+
96
+ Measurement takes about 20min to an hour for big models. Optimization takes about 30s-1m.
97
+
98
+ ```bash
99
+ python util/measure.py -i /path/to/model-2bpw /path/to/model-3bpw -r /path/to/hf-model -o measurement.json -cr 10 -cc 1024 -d 0
100
+ python util/optimize.py -i /path/to/model-2bpw /path/to/model-3bpw -m measurement.json -o /path/to/model-optimized -b 2.5 -ss 8192
101
+ ```
102
+
103
+ Alternative measure form with `-ms`:
104
+ ```bash
105
+ python util/measure.py -r /workspace/models/original-model -ms 128 -i /workspace/models/quant-2.5bpw /workspace/models/quant-3.5bpw -o /workspace/measurement.json
106
+ ```
107
+
108
+ Optimize example:
109
+ ```bash
110
+ python util/optimize.py -i /workspace/models/quant-2.5bpw /workspace/models/quant-3.0bpw -o /workspace/models/new-quant-2.75bpw -m /workspace/measurement.json -b 2.75
111
+ ```
112
+
113
+ ## Recompilation
114
+ `override.yaml` replaces tensors in one quant with tensors from another quant. It is manual optimization. The source notes that attention and shared expert tensors were replaced with 8bpw tensors for all optimized quants. Recompilation takes about 30s-1m, and the actual new bpw is known after recompilation is done.
115
+
116
+ ### Artus multi-source example
117
+ ```yaml
118
+ sources:
119
+ - id: 6
120
+ model_dir: /path/to/6bpw
121
+ - id: 8
122
+ model_dir: /path/to/8bpw
123
+ overrides:
124
+ - key: "*.self_attn.*"
125
+ source: 6
126
+ - key: "*.shared_experts.*"
127
+ source: 8
128
+ ```
129
+
130
+ ### GLM-Air example
131
+ The GLM-Air example replaces attention and shared experts with 8bpw tensors, and layers 2, 43, 1, 29 with 5bpw tensors because `measurement.json` showed those layers had the worst KLD.
132
+
133
+ ```yaml
134
+ sources:
135
+ - id: 8
136
+ model_dir: /workspace/models/quants-8.0bpw
137
+ - id: 5
138
+ model_dir: /workspace/models/quants-5.0bpw
139
+ overrides:
140
+ - key: "*.self_attn.*"
141
+ source: 8
142
+ - key: "*.shared_experts.*"
143
+ source: 8
144
+ - key: "model.layers.2.*"
145
+ source: 5
146
+ - key: "model.layers.43.*"
147
+ source: 5
148
+ - key: "model.layers.1.*"
149
+ source: 5
150
+ - key: "model.layers.29.*"
151
+ source: 5
152
+ ```
153
+
154
+ ```bash
155
+ python util/recompile.py -i /workspace/models/quant-2.75bpw -o /workspace/models/quant-recompiled -or override.yaml
156
+ ```