Nekochu commited on
Commit
f92fc02
·
verified ·
1 Parent(s): 1c16ac1

add architecture compatibility check step

Browse files
Files changed (1) hide show
  1. exl3-quant/SKILL.md +196 -189
exl3-quant/SKILL.md CHANGED
@@ -1,189 +1,196 @@
1
- ---
2
- name: exl3-quant
3
- description: Autonomous EXL3 quantization pipeline. Converts any HuggingFace model to optimized EXL3 quants with KLD evaluation, dark plot, and HF upload. Handles base quants, KLD-guided optimization, attention recompilation, and branch-per-bpw repo structure.
4
- ---
5
-
6
- # EXL3 Quantization Pipeline
7
-
8
- When invoked, autonomously execute this full pipeline end-to-end:
9
-
10
- 0. Ensure exllamav3 is installed from source. If not: `git clone https://github.com/turboderp-org/exllamav3.git` and install.
11
- 1. Fetch the latest official docs: `curl -sL https://raw.githubusercontent.com/turboderp-org/exllamav3/refs/heads/master/doc/convert.md -o convert_docs.md` and read for any API changes.
12
- 2. Download model, create ALL possible quants (bases + optimized), run KLD eval, generate dark plot.
13
- 3. If user wants push: upload as single repo with branch-per-bpw structure.
14
-
15
- ---
16
-
17
- # EXL3 Optimization Guide
18
-
19
- ## How Quants Work
20
-
21
- There are two types of quants with different naming:
22
-
23
- - **Base quants** (direct convert): exact round bpw (2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0). Ship as-is, NEVER recompiled.
24
- - **Optimized quants** (optimize + recompile): non-round actual bpw (e.g. 2.91, 3.35, 3.49, 4.13). The `-b` target is a budget, not a guaranteed final bitrate.
25
-
26
- ### Actual bpw: always verify the emitted artifact
27
-
28
- `-b` is a target budget, not a guaranteed final bitrate.
29
-
30
- - `util/optimize.py` can already produce a non-round actual bpw because it mixes whole tensor groups under the requested budget. The optimizer packs upgrades in indivisible groups — if the next useful group would exceed the remaining budget, it stops below the target. The resulting storage average is rounded to 2 decimals. That is how a target like `3.5` can legitimately emit `3.35`.
31
- - `util/recompile.py` may change bpw again after manual overrides, because it recomputes storage-derived bitrate and recompiles a second artifact. So `3.35` from optimize can become `3.49` after recompile.
32
- - Before naming branches or uploading, always read `quantization_config.json -> bits` from the **final** artifact you will publish.
33
-
34
- Branch naming uses the ACTUAL measured bpw, not the target:
35
- - `2.0bpw_H6` (base, exact)
36
- - `3.35bpw_H6` (optimized, non-round — this is normal and expected)
37
-
38
- Create ALL bases at round numbers (2.0 through 8.0), then create optimized quants between each adjacent pair of bases.
39
-
40
- ## Overview
41
- Dynamic EXL3 quants mix tensor precision, similar to mixed-precision GGUFs. There are two frameworks:
42
-
43
- - **Optimization**
44
- - **Recompilation**
45
-
46
- Usually, optimization and recompilation are used together: create a mixed quant through optimization, then run recompilation on top of it.
47
-
48
- ## Optimization
49
- 1. Start with two quants at different bpw, for example 2bpw and 3bpw.
50
- 2. `measure.py` measures KLD differences by replacing layer groups in the lower-bpw quant with groups from the higher-bpw quant; standard EXL3 calibration data is used.
51
- 3. The resulting `measurement.json` can be reused. You only have to create it once, no matter how many mixed quants you make.
52
- 4. `optimize.py` uses that `measurement.json` to create a third quant from two source quants, replacing the tensors that matter most with higher-bpw tensors.
53
-
54
- Measurement takes about 20min to an hour for big models. Optimization takes about 30s-1m.
55
-
56
- ```bash
57
- python util/measure.py -i /path/to/model-2bpw /path/to/model-3bpw -r /path/to/hf-model -o measurement.json -cr 10 -cc 1024 -d 0
58
- python util/optimize.py -i /path/to/model-2bpw /path/to/model-3bpw -m measurement.json -o /path/to/model-optimized -b 2.5 -ss 8192
59
- ```
60
-
61
- Alternative measure form with `-ms`:
62
- ```bash
63
- python util/measure.py -r /workspace/models/original-model -ms 128 -i /workspace/models/quant-2.5bpw /workspace/models/quant-3.5bpw -o /workspace/measurement.json
64
- ```
65
-
66
- Optimize example:
67
- ```bash
68
- python util/optimize.py -i /workspace/models/quant-2.5bpw /workspace/models/quant-3.0bpw -o /workspace/models/new-quant-2.75bpw -m /workspace/measurement.json -b 2.75
69
- ```
70
-
71
- ## Recompilation
72
- `override.yaml` replaces tensors in one quant with tensors from another quant. It is manual optimization. Recompilation takes about 30s-1m, and recomputes the storage-derived bitrate (may differ from post-optimize bpw).
73
-
74
- ### Multi-source example
75
- ```yaml
76
- sources:
77
- - id: 6
78
- model_dir: /path/to/6bpw
79
- - id: 8
80
- model_dir: /path/to/8bpw
81
- overrides:
82
- - key: "*.self_attn.*"
83
- source: 6
84
- - key: "*.shared_experts.*"
85
- source: 8
86
- ```
87
-
88
- ### GLM-Air example
89
- The GLM-Air example replaces attention and shared experts with 8bpw tensors, and layers 2, 43, 1, 29 with 5bpw tensors because `measurement.json` showed those layers had the worst KLD.
90
-
91
- ```yaml
92
- sources:
93
- - id: 8
94
- model_dir: /workspace/models/quants-8.0bpw
95
- - id: 5
96
- model_dir: /workspace/models/quants-5.0bpw
97
- overrides:
98
- - key: "*.self_attn.*"
99
- source: 8
100
- - key: "*.shared_experts.*"
101
- source: 8
102
- - key: "model.layers.2.*"
103
- source: 5
104
- - key: "model.layers.43.*"
105
- source: 5
106
- - key: "model.layers.1.*"
107
- source: 5
108
- - key: "model.layers.29.*"
109
- source: 5
110
- ```
111
-
112
- ```bash
113
- python util/recompile.py -i /workspace/models/quant-2.75bpw -o /workspace/models/quant-recompiled -or override.yaml
114
- ```
115
-
116
- ---
117
-
118
- ## Autonomous Pipeline Steps
119
-
120
- ### Step 0: Prerequisites
121
- - exllamav3 from source
122
- - Python venv with: torch (CUDA), exllamav3, flash-attn, safetensors, huggingface_hub, matplotlib, adjustText, datasets
123
-
124
- ### Step 1: Download Model
125
- ```bash
126
- huggingface-cli download <repo_id> --local-dir <model_dir>
127
- ```
128
- If model has .bin files only: convert shard by shard to BF16 safetensors.
129
-
130
- ### Step 2: Base Quants
131
- ```bash
132
- python convert.py -i <model_dir> -o <out_dir>/<name>-<bpw>bpw -w <work_dir> -b <bpw>
133
- ```
134
- Create bases at: 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0 bpw. Bases ship as-is (exact round bpw, NO recompile).
135
-
136
- ### Step 3: KLD Measurement
137
- ```bash
138
- python util/measure.py -r <model_dir> -ms 128 -i <2.0bpw_dir> <8.0bpw_dir> -o measurement.json
139
- ```
140
-
141
- ### Step 4: Optimized Quants
142
- For each gap between adjacent bases:
143
- ```bash
144
- python util/optimize.py -i <lo_bpw_dir> <hi_bpw_dir> -m measurement.json -o <out_dir> -b <target_bpw>
145
- ```
146
-
147
- ### Step 5: Recompile (ONLY Optimized, NOT Bases)
148
- Use 5bpw attention override for dense models (in our testing, 8bpw was too aggressive and caused bpw convergence on Gemma-3-27B). For MoE models, 6-8bpw may be appropriate. Adjust per model:
149
- ```yaml
150
- sources:
151
- - id: 5
152
- model_dir: <path_to_5.0bpw>
153
- overrides:
154
- - key: "*.self_attn.*"
155
- source: 5
156
- ```
157
- ```bash
158
- python util/recompile.py -i <optimized_dir> -o <recompiled_dir> -or override.yaml
159
- ```
160
- Read `quantization_config.json` -> `bits` for ACTUAL bpw after recompile.
161
-
162
- ### Step 6: KLD Eval + Dark Plot
163
- ```bash
164
- python eval/compare_q.py -d dataspec.json -m modelspec.json -kld -p -v -dark -t "<Model Name> EXL3" -pf kld_plot.png
165
- ```
166
- Critical rules:
167
- - Reference model MUST have `"out_logits"` field in modelspec (or use `-lf` to load pre-saved logits), otherwise KLD is never computed
168
- - MUST use `-p` flag (not just `-pf`)
169
- - Use highest bpw that fits VRAM as reference (6.0 if 8.0 overflows)
170
- - Use forward slashes in JSON paths even on Windows
171
-
172
- ### Step 7: Upload (conditional)
173
- If user wants to push to HuggingFace:
174
- - ONE repo with a separate branch for each quant variant (not one branch with all quants)
175
- - `main` branch: measurement.json + kld_plot.png + README.md only
176
- - Each quant gets its own branch named by actual bpw: `2.0bpw_H6`, `3.35bpw_H6`, `5.0bpw_H6`, etc.
177
- - Each bpw branch contains: only .safetensors, .json, tokenizer files (NO app.py, .css, .vscode, kld_plot.png or other junk from the source model)
178
- - Branch naming: bases = exact round bpw (e.g. `2.0bpw_H6`), optimized = actual bpw from `quantization_config.json` (e.g. `3.35bpw_H6`)
179
- - README on main branch: short and concise, use CSS dark-themed cards. Title: "EXL3 quants of [original model] using exllamav3 [version]". Include: KLD plot image, branch table (branch name, actual bpw, type), download command example. No walls of text. No em dash.
180
-
181
- ## Key Lessons
182
- - Base quants stay at exact round bpw (when using integer `-b` without `--hq`). Only optimized quants get recompiled.
183
- - Non-round bpw can appear as early as `optimize.py` (not just after recompile). Always verify `quantization_config.json -> bits` from the final artifact.
184
- - attn@8bpw on dense models caused bpw convergence in our testing (2.0 and 2.5 both became ~2.96). We used 5bpw instead. This is model-specific adjust per architecture.
185
- - `*.shared_experts.*` only applies to MoE models. Dense models omit it.
186
- - compare_q.py requires `-p` flag and either `"out_logits"` or `-lf` to compute and plot KLD.
187
- - Convert .bin to safetensors one shard at a time to avoid OOM.
188
- - Very low bpw (1.0, 1.5) may fail with GPU assert on newer architectures.
189
- - Before starting, check if quants already exist for the model (search HF for existing EXL3 repos).
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: exl3-quant
3
+ description: Autonomous EXL3 quantization pipeline. Converts any HuggingFace model to optimized EXL3 quants with KLD evaluation, dark plot, and HF upload. Handles base quants, KLD-guided optimization, attention recompilation, and branch-per-bpw repo structure.
4
+ ---
5
+
6
+ # EXL3 Quantization Pipeline
7
+
8
+ When invoked, autonomously execute this full pipeline end-to-end:
9
+
10
+ 0. Ensure exllamav3 is installed from source. If not: `git clone https://github.com/turboderp-org/exllamav3.git` and install.
11
+ 1. Fetch the latest official docs: `curl -sL https://raw.githubusercontent.com/turboderp-org/exllamav3/refs/heads/master/doc/convert.md -o convert_docs.md` and read for any API changes.
12
+ 2. Download model, create ALL possible quants (bases + optimized), run KLD eval, generate dark plot.
13
+ 3. If user wants push: upload as single repo with branch-per-bpw structure.
14
+
15
+ ---
16
+
17
+ # EXL3 Optimization Guide
18
+
19
+ ## How Quants Work
20
+
21
+ There are two types of quants with different naming:
22
+
23
+ - **Base quants** (direct convert): exact round bpw (2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0). Ship as-is, NEVER recompiled.
24
+ - **Optimized quants** (optimize + recompile): non-round actual bpw (e.g. 2.91, 3.35, 3.49, 4.13). The `-b` target is a budget, not a guaranteed final bitrate.
25
+
26
+ ### Actual bpw: always verify the emitted artifact
27
+
28
+ `-b` is a target budget, not a guaranteed final bitrate.
29
+
30
+ - `util/optimize.py` can already produce a non-round actual bpw because it mixes whole tensor groups under the requested budget. The optimizer packs upgrades in indivisible groups — if the next useful group would exceed the remaining budget, it stops below the target. The resulting storage average is rounded to 2 decimals. That is how a target like `3.5` can legitimately emit `3.35`.
31
+ - `util/recompile.py` may change bpw again after manual overrides, because it recomputes storage-derived bitrate and recompiles a second artifact. So `3.35` from optimize can become `3.49` after recompile.
32
+ - Before naming branches or uploading, always read `quantization_config.json -> bits` from the **final** artifact you will publish.
33
+
34
+ Branch naming uses the ACTUAL measured bpw, not the target:
35
+ - `2.0bpw_H6` (base, exact)
36
+ - `3.35bpw_H6` (optimized, non-round — this is normal and expected)
37
+
38
+ Create ALL bases at round numbers (2.0 through 8.0), then create optimized quants between each adjacent pair of bases.
39
+
40
+ ## Overview
41
+ Dynamic EXL3 quants mix tensor precision, similar to mixed-precision GGUFs. There are two frameworks:
42
+
43
+ - **Optimization**
44
+ - **Recompilation**
45
+
46
+ Usually, optimization and recompilation are used together: create a mixed quant through optimization, then run recompilation on top of it.
47
+
48
+ ## Optimization
49
+ 1. Start with two quants at different bpw, for example 2bpw and 3bpw.
50
+ 2. `measure.py` measures KLD differences by replacing layer groups in the lower-bpw quant with groups from the higher-bpw quant; standard EXL3 calibration data is used.
51
+ 3. The resulting `measurement.json` can be reused. You only have to create it once, no matter how many mixed quants you make.
52
+ 4. `optimize.py` uses that `measurement.json` to create a third quant from two source quants, replacing the tensors that matter most with higher-bpw tensors.
53
+
54
+ Measurement takes about 20min to an hour for big models. Optimization takes about 30s-1m.
55
+
56
+ ```bash
57
+ python util/measure.py -i /path/to/model-2bpw /path/to/model-3bpw -r /path/to/hf-model -o measurement.json -cr 10 -cc 1024 -d 0
58
+ python util/optimize.py -i /path/to/model-2bpw /path/to/model-3bpw -m measurement.json -o /path/to/model-optimized -b 2.5 -ss 8192
59
+ ```
60
+
61
+ Alternative measure form with `-ms`:
62
+ ```bash
63
+ python util/measure.py -r /workspace/models/original-model -ms 128 -i /workspace/models/quant-2.5bpw /workspace/models/quant-3.5bpw -o /workspace/measurement.json
64
+ ```
65
+
66
+ Optimize example:
67
+ ```bash
68
+ python util/optimize.py -i /workspace/models/quant-2.5bpw /workspace/models/quant-3.0bpw -o /workspace/models/new-quant-2.75bpw -m /workspace/measurement.json -b 2.75
69
+ ```
70
+
71
+ ## Recompilation
72
+ `override.yaml` replaces tensors in one quant with tensors from another quant. It is manual optimization. Recompilation takes about 30s-1m, and recomputes the storage-derived bitrate (may differ from post-optimize bpw).
73
+
74
+ ### Multi-source example
75
+ ```yaml
76
+ sources:
77
+ - id: 6
78
+ model_dir: /path/to/6bpw
79
+ - id: 8
80
+ model_dir: /path/to/8bpw
81
+ overrides:
82
+ - key: "*.self_attn.*"
83
+ source: 6
84
+ - key: "*.shared_experts.*"
85
+ source: 8
86
+ ```
87
+
88
+ ### GLM-Air example
89
+ The GLM-Air example replaces attention and shared experts with 8bpw tensors, and layers 2, 43, 1, 29 with 5bpw tensors because `measurement.json` showed those layers had the worst KLD.
90
+
91
+ ```yaml
92
+ sources:
93
+ - id: 8
94
+ model_dir: /workspace/models/quants-8.0bpw
95
+ - id: 5
96
+ model_dir: /workspace/models/quants-5.0bpw
97
+ overrides:
98
+ - key: "*.self_attn.*"
99
+ source: 8
100
+ - key: "*.shared_experts.*"
101
+ source: 8
102
+ - key: "model.layers.2.*"
103
+ source: 5
104
+ - key: "model.layers.43.*"
105
+ source: 5
106
+ - key: "model.layers.1.*"
107
+ source: 5
108
+ - key: "model.layers.29.*"
109
+ source: 5
110
+ ```
111
+
112
+ ```bash
113
+ python util/recompile.py -i /workspace/models/quant-2.75bpw -o /workspace/models/quant-recompiled -or override.yaml
114
+ ```
115
+
116
+ ---
117
+
118
+ ## Autonomous Pipeline Steps
119
+
120
+ ### Step 0: Prerequisites
121
+ - exllamav3 from source
122
+ - Python venv with: torch (CUDA), exllamav3, flash-attn, safetensors, huggingface_hub, matplotlib, adjustText, datasets
123
+
124
+ ### Step 0.5: Verify Architecture Compatibility
125
+ Check model's `config.json` for its `"architectures"` field, then grep the exllamav3 repo to confirm support:
126
+ ```bash
127
+ grep -r "arch_string" exllamav3/exllamav3/architecture/ | grep "<ArchitectureName>"
128
+ ```
129
+ If no match, the model is not yet supported. Check open issues on the exllamav3 repo before proceeding.
130
+
131
+ ### Step 1: Download Model
132
+ ```bash
133
+ huggingface-cli download <repo_id> --local-dir <model_dir>
134
+ ```
135
+ If model has .bin files only: convert shard by shard to BF16 safetensors.
136
+
137
+ ### Step 2: Base Quants
138
+ ```bash
139
+ python convert.py -i <model_dir> -o <out_dir>/<name>-<bpw>bpw -w <work_dir> -b <bpw>
140
+ ```
141
+ Create bases at: 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0 bpw. Bases ship as-is (exact round bpw, NO recompile).
142
+
143
+ ### Step 3: KLD Measurement
144
+ ```bash
145
+ python util/measure.py -r <model_dir> -ms 128 -i <2.0bpw_dir> <8.0bpw_dir> -o measurement.json
146
+ ```
147
+
148
+ ### Step 4: Optimized Quants
149
+ For each gap between adjacent bases:
150
+ ```bash
151
+ python util/optimize.py -i <lo_bpw_dir> <hi_bpw_dir> -m measurement.json -o <out_dir> -b <target_bpw>
152
+ ```
153
+
154
+ ### Step 5: Recompile (ONLY Optimized, NOT Bases)
155
+ Use 5bpw attention override for dense models (in our testing, 8bpw was too aggressive and caused bpw convergence on Gemma-3-27B). For MoE models, 6-8bpw may be appropriate. Adjust per model:
156
+ ```yaml
157
+ sources:
158
+ - id: 5
159
+ model_dir: <path_to_5.0bpw>
160
+ overrides:
161
+ - key: "*.self_attn.*"
162
+ source: 5
163
+ ```
164
+ ```bash
165
+ python util/recompile.py -i <optimized_dir> -o <recompiled_dir> -or override.yaml
166
+ ```
167
+ Read `quantization_config.json` -> `bits` for ACTUAL bpw after recompile.
168
+
169
+ ### Step 6: KLD Eval + Dark Plot
170
+ ```bash
171
+ python eval/compare_q.py -d dataspec.json -m modelspec.json -kld -p -v -dark -t "<Model Name> EXL3" -pf kld_plot.png
172
+ ```
173
+ Critical rules:
174
+ - Reference model MUST have `"out_logits"` field in modelspec (or use `-lf` to load pre-saved logits), otherwise KLD is never computed
175
+ - MUST use `-p` flag (not just `-pf`)
176
+ - Use highest bpw that fits VRAM as reference (6.0 if 8.0 overflows)
177
+ - Use forward slashes in JSON paths even on Windows
178
+
179
+ ### Step 7: Upload (conditional)
180
+ If user wants to push to HuggingFace:
181
+ - ONE repo with a separate branch for each quant variant (not one branch with all quants)
182
+ - `main` branch: measurement.json + kld_plot.png + README.md only
183
+ - Each quant gets its own branch named by actual bpw: `2.0bpw_H6`, `3.35bpw_H6`, `5.0bpw_H6`, etc.
184
+ - Each bpw branch contains: only .safetensors, .json, tokenizer files (NO app.py, .css, .vscode, kld_plot.png or other junk from the source model)
185
+ - Branch naming: bases = exact round bpw (e.g. `2.0bpw_H6`), optimized = actual bpw from `quantization_config.json` (e.g. `3.35bpw_H6`)
186
+ - README on main branch: short and concise, use CSS dark-themed cards. Title: "EXL3 quants of [original model] using exllamav3 [version]". Include: KLD plot image, branch table (branch name, actual bpw, type), download command example. No walls of text. No em dash.
187
+
188
+ ## Key Lessons
189
+ - Base quants stay at exact round bpw (when using integer `-b` without `--hq`). Only optimized quants get recompiled.
190
+ - Non-round bpw can appear as early as `optimize.py` (not just after recompile). Always verify `quantization_config.json -> bits` from the final artifact.
191
+ - attn@8bpw on dense models caused bpw convergence in our testing (2.0 and 2.5 both became ~2.96). We used 5bpw instead. This is model-specific — adjust per architecture.
192
+ - `*.shared_experts.*` only applies to MoE models. Dense models omit it.
193
+ - compare_q.py requires `-p` flag and either `"out_logits"` or `-lf` to compute and plot KLD.
194
+ - Convert .bin to safetensors one shard at a time to avoid OOM.
195
+ - Very low bpw (1.0, 1.5) may fail with GPU assert on newer architectures.
196
+ - Before starting, check if quants already exist for the model (search HF for existing EXL3 repos).