Spaces:

Luminia
/

README

Running

App Files Files Community

Nekochu commited on 6 days ago

Commit

f92fc02

verified ·

1 Parent(s): 1c16ac1

add architecture compatibility check step

Browse files

Files changed (1) hide show

exl3-quant/SKILL.md +196 -189

exl3-quant/SKILL.md CHANGED Viewed

@@ -1,189 +1,196 @@
----
-name: exl3-quant
-description: Autonomous EXL3 quantization pipeline. Converts any HuggingFace model to optimized EXL3 quants with KLD evaluation, dark plot, and HF upload. Handles base quants, KLD-guided optimization, attention recompilation, and branch-per-bpw repo structure.
----
-# EXL3 Quantization Pipeline
-When invoked, autonomously execute this full pipeline end-to-end:
-0. Ensure exllamav3 is installed from source. If not: `git clone https://github.com/turboderp-org/exllamav3.git` and install.
-1. Fetch the latest official docs: `curl -sL https://raw.githubusercontent.com/turboderp-org/exllamav3/refs/heads/master/doc/convert.md -o convert_docs.md` and read for any API changes.
-2. Download model, create ALL possible quants (bases + optimized), run KLD eval, generate dark plot.
-3. If user wants push: upload as single repo with branch-per-bpw structure.
----
-# EXL3 Optimization Guide
-## How Quants Work
-There are two types of quants with different naming:
-- **Base quants** (direct convert): exact round bpw (2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0). Ship as-is, NEVER recompiled.
-- **Optimized quants** (optimize + recompile): non-round actual bpw (e.g. 2.91, 3.35, 3.49, 4.13). The `-b` target is a budget, not a guaranteed final bitrate.
-### Actual bpw: always verify the emitted artifact
-`-b` is a target budget, not a guaranteed final bitrate.
-- `util/optimize.py` can already produce a non-round actual bpw because it mixes whole tensor groups under the requested budget. The optimizer packs upgrades in indivisible groups — if the next useful group would exceed the remaining budget, it stops below the target. The resulting storage average is rounded to 2 decimals. That is how a target like `3.5` can legitimately emit `3.35`.
-- `util/recompile.py` may change bpw again after manual overrides, because it recomputes storage-derived bitrate and recompiles a second artifact. So `3.35` from optimize can become `3.49` after recompile.
-- Before naming branches or uploading, always read `quantization_config.json -> bits` from the **final** artifact you will publish.
-Branch naming uses the ACTUAL measured bpw, not the target:
-- `2.0bpw_H6` (base, exact)
-- `3.35bpw_H6` (optimized, non-round — this is normal and expected)
-Create ALL bases at round numbers (2.0 through 8.0), then create optimized quants between each adjacent pair of bases.
-## Overview
-Dynamic EXL3 quants mix tensor precision, similar to mixed-precision GGUFs. There are two frameworks:
-- **Optimization**
-- **Recompilation**
-Usually, optimization and recompilation are used together: create a mixed quant through optimization, then run recompilation on top of it.
-## Optimization
-1. Start with two quants at different bpw, for example 2bpw and 3bpw.
-2. `measure.py` measures KLD differences by replacing layer groups in the lower-bpw quant with groups from the higher-bpw quant; standard EXL3 calibration data is used.
-3. The resulting `measurement.json` can be reused. You only have to create it once, no matter how many mixed quants you make.
-4. `optimize.py` uses that `measurement.json` to create a third quant from two source quants, replacing the tensors that matter most with higher-bpw tensors.
-Measurement takes about 20min to an hour for big models. Optimization takes about 30s-1m.
-```bash
-python util/measure.py -i /path/to/model-2bpw /path/to/model-3bpw -r /path/to/hf-model -o measurement.json -cr 10 -cc 1024 -d 0
-python util/optimize.py -i /path/to/model-2bpw /path/to/model-3bpw -m measurement.json -o /path/to/model-optimized -b 2.5 -ss 8192
-```
-Alternative measure form with `-ms`:
-```bash
-python util/measure.py -r /workspace/models/original-model -ms 128 -i /workspace/models/quant-2.5bpw /workspace/models/quant-3.5bpw -o /workspace/measurement.json
-```
-Optimize example:
-```bash
-python util/optimize.py -i /workspace/models/quant-2.5bpw /workspace/models/quant-3.0bpw -o /workspace/models/new-quant-2.75bpw -m /workspace/measurement.json -b 2.75
-```
-## Recompilation
-`override.yaml` replaces tensors in one quant with tensors from another quant. It is manual optimization. Recompilation takes about 30s-1m, and recomputes the storage-derived bitrate (may differ from post-optimize bpw).
-### Multi-source example
-```yaml
-sources:
-  - id: 6
-    model_dir: /path/to/6bpw
-  - id: 8
-    model_dir: /path/to/8bpw
-overrides:
-  - key: "*.self_attn.*"
-    source: 6
-  - key: "*.shared_experts.*"
-    source: 8
-```
-### GLM-Air example
-The GLM-Air example replaces attention and shared experts with 8bpw tensors, and layers 2, 43, 1, 29 with 5bpw tensors because `measurement.json` showed those layers had the worst KLD.
-```yaml
-sources:
-  - id: 8
-    model_dir: /workspace/models/quants-8.0bpw
-  - id: 5
-    model_dir: /workspace/models/quants-5.0bpw
-overrides:
-  - key: "*.self_attn.*"
-    source: 8
-  - key: "*.shared_experts.*"
-    source: 8
-  - key: "model.layers.2.*"
-    source: 5
-  - key: "model.layers.43.*"
-    source: 5
-  - key: "model.layers.1.*"
-    source: 5
-  - key: "model.layers.29.*"
-    source: 5
-```
-```bash
-python util/recompile.py -i /workspace/models/quant-2.75bpw -o /workspace/models/quant-recompiled -or override.yaml
-```
----
-## Autonomous Pipeline Steps
-### Step 0: Prerequisites
-- exllamav3 from source
-- Python venv with: torch (CUDA), exllamav3, flash-attn, safetensors, huggingface_hub, matplotlib, adjustText, datasets
-### Step 1: Download Model
-```bash
-huggingface-cli download <repo_id> --local-dir <model_dir>
-```
-If model has .bin files only: convert shard by shard to BF16 safetensors.
-### Step 2: Base Quants
-```bash
-python convert.py -i <model_dir> -o <out_dir>/<name>-<bpw>bpw -w <work_dir> -b <bpw>
-```
-Create bases at: 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0 bpw. Bases ship as-is (exact round bpw, NO recompile).
-### Step 3: KLD Measurement
-```bash
-python util/measure.py -r <model_dir> -ms 128 -i <2.0bpw_dir> <8.0bpw_dir> -o measurement.json
-```
-### Step 4: Optimized Quants
-For each gap between adjacent bases:
-```bash
-python util/optimize.py -i <lo_bpw_dir> <hi_bpw_dir> -m measurement.json -o <out_dir> -b <target_bpw>
-```
-### Step 5: Recompile (ONLY Optimized, NOT Bases)
-Use 5bpw attention override for dense models (in our testing, 8bpw was too aggressive and caused bpw convergence on Gemma-3-27B). For MoE models, 6-8bpw may be appropriate. Adjust per model:
-```yaml
-sources:
-  - id: 5
-    model_dir: <path_to_5.0bpw>
-overrides:
-  - key: "*.self_attn.*"
-    source: 5
-```
-```bash
-python util/recompile.py -i <optimized_dir> -o <recompiled_dir> -or override.yaml
-```
-Read `quantization_config.json` -> `bits` for ACTUAL bpw after recompile.
-### Step 6: KLD Eval + Dark Plot
-```bash
-python eval/compare_q.py -d dataspec.json -m modelspec.json -kld -p -v -dark -t "<Model Name> EXL3" -pf kld_plot.png
-```
-Critical rules:
-- Reference model MUST have `"out_logits"` field in modelspec (or use `-lf` to load pre-saved logits), otherwise KLD is never computed
-- MUST use `-p` flag (not just `-pf`)
-- Use highest bpw that fits VRAM as reference (6.0 if 8.0 overflows)
-- Use forward slashes in JSON paths even on Windows
-### Step 7: Upload (conditional)
-If user wants to push to HuggingFace:
-- ONE repo with a separate branch for each quant variant (not one branch with all quants)
-- `main` branch: measurement.json + kld_plot.png + README.md only
-- Each quant gets its own branch named by actual bpw: `2.0bpw_H6`, `3.35bpw_H6`, `5.0bpw_H6`, etc.
-- Each bpw branch contains: only .safetensors, .json, tokenizer files (NO app.py, .css, .vscode, kld_plot.png or other junk from the source model)
-- Branch naming: bases = exact round bpw (e.g. `2.0bpw_H6`), optimized = actual bpw from `quantization_config.json` (e.g. `3.35bpw_H6`)
-- README on main branch: short and concise, use CSS dark-themed cards. Title: "EXL3 quants of [original model] using exllamav3 [version]". Include: KLD plot image, branch table (branch name, actual bpw, type), download command example. No walls of text. No em dash.
-## Key Lessons
-- Base quants stay at exact round bpw (when using integer `-b` without `--hq`). Only optimized quants get recompiled.
-- Non-round bpw can appear as early as `optimize.py` (not just after recompile). Always verify `quantization_config.json -> bits` from the final artifact.
-- attn@8bpw on dense models caused bpw convergence in our testing (2.0 and 2.5 both became ~2.96). We used 5bpw instead. This is model-specific — adjust per architecture.
-- `*.shared_experts.*` only applies to MoE models. Dense models omit it.
-- compare_q.py requires `-p` flag and either `"out_logits"` or `-lf` to compute and plot KLD.
-- Convert .bin to safetensors one shard at a time to avoid OOM.
-- Very low bpw (1.0, 1.5) may fail with GPU assert on newer architectures.
-- Before starting, check if quants already exist for the model (search HF for existing EXL3 repos).

+---
+name: exl3-quant
+description: Autonomous EXL3 quantization pipeline. Converts any HuggingFace model to optimized EXL3 quants with KLD evaluation, dark plot, and HF upload. Handles base quants, KLD-guided optimization, attention recompilation, and branch-per-bpw repo structure.
+---
+# EXL3 Quantization Pipeline
+When invoked, autonomously execute this full pipeline end-to-end:
+0. Ensure exllamav3 is installed from source. If not: `git clone https://github.com/turboderp-org/exllamav3.git` and install.
+1. Fetch the latest official docs: `curl -sL https://raw.githubusercontent.com/turboderp-org/exllamav3/refs/heads/master/doc/convert.md -o convert_docs.md` and read for any API changes.
+2. Download model, create ALL possible quants (bases + optimized), run KLD eval, generate dark plot.
+3. If user wants push: upload as single repo with branch-per-bpw structure.
+---
+# EXL3 Optimization Guide
+## How Quants Work
+There are two types of quants with different naming:
+- **Base quants** (direct convert): exact round bpw (2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0). Ship as-is, NEVER recompiled.
+- **Optimized quants** (optimize + recompile): non-round actual bpw (e.g. 2.91, 3.35, 3.49, 4.13). The `-b` target is a budget, not a guaranteed final bitrate.
+### Actual bpw: always verify the emitted artifact
+`-b` is a target budget, not a guaranteed final bitrate.
+- `util/optimize.py` can already produce a non-round actual bpw because it mixes whole tensor groups under the requested budget. The optimizer packs upgrades in indivisible groups — if the next useful group would exceed the remaining budget, it stops below the target. The resulting storage average is rounded to 2 decimals. That is how a target like `3.5` can legitimately emit `3.35`.
+- `util/recompile.py` may change bpw again after manual overrides, because it recomputes storage-derived bitrate and recompiles a second artifact. So `3.35` from optimize can become `3.49` after recompile.
+- Before naming branches or uploading, always read `quantization_config.json -> bits` from the **final** artifact you will publish.
+Branch naming uses the ACTUAL measured bpw, not the target:
+- `2.0bpw_H6` (base, exact)
+- `3.35bpw_H6` (optimized, non-round — this is normal and expected)
+Create ALL bases at round numbers (2.0 through 8.0), then create optimized quants between each adjacent pair of bases.
+## Overview
+Dynamic EXL3 quants mix tensor precision, similar to mixed-precision GGUFs. There are two frameworks:
+- **Optimization**
+- **Recompilation**
+Usually, optimization and recompilation are used together: create a mixed quant through optimization, then run recompilation on top of it.
+## Optimization
+1. Start with two quants at different bpw, for example 2bpw and 3bpw.
+2. `measure.py` measures KLD differences by replacing layer groups in the lower-bpw quant with groups from the higher-bpw quant; standard EXL3 calibration data is used.
+3. The resulting `measurement.json` can be reused. You only have to create it once, no matter how many mixed quants you make.
+4. `optimize.py` uses that `measurement.json` to create a third quant from two source quants, replacing the tensors that matter most with higher-bpw tensors.
+Measurement takes about 20min to an hour for big models. Optimization takes about 30s-1m.
+```bash
+python util/measure.py -i /path/to/model-2bpw /path/to/model-3bpw -r /path/to/hf-model -o measurement.json -cr 10 -cc 1024 -d 0
+python util/optimize.py -i /path/to/model-2bpw /path/to/model-3bpw -m measurement.json -o /path/to/model-optimized -b 2.5 -ss 8192
+```
+Alternative measure form with `-ms`:
+```bash
+python util/measure.py -r /workspace/models/original-model -ms 128 -i /workspace/models/quant-2.5bpw /workspace/models/quant-3.5bpw -o /workspace/measurement.json
+```
+Optimize example:
+```bash
+python util/optimize.py -i /workspace/models/quant-2.5bpw /workspace/models/quant-3.0bpw -o /workspace/models/new-quant-2.75bpw -m /workspace/measurement.json -b 2.75
+```
+## Recompilation
+`override.yaml` replaces tensors in one quant with tensors from another quant. It is manual optimization. Recompilation takes about 30s-1m, and recomputes the storage-derived bitrate (may differ from post-optimize bpw).
+### Multi-source example
+```yaml
+sources:
+  - id: 6
+    model_dir: /path/to/6bpw
+  - id: 8
+    model_dir: /path/to/8bpw
+overrides:
+  - key: "*.self_attn.*"
+    source: 6
+  - key: "*.shared_experts.*"
+    source: 8
+```
+### GLM-Air example
+The GLM-Air example replaces attention and shared experts with 8bpw tensors, and layers 2, 43, 1, 29 with 5bpw tensors because `measurement.json` showed those layers had the worst KLD.
+```yaml
+sources:
+  - id: 8
+    model_dir: /workspace/models/quants-8.0bpw
+  - id: 5
+    model_dir: /workspace/models/quants-5.0bpw
+overrides:
+  - key: "*.self_attn.*"
+    source: 8
+  - key: "*.shared_experts.*"
+    source: 8
+  - key: "model.layers.2.*"
+    source: 5
+  - key: "model.layers.43.*"
+    source: 5
+  - key: "model.layers.1.*"
+    source: 5
+  - key: "model.layers.29.*"
+    source: 5
+```
+```bash
+python util/recompile.py -i /workspace/models/quant-2.75bpw -o /workspace/models/quant-recompiled -or override.yaml
+```
+---
+## Autonomous Pipeline Steps
+### Step 0: Prerequisites
+- exllamav3 from source
+- Python venv with: torch (CUDA), exllamav3, flash-attn, safetensors, huggingface_hub, matplotlib, adjustText, datasets
+### Step 0.5: Verify Architecture Compatibility
+Check model's `config.json` for its `"architectures"` field, then grep the exllamav3 repo to confirm support:
+```bash
+grep -r "arch_string" exllamav3/exllamav3/architecture/ | grep "<ArchitectureName>"
+```
+If no match, the model is not yet supported. Check open issues on the exllamav3 repo before proceeding.
+### Step 1: Download Model
+```bash
+huggingface-cli download <repo_id> --local-dir <model_dir>
+```
+If model has .bin files only: convert shard by shard to BF16 safetensors.
+### Step 2: Base Quants
+```bash
+python convert.py -i <model_dir> -o <out_dir>/<name>-<bpw>bpw -w <work_dir> -b <bpw>
+```
+Create bases at: 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0 bpw. Bases ship as-is (exact round bpw, NO recompile).
+### Step 3: KLD Measurement
+```bash
+python util/measure.py -r <model_dir> -ms 128 -i <2.0bpw_dir> <8.0bpw_dir> -o measurement.json
+```
+### Step 4: Optimized Quants
+For each gap between adjacent bases:
+```bash
+python util/optimize.py -i <lo_bpw_dir> <hi_bpw_dir> -m measurement.json -o <out_dir> -b <target_bpw>
+```
+### Step 5: Recompile (ONLY Optimized, NOT Bases)
+Use 5bpw attention override for dense models (in our testing, 8bpw was too aggressive and caused bpw convergence on Gemma-3-27B). For MoE models, 6-8bpw may be appropriate. Adjust per model:
+```yaml
+sources:
+  - id: 5
+    model_dir: <path_to_5.0bpw>
+overrides:
+  - key: "*.self_attn.*"
+    source: 5
+```
+```bash
+python util/recompile.py -i <optimized_dir> -o <recompiled_dir> -or override.yaml
+```
+Read `quantization_config.json` -> `bits` for ACTUAL bpw after recompile.
+### Step 6: KLD Eval + Dark Plot
+```bash
+python eval/compare_q.py -d dataspec.json -m modelspec.json -kld -p -v -dark -t "<Model Name> EXL3" -pf kld_plot.png
+```
+Critical rules:
+- Reference model MUST have `"out_logits"` field in modelspec (or use `-lf` to load pre-saved logits), otherwise KLD is never computed
+- MUST use `-p` flag (not just `-pf`)
+- Use highest bpw that fits VRAM as reference (6.0 if 8.0 overflows)
+- Use forward slashes in JSON paths even on Windows
+### Step 7: Upload (conditional)
+If user wants to push to HuggingFace:
+- ONE repo with a separate branch for each quant variant (not one branch with all quants)
+- `main` branch: measurement.json + kld_plot.png + README.md only
+- Each quant gets its own branch named by actual bpw: `2.0bpw_H6`, `3.35bpw_H6`, `5.0bpw_H6`, etc.
+- Each bpw branch contains: only .safetensors, .json, tokenizer files (NO app.py, .css, .vscode, kld_plot.png or other junk from the source model)
+- Branch naming: bases = exact round bpw (e.g. `2.0bpw_H6`), optimized = actual bpw from `quantization_config.json` (e.g. `3.35bpw_H6`)
+- README on main branch: short and concise, use CSS dark-themed cards. Title: "EXL3 quants of [original model] using exllamav3 [version]". Include: KLD plot image, branch table (branch name, actual bpw, type), download command example. No walls of text. No em dash.
+## Key Lessons
+- Base quants stay at exact round bpw (when using integer `-b` without `--hq`). Only optimized quants get recompiled.
+- Non-round bpw can appear as early as `optimize.py` (not just after recompile). Always verify `quantization_config.json -> bits` from the final artifact.
+- attn@8bpw on dense models caused bpw convergence in our testing (2.0 and 2.5 both became ~2.96). We used 5bpw instead. This is model-specific — adjust per architecture.
+- `*.shared_experts.*` only applies to MoE models. Dense models omit it.
+- compare_q.py requires `-p` flag and either `"out_logits"` or `-lf` to compute and plot KLD.
+- Convert .bin to safetensors one shard at a time to avoid OOM.
+- Very low bpw (1.0, 1.5) may fail with GPU assert on newer architectures.
+- Before starting, check if quants already exist for the model (search HF for existing EXL3 repos).