Append optimization guide to README
Browse files
README.md
CHANGED
|
@@ -61,3 +61,96 @@ Note: Gemma-3 is dense (no MoE), so `*.shared_experts.*` is not applicable.
|
|
| 61 |
- Base model: [VIDraft/Gemma-3-R1984-27B](https://huggingface.co/VIDraft/Gemma-3-R1984-27B)
|
| 62 |
- Quantization: [exllamav3](https://github.com/turboderp-org/exllamav3) v0.0.34
|
| 63 |
- Optimization method: [ArtusDev](https://huggingface.co/ArtusDev)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
- Base model: [VIDraft/Gemma-3-R1984-27B](https://huggingface.co/VIDraft/Gemma-3-R1984-27B)
|
| 62 |
- Quantization: [exllamav3](https://github.com/turboderp-org/exllamav3) v0.0.34
|
| 63 |
- Optimization method: [ArtusDev](https://huggingface.co/ArtusDev)
|
| 64 |
+
|
| 65 |
+
---
|
| 66 |
+
|
| 67 |
+
# EXL3 Optimization Guide
|
| 68 |
+
|
| 69 |
+
## Targets
|
| 70 |
+
`2.5bpw_H6 3.0bpw_H6 3.5bpw_H6 4.0bpw_H6 4.5bpw_H6 5.0bpw_H6 6.0bpw_H6`
|
| 71 |
+
|
| 72 |
+
| Target | Action |
|
| 73 |
+
|---|---|
|
| 74 |
+
| 2.5bpw_H6 | optimized |
|
| 75 |
+
| 3.0bpw_H6 | direct convert |
|
| 76 |
+
| 3.5bpw_H6 | optimized |
|
| 77 |
+
| 4.0bpw_H6 | optimized |
|
| 78 |
+
| 4.5bpw_H6 | optimized |
|
| 79 |
+
| 5.0bpw_H6 | direct convert |
|
| 80 |
+
| 6.0bpw_H6 | direct convert |
|
| 81 |
+
|
| 82 |
+
## Overview
|
| 83 |
+
Dynamic EXL3 quants mix tensor precision, similar to mixed-precision GGUFs. There are two frameworks:
|
| 84 |
+
|
| 85 |
+
- **Optimization**
|
| 86 |
+
- **Recompilation**
|
| 87 |
+
|
| 88 |
+
Usually, optimization and recompilation are used together: create a mixed quant through optimization, then run recompilation on top of it.
|
| 89 |
+
|
| 90 |
+
## Optimization
|
| 91 |
+
1. Start with two quants at different bpw, for example 2bpw and 3bpw.
|
| 92 |
+
2. `measure.py` measures KLD differences by replacing layer groups in the lower-bpw quant with groups from the higher-bpw quant; standard EXL3 calibration data is used.
|
| 93 |
+
3. The resulting `measurement.json` can be reused. You only have to create it once, no matter how many mixed quants you make.
|
| 94 |
+
4. `optimize.py` uses that `measurement.json` to create a third quant from two source quants, replacing the tensors that matter most with higher-bpw tensors.
|
| 95 |
+
|
| 96 |
+
Measurement takes about 20min to an hour for big models. Optimization takes about 30s-1m.
|
| 97 |
+
|
| 98 |
+
```bash
|
| 99 |
+
python util/measure.py -i /path/to/model-2bpw /path/to/model-3bpw -r /path/to/hf-model -o measurement.json -cr 10 -cc 1024 -d 0
|
| 100 |
+
python util/optimize.py -i /path/to/model-2bpw /path/to/model-3bpw -m measurement.json -o /path/to/model-optimized -b 2.5 -ss 8192
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
Alternative measure form with `-ms`:
|
| 104 |
+
```bash
|
| 105 |
+
python util/measure.py -r /workspace/models/original-model -ms 128 -i /workspace/models/quant-2.5bpw /workspace/models/quant-3.5bpw -o /workspace/measurement.json
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
Optimize example:
|
| 109 |
+
```bash
|
| 110 |
+
python util/optimize.py -i /workspace/models/quant-2.5bpw /workspace/models/quant-3.0bpw -o /workspace/models/new-quant-2.75bpw -m /workspace/measurement.json -b 2.75
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
## Recompilation
|
| 114 |
+
`override.yaml` replaces tensors in one quant with tensors from another quant. It is manual optimization. The source notes that attention and shared expert tensors were replaced with 8bpw tensors for all optimized quants. Recompilation takes about 30s-1m, and the actual new bpw is known after recompilation is done.
|
| 115 |
+
|
| 116 |
+
### Artus multi-source example
|
| 117 |
+
```yaml
|
| 118 |
+
sources:
|
| 119 |
+
- id: 6
|
| 120 |
+
model_dir: /path/to/6bpw
|
| 121 |
+
- id: 8
|
| 122 |
+
model_dir: /path/to/8bpw
|
| 123 |
+
overrides:
|
| 124 |
+
- key: "*.self_attn.*"
|
| 125 |
+
source: 6
|
| 126 |
+
- key: "*.shared_experts.*"
|
| 127 |
+
source: 8
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
### GLM-Air example
|
| 131 |
+
The GLM-Air example replaces attention and shared experts with 8bpw tensors, and layers 2, 43, 1, 29 with 5bpw tensors because `measurement.json` showed those layers had the worst KLD.
|
| 132 |
+
|
| 133 |
+
```yaml
|
| 134 |
+
sources:
|
| 135 |
+
- id: 8
|
| 136 |
+
model_dir: /workspace/models/quants-8.0bpw
|
| 137 |
+
- id: 5
|
| 138 |
+
model_dir: /workspace/models/quants-5.0bpw
|
| 139 |
+
overrides:
|
| 140 |
+
- key: "*.self_attn.*"
|
| 141 |
+
source: 8
|
| 142 |
+
- key: "*.shared_experts.*"
|
| 143 |
+
source: 8
|
| 144 |
+
- key: "model.layers.2.*"
|
| 145 |
+
source: 5
|
| 146 |
+
- key: "model.layers.43.*"
|
| 147 |
+
source: 5
|
| 148 |
+
- key: "model.layers.1.*"
|
| 149 |
+
source: 5
|
| 150 |
+
- key: "model.layers.29.*"
|
| 151 |
+
source: 5
|
| 152 |
+
```
|
| 153 |
+
|
| 154 |
+
```bash
|
| 155 |
+
python util/recompile.py -i /workspace/models/quant-2.75bpw -o /workspace/models/quant-recompiled -or override.yaml
|
| 156 |
+
```
|