Actual bpw branches, clean README + guide appended
Browse files
README.md
CHANGED
|
@@ -1,43 +1,47 @@
|
|
| 1 |
# Gemma-3-R1984-27B EXL3
|
| 2 |
|
| 3 |
-
EXL3 quants of [VIDraft/Gemma-3-R1984-27B](https://huggingface.co/VIDraft/Gemma-3-R1984-27B) (27B).
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
## Branches
|
| 6 |
|
| 7 |
-
| Branch |
|
| 8 |
-
|---|---|---|
|
| 9 |
-
| `2.
|
| 10 |
-
| `2.
|
| 11 |
-
| `3.
|
| 12 |
-
| `3.
|
| 13 |
-
| `
|
| 14 |
-
| `4.
|
| 15 |
-
| `5.
|
| 16 |
-
| `6.
|
| 17 |
|
| 18 |
H6 = head_bits 6. All variants recompiled with `*.self_attn.*` boosted to 8bpw.
|
|
|
|
| 19 |
|
| 20 |
-
##
|
| 21 |
|
| 22 |
-
### Base quants
|
| 23 |
```bash
|
| 24 |
python convert.py -i <hf-model> -o <out> -w <work> -b <bpw>
|
| 25 |
```
|
| 26 |
5 base quants: 2.0, 3.0, 5.0, 6.0, 8.0 bpw.
|
| 27 |
|
| 28 |
-
### KLD measurement
|
| 29 |
```bash
|
| 30 |
python util/measure.py -r <hf-model> -ms 128 -i <2.0bpw> <8.0bpw> -o measurement.json
|
| 31 |
```
|
| 32 |
Reusable across all optimized targets. Included in main branch.
|
| 33 |
|
| 34 |
-
### Optimization (mixed-precision)
|
| 35 |
```bash
|
| 36 |
python util/optimize.py -i <lo-bpw> <hi-bpw> -m measurement.json -o <out> -b <target>
|
| 37 |
```
|
| 38 |
-
|
| 39 |
|
| 40 |
-
### Recompilation (attn override)
|
| 41 |
```yaml
|
| 42 |
sources:
|
| 43 |
- id: 8
|
|
@@ -47,20 +51,19 @@ overrides:
|
|
| 47 |
source: 8
|
| 48 |
```
|
| 49 |
```bash
|
| 50 |
-
python util/recompile.py -i <
|
| 51 |
```
|
| 52 |
-
|
| 53 |
|
| 54 |
## Files
|
| 55 |
|
| 56 |
-
- `main` branch: `measurement.json` (KLD map
|
| 57 |
- Each bpw branch: quantized model shards + config + tokenizer
|
| 58 |
|
| 59 |
## Credits
|
| 60 |
|
| 61 |
- Base model: [VIDraft/Gemma-3-R1984-27B](https://huggingface.co/VIDraft/Gemma-3-R1984-27B)
|
| 62 |
- Quantization: [exllamav3](https://github.com/turboderp-org/exllamav3) v0.0.34
|
| 63 |
-
- Optimization method: [ArtusDev](https://huggingface.co/ArtusDev)
|
| 64 |
|
| 65 |
---
|
| 66 |
|
|
@@ -113,7 +116,7 @@ python util/optimize.py -i /workspace/models/quant-2.5bpw /workspace/models/quan
|
|
| 113 |
## Recompilation
|
| 114 |
`override.yaml` replaces tensors in one quant with tensors from another quant. It is manual optimization. The source notes that attention and shared expert tensors were replaced with 8bpw tensors for all optimized quants. Recompilation takes about 30s-1m, and the actual new bpw is known after recompilation is done.
|
| 115 |
|
| 116 |
-
###
|
| 117 |
```yaml
|
| 118 |
sources:
|
| 119 |
- id: 6
|
|
|
|
| 1 |
# Gemma-3-R1984-27B EXL3
|
| 2 |
|
| 3 |
+
EXL3 quants of [VIDraft/Gemma-3-R1984-27B](https://huggingface.co/VIDraft/Gemma-3-R1984-27B) (27B).
|
| 4 |
+
Each bpw variant is a separate branch. Attention tensors boosted to 8bpw via recompilation.
|
| 5 |
+
|
| 6 |
+
Docs: [exllamav3 convert.md](https://github.com/turboderp-org/exllamav3/blob/master/doc/convert.md)
|
| 7 |
|
| 8 |
## Branches
|
| 9 |
|
| 10 |
+
| Branch | Target | Actual bpw | Method |
|
| 11 |
+
|---|---|---|---|
|
| 12 |
+
| `2.96bpw_H6` | 2.0 | 2.96 | base + recompile |
|
| 13 |
+
| `2.98bpw_H6` | 2.5 | 2.98 | optimized (2.0+3.0) + recompile |
|
| 14 |
+
| `3.80bpw_H6` | 3.0 | 3.80 | base + recompile |
|
| 15 |
+
| `3.83bpw_H6` | 3.5 | 3.83 | optimized (3.0+5.0) + recompile |
|
| 16 |
+
| `3.97bpw_H6` | 4.0 | 3.97 | optimized (3.0+5.0) + recompile |
|
| 17 |
+
| `4.13bpw_H6` | 4.5 | 4.13 | optimized (3.0+5.0) + recompile |
|
| 18 |
+
| `5.48bpw_H6` | 5.0 | 5.48 | base + recompile |
|
| 19 |
+
| `6.32bpw_H6` | 6.0 | 6.32 | base + recompile |
|
| 20 |
|
| 21 |
H6 = head_bits 6. All variants recompiled with `*.self_attn.*` boosted to 8bpw.
|
| 22 |
+
Gemma-3 is dense (no MoE), so `*.shared_experts.*` is not applicable.
|
| 23 |
|
| 24 |
+
## Build recipe
|
| 25 |
|
| 26 |
+
### 1. Base quants
|
| 27 |
```bash
|
| 28 |
python convert.py -i <hf-model> -o <out> -w <work> -b <bpw>
|
| 29 |
```
|
| 30 |
5 base quants: 2.0, 3.0, 5.0, 6.0, 8.0 bpw.
|
| 31 |
|
| 32 |
+
### 2. KLD measurement
|
| 33 |
```bash
|
| 34 |
python util/measure.py -r <hf-model> -ms 128 -i <2.0bpw> <8.0bpw> -o measurement.json
|
| 35 |
```
|
| 36 |
Reusable across all optimized targets. Included in main branch.
|
| 37 |
|
| 38 |
+
### 3. Optimization (mixed-precision)
|
| 39 |
```bash
|
| 40 |
python util/optimize.py -i <lo-bpw> <hi-bpw> -m measurement.json -o <out> -b <target>
|
| 41 |
```
|
| 42 |
+
KLD-guided tensor replacement: tensors that matter most get higher-bpw versions.
|
| 43 |
|
| 44 |
+
### 4. Recompilation (attn override)
|
| 45 |
```yaml
|
| 46 |
sources:
|
| 47 |
- id: 8
|
|
|
|
| 51 |
source: 8
|
| 52 |
```
|
| 53 |
```bash
|
| 54 |
+
python util/recompile.py -i <input> -o <final> -or override.yaml
|
| 55 |
```
|
| 56 |
+
Actual bpw is determined after recompile (attn@8bpw shifts average up).
|
| 57 |
|
| 58 |
## Files
|
| 59 |
|
| 60 |
+
- `main` branch: `measurement.json` (KLD map)
|
| 61 |
- Each bpw branch: quantized model shards + config + tokenizer
|
| 62 |
|
| 63 |
## Credits
|
| 64 |
|
| 65 |
- Base model: [VIDraft/Gemma-3-R1984-27B](https://huggingface.co/VIDraft/Gemma-3-R1984-27B)
|
| 66 |
- Quantization: [exllamav3](https://github.com/turboderp-org/exllamav3) v0.0.34
|
|
|
|
| 67 |
|
| 68 |
---
|
| 69 |
|
|
|
|
| 116 |
## Recompilation
|
| 117 |
`override.yaml` replaces tensors in one quant with tensors from another quant. It is manual optimization. The source notes that attention and shared expert tensors were replaced with 8bpw tensors for all optimized quants. Recompilation takes about 30s-1m, and the actual new bpw is known after recompilation is done.
|
| 118 |
|
| 119 |
+
### Multi-source example
|
| 120 |
```yaml
|
| 121 |
sources:
|
| 122 |
- id: 6
|