Nekochu commited on
Commit
be5202e
·
verified ·
1 Parent(s): 289dd45

Actual bpw branches, clean README + guide appended

Browse files
Files changed (1) hide show
  1. README.md +25 -22
README.md CHANGED
@@ -1,43 +1,47 @@
1
  # Gemma-3-R1984-27B EXL3
2
 
3
- EXL3 quants of [VIDraft/Gemma-3-R1984-27B](https://huggingface.co/VIDraft/Gemma-3-R1984-27B) (27B). Each bpw variant is a separate branch. Attention tensors boosted to 8bpw via recompilation.
 
 
 
4
 
5
  ## Branches
6
 
7
- | Branch | Action | Description |
8
- |---|---|---|
9
- | `2.0bpw-H6` | base quant | Lowest size |
10
- | `2.5bpw-H6` | optimized (2.0+3.0) | KLD-mixed |
11
- | `3.0bpw-H6` | base quant | Direct convert |
12
- | `3.5bpw-H6` | optimized (3.0+5.0) | KLD-mixed |
13
- | `4.0bpw-H6` | optimized (3.0+5.0) | KLD-mixed |
14
- | `4.5bpw-H6` | optimized (3.0+5.0) | KLD-mixed |
15
- | `5.0bpw-H6` | base quant | Direct convert |
16
- | `6.0bpw-H6` | base quant | Direct convert |
17
 
18
  H6 = head_bits 6. All variants recompiled with `*.self_attn.*` boosted to 8bpw.
 
19
 
20
- ## How these were made
21
 
22
- ### Base quants
23
  ```bash
24
  python convert.py -i <hf-model> -o <out> -w <work> -b <bpw>
25
  ```
26
  5 base quants: 2.0, 3.0, 5.0, 6.0, 8.0 bpw.
27
 
28
- ### KLD measurement
29
  ```bash
30
  python util/measure.py -r <hf-model> -ms 128 -i <2.0bpw> <8.0bpw> -o measurement.json
31
  ```
32
  Reusable across all optimized targets. Included in main branch.
33
 
34
- ### Optimization (mixed-precision)
35
  ```bash
36
  python util/optimize.py -i <lo-bpw> <hi-bpw> -m measurement.json -o <out> -b <target>
37
  ```
38
- Replaces tensors that matter most (by KLD) with higher-bpw versions.
39
 
40
- ### Recompilation (attn override)
41
  ```yaml
42
  sources:
43
  - id: 8
@@ -47,20 +51,19 @@ overrides:
47
  source: 8
48
  ```
49
  ```bash
50
- python util/recompile.py -i <optimized> -o <final> -or override.yaml
51
  ```
52
- Note: Gemma-3 is dense (no MoE), so `*.shared_experts.*` is not applicable.
53
 
54
  ## Files
55
 
56
- - `main` branch: `measurement.json` (KLD map, reusable)
57
  - Each bpw branch: quantized model shards + config + tokenizer
58
 
59
  ## Credits
60
 
61
  - Base model: [VIDraft/Gemma-3-R1984-27B](https://huggingface.co/VIDraft/Gemma-3-R1984-27B)
62
  - Quantization: [exllamav3](https://github.com/turboderp-org/exllamav3) v0.0.34
63
- - Optimization method: [ArtusDev](https://huggingface.co/ArtusDev)
64
 
65
  ---
66
 
@@ -113,7 +116,7 @@ python util/optimize.py -i /workspace/models/quant-2.5bpw /workspace/models/quan
113
  ## Recompilation
114
  `override.yaml` replaces tensors in one quant with tensors from another quant. It is manual optimization. The source notes that attention and shared expert tensors were replaced with 8bpw tensors for all optimized quants. Recompilation takes about 30s-1m, and the actual new bpw is known after recompilation is done.
115
 
116
- ### Artus multi-source example
117
  ```yaml
118
  sources:
119
  - id: 6
 
1
  # Gemma-3-R1984-27B EXL3
2
 
3
+ EXL3 quants of [VIDraft/Gemma-3-R1984-27B](https://huggingface.co/VIDraft/Gemma-3-R1984-27B) (27B).
4
+ Each bpw variant is a separate branch. Attention tensors boosted to 8bpw via recompilation.
5
+
6
+ Docs: [exllamav3 convert.md](https://github.com/turboderp-org/exllamav3/blob/master/doc/convert.md)
7
 
8
  ## Branches
9
 
10
+ | Branch | Target | Actual bpw | Method |
11
+ |---|---|---|---|
12
+ | `2.96bpw_H6` | 2.0 | 2.96 | base + recompile |
13
+ | `2.98bpw_H6` | 2.5 | 2.98 | optimized (2.0+3.0) + recompile |
14
+ | `3.80bpw_H6` | 3.0 | 3.80 | base + recompile |
15
+ | `3.83bpw_H6` | 3.5 | 3.83 | optimized (3.0+5.0) + recompile |
16
+ | `3.97bpw_H6` | 4.0 | 3.97 | optimized (3.0+5.0) + recompile |
17
+ | `4.13bpw_H6` | 4.5 | 4.13 | optimized (3.0+5.0) + recompile |
18
+ | `5.48bpw_H6` | 5.0 | 5.48 | base + recompile |
19
+ | `6.32bpw_H6` | 6.0 | 6.32 | base + recompile |
20
 
21
  H6 = head_bits 6. All variants recompiled with `*.self_attn.*` boosted to 8bpw.
22
+ Gemma-3 is dense (no MoE), so `*.shared_experts.*` is not applicable.
23
 
24
+ ## Build recipe
25
 
26
+ ### 1. Base quants
27
  ```bash
28
  python convert.py -i <hf-model> -o <out> -w <work> -b <bpw>
29
  ```
30
  5 base quants: 2.0, 3.0, 5.0, 6.0, 8.0 bpw.
31
 
32
+ ### 2. KLD measurement
33
  ```bash
34
  python util/measure.py -r <hf-model> -ms 128 -i <2.0bpw> <8.0bpw> -o measurement.json
35
  ```
36
  Reusable across all optimized targets. Included in main branch.
37
 
38
+ ### 3. Optimization (mixed-precision)
39
  ```bash
40
  python util/optimize.py -i <lo-bpw> <hi-bpw> -m measurement.json -o <out> -b <target>
41
  ```
42
+ KLD-guided tensor replacement: tensors that matter most get higher-bpw versions.
43
 
44
+ ### 4. Recompilation (attn override)
45
  ```yaml
46
  sources:
47
  - id: 8
 
51
  source: 8
52
  ```
53
  ```bash
54
+ python util/recompile.py -i <input> -o <final> -or override.yaml
55
  ```
56
+ Actual bpw is determined after recompile (attn@8bpw shifts average up).
57
 
58
  ## Files
59
 
60
+ - `main` branch: `measurement.json` (KLD map)
61
  - Each bpw branch: quantized model shards + config + tokenizer
62
 
63
  ## Credits
64
 
65
  - Base model: [VIDraft/Gemma-3-R1984-27B](https://huggingface.co/VIDraft/Gemma-3-R1984-27B)
66
  - Quantization: [exllamav3](https://github.com/turboderp-org/exllamav3) v0.0.34
 
67
 
68
  ---
69
 
 
116
  ## Recompilation
117
  `override.yaml` replaces tensors in one quant with tensors from another quant. It is manual optimization. The source notes that attention and shared expert tensors were replaced with 8bpw tensors for all optimized quants. Recompilation takes about 30s-1m, and the actual new bpw is known after recompilation is done.
118
 
119
+ ### Multi-source example
120
  ```yaml
121
  sources:
122
  - id: 6