WeReCooking
/

Gemma-3-R1984-27B-EXL3

Text Generation

exl3

Model card Files Files and versions

xet

Community

Nekochu commited on 13 days ago

Commit

be5202e

verified ·

1 Parent(s): 289dd45

Actual bpw branches, clean README + guide appended

Browse files

Files changed (1) hide show

README.md +25 -22

README.md CHANGED Viewed

@@ -1,43 +1,47 @@
 # Gemma-3-R1984-27B EXL3
-EXL3 quants of [VIDraft/Gemma-3-R1984-27B](https://huggingface.co/VIDraft/Gemma-3-R1984-27B) (27B). Each bpw variant is a separate branch. Attention tensors boosted to 8bpw via recompilation.
 ## Branches
-| Branch | Action | Description |
-|---|---|---|
-| `2.0bpw-H6` | base quant | Lowest size |
-| `2.5bpw-H6` | optimized (2.0+3.0) | KLD-mixed |
-| `3.0bpw-H6` | base quant | Direct convert |
-| `3.5bpw-H6` | optimized (3.0+5.0) | KLD-mixed |
-| `4.0bpw-H6` | optimized (3.0+5.0) | KLD-mixed |
-| `4.5bpw-H6` | optimized (3.0+5.0) | KLD-mixed |
-| `5.0bpw-H6` | base quant | Direct convert |
-| `6.0bpw-H6` | base quant | Direct convert |
 H6 = head_bits 6. All variants recompiled with `*.self_attn.*` boosted to 8bpw.
-## How these were made
-### Base quants
 ```bash
 python convert.py -i <hf-model> -o <out> -w <work> -b <bpw>
 ```
 5 base quants: 2.0, 3.0, 5.0, 6.0, 8.0 bpw.
-### KLD measurement
 ```bash
 python util/measure.py -r <hf-model> -ms 128 -i <2.0bpw> <8.0bpw> -o measurement.json
 ```
 Reusable across all optimized targets. Included in main branch.
-### Optimization (mixed-precision)
 ```bash
 python util/optimize.py -i <lo-bpw> <hi-bpw> -m measurement.json -o <out> -b <target>
 ```
-Replaces tensors that matter most (by KLD) with higher-bpw versions.
-### Recompilation (attn override)
 ```yaml
 sources:
   - id: 8
@@ -47,20 +51,19 @@ overrides:
     source: 8
 ```
 ```bash
-python util/recompile.py -i <optimized> -o <final> -or override.yaml
 ```
-Note: Gemma-3 is dense (no MoE), so `*.shared_experts.*` is not applicable.
 ## Files
-- `main` branch: `measurement.json` (KLD map, reusable)
 - Each bpw branch: quantized model shards + config + tokenizer
 ## Credits
 - Base model: [VIDraft/Gemma-3-R1984-27B](https://huggingface.co/VIDraft/Gemma-3-R1984-27B)
 - Quantization: [exllamav3](https://github.com/turboderp-org/exllamav3) v0.0.34
-- Optimization method: [ArtusDev](https://huggingface.co/ArtusDev)
 ---
@@ -113,7 +116,7 @@ python util/optimize.py -i /workspace/models/quant-2.5bpw /workspace/models/quan
 ## Recompilation
 `override.yaml` replaces tensors in one quant with tensors from another quant. It is manual optimization. The source notes that attention and shared expert tensors were replaced with 8bpw tensors for all optimized quants. Recompilation takes about 30s-1m, and the actual new bpw is known after recompilation is done.
-### Artus multi-source example
 ```yaml
 sources:
   - id: 6

 # Gemma-3-R1984-27B EXL3
+EXL3 quants of [VIDraft/Gemma-3-R1984-27B](https://huggingface.co/VIDraft/Gemma-3-R1984-27B) (27B).
+Each bpw variant is a separate branch. Attention tensors boosted to 8bpw via recompilation.
+Docs: [exllamav3 convert.md](https://github.com/turboderp-org/exllamav3/blob/master/doc/convert.md)
 ## Branches
+| Branch | Target | Actual bpw | Method |
+|---|---|---|---|
+| `2.96bpw_H6` | 2.0 | 2.96 | base + recompile |
+| `2.98bpw_H6` | 2.5 | 2.98 | optimized (2.0+3.0) + recompile |
+| `3.80bpw_H6` | 3.0 | 3.80 | base + recompile |
+| `3.83bpw_H6` | 3.5 | 3.83 | optimized (3.0+5.0) + recompile |
+| `3.97bpw_H6` | 4.0 | 3.97 | optimized (3.0+5.0) + recompile |
+| `4.13bpw_H6` | 4.5 | 4.13 | optimized (3.0+5.0) + recompile |
+| `5.48bpw_H6` | 5.0 | 5.48 | base + recompile |
+| `6.32bpw_H6` | 6.0 | 6.32 | base + recompile |
 H6 = head_bits 6. All variants recompiled with `*.self_attn.*` boosted to 8bpw.
+Gemma-3 is dense (no MoE), so `*.shared_experts.*` is not applicable.
+## Build recipe
+### 1. Base quants
 ```bash
 python convert.py -i <hf-model> -o <out> -w <work> -b <bpw>
 ```
 5 base quants: 2.0, 3.0, 5.0, 6.0, 8.0 bpw.
+### 2. KLD measurement
 ```bash
 python util/measure.py -r <hf-model> -ms 128 -i <2.0bpw> <8.0bpw> -o measurement.json
 ```
 Reusable across all optimized targets. Included in main branch.
+### 3. Optimization (mixed-precision)
 ```bash
 python util/optimize.py -i <lo-bpw> <hi-bpw> -m measurement.json -o <out> -b <target>
 ```
+KLD-guided tensor replacement: tensors that matter most get higher-bpw versions.
+### 4. Recompilation (attn override)
 ```yaml
 sources:
   - id: 8
     source: 8
 ```
 ```bash
+python util/recompile.py -i <input> -o <final> -or override.yaml
 ```
+Actual bpw is determined after recompile (attn@8bpw shifts average up).
 ## Files
+- `main` branch: `measurement.json` (KLD map)
 - Each bpw branch: quantized model shards + config + tokenizer
 ## Credits
 - Base model: [VIDraft/Gemma-3-R1984-27B](https://huggingface.co/VIDraft/Gemma-3-R1984-27B)
 - Quantization: [exllamav3](https://github.com/turboderp-org/exllamav3) v0.0.34
 ---
 ## Recompilation
 `override.yaml` replaces tensors in one quant with tensors from another quant. It is manual optimization. The source notes that attention and shared expert tensors were replaced with 8bpw tensors for all optimized quants. Recompilation takes about 30s-1m, and the actual new bpw is known after recompilation is done.
+### Multi-source example
 ```yaml
 sources:
   - id: 6