YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Gemma-3-R1984-27B EXL3
EXL3 quants of VIDraft/Gemma-3-R1984-27B (27B). Each bpw variant is a separate branch. Attention tensors boosted to 8bpw via recompilation.
Docs: exllamav3 convert.md
Branches
| Branch | Target | Actual bpw | Method |
|---|---|---|---|
2.96bpw_H6 |
2.0 | 2.96 | base + recompile |
2.98bpw_H6 |
2.5 | 2.98 | optimized (2.0+3.0) + recompile |
3.80bpw_H6 |
3.0 | 3.80 | base + recompile |
3.83bpw_H6 |
3.5 | 3.83 | optimized (3.0+5.0) + recompile |
3.97bpw_H6 |
4.0 | 3.97 | optimized (3.0+5.0) + recompile |
4.13bpw_H6 |
4.5 | 4.13 | optimized (3.0+5.0) + recompile |
5.48bpw_H6 |
5.0 | 5.48 | base + recompile |
6.32bpw_H6 |
6.0 | 6.32 | base + recompile |
H6 = head_bits 6. All variants recompiled with *.self_attn.* boosted to 8bpw.
Gemma-3 is dense (no MoE), so *.shared_experts.* is not applicable.
Build recipe
1. Base quants
python convert.py -i <hf-model> -o <out> -w <work> -b <bpw>
5 base quants: 2.0, 3.0, 5.0, 6.0, 8.0 bpw.
2. KLD measurement
python util/measure.py -r <hf-model> -ms 128 -i <2.0bpw> <8.0bpw> -o measurement.json
Reusable across all optimized targets. Included in main branch.
3. Optimization (mixed-precision)
python util/optimize.py -i <lo-bpw> <hi-bpw> -m measurement.json -o <out> -b <target>
KLD-guided tensor replacement: tensors that matter most get higher-bpw versions.
4. Recompilation (attn override)
sources:
- id: 8
model_dir: /path/to/8.0bpw
overrides:
- key: "*.self_attn.*"
source: 8
python util/recompile.py -i <input> -o <final> -or override.yaml
Actual bpw is determined after recompile (attn@8bpw shifts average up).
Files
mainbranch:measurement.json(KLD map)- Each bpw branch: quantized model shards + config + tokenizer
Credits
- Base model: VIDraft/Gemma-3-R1984-27B
- Quantization: exllamav3 v0.0.34
EXL3 Optimization Guide
Targets
2.5bpw_H6 3.0bpw_H6 3.5bpw_H6 4.0bpw_H6 4.5bpw_H6 5.0bpw_H6 6.0bpw_H6
| Target | Action |
|---|---|
| 2.5bpw_H6 | optimized |
| 3.0bpw_H6 | direct convert |
| 3.5bpw_H6 | optimized |
| 4.0bpw_H6 | optimized |
| 4.5bpw_H6 | optimized |
| 5.0bpw_H6 | direct convert |
| 6.0bpw_H6 | direct convert |
Overview
Dynamic EXL3 quants mix tensor precision, similar to mixed-precision GGUFs. There are two frameworks:
- Optimization
- Recompilation
Usually, optimization and recompilation are used together: create a mixed quant through optimization, then run recompilation on top of it.
Optimization
- Start with two quants at different bpw, for example 2bpw and 3bpw.
measure.pymeasures KLD differences by replacing layer groups in the lower-bpw quant with groups from the higher-bpw quant; standard EXL3 calibration data is used.- The resulting
measurement.jsoncan be reused. You only have to create it once, no matter how many mixed quants you make. optimize.pyuses thatmeasurement.jsonto create a third quant from two source quants, replacing the tensors that matter most with higher-bpw tensors.
Measurement takes about 20min to an hour for big models. Optimization takes about 30s-1m.
python util/measure.py -i /path/to/model-2bpw /path/to/model-3bpw -r /path/to/hf-model -o measurement.json -cr 10 -cc 1024 -d 0
python util/optimize.py -i /path/to/model-2bpw /path/to/model-3bpw -m measurement.json -o /path/to/model-optimized -b 2.5 -ss 8192
Alternative measure form with -ms:
python util/measure.py -r /workspace/models/original-model -ms 128 -i /workspace/models/quant-2.5bpw /workspace/models/quant-3.5bpw -o /workspace/measurement.json
Optimize example:
python util/optimize.py -i /workspace/models/quant-2.5bpw /workspace/models/quant-3.0bpw -o /workspace/models/new-quant-2.75bpw -m /workspace/measurement.json -b 2.75
Recompilation
override.yaml replaces tensors in one quant with tensors from another quant. It is manual optimization. The source notes that attention and shared expert tensors were replaced with 8bpw tensors for all optimized quants. Recompilation takes about 30s-1m, and the actual new bpw is known after recompilation is done.
Multi-source example
sources:
- id: 6
model_dir: /path/to/6bpw
- id: 8
model_dir: /path/to/8bpw
overrides:
- key: "*.self_attn.*"
source: 6
- key: "*.shared_experts.*"
source: 8
GLM-Air example
The GLM-Air example replaces attention and shared experts with 8bpw tensors, and layers 2, 43, 1, 29 with 5bpw tensors because measurement.json showed those layers had the worst KLD.
sources:
- id: 8
model_dir: /workspace/models/quants-8.0bpw
- id: 5
model_dir: /workspace/models/quants-5.0bpw
overrides:
- key: "*.self_attn.*"
source: 8
- key: "*.shared_experts.*"
source: 8
- key: "model.layers.2.*"
source: 5
- key: "model.layers.43.*"
source: 5
- key: "model.layers.1.*"
source: 5
- key: "model.layers.29.*"
source: 5
python util/recompile.py -i /workspace/models/quant-2.75bpw -o /workspace/models/quant-recompiled -or override.yaml