Gemma-3-R1984-27B EXL3

EXL3 quants of VIDraft/Gemma-3-R1984-27B (27B). Each bpw variant is a separate branch. Attention tensors boosted to 8bpw via recompilation.

Docs: exllamav3 convert.md

Branches

Branch	Target	Actual bpw	Method
`2.96bpw_H6`	2.0	2.96	base + recompile
`2.98bpw_H6`	2.5	2.98	optimized (2.0+3.0) + recompile
`3.80bpw_H6`	3.0	3.80	base + recompile
`3.83bpw_H6`	3.5	3.83	optimized (3.0+5.0) + recompile
`3.97bpw_H6`	4.0	3.97	optimized (3.0+5.0) + recompile
`4.13bpw_H6`	4.5	4.13	optimized (3.0+5.0) + recompile
`5.48bpw_H6`	5.0	5.48	base + recompile
`6.32bpw_H6`	6.0	6.32	base + recompile

H6 = head_bits 6. All variants recompiled with *.self_attn.* boosted to 8bpw. Gemma-3 is dense (no MoE), so *.shared_experts.* is not applicable.

Build recipe

1. Base quants

python convert.py -i <hf-model> -o <out> -w <work> -b <bpw>

5 base quants: 2.0, 3.0, 5.0, 6.0, 8.0 bpw.

2. KLD measurement

python util/measure.py -r <hf-model> -ms 128 -i <2.0bpw> <8.0bpw> -o measurement.json

Reusable across all optimized targets. Included in main branch.

3. Optimization (mixed-precision)

python util/optimize.py -i <lo-bpw> <hi-bpw> -m measurement.json -o <out> -b <target>

KLD-guided tensor replacement: tensors that matter most get higher-bpw versions.

4. Recompilation (attn override)

sources:
  - id: 8
    model_dir: /path/to/8.0bpw
overrides:
  - key: "*.self_attn.*"
    source: 8

python util/recompile.py -i <input> -o <final> -or override.yaml

Actual bpw is determined after recompile (attn@8bpw shifts average up).

Files

main branch: measurement.json (KLD map)
Each bpw branch: quantized model shards + config + tokenizer

Credits

Base model: VIDraft/Gemma-3-R1984-27B
Quantization: exllamav3 v0.0.34

EXL3 Optimization Guide

Targets

2.5bpw_H6 3.0bpw_H6 3.5bpw_H6 4.0bpw_H6 4.5bpw_H6 5.0bpw_H6 6.0bpw_H6

Target	Action
2.5bpw_H6	optimized
3.0bpw_H6	direct convert
3.5bpw_H6	optimized
4.0bpw_H6	optimized
4.5bpw_H6	optimized
5.0bpw_H6	direct convert
6.0bpw_H6	direct convert

Overview

Dynamic EXL3 quants mix tensor precision, similar to mixed-precision GGUFs. There are two frameworks:

Optimization
Recompilation

Usually, optimization and recompilation are used together: create a mixed quant through optimization, then run recompilation on top of it.

Optimization

Start with two quants at different bpw, for example 2bpw and 3bpw.
measure.py measures KLD differences by replacing layer groups in the lower-bpw quant with groups from the higher-bpw quant; standard EXL3 calibration data is used.
The resulting measurement.json can be reused. You only have to create it once, no matter how many mixed quants you make.
optimize.py uses that measurement.json to create a third quant from two source quants, replacing the tensors that matter most with higher-bpw tensors.

Measurement takes about 20min to an hour for big models. Optimization takes about 30s-1m.

python util/measure.py -i /path/to/model-2bpw /path/to/model-3bpw -r /path/to/hf-model -o measurement.json -cr 10 -cc 1024 -d 0
python util/optimize.py -i /path/to/model-2bpw /path/to/model-3bpw -m measurement.json -o /path/to/model-optimized -b 2.5 -ss 8192

Alternative measure form with -ms:

python util/measure.py -r /workspace/models/original-model -ms 128 -i /workspace/models/quant-2.5bpw /workspace/models/quant-3.5bpw -o /workspace/measurement.json

Optimize example:

python util/optimize.py -i /workspace/models/quant-2.5bpw /workspace/models/quant-3.0bpw -o /workspace/models/new-quant-2.75bpw -m /workspace/measurement.json -b 2.75

Recompilation

override.yaml replaces tensors in one quant with tensors from another quant. It is manual optimization. The source notes that attention and shared expert tensors were replaced with 8bpw tensors for all optimized quants. Recompilation takes about 30s-1m, and the actual new bpw is known after recompilation is done.

Multi-source example

sources:
  - id: 6
    model_dir: /path/to/6bpw
  - id: 8
    model_dir: /path/to/8bpw
overrides:
  - key: "*.self_attn.*"
    source: 6
  - key: "*.shared_experts.*"
    source: 8

GLM-Air example

The GLM-Air example replaces attention and shared experts with 8bpw tensors, and layers 2, 43, 1, 29 with 5bpw tensors because measurement.json showed those layers had the worst KLD.

sources:
  - id: 8
    model_dir: /workspace/models/quants-8.0bpw
  - id: 5
    model_dir: /workspace/models/quants-5.0bpw
overrides:
  - key: "*.self_attn.*"
    source: 8
  - key: "*.shared_experts.*"
    source: 8
  - key: "model.layers.2.*"
    source: 5
  - key: "model.layers.43.*"
    source: 5
  - key: "model.layers.1.*"
    source: 5
  - key: "model.layers.29.*"
    source: 5

python util/recompile.py -i /workspace/models/quant-2.75bpw -o /workspace/models/quant-recompiled -or override.yaml

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support