gemma-4-26B-A4B-it — Cactus CQ (calibrated)
2-, 3-, and 4-bit quantizations of google/gemma-4-26B-A4B-it in the Cactus .weights format for on-device (ARM) inference.
Method
- CQ Cactus codebook quantization: per-group Hadamard rotation + (2/3-bit) AWQ-style activation scaling + routing-aware per-expert GPTQ — every MoE expert calibrated with its own routed-token Hessian.
- Embeddings: CQ4 (orthogonal). Norms / router / biases: FP16.
- Calibration: ~2M tokens of WildChat + AceCode trajectories generated by the model with thinking enabled.
- 4-bit uses plain RTN (no GPTQ/AWQ): at 4-bit the activation-scaling/GPTQ calibration is net-harmful (a known high-bit AWQ effect), so RTN is the best-performing 4-bit and keeps quality monotonic with bit-width.
Quality — held-out completion perplexity (56k answer tokens, ±~2 PPL noise)
| Variant | PPL |
|---|---|
| bf16 baseline | 7.25 |
| 2-bit RTN (uncalibrated) | 33,827 |
| 2-bit calibrated | 6.81 |
| 3-bit RTN (uncalibrated) | 32.56 |
| 3-bit calibrated | 6.32 |
| 4-bit (RTN) | 6.00 |
Files
weights/gemma-4-26b-a4b-it-cq2.zip— 2-bit calibrated (~2.36 bits/weight overall)weights/gemma-4-26b-a4b-it-cq3.zip— 3-bit calibratedweights/gemma-4-26b-a4b-it-cq4.zip— 4-bit RTN
Runs on-device via the Cactus runtime (ARM).
- Downloads last month
- 158