gemma-4-26B-A4B-it — Cactus CQ (calibrated)

2-, 3-, and 4-bit quantizations of google/gemma-4-26B-A4B-it in the Cactus .weights format for on-device (ARM) inference.

Method

CQ Cactus codebook quantization: per-group Hadamard rotation + (2/3-bit) AWQ-style activation scaling + routing-aware per-expert GPTQ — every MoE expert calibrated with its own routed-token Hessian.
Embeddings: CQ4 (orthogonal). Norms / router / biases: FP16.
Calibration: ~2M tokens of WildChat + AceCode trajectories generated by the model with thinking enabled.
4-bit uses plain RTN (no GPTQ/AWQ): at 4-bit the activation-scaling/GPTQ calibration is net-harmful (a known high-bit AWQ effect), so RTN is the best-performing 4-bit and keeps quality monotonic with bit-width.

weights/gemma-4-26b-a4b-it-cq2.zip — 2-bit calibrated (~2.36 bits/weight overall)
weights/gemma-4-26b-a4b-it-cq3.zip — 3-bit calibrated
weights/gemma-4-26b-a4b-it-cq4.zip — 4-bit RTN

Runs on-device via the Cactus runtime (ARM).

Base model

Finetuned

Finetuned

(124)

this model