Gemma 4 26B-A4B-it -- TQ3 Compressed (FP16 weights with TQ3 noise)

TurboQuant TQ3-compressed version of google/gemma-4-26B-A4B-it.

Compression: 3-bit TurboQuant with group size 128. Zero calibration data. Weights are decompressed back to FP16 for maximum compatibility -- any framework that loads HuggingFace models can use this checkpoint directly.

For smaller GPUs, use the native TQ3 checkpoint instead (12 GB on disk, loads on L40S 48GB).

Results

Metric	Original (BF16)	This checkpoint (FP16 with TQ3 noise)
Quality (20 scenarios)	baseline	4.79/5
Serving speed (A100)	9-16 tok/s	14-17 tok/s (with runtime re-compression)
Checkpoint on disk	~52 GB	~52 GB (FP16, same parameter count)
Runtime GPU memory	~52 GB	~12 GB (with `enable_weight_quantization(bits=3)`)

Quality validated on 20 multi-turn conversation scenarios scored by Llama-3.3-70B in an LLM-as-a-judge setup.

Usage

This checkpoint stores standard FP16 weights. It loads like any HuggingFace model but requires transformers >= 5.5.0 for Gemma 4 support.

With runtime TQ3 re-compression (~12 GB GPU memory)

The weights already carry TQ3 quantization noise, so re-compressing them at runtime introduces near-zero additional error while reducing GPU memory from ~52 GB to ~12 GB.

from turboquant_vllm import enable_weight_quantization

enable_weight_quantization(bits=3)
# Then: vllm serve varjosoft/gemma-4-26B-A4B-it-TQ3

Requires turboquant-plus-vllm and an A100 80GB or larger GPU (the full checkpoint must fit in GPU memory during loading before compression).

Standard loading (~52 GB GPU memory)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("varjosoft/gemma-4-26B-A4B-it-TQ3")
tokenizer = AutoTokenizer.from_pretrained("varjosoft/gemma-4-26B-A4B-it-TQ3")

For smaller GPUs

Use the native TQ3 checkpoint instead. It stores packed 3-bit indices directly (12 GB on disk). Tested on L40S 48GB and H100 80GB using a custom loader that keeps weights compressed in GPU memory.

How It Was Made

Compressed using turboquant-plus-vllm:

Loaded original BF16 checkpoint to CPU
Applied TurboQuant TQ3 compression per weight group (Walsh-Hadamard rotation + Gaussian Lloyd-Max codebook + norm correction)
Decompressed back to FP16 and saved as standard safetensors

The weights carry TQ3 quantization noise but are stored as standard FP16. Any framework that loads HuggingFace models can use this checkpoint directly.

Algorithm

Inspired by TurboQuant (Zandieh, Daliri, Hadian, Mirrokni; ICLR 2026). Our implementation uses a Gaussian Lloyd-Max codebook as an approximation of the paper's distortion-rate framework. Norm correction stores original_norm / reconstruction_norm per group to fix magnitude shrinkage at 3-bit.

Citation

@inproceedings{zandieh2026turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
  booktitle={International Conference on Learning Representations},
  year={2026}
}

Compressed by Varjosoft Oy using turboquant-plus-vllm.

Downloads last month: 894

Safetensors

Model size

26B params

Tensor type

F16

Model tree for varjosoft/gemma-4-26B-A4B-it-TQ3

Base model

google/gemma-4-26B-A4B-it

Finetuned

(53)

this model

Paper for varjosoft/gemma-4-26B-A4B-it-TQ3

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Paper • 2504.19874 • Published Apr 28, 2025 • 32