Gemma4-E2B-ternary

Gemma 4 E2B (2.3B effective / 5.1B with embeddings) with both text and vision backbones quantized to ternary weights. 357 transformer layers quantized across language model and vision encoder; weights stored as 2-bit packed ternary codes with per-group float16 scales. Supports image+text inputs (Gemma 4 multimodal). Load with ternary-quant for inference on GPU, CPU (auto BF16), or Apple Silicon.

Compression: 1.8× (text + vision backbone) vs FP16 | Quality: coherent output ✓ (text + image verified)

Usage

pip install ternary-quant

from ternary_quant.inference import load_ternary_model

# GPU — fastest (dequantize once, use cuBLAS)
model, tokenizer = load_ternary_model(
    "AsadIsmail/Gemma4-E2B-ternary", device="cuda", runtime_mode="cached"
)

# GPU — lowest VRAM (~2.3 GB (use runtime_mode='gemlite' or 'triton'))
model, tokenizer = load_ternary_model(
    "AsadIsmail/Gemma4-E2B-ternary", device="cuda", runtime_mode="gemlite"
)

# CPU — auto BF16 on Intel AVX2+
model, tokenizer = load_ternary_model(
    "AsadIsmail/Gemma4-E2B-ternary", device="cpu", runtime_mode="cached"
)

# Apple Silicon
model, tokenizer = load_ternary_model(
    "AsadIsmail/Gemma4-E2B-ternary", device="mps", runtime_mode="metal"
)

# Generate
from ternary_quant.inference import generate_text
output = generate_text(model, tokenizer, "Ternary quantization works by", max_new_tokens=100)
print(output)

Performance

Hardware	Speed	Mode
NVIDIA GPU	15.6 tok/s (CUDA, runtime_mode='cached')	runtime_mode='cached' (dequantize to FP16 once)
Intel CPU	6.3 tok/s (Intel CPU, BF16 auto-detected)	runtime_mode='cached' (auto BF16)
Apple Silicon	—	runtime_mode='metal'

Quantization details

Property	Value
Base model	`google/gemma-4-E2B-it`
Quantized components	text_backbone + vision_backbone
Scheme	tritplane3 (sum of ternary planes)
Weight values	{-1, 0, +1} packed 2-bit (4 values per byte)
Group size	32 (per-group float16 scale + offset)
FP16 VRAM	4.0 GB
Packed VRAM	~2.3 GB (use runtime_mode='gemlite' or 'triton')
Compression	1.8× (text + vision backbone)
Quality	coherent output ✓ (text + image verified)