Gemma 4 31B-it β€” Ternary Quantized (tritplane3)

Ternary-quantized version of google/gemma-4-31B-it using ternary-quant with component-aware tritplane3 quantization.

Model Specifications

Property Value
Base Model google/gemma-4-31B-it
Parameters 31B
Architecture Dense transformer, multimodal (image + text)
Quantization tritplane3 (3-plane progressive ternary)
Quantized Components text_backbone + multimodal_connector (410 layers)
Vision Encoder FP16 (unquantized, preserves image quality)
Effective Bits ~8-10 bits/weight (quantized layers)
License Gemma

Size Comparison

Method Format Size Bits/Weight VLM Support
FP16 (original) safetensors 62.6 GB 16 Yes
GGUF Q8_0 GGUF ~33 GB 8.5 Text only*
Ternary tritplane3 ternary-quant 31 GB ~8-10 Yes (vision+text)
GGUF Q4_K_M GGUF ~18 GB 4.5 Text only*
MLX 4-bit MLX ~17 GB 4 Yes (MLX only)
GGUF Q2_K GGUF ~12 GB 2.5 Text only*

*GGUF quantizations of Gemma 4 typically strip or break the vision pipeline. Our ternary quantization preserves the full multimodal capability by keeping the vision encoder in FP16.

Quality Comparison (FP16 vs Ternary)

Side-by-side with greedy decoding, same prompts, chat template applied:

Prompt FP16 Original Ternary (ours)
"What is the capital of France?" The capital of France is Paris. The capital of France is Paris.
"Explain photosynthesis in 2 sentences." ...convert light energy into chemical energy in the form of glucose. This vital process consumes CO2 and water while releasing oxygen. ...convert light energy into chemical energy, using CO2 and water to produce glucose. This process releases oxygen, essential for life on Earth.
"Write a Python function to reverse a string." The Pythonic Way (Slicing) - Recommended Using Slicing (The most Pythonic way)

Result: Near-identical output. Same facts, same reasoning, same code β€” minor phrasing differences only.

Memory Requirements

Runtime Min Memory Speed Hardware
cached (CPU) ~35 GB RAM Moderate Any x86/ARM CPU
cached (CUDA) ~32 GB VRAM Fast A100, H100, RTX 4090
metal (Apple Silicon) ~35 GB unified Moderate M2 Pro 48GB+, M4 Pro 48GB+
triton_memory (CUDA) ~24 GB VRAM Slower RTX 3090, RTX 4090

Quickstart

pip install ternary-quant
from ternary_quant.inference import load_ternary_model

model, processor = load_ternary_model(
    "AsadIsmail/gemma-4-31B-it-ternary",
    runtime_mode="cached",  # "metal" for Apple Silicon, "triton_memory" for low-VRAM GPU
    device="auto"
)

# Text generation with chat template
messages = [{"role": "user", "content": [{"type": "text", "text": "Explain quantum computing"}]}]
formatted = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=formatted, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(outputs[0], skip_special_tokens=True))

Why ternary-quant for VLMs?

GGUF and GPTQ quantize all weights uniformly. For multimodal models with vision encoders, text decoders, and multimodal connectors, this often breaks the vision pipeline.

ternary-quant quantizes each component independently:

  • Text backbone β†’ ternary (compressed)
  • Vision encoder β†’ FP16 (preserved)
  • Multimodal connector β†’ ternary (compressed)

This preserves image understanding while still compressing the text generation layers.

Reproduce

pip install ternary-quant
ternary-quant quantize-broad google/gemma-4-31B-it \
    --output ./gemma-4-31B-it-ternary \
    --components text_backbone multimodal_connector \
    --scheme tritplane3 --dtype float16 --device cpu \
    --calibration-batch-size 1

Collection

Part of ternary-models β€” ternary-quantized VLMs, multimodal, and audio models.

GitHub: github.com/Asad-Ismail/ternary-models | Library: github.com/Asad-Ismail/ternary-quant

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for AsadIsmail/gemma-4-31B-it-ternary

Finetuned
(62)
this model

Collection including AsadIsmail/gemma-4-31B-it-ternary