Gemma 4 31B-it β Ternary Quantized (tritplane3)
Ternary-quantized version of google/gemma-4-31B-it using ternary-quant with component-aware tritplane3 quantization.
Model Specifications
| Property | Value |
|---|---|
| Base Model | google/gemma-4-31B-it |
| Parameters | 31B |
| Architecture | Dense transformer, multimodal (image + text) |
| Quantization | tritplane3 (3-plane progressive ternary) |
| Quantized Components | text_backbone + multimodal_connector (410 layers) |
| Vision Encoder | FP16 (unquantized, preserves image quality) |
| Effective Bits | ~8-10 bits/weight (quantized layers) |
| License | Gemma |
Size Comparison
| Method | Format | Size | Bits/Weight | VLM Support |
|---|---|---|---|---|
| FP16 (original) | safetensors | 62.6 GB | 16 | Yes |
| GGUF Q8_0 | GGUF | ~33 GB | 8.5 | Text only* |
| Ternary tritplane3 | ternary-quant | 31 GB | ~8-10 | Yes (vision+text) |
| GGUF Q4_K_M | GGUF | ~18 GB | 4.5 | Text only* |
| MLX 4-bit | MLX | ~17 GB | 4 | Yes (MLX only) |
| GGUF Q2_K | GGUF | ~12 GB | 2.5 | Text only* |
*GGUF quantizations of Gemma 4 typically strip or break the vision pipeline. Our ternary quantization preserves the full multimodal capability by keeping the vision encoder in FP16.
Quality Comparison (FP16 vs Ternary)
Side-by-side with greedy decoding, same prompts, chat template applied:
| Prompt | FP16 Original | Ternary (ours) |
|---|---|---|
| "What is the capital of France?" | The capital of France is Paris. | The capital of France is Paris. |
| "Explain photosynthesis in 2 sentences." | ...convert light energy into chemical energy in the form of glucose. This vital process consumes CO2 and water while releasing oxygen. | ...convert light energy into chemical energy, using CO2 and water to produce glucose. This process releases oxygen, essential for life on Earth. |
| "Write a Python function to reverse a string." | The Pythonic Way (Slicing) - Recommended | Using Slicing (The most Pythonic way) |
Result: Near-identical output. Same facts, same reasoning, same code β minor phrasing differences only.
Memory Requirements
| Runtime | Min Memory | Speed | Hardware |
|---|---|---|---|
cached (CPU) |
~35 GB RAM | Moderate | Any x86/ARM CPU |
cached (CUDA) |
~32 GB VRAM | Fast | A100, H100, RTX 4090 |
metal (Apple Silicon) |
~35 GB unified | Moderate | M2 Pro 48GB+, M4 Pro 48GB+ |
triton_memory (CUDA) |
~24 GB VRAM | Slower | RTX 3090, RTX 4090 |
Quickstart
pip install ternary-quant
from ternary_quant.inference import load_ternary_model
model, processor = load_ternary_model(
"AsadIsmail/gemma-4-31B-it-ternary",
runtime_mode="cached", # "metal" for Apple Silicon, "triton_memory" for low-VRAM GPU
device="auto"
)
# Text generation with chat template
messages = [{"role": "user", "content": [{"type": "text", "text": "Explain quantum computing"}]}]
formatted = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=formatted, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(outputs[0], skip_special_tokens=True))
Why ternary-quant for VLMs?
GGUF and GPTQ quantize all weights uniformly. For multimodal models with vision encoders, text decoders, and multimodal connectors, this often breaks the vision pipeline.
ternary-quant quantizes each component independently:
- Text backbone β ternary (compressed)
- Vision encoder β FP16 (preserved)
- Multimodal connector β ternary (compressed)
This preserves image understanding while still compressing the text generation layers.
Reproduce
pip install ternary-quant
ternary-quant quantize-broad google/gemma-4-31B-it \
--output ./gemma-4-31B-it-ternary \
--components text_backbone multimodal_connector \
--scheme tritplane3 --dtype float16 --device cpu \
--calibration-batch-size 1
Collection
Part of ternary-models β ternary-quantized VLMs, multimodal, and audio models.
GitHub: github.com/Asad-Ismail/ternary-models | Library: github.com/Asad-Ismail/ternary-quant
Model tree for AsadIsmail/gemma-4-31B-it-ternary
Base model
google/gemma-4-31B-it