Instructions to use AsadIsmail/gemma-4-31B-it-ternary with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AsadIsmail/gemma-4-31B-it-ternary with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="AsadIsmail/gemma-4-31B-it-ternary")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("AsadIsmail/gemma-4-31B-it-ternary", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use AsadIsmail/gemma-4-31B-it-ternary with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AsadIsmail/gemma-4-31B-it-ternary"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AsadIsmail/gemma-4-31B-it-ternary",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/AsadIsmail/gemma-4-31B-it-ternary

SGLang

How to use AsadIsmail/gemma-4-31B-it-ternary with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AsadIsmail/gemma-4-31B-it-ternary" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AsadIsmail/gemma-4-31B-it-ternary",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AsadIsmail/gemma-4-31B-it-ternary" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AsadIsmail/gemma-4-31B-it-ternary",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use AsadIsmail/gemma-4-31B-it-ternary with Docker Model Runner:
```
docker model run hf.co/AsadIsmail/gemma-4-31B-it-ternary
```

Gemma 4 31B-it — Ternary Quantized (tritplane3)

Ternary-quantized version of google/gemma-4-31B-it using ternary-quant with component-aware tritplane3 quantization.

Model Specifications

Property	Value
Base Model	google/gemma-4-31B-it
Parameters	31B
Architecture	Dense transformer, multimodal (image + text)
Quantization	tritplane3 (3-plane progressive ternary)
Quantized Components	text_backbone + multimodal_connector (410 layers)
Vision Encoder	FP16 (unquantized, preserves image quality)
Effective Bits	~8-10 bits/weight (quantized layers)
License	Gemma

Size Comparison

Method	Format	Size	Bits/Weight	VLM Support
FP16 (original)	safetensors	62.6 GB	16	Yes
GGUF Q8_0	GGUF	~33 GB	8.5	Text only*
Ternary tritplane3	ternary-quant	31 GB	~8-10	Yes (vision+text)
GGUF Q4_K_M	GGUF	~18 GB	4.5	Text only*
MLX 4-bit	MLX	~17 GB	4	Yes (MLX only)
GGUF Q2_K	GGUF	~12 GB	2.5	Text only*

*GGUF quantizations of Gemma 4 typically strip or break the vision pipeline. Our ternary quantization preserves the full multimodal capability by keeping the vision encoder in FP16.

Quality Comparison (FP16 vs Ternary)

Side-by-side with greedy decoding, same prompts, chat template applied:

Prompt	FP16 Original	Ternary (ours)
"What is the capital of France?"	The capital of France is Paris.	The capital of France is Paris.
"Explain photosynthesis in 2 sentences."	...convert light energy into chemical energy in the form of glucose. This vital process consumes CO2 and water while releasing oxygen.	...convert light energy into chemical energy, using CO2 and water to produce glucose. This process releases oxygen, essential for life on Earth.
"Write a Python function to reverse a string."	The Pythonic Way (Slicing) - Recommended	Using Slicing (The most Pythonic way)

Result: Near-identical output. Same facts, same reasoning, same code — minor phrasing differences only.

Memory Requirements

Runtime	Min Memory	Speed	Hardware
`cached` (CPU)	~35 GB RAM	Moderate	Any x86/ARM CPU
`cached` (CUDA)	~32 GB VRAM	Fast	A100, H100, RTX 4090
`metal` (Apple Silicon)	~35 GB unified	Moderate	M2 Pro 48GB+, M4 Pro 48GB+
`triton_memory` (CUDA)	~24 GB VRAM	Slower	RTX 3090, RTX 4090

Quickstart

pip install ternary-quant

from ternary_quant.inference import load_ternary_model

model, processor = load_ternary_model(
    "AsadIsmail/gemma-4-31B-it-ternary",
    runtime_mode="cached",  # "metal" for Apple Silicon, "triton_memory" for low-VRAM GPU
    device="auto"
)

# Text generation with chat template
messages = [{"role": "user", "content": [{"type": "text", "text": "Explain quantum computing"}]}]
formatted = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=formatted, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(outputs[0], skip_special_tokens=True))

Why ternary-quant for VLMs?

GGUF and GPTQ quantize all weights uniformly. For multimodal models with vision encoders, text decoders, and multimodal connectors, this often breaks the vision pipeline.

ternary-quant quantizes each component independently:

Text backbone → ternary (compressed)
Vision encoder → FP16 (preserved)
Multimodal connector → ternary (compressed)

This preserves image understanding while still compressing the text generation layers.

Reproduce

pip install ternary-quant
ternary-quant quantize-broad google/gemma-4-31B-it \
    --output ./gemma-4-31B-it-ternary \
    --components text_backbone multimodal_connector \
    --scheme tritplane3 --dtype float16 --device cpu \
    --calibration-batch-size 1