Gemma 4 Text-Only

If all you need is text, these are the Gemma 4 models for you.

Trimmed checkpoints of the Gemma 4 model family with vision and audio encoder weights removed — smaller files, lower VRAM, drop-in text-only replacement.

⚠️ Disclaimer: These models were tested exclusively with HuggingFace Transformers. vLLM, SGLang, llama.cpp, Ollama, and other inference engines are not supported yet — partly because Transformers support for Gemma 4 is still cooking in those projects, and partly because we just threw these checkpoints on the Hub while messing around in the lab. If you get any of these running on other engines, we'd love to hear about it — open a discussion or drop a community post. We didn't set out to build a production-ready model zoo; we just left the oven door open. Use accordingly.

For official details on the Gemma 4 model family — architecture, benchmarks, training data, and intended use — see the original Gemma 4 E2B-it model card.

How It Works

The Gemma 4 E2B architecture consists of a vision encoder, an audio encoder, and a language model sharing a single checkpoint. During text-only inference the vision and audio encoders are never called, but their weights are still loaded into memory. By loading the checkpoint with the causal LM class instead of the full conditional generation class, HuggingFace Transformers instantiates only the language model component. Re-saving that model produces a checkpoint with no vision or audio weights, which can subsequently be loaded with the standard AutoModelForCausalLM interface.

Why bother?

Lower VRAM — vision and audio encoder weights are freed, reducing peak memory usage
Smaller checkpoints — faster downloads and storage savings
Simpler loading — standard AutoModelForCausalLM, no multimodal dependencies
Drop-in replacement — identical tokenizer, same chat template, same text generation behavior as the original Gemma 4 models

Available Models

Model	HuggingFace Hub
Gemma-4-E2B-it-text-only	`principled-intelligence/gemma-4-E2B-it-text-only`
Gemma-4-E4B-it-text-only	`principled-intelligence/gemma-4-E4B-it-text-only`

Size Reduction

We compared the text-only checkpoint against the original Gemma 4 E2B-it across two metrics: peak VRAM usage when loaded in bfloat16 with device_map="auto", and total parameter count.

Gemma 4 E2B-it vs. Gemma 4 E2B-it-text-only

Metric	Gemma 4 E2B-it	Text-Only	Reduction
VRAM (MiB)	10,504	9,596	~9%
Parameters (B)	5.12	4.65	~9%
File size (GB)	10.20	9.29	~9%

Note: The "E" in E2B stands for "effective" parameters. The Gemma 4 E2B architecture uses Per-Layer Embeddings (PLE) to maximize parameter efficiency on-device — the total parameter count is higher than the effective size. The text-only variant removes the vision and audio encoder weights while preserving the full language model, including all PLE parameters.

Quickstart

The latest transformers is required:

uv pip install transformers>=5.5.0

Load and run inference exactly like any causal LM:

from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="principled-intelligence/gemma-4-E2B-it-text-only",
    device_map="auto",
)

messages = [{"role": "user", "content": "What is the capital of Italy?"}]
print(pipe(messages, max_new_tokens=512))

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "principled-intelligence/gemma-4-E2B-it-text-only"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

messages = [
    {"role": "user", "content": "What is the capital of Italy?"},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(output_ids[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Gemma 4 thinks by default, generating internal reasoning content before the final response. Thinking is enabled by including the <|think|> token at the start of the system prompt. To disable thinking, remove the token. Many libraries like Transformers handle this via the chat template for you.

Contributing

Contributions are welcome! Whether it's getting these checkpoints running on vLLM, SGLang, llama.cpp, Ollama, or something else entirely — we'd love your help. Bug reports, compatibility notes, and PRs are all appreciated. Open a discussion or community post and let us know what you find.