Gemma 4 Text-Only
If all you need is text, these are the Gemma 4 models for you.
Trimmed checkpoints of the Gemma 4 model family with vision and audio encoder weights removed — smaller files, lower VRAM, drop-in text-only replacement.
⚠️ Disclaimer: These models were tested exclusively with HuggingFace Transformers. vLLM, SGLang, llama.cpp, Ollama, and other inference engines are not supported yet — partly because Transformers support for Gemma 4 is still cooking in those projects, and partly because we just threw these checkpoints on the Hub while messing around in the lab. If you get any of these running on other engines, we'd love to hear about it — open a discussion or drop a community post. We didn't set out to build a production-ready model zoo; we just left the oven door open. Use accordingly.
For official details on the Gemma 4 model family — architecture, benchmarks, training data, and intended use — see the original Gemma 4 E2B-it model card.
How It Works
The Gemma 4 E2B architecture consists of a vision encoder, an audio encoder, and a language model sharing a single checkpoint. During text-only inference the vision and audio encoders are never called, but their weights are still loaded into memory. By loading the checkpoint with the causal LM class instead of the full conditional generation class, HuggingFace Transformers instantiates only the language model component. Re-saving that model produces a checkpoint with no vision or audio weights, which can subsequently be loaded with the standard AutoModelForCausalLM interface.
Why bother?
- Lower VRAM — vision and audio encoder weights are freed, reducing peak memory usage
- Smaller checkpoints — faster downloads and storage savings
- Simpler loading — standard
AutoModelForCausalLM, no multimodal dependencies - Drop-in replacement — identical tokenizer, same chat template, same text generation behavior as the original Gemma 4 models
Available Models
| Model | HuggingFace Hub |
|---|---|
| Gemma-4-E2B-it-text-only | principled-intelligence/gemma-4-E2B-it-text-only |
| Gemma-4-E4B-it-text-only | principled-intelligence/gemma-4-E4B-it-text-only |
Size Reduction
We compared the text-only checkpoint against the original Gemma 4 E2B-it across two metrics: peak VRAM usage when loaded in bfloat16 with device_map="auto", and total parameter count.
Gemma 4 E2B-it vs. Gemma 4 E2B-it-text-only
| Metric | Gemma 4 E2B-it | Text-Only | Reduction |
|---|---|---|---|
| VRAM (MiB) | 10,504 | 9,596 | ~9% |
| Parameters (B) | 5.12 | 4.65 | ~9% |
| File size (GB) | 10.20 | 9.29 | ~9% |
Note: The "E" in E2B stands for "effective" parameters. The Gemma 4 E2B architecture uses Per-Layer Embeddings (PLE) to maximize parameter efficiency on-device — the total parameter count is higher than the effective size. The text-only variant removes the vision and audio encoder weights while preserving the full language model, including all PLE parameters.
Quickstart
The latest transformers is required:
uv pip install transformers>=5.5.0
Load and run inference exactly like any causal LM:
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="principled-intelligence/gemma-4-E2B-it-text-only",
device_map="auto",
)
messages = [{"role": "user", "content": "What is the capital of Italy?"}]
print(pipe(messages, max_new_tokens=512))
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "principled-intelligence/gemma-4-E2B-it-text-only"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
messages = [
{"role": "user", "content": "What is the capital of Italy?"},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(output_ids[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
Gemma 4 thinks by default, generating internal reasoning content before the final response. Thinking is enabled by including the
<|think|>token at the start of the system prompt. To disable thinking, remove the token. Many libraries like Transformers handle this via the chat template for you.
Contributing
Contributions are welcome! Whether it's getting these checkpoints running on vLLM, SGLang, llama.cpp, Ollama, or something else entirely — we'd love your help. Bug reports, compatibility notes, and PRs are all appreciated. Open a discussion or community post and let us know what you find.
License
These checkpoints are released under the Apache 2.0 License, consistent with the original Gemma 4 models.
Made with love from Principled Intelligence ❤️
Learn more about what we build in Principled Intelligence on our website.
- Downloads last month
- 40