How to use from
Lemonade
Pull the model
# Download Lemonade from https://lemonade-server.ai/
lemonade pull igorls/gemma4-e4b-classifier:Q4_K_M
Run and chat with the model
lemonade run user.gemma4-e4b-classifier-Q4_K_M
List all available models
lemonade list
Quick Links

Gemma 4 E4B Classifier (vision/audio-stripped)

A modality-stripped variant of google/gemma-4-E4B-it for text-only classification, entity extraction, and structured-memory extraction. The vision encoder (150M params) and audio encoder (300M params) are removed; the text path is unchanged.

Headline: Same instruction-tuned text behavior as the official Gemma 4 E4B-it — including its multilingual coverage — but at 6.5 GB resident VRAM instead of 10.6 GB (Ollama Q4_K_M, RTX 3090, Linux). All safety alignment is preserved — this is not an abliterated or uncensored variant.

Fits comfortably on 8 GB GPUs at Q4_K_M with realistic context lengths (5.85 GB resident at ctx=4096, 5.96 GB at ctx=8192). The official multimodal Q4_K_M sits at 10.2 GB resident even at ctx=8192 and won't load on 8 GB cards.

Why this exists

Gemma 4 E4B is the local leader on small-model classification tasks (room classification, entity/memory extraction). It locks out users with 12 GB GPUs because the official Q4_K_M is 10.6 GB resident — the vision + audio encoders sit in VRAM whether you use them or not. For text-only workloads, those modality encoders are dead weight.

This variant strips them via clean re-instantiation: load the multimodal checkpoint, copy text-path tensors into a fresh Gemma4ForCausalLM(text_config), save. No safety-alignment changes. No retraining. No surgery on safetensors files.

How it compares

Measured on RTX 3090, Ollama 0.x, against the MemPalace small-model benchmark harness (n=100 per task):

Task Official gemma4:e4b-it-q4_K_M This model (Q4_K_M) Δ
Calibration 1.0000 1.0000 0.0000
Room classification (closed-set) 0.6200 0.6200 0.0000 (exact tie)
Room classification (open-set) 0.6556 0.6526 -0.0030
Entity extraction (F1) 0.7519 0.7318 -0.0201
Memory coverage 0.9125 0.9375 +0.0250 (higher)
VRAM resident 10626 MB 6517 MB -4109 MB
e2e p50 (closed-set room) 230.9 ms 232.4 ms +1.5 ms (noise)

All accuracy deltas are within statistical noise at n=100. The 4.1 GB VRAM win is real and reproducible.

Multilingual robustness

The strip preserves the base model's multilingual capability. Same classification + extraction tasks were run with inputs translated into Portuguese (pt-BR), Spanish (es), and Chinese (zh) — labels and the slug taxonomy kept in English to test the realistic cross-lingual mapping case. Scoring uses embeddinggemma for semantic similarity so cross-lingual cosine isn't artificially penalized.

Task en pt-BR es zh
Calibration 1.000 0.950 0.950 0.950
Room classification (closed-set) 0.624 0.584 0.584 0.584
Room classification (open-set) 0.676 0.636 0.641 0.639
Entity extraction (F1) 0.732 0.747 0.747 0.694
Memory coverage 0.912 0.850 0.850 0.912

Closed/open room classification stays within ±0.02 across all four languages; entity F1 within ±0.05; memory coverage within ±0.06. The strip did not introduce a multilingual regression. Models still emit responses in the input language by default — if your application needs same-language extraction (e.g. memories phrased in Portuguese for Portuguese conversations), the model does that natively.

What was actually dropped

From the 7996.2M-parameter multimodal checkpoint:

Module Params dropped
model.audio_tower.* (USM-style conformer) 304.8M
model.vision_tower.* (MobileNet-v5 lineage) 167.4M
model.embed_audio.* (audio→text soft-token projector) 3.9M
model.embed_vision.* (vision→text soft-token projector) 2.0M
Total dropped 478.1M (6.0%)
Total kept (text path) 7518.1M (94.0%)

The VRAM saving (4.1 GB) is significantly larger than the dropped weights account for (~250 MB at Q4_K_M). The remainder comes from: modality encoders kept at higher precision than Q4 inside the GGUF, activation buffers sized for image-token sequences (up to 1120 tokens/image), and the multimodal embedders' vocab-offset tables.

Quantization variants

  • Q4_K_M (5.3 GB on disk, 6517 MB resident) — recommended default.
  • Q8_0 (8.0 GB on disk) — precision comparator; minimal accuracy lift on classification.
  • Source safetensors (this repo at bf16, 13.92 GB).

Usage

Hugging Face Transformers

from transformers import AutoTokenizer, Gemma4ForCausalLM
import torch

tok = AutoTokenizer.from_pretrained("igorls/gemma4-e4b-classifier")
model = Gemma4ForCausalLM.from_pretrained(
    "igorls/gemma4-e4b-classifier",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)

messages = [{"role": "user", "content": "What is the capital of France? One word."}]
chat = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
ids = tok(chat, return_tensors="pt").input_ids.to("cuda")
out = model.generate(ids, max_new_tokens=10, do_sample=False)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

Ollama

ollama pull igorls/gemma4-e4b-classifier:Q4_K_M
ollama run igorls/gemma4-e4b-classifier:Q4_K_M "What is the capital of France?"

For classification workloads, pass "think": false at the top level of the /api/generate request to disable Gemma 4's CoT mode (which otherwise consumes the num_predict budget):

curl http://localhost:11434/api/generate -d '{
  "model": "igorls/gemma4-e4b-classifier:Q4_K_M",
  "prompt": "Classify into one word (indoor, outdoor): The kids are playing in the backyard.",
  "think": false,
  "stream": false,
  "options": {"temperature": 0, "num_predict": 16}
}'

Safety surface

This variant is safety-aligned identically to the official gemma-4-E4B-it. The strip does not touch the text-path weights where alignment lives; it only removes the unused modality encoders.

Validated on 18 raw NSFW classification samples (closed-set room, open-set slug invention, entity extraction with named entities, structured memory extraction with decisions/preferences/facts/commitments):

  • Zero refusals on any sample.
  • JSON validity 100% on the structured extraction tasks.
  • Open-set slugs are functional rather than euphemistic.

This confirms the architectural insight from prior research: safety alignment doesn't surface on classification surfaces regardless. There's no reason to ship an uncensored variant for these workloads.

Limitations

  • Text-only. No vision input. No audio input. The encoders are gone. Passing image or audio tokens will produce undefined behavior.
  • Same context window as base (128k tokens).
  • Same tokenizer. The vocab includes vision/audio special tokens (<image>, <audio>, etc.) for compatibility with the official tokenizer; these tokens won't activate any modality processing in this variant.
  • No MTP drafter support on Ollama yet. Upstream llama.cpp doesn't recognize the Gemma4AssistantForCausalLM architecture as of May 2026, so Ollama on Linux/CUDA can't pair this target with the official MTP drafter. For MTP-accelerated inference, use Transformers or vLLM directly — see the MTP acceleration section below.

MTP acceleration

The official MTP drafter google/gemma-4-E4B-it-assistant (78M params, activation-aware) pairs cleanly with this stripped target. Output is lossless (byte-identical at deterministic decode). Measured on RTX 3090 via HF Transformers:

Prompt shape Tokens generated Baseline + MTP drafter Speedup
MCQ single letter 5 394 ms 363 ms 1.09x
Open Q one-word 5 395 ms 249 ms 1.59x
Slug classification 5 462 ms 224 ms 2.07x
JSON entity list (128 tok) 128 12291 ms 6712 ms 1.83x
JSON memories (114 tok) 114 8425 ms 2771 ms 3.04x

Speedup tracks output predictability — structured JSON outputs land at the high end (3x), short slug/letter classifications around 1.5-2x, free-form continuations near 1x.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

target = AutoModelForCausalLM.from_pretrained(
    "igorls/gemma4-e4b-classifier",
    dtype=torch.bfloat16,
    device_map="cuda",
)
drafter = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-E4B-it-assistant",
    dtype=torch.bfloat16,
    device_map="cuda",
)
tok = AutoTokenizer.from_pretrained("igorls/gemma4-e4b-classifier")

messages = [{"role": "user", "content": "Classify into one word (indoor, outdoor): The kids are playing in the backyard."}]
chat = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
ids = tok(chat, return_tensors="pt").input_ids.to("cuda")

out = target.generate(
    ids,
    assistant_model=drafter,
    max_new_tokens=20,
    do_sample=False,
)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

For a self-hosted OpenAI-compatible HTTP endpoint, wrap the pair in a small FastAPI server that holds both models resident and exposes /v1/chat/completions. Example: scripts/08_mtp_server.py in the source repo, callable as:

curl http://localhost:8765/v1/chat/completions -d '{
  "model": "igorls/gemma4-e4b-classifier",
  "messages": [{"role":"user","content":"What is the capital of France?"}],
  "max_tokens": 16,
  "use_mtp": true
}'

vLLM (future)

vLLM is the right inference stack for production throughput — it implements the drafter's centroid-masking optimization (sparse lm_head over ~4K candidates instead of ~262K vocab, ~45x reduction in lm_head compute):

vllm serve igorls/gemma4-e4b-classifier \
  --speculative-config '{"model": "google/gemma-4-E4B-it-assistant", "num_speculative_tokens": 4}'

However, as of May 2026 (vLLM 0.20.2, latest on PyPI), this fails: the drafter's Gemma4AssistantConfig is not yet registered in vLLM's AutoModel mapping. The vLLM Gemma 4 recipes page documents the feature but it's ahead of the released version. Track vllm-project/vllm for the release that lands Gemma4Assistant support; once available, the command above should work as-is against this model.

License

Inherited from the base model: Gemma Terms of Use. By using this model you agree to those terms.

Citation

This is a derivative work of Google's Gemma 4 E4B. If you use it, please also credit:

@misc{gemma_2025,
  title={Gemma 4 Technical Report},
  author={Google DeepMind},
  year={2026},
  url={https://huggingface.co/google/gemma-4-E4B-it},
}

Acknowledgments

  • Google DeepMind for Gemma 4 and the open-weight release.
  • The MemPalace small-model benchmark research (PR #1447) that surfaced the VRAM gap and motivated this work.
  • The igorls/gemma-4-E4B-it-heretic-GGUF (author's prior abliteration experiment) for accidentally demonstrating the architectural VRAM win that this artifact reproduces through a clean, safety-aligned path.
Downloads last month
304
Safetensors
Model size
7B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for igorls/gemma4-e4b-classifier

Quantized
(179)
this model