Gemma-4-E4B-it W4A16 (Quantized)
This is a 4-bit W4A16 quantized version of Google's Gemma-4-E4B-it, optimized for high-speed text generation with significantly reduced VRAM requirements.
It was compressed using llmcompressor** using 512 high-quality conversational samples. The model weighs just ~2.5 GB, allowing it to run effortlessly on consumer GPUs (e.g., 8GB VRAM) while retaining the exceptional instruction-following capabilities of the base Gemma 4 architecture.
🚀 Quick Start (vLLM)
pip install vllm
vllm serve ./gemma4-e4b-w4a16-it --dtype bfloat16 --max-model-len 4096
Python Inference:
from vllm import LLM, SamplingParams
llm = LLM(model="./gemma4-e4b-w4a16-it", dtype="bfloat16")
params = SamplingParams(max_tokens=512, temperature=0.7)
outputs = llm.generate(["Explain quantum computing in simple terms."], params)
print(outputs[0].outputs[0].text)
⚙️ Quantization Configuration
This model was quantized with exacting precision to balance compression ratio with output degradation.
| Setting | Value |
|---|---|
| Scheme | W4A16 (4-bit weights, 16-bit activations) |
| Algorithm | QuantizationModifier (Static/ GPTQ-based) |
| Calibration Dataset | Ultrachat 200k (512 samples) |
| Sequence Length | 2048 |
🛡️ Architecture Exclusions (Why some layers are FP16)
To prevent the "garbage output" and tracing errors common in modern hybrid models, the following modules were explicitly ignored and kept in their native bfloat16 precision:
lm_head&embed_tokens: Quantizing the output projection and embedding layers of Gemma models typically causes catastrophic token distribution collapse. Keeping them in FP16 costs ~100MB of VRAM but saves model coherence.vision&audiotowers: Because calibration was text-only, multimodal encoders were excluded to prevent cross-modal interference. (Note: These untouched towers are preserved in the weights if you wish to build a multimodal pipeline around this text backbone).
🛠️ The torch.fx Tracing Patch (Important!)
Gemma 4 introduces a get_per_layer_inputs_embedding method that crashes llmcompressor and torch.fx graph tracing out-of-the-box.
If you are trying to replicate this quantization on other Gemma 4 variants, you must apply this monkey patch before calling onesot():
import torch.fx as fx
from transformers.models.gemma4.modeling_gemma4 import Gemma4Model
# 1. Save original method
_orig_method = Gemma4Model.get_per_layer_inputs_embedding
# 2. Wrap it to bypass torch.fx tracing errors
@fx.wrap
def _patched_method(self, *args, **kwargs):
return _orig_method(self, *args, **kwargs)
# 3. Apply to class BEFORE loading/tracing
Gemma4Model.get_per_layer_inputs_embedding = _patched_method
Prerequisite
Make sure you have the compressed-tensors library installed, as llmcompressor relies on it to load the custom W4A16 linear layers back into the standard nn.Module structure:
pip install compressed-tensors transformers torch pillow requests
Full Testing Script (test_image_to_text.py)
import torch
import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
# ============ CONFIG ============
MODEL_PATH = "./gemma4-e4b-w4a16-it" # Path to your compressed model
IMAGE_URL = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
PROMPT = "Describe what is happening in this image in detail."
print("="*60)
print("👀 GEMMA 4 IMAGE-TO-TEXT TESTING (W4A16 Text / BF16 Vision)")
print("="*60)
# 1. Load Processor and Model
print("\n[1/3] Loading processor and quantized model...")
processor = AutoProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True)
# Load model.
# compressed-tensors will automatically intercept this and reconstruct the
# W4A16 layers for the text tower, while keeping the vision tower in BF16.
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
).eval()
print(f"✅ Model loaded successfully.")
# (Optional) Quick VRAM check to prove text compression worked
if torch.cuda.is_available():
mem_used = torch.cuda.memory_allocated() / (1024**3)
print(f"📊 VRAM Used (Weights + Overhead): {mem_used:.2f} GB")
# 2. Load and Process Image
print("\n[2/3] Fetching and processing image...")
try:
image = Image.open(requests.get(IMAGE_URL, stream=True).raw).convert("RGB")
except Exception as e:
print(f"Failed to download image, using local fallback or error: {e}")
exit()
# Use the processor's chat template to format the input correctly for Gemma 4
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": PROMPT}
]
}
]
# Apply chat template and tokenize
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)
print(f"✅ Input processed. Input shape: {inputs['input_ids'].shape}")
# 3. Generate Response
print("\n[3/3] Generating response...")
print("-" * 60)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False, # Use greedy decoding for deterministic testing
temperature=0.0,
)
# Decode only the newly generated tokens (skip the prompt)
input_len = inputs['input_ids'].shape[-1]
generated_tokens = outputs[0][input_len:]
response_text = processor.decode(generated_tokens, skip_special_tokens=True)
print("🤖 Model Output:")
print(response_text)
print("-" * 60)
print("\n✅ TEST COMPLETE!")
Why this specific setup is required:
compressed-tensorsbackend: Whenllmcompressorsaves the model, it replaces standardnn.Linearwith custom compressed modules in the safetensors metadata. Standard PyTorch cannot read this natively; thecompressed-tensorslibrary (which hooks intotransformersautomatically) handles the dynamic un-wrapping duringfrom_pretrained.dtype=torch.bfloat16: Even though your text layers are W4A16 (4-bit), the activations and the vision tower operate entirely in BF16. If you forcefloat32, you will get dtype mismatch errors inside the forward pass where the vision outputs meet the text inputs.apply_chat_template: Gemma 4 relies on strict<start_of_turn>formatting and specific image token placements. Never pass raw text/images directly to the model; always let the processor construct the prompt.
💻 Hardware Requirements
Because the active language weights are heavily compressed to 4-bit, the VRAM footprint is incredibly small:
- GPU Memory (Weights only): ~2.5 GB
- Recommended VRAM: 8 GB+ (e.g., RTX 3060/4060) to leave room for KV Cache and context windows.
- Optimal VRAM: 16 GB - 24 GB (e.g., RTX 4090, L4) for massive concurrent batching via vLLM.
⚠️ Limitations
- Text-Only Calibration: This specific checkpoint was calibrated purely on text instructions. While the multimodal weights are preserved in the safetensors, standard text-generation pipelines will not route audio/vision inputs through them without custom pipeline modifications.
- Not for Fine-Tuning: This repository is intended for inference only. Attempting to run LoRA/QLoRA directly on 4-bit compressed weights without a quantized-aware training framework (like AWQ-LoRA specific implementations) may yield unstable gradients.
📄 License
This model inherits the Gemma Terms of Use from the base model.
- Downloads last month
- 41
Model tree for amir22010/gemma4-e4b-w4a16-it
Base model
google/gemma-4-E4B-it