You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Gemma-4-E4B-it W4A16 (Quantized)

This is a 4-bit W4A16 quantized version of Google's Gemma-4-E4B-it, optimized for high-speed text generation with significantly reduced VRAM requirements.

It was compressed using llmcompressor** using 512 high-quality conversational samples. The model weighs just ~2.5 GB, allowing it to run effortlessly on consumer GPUs (e.g., 8GB VRAM) while retaining the exceptional instruction-following capabilities of the base Gemma 4 architecture.

🚀 Quick Start (vLLM)

pip install vllm
vllm serve ./gemma4-e4b-w4a16-it --dtype bfloat16 --max-model-len 4096

Python Inference:

from vllm import LLM, SamplingParams

llm = LLM(model="./gemma4-e4b-w4a16-it", dtype="bfloat16")
params = SamplingParams(max_tokens=512, temperature=0.7)

outputs = llm.generate(["Explain quantum computing in simple terms."], params)
print(outputs[0].outputs[0].text)

⚙️ Quantization Configuration

This model was quantized with exacting precision to balance compression ratio with output degradation.

Setting	Value
Scheme	W4A16 (4-bit weights, 16-bit activations)
Algorithm	`QuantizationModifier` (Static/ GPTQ-based)
Calibration Dataset	Ultrachat 200k (512 samples)
Sequence Length	2048

🛡️ Architecture Exclusions (Why some layers are FP16)

To prevent the "garbage output" and tracing errors common in modern hybrid models, the following modules were explicitly ignored and kept in their native bfloat16 precision:

lm_head & embed_tokens: Quantizing the output projection and embedding layers of Gemma models typically causes catastrophic token distribution collapse. Keeping them in FP16 costs ~100MB of VRAM but saves model coherence.
vision & audio towers: Because calibration was text-only, multimodal encoders were excluded to prevent cross-modal interference. (Note: These untouched towers are preserved in the weights if you wish to build a multimodal pipeline around this text backbone).

🛠️ The `torch.fx` Tracing Patch (Important!)

Gemma 4 introduces a get_per_layer_inputs_embedding method that crashes llmcompressor and torch.fx graph tracing out-of-the-box.

If you are trying to replicate this quantization on other Gemma 4 variants, you must apply this monkey patch before calling onesot():

import torch.fx as fx
from transformers.models.gemma4.modeling_gemma4 import Gemma4Model

# 1. Save original method
_orig_method = Gemma4Model.get_per_layer_inputs_embedding

# 2. Wrap it to bypass torch.fx tracing errors
@fx.wrap
def _patched_method(self, *args, **kwargs):
    return _orig_method(self, *args, **kwargs)

# 3. Apply to class BEFORE loading/tracing
Gemma4Model.get_per_layer_inputs_embedding = _patched_method

Prerequisite

Make sure you have the compressed-tensors library installed, as llmcompressor relies on it to load the custom W4A16 linear layers back into the standard nn.Module structure:

pip install compressed-tensors transformers torch pillow requests

Full Testing Script (`test_image_to_text.py`)

import torch
import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM

# ============ CONFIG ============
MODEL_PATH = "./gemma4-e4b-w4a16-it" # Path to your compressed model
IMAGE_URL = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
PROMPT = "Describe what is happening in this image in detail."

print("="*60)
print("👀 GEMMA 4 IMAGE-TO-TEXT TESTING (W4A16 Text / BF16 Vision)")
print("="*60)

# 1. Load Processor and Model
print("\n[1/3] Loading processor and quantized model...")
processor = AutoProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True)

# Load model. 
# compressed-tensors will automatically intercept this and reconstruct the 
# W4A16 layers for the text tower, while keeping the vision tower in BF16.
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
).eval()

print(f"✅ Model loaded successfully.")

# (Optional) Quick VRAM check to prove text compression worked
if torch.cuda.is_available():
    mem_used = torch.cuda.memory_allocated() / (1024**3)
    print(f"📊 VRAM Used (Weights + Overhead): {mem_used:.2f} GB")

# 2. Load and Process Image
print("\n[2/3] Fetching and processing image...")
try:
    image = Image.open(requests.get(IMAGE_URL, stream=True).raw).convert("RGB")
except Exception as e:
    print(f"Failed to download image, using local fallback or error: {e}")
    exit()

# Use the processor's chat template to format the input correctly for Gemma 4
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": PROMPT}
        ]
    }
]

# Apply chat template and tokenize
inputs = processor.apply_chat_template(
    messages, 
    add_generation_prompt=True, 
    tokenize=True, 
    return_dict=True, 
    return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

print(f"✅ Input processed. Input shape: {inputs['input_ids'].shape}")

# 3. Generate Response
print("\n[3/3] Generating response...")
print("-" * 60)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,      # Use greedy decoding for deterministic testing
        temperature=0.0,
    )

# Decode only the newly generated tokens (skip the prompt)
input_len = inputs['input_ids'].shape[-1]
generated_tokens = outputs[0][input_len:]
response_text = processor.decode(generated_tokens, skip_special_tokens=True)

print("🤖 Model Output:")
print(response_text)
print("-" * 60)
print("\n✅ TEST COMPLETE!")

Why this specific setup is required:

compressed-tensors backend: When llmcompressor saves the model, it replaces standard nn.Linear with custom compressed modules in the safetensors metadata. Standard PyTorch cannot read this natively; the compressed-tensors library (which hooks into transformers automatically) handles the dynamic un-wrapping during from_pretrained.
dtype=torch.bfloat16: Even though your text layers are W4A16 (4-bit), the activations and the vision tower operate entirely in BF16. If you force float32, you will get dtype mismatch errors inside the forward pass where the vision outputs meet the text inputs.
apply_chat_template: Gemma 4 relies on strict <start_of_turn> formatting and specific image token placements. Never pass raw text/images directly to the model; always let the processor construct the prompt.

💻 Hardware Requirements

Because the active language weights are heavily compressed to 4-bit, the VRAM footprint is incredibly small:

GPU Memory (Weights only): ~2.5 GB
Recommended VRAM: 8 GB+ (e.g., RTX 3060/4060) to leave room for KV Cache and context windows.
Optimal VRAM: 16 GB - 24 GB (e.g., RTX 4090, L4) for massive concurrent batching via vLLM.

⚠️ Limitations

Text-Only Calibration: This specific checkpoint was calibrated purely on text instructions. While the multimodal weights are preserved in the safetensors, standard text-generation pipelines will not route audio/vision inputs through them without custom pipeline modifications.
Not for Fine-Tuning: This repository is intended for inference only. Attempting to run LoRA/QLoRA directly on 4-bit compressed weights without a quantized-aware training framework (like AWQ-LoRA specific implementations) may yield unstable gradients.

📄 License

This model inherits the Gemma Terms of Use from the base model.

Downloads last month: 41

Safetensors

Model size

8B params

Tensor type

I64

I32

BF16

Model tree for amir22010/gemma4-e4b-w4a16-it

Base model

google/gemma-4-E4B-it

Quantized

(149)

this model

amir22010
/

gemma4-e4b-w4a16-it

You need to agree to share your contact information to access this model

Gemma-4-E4B-it W4A16 (Quantized)

🚀 Quick Start (vLLM)

⚙️ Quantization Configuration

🛡️ Architecture Exclusions (Why some layers are FP16)

🛠️ The `torch.fx` Tracing Patch (Important!)

Prerequisite

Full Testing Script (`test_image_to_text.py`)

Why this specific setup is required:

💻 Hardware Requirements

⚠️ Limitations

📄 License

Model tree for amir22010/gemma4-e4b-w4a16-it

Dataset used to train amir22010/gemma4-e4b-w4a16-it

You need to agree to share your contact information to access this model

Gemma-4-E4B-it W4A16 (Quantized)

🚀 Quick Start (vLLM)

⚙️ Quantization Configuration

🛡️ Architecture Exclusions (Why some layers are FP16)

🛠️ The torch.fx Tracing Patch (Important!)

Prerequisite

Full Testing Script (test_image_to_text.py)

Why this specific setup is required:

💻 Hardware Requirements

⚠️ Limitations

📄 License

Model tree for amir22010/gemma4-e4b-w4a16-it

Dataset used to train amir22010/gemma4-e4b-w4a16-it

🛠️ The `torch.fx` Tracing Patch (Important!)

Full Testing Script (`test_image_to_text.py`)