Gemma-4-E2B-NoAudio (Optimized Vision-Language)

This is an optimized version of Google's Gemma-4-E2B-it, specifically re-architected to be more lightweight and efficient for local inference. By streamlining the model's architecture and focusing on essential modalities, the resource requirements have been significantly reduced while maintaining peak performance in vision and text tasks.

Key Highlights

Modality: Vision and Text (Audio components removed for efficiency).
Optimized Size: Approximately 9.6 GB, offering a much smaller footprint for storage and memory.
Precision: Weights are saved in bfloat16.
Target Hardware: Designed for Local GPUs, Mac M-Series, and memory-constrained edge devices.

Performance and Capability

Despite the architectural streamlining, this model retains 100% of its native Vision (Image Understanding) and Language (Reasoning/Text Generation) capabilities. Users will experience faster loading times and lower VRAM overhead during inference.

Usage (Vision + Text)

This model is compatible with the standard Transformers library. It is recommended to use processor.apply_chat_template to ensure correct image token injection.

import torch
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image

model_id = "bombman/Gemma-4-E2B-NoAudio"

# Load Processor and Model
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
).eval()

# Prepare Input
image = Image.open("your_image.jpg").convert("RGB")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

# Generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=False,
        repetition_penalty=1.2
    )

response = processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)

License

This model is a derivative work based on Google's Gemma-4. Users must comply with the original Gemma Terms of Use.

Downloads last month: 18

Safetensors

Model size

5B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bombman/Gemma-4-E2B-NoAudio

Base model

google/gemma-4-E2B

Finetuned

google/gemma-4-E2B-it

Finetuned

(206)

this model