How to use from the
Use from the
llama-cpp-python library
# Gated model: Login with a HF token with gated access permission
hf auth login
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="meghanamakkapati/Gemma-4_quantization",
	filename="",
)
llm.create_chat_completion(
	messages = "\"cats.jpg\""
)

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Gemma-4-E4B-IT — GGUF Q4_K_M Quantized

Quantized version of google/gemma-4-E4B-IT submitted to the Resilient AI Challenge — Image to Text category.

Compression

Parameter Value
Method Post-training quantization via llama.cpp
Format GGUF Q4_K_M (4-bit, K-quant)
Original size 16.02 GB (FP16)
Quantized size 5.34 GB
Size reduction 67%
Vision projector 0.99 GB (BF16 GGUF)
VRAM required ~6-8 GB (fits on L4 24GB)
Hardware tested NVIDIA A100-SXM4-40GB

Converted from HuggingFace format using convert_hf_to_gguf.py then quantized to Q4_K_M using llama-quantize. The vision projector (mmproj) is extracted separately and required for multimodal image+text inference.

Files

File Size Description
gemma4-E4B-Q4_K_M.gguf 5.34 GB Q4_K_M quantized language model
mmproj-BF16.gguf 0.99 GB Vision projector (required for images)
llama_server_config.json Server configuration

Inference — llama-server

llama-server \
    -m gemma4-E4B-Q4_K_M.gguf \
    --mmproj mmproj-BF16.gguf \
    --host 0.0.0.0 --port 8080 \
    --n-gpu-layers 99 \
    --ctx-size 8192 \
    --temp 1.0 --top-p 0.95 --top-k 64

API Usage

curl http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemma4",
    "messages": [{"role": "user", "content": [
      {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<b64>"}},
      {"type": "text", "text": "Describe this image."}
    ]}],
    "max_tokens": 256
  }'

Evaluation Parameters

Parameter Value
temperature 1.0
top_p 0.95
top_k 64
max_tokens 256

Hardware

Quantized on NVIDIA A100-SXM4-40GB. Compatible with NVIDIA L4 24GB (~6-8 GB VRAM used).

License

Apache 2.0 — same as original google/gemma-4-E4B-IT.

Downloads last month
120
GGUF
Model size
8B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support