Gemma-4-E4B-IT — GGUF Q4_K_M Quantized

Compressed version of google/gemma-4-E4B-IT Submitted to the Resilient AI Challenge — Image to Text category.

Compression Technique

Method: Post-training quantization using llama.cpp (build b9216) Format: GGUF Quantization type: Q4_K_M (4-bit, k-quant mixed precision) Original size: 16.02 GB (FP16) Compressed size: 5.34 GB Size reduction: ~67%

The model was converted from HuggingFace format to GGUF using convert_hf_to_gguf.py from llama.cpp build b9216, then quantized to Q4_K_M using llama-quantize. Q4_K_M applies 4-bit quantization with mixed precision on sensitive layers, preserving quality while maximizing compression.

Model Weights

File Size Format Recommended
gemma4-4b-it-Q4_K_M.gguf 5.34 GB GGUF Q4_K_M Yes

Configuration

vllm_config.yaml is included in this repo root.

gpu-memory-utilization: 0.85
max-model-len: 20000

Inference Instructions

llama.cpp (primary)

llama-server \
  --hf-repo meghanamakkapati/Gemma-4_quantization \
  -m gemma4-4b-it-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 \
  --ctx-size 8192 \
  --n-gpu-layers 99

vLLM

vllm serve meghanamakkapati/Gemma-4_quantization --config vllm_config.yaml

Evaluation Parameters

temperature: 1.0
top_p:       0.95
top_k:       64

Hardware

Tested on NVIDIA A100 80GB. Compatible with NVIDIA L4 24GB (5.34 GB fits within VRAM).

License

Apache 2.0 — same as original model google/gemma-4-E4B-IT.

Downloads last month
52
GGUF
Model size
8B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support