How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="sunil-pathak/gemma-4-E2B-it-Q4_K_M",
	filename="gemma-4-E2B-it-Q4_K_M.gguf",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

gemma-4-E2B-it — GGUF (Q4_K_M)


📊 Performance Metrics

  • Size: 3.19 GB
  • Speed: 5.42 tokens/sec
  • Format: GGUF (llama.cpp optimized)
  • Quantization: Q4_K_M

🔷 Model Overview

This repository contains a GGUF quantized version of:

  • Base Model: gemma-4-E2B-it
  • Format: GGUF (optimized for llama.cpp inference)
  • Precision: Q4_K_M
  • Purpose: Efficient local inference on CPU/GPU

GGUF format provides:

  • Fast loading via memory mapping
  • Single-file model distribution
  • Cross-platform compatibility
  • Efficient inference with llama.cpp

📦 Files

File Description
gemma-4-E2B-it-Q4_K_M.gguf Quantized GGUF model file

⚙️ Technical Details

Parameter Value
Architecture gemma-4-E2B-it
Format GGUF
Precision Q4_K_M
Runtime llama.cpp
Use Case Local inference / deployment

⚡ Why GGUF?

GGUF is designed for efficient inference:

  • Optimized for llama.cpp
  • Supports CPU and GPU inference
  • Single-file deployment
  • Memory-mapped loading for speed
  • Ideal for edge / local environments

⚠️ License & Usage

This is a converted derivative model.

  • You must comply with the original model license of gemma-4-E2B-it
  • This is not an official release
  • No additional rights are granted
  • Original ownership remains with the base model creator

🚀 Quick Start (llama.cpp)

./llama-cli -m gemma-4-E2B-it-Q4_K_M.gguf -p "Explain AI simply"
Downloads last month
11
GGUF
Model size
5B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support