Instructions to use rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF",
	filename="gemma4-e4b-claude-coder.Q4_K_M.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF:Q4_K_M

Use Docker

docker model run hf.co/rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF:Q4_K_M

Ollama
How to use rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF with Ollama:
```
ollama run hf.co/rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF:Q4_K_M
```

Unsloth Studio

How to use rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF to start chatting

Docker Model Runner
How to use rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF with Docker Model Runner:
```
docker model run hf.co/rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF:Q4_K_M
```

Lemonade

How to use rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull rwiecekgmailcom/gemma4-e4b-claude-coder-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.gemma4-e4b-claude-coder-GGUF-Q4_K_M

List all available models

lemonade list

Gemma 4 Claude Coder — local model family

A family of custom models built on Gemma 4 (edge variants E2B and E4B), tuned to act as autonomous coding and administration agents. The models speak the Anthropic-compatible API, so they drive Claude Code fully locally — your code never leaves your machine and cloud token cost drops to zero.

Each model ships with a system prompt focused on real work inside a codebase: use tools instead of guessing, make minimal and precise code changes, return complete and runnable output, and verify after acting. Sampling follows Google's official Gemma 4 recommendation (temperature 1.0, top_k 64, top_p 0.95), with thinking mode enabled for better planning before a tool call.

The idea

The whole point of this family is to run Claude Code on small, popular, consumer-grade hardware. No datacenter GPU, no cloud bill — just an everyday Mac Mini (or similar 16 GB machine) acting as a fully local, agentic coding assistant. These models make that practical: light enough to fit, smart enough to drive real tool-calling agent loops.

In a time of RAM shortages and the big tech giants tightening usage limits and quotas, owning a capable agent that runs entirely on your own modest hardware stops being a hobby and becomes leverage: no rate limits, no surprise pricing, no dependency on someone else's quota.

Models in the family

Model	Base	Context	Purpose
gemma4-e2b-claude-coder	Gemma 4 E2B (eff. 2B / 5.1B with embeddings)	64K	Fast everyday coding agent — edits, autocomplete, short agent loops. Lightest on memory.
gemma4-e4b-claude-coder	Gemma 4 E4B (eff. 4B / 8B with embeddings)	64K	Stronger coding agent — better reasoning and tool use on larger tasks.
gemma4-e4b-claude-coder-admin	Gemma 4 E4B	32K	Administration and system tasks (scripts, shell, devops). Smaller context fits 100% in GPU for higher, stable throughput.

What it's for

Driving Claude Code locally (ollama launch claude --model <name>).
Agentic code writing and editing with native function calling / tool use.
Administration and devops tasks on a server (the admin variant).
Full privacy and offline operation — no code sent to the cloud.

Context

Coders (E2B / E4B): 64K tokens — matching Claude Code's recommendation (64K minimum).
Admin (E4B): 32K tokens — a deliberate trade-off for 16 GB hardware that keeps the model entirely on the GPU.
Base Gemma 4 E2B/E4B natively supports up to 128K, so context can be raised on stronger hardware.

Test hardware

The models were built and tested on:

Mac Mini (Apple Silicon, M-series), 16 GB RAM, macOS 15.6
Ollama 0.24, GPU (Metal) inference

Measured performance (16 GB RAM)

Model	Placement	Speed	Tool calling
gemma4-e2b-claude-coder	100% GPU	~55 tok/s	✅ valid JSON
gemma4-e4b-claude-coder (64K)	39% GPU / 61% CPU	~27 tok/s (drops under load)	✅
gemma4-e4b-claude-coder-admin (32K)	100% GPU	~30 tok/s (stable)	✅

All three passed an end-to-end test through Claude Code: real turns with tool calls and correct responses (HTTP 200 on /v1/messages).

How they were made

These models were designed, built and tested with the help of Claude Opus 4.8 — the best coding model in the world. Their system prompts, parameter choices and context configuration draw directly on its knowledge. In other words: the world's best coding model prepared local models that take that work over right on your desk.

License

Apache 2.0 (inherited from the base Gemma 4).

Downloads last month: 321

GGUF

Model size

8B params

Architecture

gemma4

Hardware compatibility

4-bit