Instructions to use nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF", filename="gemma-4-e2b-it.BF16-mmproj.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF:Q4_K_M
Use Docker
docker model run hf.co/nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF:Q4_K_M
- Ollama
How to use nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF with Ollama:
ollama run hf.co/nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF:Q4_K_M
- Unsloth Studio new
How to use nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF to start chatting
- Pi new
How to use nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF with Docker Model Runner:
docker model run hf.co/nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF:Q4_K_M
- Lemonade
How to use nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Gemma-4-e2b-CodeX-Distill-v1-GGUF-Q4_K_M
List all available models
lemonade list
Gemma-4-e2b-CodeX-Distill-v1-GGUF
A distilled code-focused variant of Gemma-4 e2b, optimized for efficient local inference using GGUF format. This model is designed for coding assistance, reasoning, and structured generation tasks, with optional “thinking” mode enabled via chat templates.
Example usage:
- For text only LLMs:
llama-cli -hf nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF --jinja - For multimodal models:
llama-mtmd-cli -hf nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF --jinja
📦 Available Model Files
gemma-4-e2b-it.Q8_0.gguf— Quantized model (Q8_0 for high quality)gemma-4-e2b-it.BF16-mmproj.gguf— Multimodal projection (required for full functionality)
🚀 Features
- Strong code generation & reasoning (CodeX-style distillation)
- Long context support (tested up to 131k tokens)
- Optimized for llama.cpp
- Supports structured chat templates (Jinja-based)
- Optional “thinking mode” for better reasoning traces
🖥️ Running with llama.cpp
Make sure you’re using a recent build of llama.cpp with:
- Flash Attention enabled
- Jinja/chat template support compiled
Start Server
llama-server \
-m gemma-4-e2b-it.Q8_0.gguf \
--port 53281 \
-c 131072 \
--parallel 1 \
--flash-attn on \
--no-context-shift \
-ngl -1 \
--jinja \
--chat-template-kwargs "{\"enable_thinking\": true}" \
--mmproj gemma-4-e2b-it.BF16-mmproj.gguf
Key Flags Explained
-c 131072→ Enables long context (131k tokens)--flash-attn on→ Faster attention (requires compatible GPU)-ngl -1→ Offload all layers to GPU--jinja→ Enables chat template rendering--chat-template-kwargs→ Activates thinking mode--mmproj→ Required for multimodal projection
Test Request
curl http://localhost:53281/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Write a Python function to reverse a linked list"}
]
}'
🧠 Notes on Thinking Mode
When enable_thinking=true, the model may:
- Produce intermediate reasoning steps
- Improve structured problem solving
- Slightly increase latency
Disable it if you need faster responses.
🦙 Running with Ollama
Important: ⚠️ Ollama Note for Vision Models, currently does not support separate mmproj files for vision models.
Create a Modelfile:
FROM ./gemma-4-e2b-it.Q8_0.gguf
PARAMETER num_ctx 131072
PARAMETER num_gpu -1
PARAMETER stop "<end_of_turn>"
TEMPLATE """{{ if .System }}<start_of_turn>system
{{ .System }}<end_of_turn>
{{ end }}{{ if .Prompt }}<start_of_turn>user
{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ end }}"""
# Optional: enable reasoning-style outputs
SYSTEM "You are a highly capable coding assistant with strong reasoning ability."
Build & Run
ollama create gemma-4-codex -f Modelfile
ollama run gemma-4-codex
⚙️ Recommended Settings
| Use Case | Context | GPU Layers | Notes |
|---|---|---|---|
| Coding assistant | 32k–64k | Full (-1) | Best balance |
| Long reasoning | 131k | Full | Needs high VRAM |
| Low VRAM setup | 8k–16k | Partial | Disable flash-attn |
⚠️ Limitations
- Requires significant VRAM for full 131k context
- Thinking mode increases latency
- Multimodal projection file must match model variant
📜 License
Follow the original Gemma license and any additional terms from this distillation.
🙌 Credits
- Base model: Google Gemma family
- Distillation: Code-focused adaptation
- Runtime: llama.cpp ecosystem
- Downloads last month
- 2,077
4-bit
8-bit