Instructions to use CompressedGemma/gemma-4-31B-it-compressed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use CompressedGemma/gemma-4-31B-it-compressed with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="CompressedGemma/gemma-4-31B-it-compressed", filename="Gemma-31B-it-quant.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use CompressedGemma/gemma-4-31B-it-compressed with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf CompressedGemma/gemma-4-31B-it-compressed # Run inference directly in the terminal: llama cli -hf CompressedGemma/gemma-4-31B-it-compressed
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf CompressedGemma/gemma-4-31B-it-compressed # Run inference directly in the terminal: llama cli -hf CompressedGemma/gemma-4-31B-it-compressed
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf CompressedGemma/gemma-4-31B-it-compressed # Run inference directly in the terminal: ./llama-cli -hf CompressedGemma/gemma-4-31B-it-compressed
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf CompressedGemma/gemma-4-31B-it-compressed # Run inference directly in the terminal: ./build/bin/llama-cli -hf CompressedGemma/gemma-4-31B-it-compressed
Use Docker
docker model run hf.co/CompressedGemma/gemma-4-31B-it-compressed
- LM Studio
- Jan
- Ollama
How to use CompressedGemma/gemma-4-31B-it-compressed with Ollama:
ollama run hf.co/CompressedGemma/gemma-4-31B-it-compressed
- Unsloth Studio
How to use CompressedGemma/gemma-4-31B-it-compressed with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for CompressedGemma/gemma-4-31B-it-compressed to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for CompressedGemma/gemma-4-31B-it-compressed to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for CompressedGemma/gemma-4-31B-it-compressed to start chatting
- Pi
How to use CompressedGemma/gemma-4-31B-it-compressed with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf CompressedGemma/gemma-4-31B-it-compressed
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "CompressedGemma/gemma-4-31B-it-compressed" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use CompressedGemma/gemma-4-31B-it-compressed with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf CompressedGemma/gemma-4-31B-it-compressed
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default CompressedGemma/gemma-4-31B-it-compressed
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use CompressedGemma/gemma-4-31B-it-compressed with Docker Model Runner:
docker model run hf.co/CompressedGemma/gemma-4-31B-it-compressed
- Lemonade
How to use CompressedGemma/gemma-4-31B-it-compressed with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull CompressedGemma/gemma-4-31B-it-compressed
Run and chat with the model
lemonade run user.gemma-4-31B-it-compressed-{{QUANT_TAG}}List all available models
lemonade list
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Gemma-4-31B-it — Q2_K + Q4_0·HPC
The smallest functional quantization of Gemma 4 31B. 12 GB. Runs on 16 GB hardware. Reasons like Q5-Q6.
No other public quantization of this model fits in 16 GB. The smallest community quant (Q4_K_M) is ~18–20 GB and requires 24+ GB at runtime. This fits and runs with headroom to spare.
This is not a typical Q2 quant. Standard Q2 quantization destroys reasoning capability.
HPC uses anisotropic error optimization — D₆ vesica gate error shaping + global belief propagation — to push quantization noise into dimensions orthogonal to the computation flow. The reasoning substrate survives intact.
Note: In this version, there is no need to use penalty for repeating tokens but I am lazy and don't feel like changing the readme.
For those who have used this and want to use HPC, I don't care if you do that to make your own quants and I do not require citation, I do not develop for the recognition or whatever..
Model Details
| Base Model | google/gemma-4-31B-it |
| Architecture | Gemma 4 Dense — 31B params, 60 layers, 5376 hidden, sliding window + full attention |
| Quantization | Mixed Q2_K (2.63 bpw) + Q4_0·HPC (4.5 bpw) |
| File Size | 12 GB |
| Format | GGUF v3 — compatible with llama.cpp, LM Studio, Ollama |
| Quantizer | HPC |
Precision Tiers
| Layer Type | Quantization | BPW | Method |
|---|---|---|---|
| Attention Q/K/V/O | Q4_0·HPC | 4.5 | 24-beam Hensel search + triality BP (16 candidates) |
| FFN gate/up/down | Q2_K·HPC | 2.63 | 24-beam Hensel search + triality BP (16×16 = 256 candidates) |
| Embeddings / Norms | F32 | 32 | Preserved |
Tensor Distribution
| Type | Count | Purpose |
|---|---|---|
| Q4_0·HPC | 230 | Attention projections |
| Q2_K·HPC | 181 | FFN / MLP weights |
| F32 | 422 | Embeddings, norms, biases |
| Total | 833 |
Size Comparison
| Quantization | Size | Fits 16 GB? | Source |
|---|---|---|---|
| BF16 | 58 GB | ❌ | |
| Q8_0 | ~33 GB | ❌ | Community |
| Q6_K | ~25 GB | ❌ | Community |
| Q4_K_M | ~19 GB | ❌ | LM Studio / bartowski |
| This | 12 GB | ✅ | HPC |
Quick Start
LM Studio
- Download the GGUF
- Place in your LM Studio models directory
- Load and chat — LM Studio auto-detects the Gemma 4 template
llama.cpp Server
# Download the updated Gemma 4 chat template (required for correct output)
curl -L -o gemma4_chat_template.jinja \
"https://huggingface.co/google/gemma-4-31B-it/raw/main/chat_template.jinja"
# Launch the server
llama-server \
-m Gemma-4-31B-it-Q2_K.gguf \
-ngl 0 \
-c 4096 \
--host 0.0.0.0 --port 8989 \
--jinja \
--chat-template-file gemma4_chat_template.jinja \
--cache-ram 0 \
-ctxcp 1
Important flags:
--jinja --chat-template-file— Uses Google's latest Gemma 4 template. The template embedded in older GGUFs is broken. Without this, you get garbage output.--cache-ram 0 -ctxcp 1— Prevents the sliding window attention checkpoint RAM explosion that affects all Gemma 4 models.-ngl 0— CPU-only. Increase for GPU offload (e.g.,-ngl 30for partial offload on 16 GB VRAM).
llama.cpp CLI
llama-cli \
-m Gemma-4-31B-it-Q2_K.gguf \
--jinja \
--chat-template-file gemma4_chat_template.jinja \
-p "Implement a concurrent hash map in C" \
-n 512 --temp 0 --repeat-penalty 1.5
Ollama
⚠️ Ollama has known issues with Gemma 4. If you get garbage output, switch to llama.cpp server or LM Studio. This is an Ollama-side problem, not a model issue.
Chat Template: Thought Resumption
Note: The Gemma 4 chat template supports thought resumption — if the model's thinking chain is interrupted by a bad token or truncation, the template allows the model to pick up and continue the broken thought rather than restarting or hallucinating. This is particularly useful at low BPW where rare token noise can occasionally disrupt a reasoning chain mid-thought. The model will close the broken
<think>block and resume coherently.
FROM ./Gemma-4-31B-it-Q2_K.gguf
PARAMETER temperature 0
PARAMETER num_ctx 4096
PARAMETER repeat_penalty 1.5
PARAMETER top_k 1
PARAMETER mlock true
API Usage
Once the server is running, use the OpenAI-compatible API:
curl http://localhost:8989/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Write a Python LRU cache"}],
"temperature": 0,
"max_tokens": 512,
"repeat_penalty": 1.5
}'
Recommended Settings
| Parameter | Value | Why |
|---|---|---|
temperature |
0 | Deterministic — eliminates sampling noise at low BPW, prevents token repetition loops |
repeat_penalty |
1.5 | High penalty aggressively suppresses repeating tokens, critical for coherent output at 2-bit |
top_k |
1 | Greedy decoding — always pick the highest-probability token |
top_p |
1.0 | Disabled when temp=0 (no effect with greedy decoding) |
context |
2048–4096 | Higher contexts increase RAM usage significantly |
Why temp=0 with high repeat penalty? At 2-bit quantization, the probability distribution over tokens is noisier than the original model. Non-zero temperature amplifies this noise, causing the model to sample low-confidence tokens that trigger self-correction loops. Setting
temperature 0forces greedy decoding — always picking the most likely token — which keeps output on the model's strongest signal. The highrepeat_penalty(1.5) prevents the degenerate case where greedy decoding gets stuck in a loop, penalizing any token the model has already emitted.
How It Works
Standard quantizers use round-to-nearest: for each weight block, compute a scale and round. This uses HPC beam search with triality-enhanced belief propagation — a fundamentally different approach.
The Pipeline
┌─────────────────────────────────────────────────────────────┐
│ For each weight tensor: │
│ │
│ 1. Compute greedy reference scales per block │
│ 2. Generate candidate grid (16×16 = 256 scale variants) │
│ 3. Encode candidates as Z₆ complex amplitudes │
│ 4. Build constraint graph (inter-block coupling) │
│ 5. Run belief propagation in 3 simultaneous views: │
│ Edge × Vertex × Diagonal (triality) │
│ 6. Combine via geometric mean: │
│ marginal[v] = ∛(edge × vertex × diagonal) │
│ 7. 24-beam Hensel search using combined marginals │
│ (6,144 extensions evaluated per block) │
│ 8. Pack into GGUF blocks with optimal scales │
└─────────────────────────────────────────────────────────────┘
Why Attention Gets Q4_0
Quantization noise in attention projections cascades through softmax(Q·K^T/√d)·V. A single bad scale in a Q block shifts dot products enough to promote wrong tokens — manifesting as:
- Korean/Arabic character injection
- Word substitutions
- Self-correction loops
Promoting Q/K/V/O to Q4_0 (16 levels vs 4) eliminates these artifacts at a cost of only ~1.5 GB.
RMSE Quality (v2.1 — 24-beam, 256-candidate)
| Metric | v2.0 (12-beam, 100-cand) | v2.1 (24-beam, 256-cand) |
|---|---|---|
| Q4_0·HPC attention RMSE | 3.0e-03 | 1.2–1.5e-03 |
| Q2_K·HPC FFN RMSE | 3.0e-03 | 1.5–1.8e-03 |
| RMSE spread | ±0.5e-03 | ±0.3e-03 |
| e-02 outliers | 0 | 0 |
| MSE reduction vs v2.0 | — | ~4× lower |
The 2× RMSE improvement (4× MSE) comes from denser scale sampling (0.022 grid gap vs 0.035) and wider beam survival (24 vs 12 joint configurations). Every tensor in the model — attention and FFN — achieves RMSE in the 1.2–1.8e-03 range with no outliers. This is reconstruction fidelity typically associated with Q4_K, achieved at Q2_K file size.
Reasoning Verification
All tests run at --temp 0 --repeat-penalty 1.15 on a single RTX 3060 12GB. Zero cherry-picking — every result shown is from the first attempt.
Combinatorial Logic
| Test | Result |
|---|---|
| 25 Horses Problem — "Find the 3 fastest horses with a 5-lane track, no stopwatch. Minimum races?" | ✅ Correct answer (7) with complete elimination proof. Correctly identified all 5 candidates {W2, W3, W1_2nd, W1_3rd, W2_2nd}. Self-verified by checking every possible missed candidate. |
| Arto Inkala's "World's Hardest Sudoku" — AC-3 constraint propagation + MRV backtracking | ✅ Solved on first attempt. Correct implementation of arc consistency with minimum remaining values heuristic. |
Algorithm Implementation
| Test | Difficulty | Result |
|---|---|---|
| LRU Cache (Python) — O(1) get/put with doubly-linked list + hashmap | Medium | ✅ Perfect architecture, correct sentinel nodes, correct eviction. 9/10 |
| Interval Merge with Query (C) — sort, merge touching intervals, point query | Hard | ✅ Correct algorithm, touching boundary logic (<= not <), all edge cases. 9/10 |
| Compiler Constant Folding (C) — AST optimization pass with partial evaluation | Expert | ✅ Correct post-order traversal, correct partial evaluation, div-by-zero handling, ((2+3)*x) + (10/2 - 3*1) → (5*x) + 2. 9/10 |
Type Theory
| Test | Result |
|---|---|
| Hindley-Milner Type Inference (Python) — complete implementation from scratch | ✅ Correct Robinson's unification with occurs check. Correct substitution composition. Correct let-polymorphism — let id = λx. x in id id infers (T8 -> T8) while lambda args remain monomorphic. 8/10 |
Self-Diagnosis
| Test | Result |
|---|---|
| Factoring Engine Bug Analysis — given 2500 lines of unfamiliar C code, identify bugs | ✅ Identified 3 non-obvious bugs: (1) destructive accumulator reset wiping cross-base period convergence, (2) successful overshoot value not committed to global state, (3) noise lock logic gap. All diagnoses correct. Fixes compiled and ran. |
What This Means
Standard Q2 quantization produces models that can barely maintain coherent conversation. This Q2 quant:
- Solves combinatorial proofs requiring 7-step elimination chains
- Implements graduate-level programming language theory from scratch
- Correctly analyzes and debugs unfamiliar production C codebases
- Achieves Q5-equivalent reasoning at Q2 file size
The quantization noise is still there — the RMSE proves it — but the D₆ vesica gate has rotated it into dimensions the transformer doesn't use for reasoning.
Gemma 4 31B Architecture
Unlike the 26B MoE variant, the 31B is a dense transformer — every parameter is active on every token. This means:
| Property | Value |
|---|---|
| Total parameters | 31B |
| Active parameters | 31B (all, every token) |
| Hidden size | 5376 |
| Layers | 60 |
| Attention heads | 32 |
| KV heads (sliding) | 16 |
| KV heads (full/global) | 4 |
| Head dim (sliding) | 256 |
| Head dim (full/global) | 512 |
| FFN intermediate | 21504 |
| Sliding window | 1024 tokens |
| Full attention | Every 6th layer |
| Max context | 262,144 tokens |
| Vocab size | 262,144 |
| Activation | GeLU (tanh approx) |
| Logit softcapping | 30.0 |
| Weight tying | Yes (embed = output) |
Attention Pattern
The model alternates between sliding window attention (1024 tokens) and full attention across its 60 layers:
Layers 0–4: sliding (local context, 1024 window)
Layer 5: full (global context, all tokens)
Layers 6–10: sliding
Layer 11: full
...repeating every 6 layers...
Layer 59: full
This hybrid pattern gives the model local efficiency with periodic global context integration — 50 sliding layers + 10 full attention layers.
Known Limitations
Safety alignment degradation — extreme quantization (< 3 BPW) can weaken RLHF guardrails. The model may comply with requests the original would refuse. Evaluate safety properties before deployment.
Ollama compatibility — Ollama's Gemma 4 support is unreliable as of April 2026. Use llama.cpp or LM Studio.
Higher memory at runtime than 26B MoE — despite the smaller GGUF, the dense architecture activates all 31B params per token vs the 26B MoE's 4B active. KV cache and compute overhead are higher.
Long-context (8K+) stress testing — verification suite covers < 5K tokens. Long-context coherence is expected to hold but has not been formally benchmarked.
Technical Details
Q2_K Block Layout (84 bytes / 256 weights)
Offset Size Field
0 16 scales[16] 4-bit scale | 4-bit min per sub-block
16 64 qs[64] packed 2-bit quants (4 per byte)
80 2 d fp16 super-block scale
82 2 dmin fp16 super-block min scale
Q4_0 Block Layout (18 bytes / 32 weights)
Offset Size Field
0 2 d fp16 block scale
2 16 qs[16] packed 4-bit quants (2 per byte)
nibble order: qs[j] = w[j] | (w[j+16] << 4)
Gemma 4 31B Dense Handling
| Challenge | Solution |
|---|---|
| 60-layer deep network | Layer-by-layer streaming quantization |
| Hybrid sliding/full attention | Uniform Q4_0·HPC for all attention types |
| Tied embeddings (embed = lm_head) | Shared F32 tensor, single copy |
| 262K vocab embeddings | Preserved at F32 (routing-critical) |
| Sliding window (1024) | Metadata passthrough |
| Logit softcapping (30.0) | Metadata passthrough |
License
This quantization inherits the Gemma license from the base model.
HPC is MIT licensed
Credits
Quantized with HPC
- Downloads last month
- 537
We're not able to determine the quantization variants.