YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Gemma-4-31B-it — Q2_K + Q4_0·HPC

The smallest functional quantization of Gemma 4 31B. 12 GB. Runs on 16 GB hardware. Reasons like Q5-Q6.

No other public quantization of this model fits in 16 GB. The smallest community quant (Q4_K_M) is ~18–20 GB and requires 24+ GB at runtime. This fits and runs with headroom to spare.

This is not a typical Q2 quant. Standard Q2 quantization destroys reasoning capability.

HPC uses anisotropic error optimization — D₆ vesica gate error shaping + global belief propagation — to push quantization noise into dimensions orthogonal to the computation flow. The reasoning substrate survives intact.

Note: In this version, there is no need to use penalty for repeating tokens but I am lazy and don't feel like changing the readme.

For those who have used this and want to use HPC, I don't care if you do that to make your own quants and I do not require citation, I do not develop for the recognition or whatever..


Model Details

Base Model google/gemma-4-31B-it
Architecture Gemma 4 Dense — 31B params, 60 layers, 5376 hidden, sliding window + full attention
Quantization Mixed Q2_K (2.63 bpw) + Q4_0·HPC (4.5 bpw)
File Size 12 GB
Format GGUF v3 — compatible with llama.cpp, LM Studio, Ollama
Quantizer HPC

Precision Tiers

Layer Type Quantization BPW Method
Attention Q/K/V/O Q4_0·HPC 4.5 24-beam Hensel search + triality BP (16 candidates)
FFN gate/up/down Q2_K·HPC 2.63 24-beam Hensel search + triality BP (16×16 = 256 candidates)
Embeddings / Norms F32 32 Preserved

Tensor Distribution

Type Count Purpose
Q4_0·HPC 230 Attention projections
Q2_K·HPC 181 FFN / MLP weights
F32 422 Embeddings, norms, biases
Total 833

Size Comparison

Quantization Size Fits 16 GB? Source
BF16 58 GB Google
Q8_0 ~33 GB Community
Q6_K ~25 GB Community
Q4_K_M ~19 GB LM Studio / bartowski
This 12 GB HPC

Quick Start

LM Studio

  1. Download the GGUF
  2. Place in your LM Studio models directory
  3. Load and chat — LM Studio auto-detects the Gemma 4 template

llama.cpp Server

# Download the updated Gemma 4 chat template (required for correct output)
curl -L -o gemma4_chat_template.jinja \
  "https://huggingface.co/google/gemma-4-31B-it/raw/main/chat_template.jinja"

# Launch the server
llama-server \
  -m Gemma-4-31B-it-Q2_K.gguf \
  -ngl 0 \
  -c 4096 \
  --host 0.0.0.0 --port 8989 \
  --jinja \
  --chat-template-file gemma4_chat_template.jinja \
  --cache-ram 0 \
  -ctxcp 1

Important flags:

  • --jinja --chat-template-file — Uses Google's latest Gemma 4 template. The template embedded in older GGUFs is broken. Without this, you get garbage output.
  • --cache-ram 0 -ctxcp 1 — Prevents the sliding window attention checkpoint RAM explosion that affects all Gemma 4 models.
  • -ngl 0 — CPU-only. Increase for GPU offload (e.g., -ngl 30 for partial offload on 16 GB VRAM).

llama.cpp CLI

llama-cli \
  -m Gemma-4-31B-it-Q2_K.gguf \
  --jinja \
  --chat-template-file gemma4_chat_template.jinja \
  -p "Implement a concurrent hash map in C" \
  -n 512 --temp 0 --repeat-penalty 1.5

Ollama

⚠️ Ollama has known issues with Gemma 4. If you get garbage output, switch to llama.cpp server or LM Studio. This is an Ollama-side problem, not a model issue.

Chat Template: Thought Resumption

Note: The Gemma 4 chat template supports thought resumption — if the model's thinking chain is interrupted by a bad token or truncation, the template allows the model to pick up and continue the broken thought rather than restarting or hallucinating. This is particularly useful at low BPW where rare token noise can occasionally disrupt a reasoning chain mid-thought. The model will close the broken <think> block and resume coherently.

FROM ./Gemma-4-31B-it-Q2_K.gguf

PARAMETER temperature 0
PARAMETER num_ctx 4096
PARAMETER repeat_penalty 1.5
PARAMETER top_k 1
PARAMETER mlock true

API Usage

Once the server is running, use the OpenAI-compatible API:

curl http://localhost:8989/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Write a Python LRU cache"}],
    "temperature": 0,
    "max_tokens": 512,
    "repeat_penalty": 1.5
  }'

Recommended Settings

Parameter Value Why
temperature 0 Deterministic — eliminates sampling noise at low BPW, prevents token repetition loops
repeat_penalty 1.5 High penalty aggressively suppresses repeating tokens, critical for coherent output at 2-bit
top_k 1 Greedy decoding — always pick the highest-probability token
top_p 1.0 Disabled when temp=0 (no effect with greedy decoding)
context 2048–4096 Higher contexts increase RAM usage significantly

Why temp=0 with high repeat penalty? At 2-bit quantization, the probability distribution over tokens is noisier than the original model. Non-zero temperature amplifies this noise, causing the model to sample low-confidence tokens that trigger self-correction loops. Setting temperature 0 forces greedy decoding — always picking the most likely token — which keeps output on the model's strongest signal. The high repeat_penalty (1.5) prevents the degenerate case where greedy decoding gets stuck in a loop, penalizing any token the model has already emitted.


How It Works

Standard quantizers use round-to-nearest: for each weight block, compute a scale and round. This uses HPC beam search with triality-enhanced belief propagation — a fundamentally different approach.

The Pipeline

┌─────────────────────────────────────────────────────────────┐
│  For each weight tensor:                                     │
│                                                              │
│  1. Compute greedy reference scales per block                │
│  2. Generate candidate grid (16×16 = 256 scale variants)     │
│  3. Encode candidates as Z₆ complex amplitudes               │
│  4. Build constraint graph (inter-block coupling)            │
│  5. Run belief propagation in 3 simultaneous views:          │
│       Edge × Vertex × Diagonal (triality)                    │
│  6. Combine via geometric mean:                              │
│       marginal[v] = ∛(edge × vertex × diagonal)             │
│  7. 24-beam Hensel search using combined marginals           │
│       (6,144 extensions evaluated per block)                 │
│  8. Pack into GGUF blocks with optimal scales                │
└─────────────────────────────────────────────────────────────┘

Why Attention Gets Q4_0

Quantization noise in attention projections cascades through softmax(Q·K^T/√d)·V. A single bad scale in a Q block shifts dot products enough to promote wrong tokens — manifesting as:

  • Korean/Arabic character injection
  • Word substitutions
  • Self-correction loops

Promoting Q/K/V/O to Q4_0 (16 levels vs 4) eliminates these artifacts at a cost of only ~1.5 GB.

RMSE Quality (v2.1 — 24-beam, 256-candidate)

Metric v2.0 (12-beam, 100-cand) v2.1 (24-beam, 256-cand)
Q4_0·HPC attention RMSE 3.0e-03 1.2–1.5e-03
Q2_K·HPC FFN RMSE 3.0e-03 1.5–1.8e-03
RMSE spread ±0.5e-03 ±0.3e-03
e-02 outliers 0 0
MSE reduction vs v2.0 ~4× lower

The 2× RMSE improvement (4× MSE) comes from denser scale sampling (0.022 grid gap vs 0.035) and wider beam survival (24 vs 12 joint configurations). Every tensor in the model — attention and FFN — achieves RMSE in the 1.2–1.8e-03 range with no outliers. This is reconstruction fidelity typically associated with Q4_K, achieved at Q2_K file size.


Reasoning Verification

All tests run at --temp 0 --repeat-penalty 1.15 on a single RTX 3060 12GB. Zero cherry-picking — every result shown is from the first attempt.

Combinatorial Logic

Test Result
25 Horses Problem — "Find the 3 fastest horses with a 5-lane track, no stopwatch. Minimum races?" ✅ Correct answer (7) with complete elimination proof. Correctly identified all 5 candidates {W2, W3, W1_2nd, W1_3rd, W2_2nd}. Self-verified by checking every possible missed candidate.
Arto Inkala's "World's Hardest Sudoku" — AC-3 constraint propagation + MRV backtracking ✅ Solved on first attempt. Correct implementation of arc consistency with minimum remaining values heuristic.

Algorithm Implementation

Test Difficulty Result
LRU Cache (Python) — O(1) get/put with doubly-linked list + hashmap Medium ✅ Perfect architecture, correct sentinel nodes, correct eviction. 9/10
Interval Merge with Query (C) — sort, merge touching intervals, point query Hard ✅ Correct algorithm, touching boundary logic (<= not <), all edge cases. 9/10
Compiler Constant Folding (C) — AST optimization pass with partial evaluation Expert ✅ Correct post-order traversal, correct partial evaluation, div-by-zero handling, ((2+3)*x) + (10/2 - 3*1)(5*x) + 2. 9/10

Type Theory

Test Result
Hindley-Milner Type Inference (Python) — complete implementation from scratch ✅ Correct Robinson's unification with occurs check. Correct substitution composition. Correct let-polymorphismlet id = λx. x in id id infers (T8 -> T8) while lambda args remain monomorphic. 8/10

Self-Diagnosis

Test Result
Factoring Engine Bug Analysis — given 2500 lines of unfamiliar C code, identify bugs ✅ Identified 3 non-obvious bugs: (1) destructive accumulator reset wiping cross-base period convergence, (2) successful overshoot value not committed to global state, (3) noise lock logic gap. All diagnoses correct. Fixes compiled and ran.

What This Means

Standard Q2 quantization produces models that can barely maintain coherent conversation. This Q2 quant:

  • Solves combinatorial proofs requiring 7-step elimination chains
  • Implements graduate-level programming language theory from scratch
  • Correctly analyzes and debugs unfamiliar production C codebases
  • Achieves Q5-equivalent reasoning at Q2 file size

The quantization noise is still there — the RMSE proves it — but the D₆ vesica gate has rotated it into dimensions the transformer doesn't use for reasoning.


Gemma 4 31B Architecture

Unlike the 26B MoE variant, the 31B is a dense transformer — every parameter is active on every token. This means:

Property Value
Total parameters 31B
Active parameters 31B (all, every token)
Hidden size 5376
Layers 60
Attention heads 32
KV heads (sliding) 16
KV heads (full/global) 4
Head dim (sliding) 256
Head dim (full/global) 512
FFN intermediate 21504
Sliding window 1024 tokens
Full attention Every 6th layer
Max context 262,144 tokens
Vocab size 262,144
Activation GeLU (tanh approx)
Logit softcapping 30.0
Weight tying Yes (embed = output)

Attention Pattern

The model alternates between sliding window attention (1024 tokens) and full attention across its 60 layers:

Layers  0–4:   sliding  (local context, 1024 window)
Layer   5:     full     (global context, all tokens)
Layers  6–10:  sliding
Layer   11:    full
  ...repeating every 6 layers...
Layer   59:    full

This hybrid pattern gives the model local efficiency with periodic global context integration — 50 sliding layers + 10 full attention layers.


Known Limitations

  1. Safety alignment degradation — extreme quantization (< 3 BPW) can weaken RLHF guardrails. The model may comply with requests the original would refuse. Evaluate safety properties before deployment.

  2. Ollama compatibility — Ollama's Gemma 4 support is unreliable as of April 2026. Use llama.cpp or LM Studio.

  3. Higher memory at runtime than 26B MoE — despite the smaller GGUF, the dense architecture activates all 31B params per token vs the 26B MoE's 4B active. KV cache and compute overhead are higher.

  4. Long-context (8K+) stress testing — verification suite covers < 5K tokens. Long-context coherence is expected to hold but has not been formally benchmarked.


Technical Details

Q2_K Block Layout (84 bytes / 256 weights)

Offset  Size  Field
  0      16   scales[16]    4-bit scale | 4-bit min per sub-block
 16      64   qs[64]        packed 2-bit quants (4 per byte)
 80       2   d             fp16 super-block scale
 82       2   dmin          fp16 super-block min scale

Q4_0 Block Layout (18 bytes / 32 weights)

Offset  Size  Field
  0       2   d             fp16 block scale
  2      16   qs[16]        packed 4-bit quants (2 per byte)
                             nibble order: qs[j] = w[j] | (w[j+16] << 4)

Gemma 4 31B Dense Handling

Challenge Solution
60-layer deep network Layer-by-layer streaming quantization
Hybrid sliding/full attention Uniform Q4_0·HPC for all attention types
Tied embeddings (embed = lm_head) Shared F32 tensor, single copy
262K vocab embeddings Preserved at F32 (routing-critical)
Sliding window (1024) Metadata passthrough
Logit softcapping (30.0) Metadata passthrough

License

This quantization inherits the Gemma license from the base model.

HPC is MIT licensed

Credits

Quantized with HPC

Downloads last month
537
GGUF
Model size
31B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support