How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="CompressedGemma/Gemma-4-31B-it-Opus",
	filename="Gemma-4-31B-Opus.gguf",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Note

I've recently run this through as a full Q2 quant, I created this so it will fit on budget GPUs.

On a RTX 3060 12GB, it is possible to get around 12 tokens a second with normal context and GPU offloading to 45ish layers, when cache is quant'd to Q4_0.

IF you expand to 100k, expect 6 tokens a second if your GPU has only 12GB RAM.

llama settings

Tested on RTX 3060 12GB:

GGML_CUDA_NO_PINNED=1 ./llama.cpp/build/bin/llama-server -m Gemma-4-31B-Opus.gguf --host 0.0.0.0 --port 8080 --jinja -ngl 45 -c 58096 --flash-attn on --temp 0 --cache-type-k q4_0 --cache-type-v q4_0 --main-gpu 1 -t 20 -tb 20 -np 1 --cache-prompt --tensor-split 0.5,0.5 --reasoning on --split-mode layer --batch-size 128 --ubatch-size 128 --top-p 0.95 --top-k 40

Gemma-4-31B-Opus

An Opus-flavored 31 billion parameter model created by extracting functional neuron parameters from Claude Opus via black-box oracle probing, generating lossy MLP weight matrices from those parameters, and fusing them into Google's Gemma-4-31B using Shape-Contoured Fusion — a technique that modifies the model's down-projection and SwiGLU gating weights along their own native directions.


The Full Pipeline

This model was created through a four-stage pipeline that starts with nothing more than API access to Claude Opus and ends with a fused Gemma-4-31B model.

Stage 1: Black-Box Weight Extraction

What it does: Extracts the internal weight geometry of a neural network using only input/output queries — no model access, no gradients, no weights. You are the oracle.

Phase 1 — Boundary Detection

The script computes the matrix of partial derivatives at each point using central finite differences.

If the two points differ, there's a ReLU boundary between them to machine precision.

Phase 2 — Rank-1 Decomposition

At each boundary, the script computes ΔJ = J(x* + ε) − J(x* − ε). For a single ReLU neuron switching on/off, this matrix is exactly rank-1: it factors as ΔJ = w₂ · w₁ᵀ where w₁ is the neuron's input weight direction and w₂ is its output weight direction.

Phase 3 — Sign Resolution & Bias Recovery

Multi-start coordinate descent over sign configurations minimizes prediction error on all accumulated query logs. Output bias b2 is recovered by averaging residuals across all observed input-output pairs.


Stage 2: Neuron Construction & Verification

What it does: Takes the raw extracted parameters and constructs verified, correct neuron weight matrices that exactly reproduce the target piecewise-linear function.

Why this encoding works: A piecewise-linear function with n breakpoints can be exactly represented by n + 1 ReLU neurons. Neuron 0 acts as a "carrier" providing the baseline slope everywhere. Neurons 1 and 2 add slope corrections in their respective active regions. The remaining 5 neurons (3-7) are zeroed out — reserved capacity.


Stage 3: Lossy Weight Generation at Scale

What it does: Takes the two verified source neurons and generates 614,400 neuron variants (60 layers × 10,240 neurons per layer), assembled into block-diagonal MLP weight matrices — one per layer.

This is the "lossy" step.

Why "lossy": The generated neurons are statistical variations of two source neurons. They capture the geometric character (boundary locations, slope ratios, activation patterns) of the originals but not their exact values. Each generated neuron is a sample from the distribution defined by the two extracted parameter vectors — a lossy expansion of two data points into 614K variants via Gaussian sampling, interpolation, or grid spanning. The covariance structure between the two source neurons defines the "axis of variation" that the generated population explores.


Stage 4: Shape-Contoured Fusion into Gemma

What it does: Fuses the generated adapter weights into the base Gemma-4-31B model's native weight tensors using a technique called Shape-Contoured Fusion — a streaming, zero-copy pipeline that modifies weights in-place with bounded RAM footprint.

B. SwiGLU Gate Modulation (Asymmetry Encoding)

This is where the Opus "flavor" lives. The tool analyzes each neuron's activation asymmetry.


Why This Works

The Geometric Argument

Each neuron is a hyperplane in activation space. The extracted neurons from Stage 1 characterize where these hyperplanes are (boundary locations) and how much they matter (slopes). Stage 3 generates a statistical population of hyperplanes with similar geometry. Stage 4 projects these hyperplanes along Gemma's native directions.

The result is not "Claude inside Gemma." It's Gemma whose MLP layers have been contoured — their down-projection slopes adjusted and their gating asymmetries modulated — according to the geometric signature extracted from Claude's activation patterns.

The neurons encode functional character (where to put decision boundaries, how to weight different activation regions) rather than specific knowledge. This is why the fusion doesn't break coherence: it's adjusting the "shape" of existing computations, not replacing them.

Downloads last month
608
GGUF
Model size
31B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support