Gemma-4 E2B with grouped expert routing (K=96, d=0.50)

This model is a sparsified variant of google/gemma-4-e2b-it where each MLP's FFN dimension is partitioned into 96 groups via k-means on activation profiles. At inference, the top 48 of 96 groups fire per token (50% density), allowing entire groups of neurons to be skipped (and their corresponding rows of up_proj/gate_proj and columns of down_proj to be skipped from memory).

Group sizes (D_FFN / 96): 64 in early layers (D_FFN=6144), 128 in late layers (D_FFN=12288). 64-element groups in bf16 = exactly 128B = one L2 cache line on NVIDIA GPUs — memory-bandwidth-aligned routing.

Training recipe

Train Sw_gate_only_50 baseline: 5000 steps gate-only, density=0.50, lr=1e-4, tau anneal 1.0→0.01, int4 QAT, on gemma4_e2b_it_bulk_50k.jsonl.
Cluster activations: 4 sequences × 2048 tokens through baseline; k-means at K=96 per layer on |gate*up| activation profile.
Polish: install GroupedMaskedMLP (K=96, K_active=48), add LoRA r128 on up_proj/down_proj, train 1000 steps CE on gemma4_e2b_it_final_50k.jsonl.

Results (eval on `gemma4_e2b_it_final_50k.jsonl`)

Metric	Value	vs Base	vs polish_50_per_neuron
PPL	2.95	+1.31×	+0.85×
MMLU	30.6	+0.2	+1.0
HellaSwag	54.0	+2.6	-1.8
ARC-Challenge	35.2	-2.2	+1.6
ARC-Easy	40.6	+1.2	-0.6
BoolQ	74.2	-2.4	+3.2
5-task avg	46.9	-0.1	+0.7
World-knowledge probe (172q)	58.1	-2.9	+5.2

Usage

git lfs install
git clone https://huggingface.co/Cactus-Compute/gemma4-e2b-grouped-k96
cd $(basename Cactus-Compute/gemma4-e2b-grouped-k96)
pip install torch transformers
python inference_k96.py --prompt "The capital of France is"

The inference script loads the base Gemma-4 weights from HuggingFace, installs the grouped routing wrappers (loading cluster assignments from groups/), then applies int4 QAT + LoRA + the trained state dict. verify_grouped_routing() asserts every layer is genuinely a GroupedMaskedMLP with K=96 before generation.

Citation

Internal research (Anthropic Claude + Noah Cylich, Cactus Compute), 2026-04.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Gemma-4 E2B with grouped expert routing (K=96, d=0.50)

Training recipe

Results (eval on gemma4_e2b_it_final_50k.jsonl)

Usage

Citation

Results (eval on `gemma4_e2b_it_final_50k.jsonl`)