Gemma 4 A4B 98-Expert v7-coderx — code-maximal prune (~20.8B)

Eval complete (Q6_K / llama.cpp, greedy, same host). Every cell in the scoreboard is read from summary.json under the cohort-pinned greedy recipe (temperature 0.0, top_p 1.0, top_k 0). The 128e and v6-coder columns are the matching same-host Q6_K runs. The GGUF and NVFP4A16 formats are deployment targets and are not separately benchmarked (cohort policy) — the Q6_K column is representative.

Headline — the strongest coder in the cohort. v7-coderx spends its whole prune budget on code: LCB-medium-55 98.18% and LCB-medium-100 99.0% — the highest of any Gemma-4 prune to date and +1.8pp / +2.0pp past the unpruned 128e (96.36 / 97.0 on the same Q6_K run) — plus MultiPL-E 90.0%, HumanEval+ 92.68%, IFEval 95%. The trade is graduate science: GPQA-diamond sits at 48.48% (this recipe carries no targeted_gpqa term). If you need the science back without giving up the code profile, use the sibling v7-coder (GPQA 70.71%, LCB-55 96.36%).

A research checkpoint that prunes the unpruned Gemma 4 26B-A4B-it (128 experts/layer, top-8 + shared, 30 layers) down to 98 experts per layer. The fs2440 drop map concentrates protection on generic-code (3×) and LiveCodeBench-medium (2×) on a [24,40] per-layer floor, with no science or multilingual targeting — the code-maximal member of the v7-coder cohort. Same 98e shape, same router, same attention, same norms as the rest of the cohort, plus the mandatory shared-FFN α=1.2 upweight all coder variants carry.

Quantized formats

Format Repo Notes
bf16 (this repo) …-v7-coderx-it 9 shards. fs2440 drop map + shared α=1.2.
GGUF (llama.cpp / ollama) …-v7-coderx-it-GGUF Bartowski tier sweep (imatrix K-quants) + ContribDynamic CD-* per-layer quants + F16 + imatrix.dat + mmproj.
NVFP4A16 (vLLM) …-v7-coderx-NVFP4A16 Native vLLM 4-bit + FP8 block scales, via NVIDIA modelopt main (0.45.0.dev, _QuantFusedExperts). ~13 GB. Deployment format — not separately benchmarked.
Ollama mannix/gemma4-98e-v7-coderx ollama pull mannix/gemma4-98e-v7-coderx:<tier> (:latest = Q4_K_M; :vision-<tier> adds the SigLIP vision tower).

Benchmarks

Q6_K · llama.cpp · greedy (temperature 0.0, top_p 1.0, top_k 0), all four models scored on the same host from summary.json. Row-max in bold. This repo = v7-coderx.

Benchmark 128e (unpruned) v6-coder v7-coder v7-coderx
GPQA-diamond (198q) 67.17 61.11 70.71 48.48
AIME (30q) 73.33 56.67 76.67 70.00
MATH500 (100q) 92.00 89.00 92.00 89.00
GSM8K (100q) 89.00 88.00 93.00 91.00
ARC-Challenge (full) 96.50 95.39 94.80 94.28
IFEval (100q, strict) 97.00 92.00 95.00 95.00
HumanEval (164) 97.56 98.17 98.78 95.73
HumanEval+ (164) 92.07 92.68 92.68 92.68
LCB-medium-55 96.36 92.73 96.36 98.18
LCB-medium-100 97.00 94.00 97.00 99.00
MultiPL-E (100) 90.00 89.00 88.67 90.00

Metrics: GPQA & GSM8K = exact_match flexible-extract · MATH500 = math_verify · ARC & AIME = exact_match · IFEval = prompt_level_strict_acc · HumanEval/+ = pass@1 chat-extract · LCB-55/100 & MultiPL-E = pass@1. 128e uses the lcb_medium_55/100 templates; the prunes use lcb_medium_*_v4 (corrected harness, equivalent task).

Every code/instruction axis is at the top of the cohort; the budget is paid almost entirely on graduate science, which carries no protection term in this recipe.

Coder-field comparison — v7-coderx vs Qwen2.5-Coder-14B / 7B + Qwen3.5-9B (Q6_K, llama.cpp, greedy)

The 9 canonical benches + MultiPL-E-100, all on the identical llama.cpp Q6_K / greedy recipe (reasoning models served with --reasoning-format deepseek --reasoning-budget 12288 --parallel 2). Architectures differ — this is a same-harness comparison, not a same-class one:

  • v7-coderx — Gemma-4 26B-A4B MoE pruned to 98 experts (~20.8B total, ~A4B active), reasoning.
  • Qwen2.5-Coder-14B / 7B-Instruct — dense, non-reasoning code specialists (bartowski Q6_K).
  • Qwen3.5-9B — dense reasoning model (bartowski Q6_K).
Bench (n) v7-coderx Q6_K Qwen2.5-Coder-14B Qwen2.5-Coder-7B Qwen3.5-9B
ARC-Challenge-chat (1172) 94.28% 90.53% 85.58% 96.76%
GPQA Diamond flex (198) 48.48% 34.85% 26.26% 73.74%
GSM8K-100 flex 91.00% 89.00% 80.00% 79.00%
MATH-500-100 math_verify 89.00% 62.00% 66.00% 59.00%
AIME 2024 (30) 70.00% 10.00% 10.00% 56.67%
IFEval-100 (prompt_strict) 95.00% 68.00% 54.00% 93.00%
HumanEval-164 chat 95.73% 90.85% 87.20% 89.02%
HumanEval+-164 chat 92.68% 84.76% † 83.54% 80.49%
LCB-medium-55 v4 98.18% 18.18% † 12.73% 58.18%
MultiPL-E-100 (macro) 90.00% 84.67% 80.67% 80.33%

† Qwen2.5-Coder-14B HumanEval+ / LCB-medium-55 are the same-stack GGUF HE+ sweep numbers (not re-run in this chain). All Qwen cells are the same-host reference runs used on the v6-coder card — Qwen is a fixed reference, so the columns are identical across the cohort; only the Gemma column changes.

Note on Qwen3.5-9B. Qwen3.5-9B is a verbose, slow thinking model: it emits long <think> reasoning chains (often ≥1900 tokens even on a trivial GSM8K question), so it runs several× slower per question than the non-reasoning Qwen2.5-Coder models — well beyond what its 9B size would suggest. Its GSM8K / MATH-500 / GPQA cells were re-run after a harness fix (under batched, reasoning-parsed serving the verbose thinking intermittently left the final answer inside the reasoning block, mis-scored as empty content).

Answer-length analysis (anti-rumination)

The pruned reasoning model thinks with a bounded thinking_token_budget=12288; the question is whether that length is productive (long thinking that PASSes) or rumination (long thinking that fails). Per-problem completion length is measured from omk_eval token_stats (characters from the raw completion; tokens via the 128e tokenizer) on the real-n benches, against 128e and v6-coder on the same problems, same greedy Q6_K / llama.cpp stack.

Per-problem completion length — characters (p50 / p90 / max):

Bench (n) 128e v6-coder v7-coderx
GPQA Diamond (198) 2571/16136/27811 2582/16100/25243 2627/19582/40946
AIME 2024 (30) 1963/7748/8680 2141/7469/9433 2095/8987/12815
LCB-medium-55 3734/16430/36462 31015/36260/43278 30193/36297/41168
LCB-medium-100 2056/15467/48569 29384/35389/43633 29429/35973/41168
MultiPL-E-100 (300) 245/566/3353 245/573/2725 246/617/2933
MATH-500 (100) 1083/1873/7899 1089/2025/9236 1113/1953/8548
GSM8K (100) 294/746/25989 283/780/11378 274/779/13867
IFEval (100) 877/3755/8263 855/3489/20908 732/3210/6633
HumanEval (164) 698/1284/5354 711/1438/5954 743/1412/16967
HumanEval+ (164) 714/1461/3289 694/1390/5282 743/1359/3150
ARC-Challenge (1172) 1210/1633/6254 1221/1674/48886 1234/1720/54956

Per-problem completion length — tokens (p50 / p90 / max):

Bench (n) 128e v6-coder v7-coderx
GPQA Diamond (198) 843/8189/8189 879/8189/8189 890/8189/8189
AIME 2024 (30) 933/3994/4021 946/3993/4011 954/3997/4021
LCB-medium-55 1005/5622/16022 12818/13318/15976 12820/13163/15667
LCB-medium-100 542/5353/16022 12740/13212/15976 12735/13016/15667
MultiPL-E-100 (300) 84/171/1013 85/184/965 84/188/871
MATH-500 (100) 431/895/3377 424/863/3377 443/929/3337
GSM8K (100) 131/271/8853 129/276/4687 119/266/5128
IFEval (100) 219/850/1561 222/797/3898 177/768/1231
HumanEval (164) 226/431/1611 226/448/2084 236/440/5520
HumanEval+ (164) 226/455/996 224/437/2040 233/443/1332
ARC-Challenge (1172) 258/355/1417 259/365/16266 263/374/16276

Budget-saturation incidence — share of problems whose completion reached ≥12k tokens (at/near the thinking_token_budget=12288 cap). Saturation by itself is not rumination — a saturated output that PASSes is productive use of the budget; the pruned reasoning model saturates on nearly every LCB problem, 128e almost never does.

Bench (n) 128e v6-coder v7-coderx
LCB-medium-55 1 / 55 (1.8%) 54 / 55 (98.2%) 54 / 55 (98.2%)
LCB-medium-100 2 / 100 (2.0%) 98 / 100 (98.0%) 96 / 100 (96.0%)

Rumination — long thinking that fails to PASS. The right metric is not median length (128e looks short only because it answers easy problems fast). It is the share of the model's budget-saturated outputs that still fail — tokens burned without a correct answer:

Bench (n) 128e v6-coder v7-coderx
LCB-medium-55 — saturated-and-failed 1 / 1 (100.0%) 4 / 54 (7.4%) 1 / 54 (1.9%)
LCB-medium-100 — saturated-and-failed 2 / 2 (100.0%) 6 / 98 (6.1%) 1 / 96 (1.0%)
LCB-100 — mean completion tokens, PASS vs FAIL 1392 vs 13782 12698 vs 15051 12649 vs 13289

Key findings:

  • 128e only thinks long when it is lost. Every 128e output that reaches the budget cap is a failure (1/1 on LCB-55, 2/2 on LCB-100), and its failed problems run several× longer than its passed ones (mean 13782 vs 1392 tok on LCB-100).
  • v7-coderx's long thinking is overwhelmingly productive. It saturates on ~96% of LCB-100 problems but only 1/96 of those saturated outputs fail (1.0%); its PASS and FAIL completions are nearly the same length (mean 12649 vs 13289 tok), so failures are not driven by extra rumination. On LCB-55 it is 1/54 saturated-and-failed.
  • At or below v6-coder's rumination rate. v6-coder ran 4/54 (LCB-55) and 6/98 (LCB-100) saturated-and-failed; v7-coderx matches or improves on both.
  • Non-LCB benches stay tight. On the short-answer benches (GSM8K / MATH-500 / HE / HE+ / MultiPL-E) p50/p90 length tracks 128e and v6-coder within a few tokens — the targeted prune did not trade length for accuracy on the everyday benches.

Methodology. Per-problem lengths come from omk_eval token_stats over each bench's samples_*.jsonl / lcb_result.samples.jsonl; saturation/PASS-FAIL is computed per problem from completion_tokens + passed. MultiPL-E measures code length, not reasoning (its samples store only the final code block, no <think> trace), so it is a code-conciseness reference rather than a thinking-length signal.

At a glance

128e (base) v7-coderx v7-coder (sibling)
Total params ~26B ~20.8B ~20.8B
Active / token ~4B (top-8 + shared) ~4B ~4B
Experts / layer 128 98 (30 dropped) 98 (30 dropped)
Per-layer floor [24, 40] [24, 40]
Science targeting off targeted_gpqa 1.5×
Shared FFN α 1.0 1.2 (mlp.down_proj) 1.2
Built from 128e original (fresh prune) 128e original

Recipe

The drop map is produced by generate_drop_map_v5.py (omnimergekit) from per-expert, per-class contribution scores on the rebuilt v7 competence maps (expert_neuron_v7_code_gpqa.json — 10 classes, audited producers, multilingual category included), then applied with expert_drop.py, then the shared expert is upweighted.

1. fs2440 base recipe

target        = 98          # 30 experts/layer dropped
protect_top   = 16          # 16 highest-scoring experts/layer never dropped
alpha         = 2.0         # contribution sharpening exponent
strategy      = max         # per-expert score = MAX over classes (not mean/geomean)
normalize     = rank        # rank-normalize within each (layer, class)
breadth_bonus = 0.5         # reward experts useful across many classes (anti-overfit)
v4_floor_map  = v4_layer_floor_map_v7.json     # per-layer keep floor
v4_floor_data = expert_neuron_base_v7.json
v4_floor_clamp = [24, 40]   # floor bounded into this band per layer
outlier_mode  = median      # clamp bf16 weight-norm artifacts to layer median
outlier_wnorm_thresh = 1e4
baseline      = teacher_force_98e_p16_clean.json   # tie-break anchor

strategy=max + breadth_bonus is the load-bearing pair — it favours experts strongly useful to at least one class and broadly useful across classes, the optimizer-off-manifold lesson encoded as a recipe. The [24,40] floor is the 98e-scaled analogue of the 62e [15,25] band that won the loop-floor study, and beats [20,35] by ~3.6pp LCB-55 for a coder.

2. Calibration class weights — code only

Ten contribution classes are scored; the weights steer which specialists survive. v7-coderx zeroes every non-code targeting term:

Class v7-coderx v7-coder
generic_math 1 1
generic_logic 1 1
generic_code 3 3
generic_science 1 1
generic_creative 1 1
generic_multilingual 0 0
targeted_humaneval 0 0
targeted_humanevalplus 0 0
targeted_lcb_medium_55 2 2
targeted_gpqa 0 1.5

HE/HE+ targeting is off because both already sit at/above the un-targeted baseline; the protection budget goes to LiveCodeBench-medium, the bench where pruning hurt most on earlier variants. v7-coderx is exactly v7-coder minus the targeted_gpqa term — the two share most of their keep set, which is why HE+ and IFEval match to the point and only LCB-55 (up) and GPQA (down) move.

3. Mandatory shared-FFN α=1.2 (cohort rule)

After expert drop, router_shared_upweight.py --alpha 1.2 --target mlp.down_proj.weight upweights Gemma 4's always-on shared FFN. Every coder variant carries this; omitting it yields the "weak / ruminating" pre-shared baseline and makes cross-variant comparison unfair. A .shared_applied marker records it.


Intended use

A compact (~12–13 GB at Q4_K_M / NVFP4A16, fits a single 12–16 GB GPU) Gemma 4 checkpoint for maximal coding throughput and instruction-following — the code-extreme (x) member of the v7-coder cohort. If your workload also needs strong graduate science, use v7-coder, which trades ~1.8pp LCB-55 for ~+22pp GPQA.

Inherits Gemma 4's thinking format — serve with the reasoning parser enabled (--reasoning-parser gemma4 on vLLM; --reasoning-format deepseek --reasoning-budget 8192 on llama-server).

Limitations

A research prune, not an official Google release. Expert pruning trades breadth for size: generic_multilingual is de-weighted (0×) and graduate science (GPQA) is the explicit budget axis — at 48.48% it is well below the unpruned 128e (67.17% on the same Q6_K run). Choose v7-coder if science matters. Quality below ~Q3 / 3-bit degrades on the Gemma 4 MoE — prefer Q4_K_M or higher for production. The GGUF and NVFP4A16 formats are provided for deployment but are not separately benchmarked.

Lineage

128e → (v4 → v5 → v6-coder code line) → v7 competence-map rebuildfs2440 code floor = v7-coderx. Built and evaluated on the omnimergekit toolchain.

Downloads last month
62
Safetensors
Model size
20B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/gemma-4-A4B-98e-v7-coderx-it

Finetuned
(107)
this model
Quantizations
2 models