Gemma 4 A4B 98-Expert v7-coder — science-augmented code prune (~20.8B)

Eval complete (Q6_K / llama.cpp, greedy, same host). Every cell in the scoreboard is read from summary.json under the cohort-pinned greedy recipe (temperature 0.0, top_p 1.0, top_k 0). The 128e and v6-coder columns are the matching same-host Q6_K runs. The GGUF and NVFP4A16 formats are deployment targets and are not separately benchmarked (cohort policy) — the Q6_K column is representative.

Headline — a coder that kept all its science. v7-coder holds v6-coder's top-tier code profile (LCB-medium-55 96.36%, LCB-medium-100 97.0%, HumanEval 98.78%, HumanEval+ 92.68%, IFEval 95%) while pulling GPQA-diamond to 70.71%+9.6pp over v6-coder and at parity with the unpruned 128e on the same Q6_K run (67.17%; GPQA-diamond carries ≈±3pp run-to-run noise at 198 questions). On this same-host run it also edges 128e on AIME (76.67 vs 73.33) and GSM8K (93 vs 89). The science is bought by a dedicated targeted_gpqa calibration class; the code axes are left untouched. If you want the same prune with code pushed even harder (and science left at baseline), see the sibling v7-coderx.

A research checkpoint that prunes the unpruned Gemma 4 26B-A4B-it (128 experts/layer, top-8 + shared, 30 layers) down to 98 experts per layer. The drop map is the fs2440 code recipe (generic-code 3× + LiveCodeBench-medium 2×, on a [24,40] per-layer floor) plus a targeted_gpqa class at weight 1.5 — a science-specialist protection term derived from a GPQA-diamond pass-trace calibration set. Same 98e shape, same router, same attention, same norms as the rest of the cohort, plus the mandatory shared-FFN α=1.2 upweight all coder variants carry.

Quantized formats

Format Repo Notes
bf16 (this repo) …-v7-coder-it 9 shards. fs2440 + targeted_gpqa 1.5 drop map + shared α=1.2.
GGUF (llama.cpp / ollama) …-v7-coder-it-GGUF Bartowski tier sweep (imatrix K-quants) + ContribDynamic CD-* per-layer quants + F16 + imatrix.dat + mmproj.
NVFP4A16 (vLLM) …-v7-coder-NVFP4A16 Native vLLM 4-bit + FP8 block scales, via NVIDIA modelopt main (0.45.0.dev, _QuantFusedExperts). ~13 GB. Deployment format — not separately benchmarked.
Ollama mannix/gemma4-98e-v7-coder ollama pull mannix/gemma4-98e-v7-coder:<tier> (:latest = Q4_K_M; :vision-<tier> adds the SigLIP vision tower).

Benchmarks

Q6_K · llama.cpp · greedy (temperature 0.0, top_p 1.0, top_k 0), all four models scored on the same host from summary.json. Row-max in bold. This repo = v7-coder.

Benchmark 128e (unpruned) v6-coder v7-coder v7-coderx
GPQA-diamond (198q) 67.17 61.11 70.71 48.48
AIME (30q) 73.33 56.67 76.67 70.00
MATH500 (100q) 92.00 89.00 92.00 89.00
GSM8K (100q) 89.00 88.00 93.00 91.00
ARC-Challenge (full) 96.50 95.39 94.80 94.28
IFEval (100q, strict) 97.00 92.00 95.00 95.00
HumanEval (164) 97.56 98.17 98.78 95.73
HumanEval+ (164) 92.07 92.68 92.68 92.68
LCB-medium-55 96.36 92.73 96.36 98.18
LCB-medium-100 97.00 94.00 97.00 99.00
MultiPL-E (100) 90.00 89.00 88.67 90.00

Metrics: GPQA & GSM8K = exact_match flexible-extract · MATH500 = math_verify · ARC & AIME = exact_match · IFEval = prompt_level_strict_acc · HumanEval/+ = pass@1 chat-extract · LCB-55/100 & MultiPL-E = pass@1. 128e uses the lcb_medium_55/100 templates; the prunes use lcb_medium_*_v4 (corrected harness, equivalent task).

The code profile is held at or above v6-coder on every axis, and the entire recipe delta lands where it was aimed — +9.6pp GPQA — with no code regression.

Coder-field comparison — v7-coder vs Qwen2.5-Coder-14B / 7B + Qwen3.5-9B (Q6_K, llama.cpp, greedy)

The 9 canonical benches + MultiPL-E-100, all on the identical llama.cpp Q6_K / greedy recipe (reasoning models served with --reasoning-format deepseek --reasoning-budget 12288 --parallel 2). Architectures differ — this is a same-harness comparison, not a same-class one:

  • v7-coder — Gemma-4 26B-A4B MoE pruned to 98 experts (~20.8B total, ~A4B active), reasoning.
  • Qwen2.5-Coder-14B / 7B-Instruct — dense, non-reasoning code specialists (bartowski Q6_K).
  • Qwen3.5-9B — dense reasoning model (bartowski Q6_K).
Bench (n) v7-coder Q6_K Qwen2.5-Coder-14B Qwen2.5-Coder-7B Qwen3.5-9B
ARC-Challenge-chat (1172) 94.80% 90.53% 85.58% 96.76%
GPQA Diamond flex (198) 70.71% 34.85% 26.26% 73.74%
GSM8K-100 flex 93.00% 89.00% 80.00% 79.00%
MATH-500-100 math_verify 92.00% 62.00% 66.00% 59.00%
AIME 2024 (30) 76.67% 10.00% 10.00% 56.67%
IFEval-100 (prompt_strict) 95.00% 68.00% 54.00% 93.00%
HumanEval-164 chat 98.78% 90.85% 87.20% 89.02%
HumanEval+-164 chat 92.68% 84.76% † 83.54% 80.49%
LCB-medium-55 v4 96.36% 18.18% † 12.73% 58.18%
MultiPL-E-100 (macro) 88.67% 84.67% 80.67% 80.33%

† Qwen2.5-Coder-14B HumanEval+ / LCB-medium-55 are the same-stack GGUF HE+ sweep numbers (not re-run in this chain). All Qwen cells are the same-host reference runs used on the v6-coder card — Qwen is a fixed reference, so the columns are identical across the cohort; only the Gemma column changes.

Note on Qwen3.5-9B. Qwen3.5-9B is a verbose, slow thinking model: it emits long <think> reasoning chains (often ≥1900 tokens even on a trivial GSM8K question), so it runs several× slower per question than the non-reasoning Qwen2.5-Coder models — well beyond what its 9B size would suggest. Its GSM8K / MATH-500 / GPQA cells were re-run after a harness fix (under batched, reasoning-parsed serving the verbose thinking intermittently left the final answer inside the reasoning block, mis-scored as empty content).

Answer-length analysis (anti-rumination)

The pruned reasoning model thinks with a bounded thinking_token_budget=12288; the question is whether that length is productive (long thinking that PASSes) or rumination (long thinking that fails). Per-problem completion length is measured from omk_eval token_stats (characters from the raw completion; tokens via the 128e tokenizer) on the real-n benches, against 128e and v6-coder on the same problems, same greedy Q6_K / llama.cpp stack.

Per-problem completion length — characters (p50 / p90 / max):

Bench (n) 128e v6-coder v7-coder
GPQA Diamond (198) 2571/16136/27811 2582/16100/25243 2608/16779/27097
AIME 2024 (30) 1963/7748/8680 2141/7469/9433 2016/6859/8272
LCB-medium-55 3734/16430/36462 31015/36260/43278 31164/39302/52930
LCB-medium-100 2056/15467/48569 29384/35389/43633 29670/36443/52930
MultiPL-E-100 (300) 245/566/3353 245/573/2725 246/641/3578
MATH-500 (100) 1083/1873/7899 1089/2025/9236 1064/1849/8921
GSM8K (100) 294/746/25989 283/780/11378 270/668/1527
IFEval (100) 877/3755/8263 855/3489/20908 775/3263/18052
HumanEval (164) 698/1284/5354 711/1438/5954 759/1426/3801
HumanEval+ (164) 714/1461/3289 694/1390/5282 745/1360/3822
ARC-Challenge (1172) 1210/1633/6254 1221/1674/48886 1222/1691/8344

Per-problem completion length — tokens (p50 / p90 / max):

Bench (n) 128e v6-coder v7-coder
GPQA Diamond (198) 843/8189/8189 879/8189/8189 853/8189/8190
AIME 2024 (30) 933/3994/4021 946/3993/4011 889/3955/4001
LCB-medium-55 1005/5622/16022 12818/13318/15976 12724/13134/15667
LCB-medium-100 542/5353/16022 12740/13212/15976 12709/13115/15976
MultiPL-E-100 (300) 84/171/1013 85/184/965 84/188/1013
MATH-500 (100) 431/895/3377 424/863/3377 407/844/3352
GSM8K (100) 131/271/8853 129/276/4687 121/247/485
IFEval (100) 219/850/1561 222/797/3898 190/741/4008
HumanEval (164) 226/431/1611 226/448/2084 239/450/1501
HumanEval+ (164) 226/455/996 224/437/2040 238/440/1382
ARC-Challenge (1172) 258/355/1417 259/365/16266 261/370/1960

Budget-saturation incidence — share of problems whose completion reached ≥12k tokens (at/near the thinking_token_budget=12288 cap). Saturation by itself is not rumination — a saturated output that PASSes is productive use of the budget; the pruned reasoning model saturates on nearly every LCB problem, 128e almost never does.

Bench (n) 128e v6-coder v7-coder
LCB-medium-55 1 / 55 (1.8%) 54 / 55 (98.2%) 54 / 55 (98.2%)
LCB-medium-100 2 / 100 (2.0%) 98 / 100 (98.0%) 96 / 98 (98.0%)

Rumination — long thinking that fails to PASS. The right metric is not median length (128e looks short only because it answers easy problems fast). It is the share of the model's budget-saturated outputs that still fail — tokens burned without a correct answer:

Bench (n) 128e v6-coder v7-coder
LCB-medium-55 — saturated-and-failed 1 / 1 (100.0%) 4 / 54 (7.4%) 2 / 54 (3.7%)
LCB-medium-100 — saturated-and-failed 2 / 2 (100.0%) 6 / 98 (6.1%) 1 / 96 (1.0%)
LCB-100 — mean completion tokens, PASS vs FAIL 1392 vs 13782 12698 vs 15051 12651 vs 12844

Key findings:

  • 128e only thinks long when it is lost. Every 128e output that reaches the budget cap is a failure (1/1 on LCB-55, 2/2 on LCB-100), and its failed problems run several× longer than its passed ones (mean 13782 vs 1392 tok on LCB-100).
  • v7-coder's long thinking is overwhelmingly productive. It saturates on ~98% of LCB-100 problems but only 1/96 of those saturated outputs fail (1.0%); its PASS and FAIL completions are nearly the same length (mean 12651 vs 12844 tok), so failures are not driven by extra rumination. On LCB-55 it is 2/54 saturated-and-failed.
  • At or below v6-coder's rumination rate. v6-coder ran 4/54 (LCB-55) and 6/98 (LCB-100) saturated-and-failed; v7-coder matches or improves on both.
  • Non-LCB benches stay tight. On the short-answer benches (GSM8K / MATH-500 / HE / HE+ / MultiPL-E) p50/p90 length tracks 128e and v6-coder within a few tokens — the targeted prune did not trade length for accuracy on the everyday benches.

Methodology. Per-problem lengths come from omk_eval token_stats over each bench's samples_*.jsonl / lcb_result.samples.jsonl; saturation/PASS-FAIL is computed per problem from completion_tokens + passed. MultiPL-E measures code length, not reasoning (its samples store only the final code block, no <think> trace), so it is a code-conciseness reference rather than a thinking-length signal.

At a glance

128e (base) v7-coder v7-coderx (sibling)
Total params ~26B ~20.8B ~20.8B
Active / token ~4B (top-8 + shared) ~4B ~4B
Experts / layer 128 98 (30 dropped) 98 (30 dropped)
Per-layer floor [24, 40] [24, 40]
Science targeting targeted_gpqa 1.5× off
Shared FFN α 1.0 1.2 (mlp.down_proj) 1.2
Built from 128e original (fresh prune) 128e original

Recipe

The drop map is produced by generate_drop_map_v5.py (omnimergekit) from per-expert, per-class contribution scores on the rebuilt v7 competence maps (expert_neuron_v7_code_gpqa.json — 10 classes, audited producers, multilingual category included), then applied with expert_drop.py, then the shared expert is upweighted.

1. fs2440 base (shared with v7-coderx)

target        = 98          # 30 experts/layer dropped
protect_top   = 16          # 16 highest-scoring experts/layer never dropped
alpha         = 2.0         # contribution sharpening exponent
strategy      = max         # per-expert score = MAX over classes (not mean/geomean)
normalize     = rank        # rank-normalize within each (layer, class)
breadth_bonus = 0.5         # reward experts useful across many classes (anti-overfit)
v4_floor_map  = v4_layer_floor_map_v7.json     # per-layer keep floor
v4_floor_data = expert_neuron_base_v7.json
v4_floor_clamp = [24, 40]   # floor bounded into this band per layer
outlier_mode  = median      # clamp bf16 weight-norm artifacts to layer median
outlier_wnorm_thresh = 1e4
baseline      = teacher_force_98e_p16_clean.json   # tie-break anchor

strategy=max + breadth_bonus is the load-bearing pair — it favours experts strongly useful to at least one class and broadly useful across classes, the optimizer-off-manifold lesson encoded as a recipe. The [24,40] floor is the 98e-scaled analogue of the 62e [15,25] band that won the loop-floor study.

2. Calibration class weights — the +gpqa term

Ten contribution classes are scored; the weights steer which specialists survive. v7-coder concentrates the targeted budget on code and adds a science term:

Class v7-coder v7-coderx
generic_math 1 1
generic_logic 1 1
generic_code 3 3
generic_science 1 1
generic_creative 1 1
generic_multilingual 0 0
targeted_humaneval 0 0
targeted_humanevalplus 0 0
targeted_lcb_medium_55 2 2
targeted_gpqa 1.5 0

A sweep over the targeted_gpqa weight × floor band located weight 1.5 at floor [24,40] as the frontier point: weight 1.0 under-recovers GPQA (≈53%, heavy rumination), weight 2.0 overshoots into a small code regression, and the [24,40] floor beats [20,35] by ~3.6pp LCB-55 for a coder. 1.5 is the knee that buys ~+22pp GPQA over the un-targeted prune (v7-coderx) for ≈1.8pp LCB-55.

3. Mandatory shared-FFN α=1.2 (cohort rule)

After expert drop, router_shared_upweight.py --alpha 1.2 --target mlp.down_proj.weight upweights Gemma 4's always-on shared FFN. Every coder variant carries this; omitting it yields the "weak / ruminating" pre-shared baseline and makes cross-variant comparison unfair. A .shared_applied marker records it.


Intended use

A compact (~12–13 GB at Q4_K_M / NVFP4A16, fits a single 12–16 GB GPU) Gemma 4 checkpoint for agentic coding and code reasoning that also needs solid graduate science — the science-augmented member of the v7-coder cohort. For maximal raw coding throughput with science at baseline, use v7-coderx.

Inherits Gemma 4's thinking format — serve with the reasoning parser enabled (--reasoning-parser gemma4 on vLLM; --reasoning-format deepseek --reasoning-budget 8192 on llama-server).

Limitations

A research prune, not an official Google release. Expert pruning trades a slice of breadth for size: generic_multilingual is de-weighted (0×), and the non-code generalist axes carry the prune budget. GPQA/AIME/GSM8K parity-with-128e is measured on small benches (30–198 questions) at greedy and carries run-to-run variance — read the scoreboard as "recovered the science gap", not a robust win over the base. Quality below ~Q3 / 3-bit degrades on the Gemma 4 MoE — prefer Q4_K_M or higher for production. The GGUF and NVFP4A16 formats are provided for deployment but are not separately benchmarked.

Lineage

128e → (v4 → v5 → v6-coder code line) → v7 competence-map rebuild → fs2440 code floor → +targeted_gpqa 1.5× = v7-coder. Built and evaluated on the omnimergekit toolchain.

Downloads last month
333
Safetensors
Model size
20B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/gemma-4-A4B-98e-v7-coder-it

Finetuned
(104)
this model
Quantizations
2 models