Gemma 4 A4B 98-Expert v7-coderx — code-maximal prune (~20.8B)

Eval complete (Q6_K / llama.cpp, greedy, same host). Every cell in the scoreboard is read from summary.json under the cohort-pinned greedy recipe (temperature 0.0, top_p 1.0, top_k 0). The 128e, v6-coder and v7-coder columns are the matching same-host Q6_K runs. The GGUF and NVFP4A16 formats are deployment targets and are not separately benchmarked (cohort policy) — the Q6_K column is representative.

Headline — the cohort's code specialist. v7-coderx spends its whole prune budget on code and short-form reasoning. On the all-hard LiveCodeBench-77 set — the most demanding and most discriminating LCB slice — it scores 85.71%, the highest in the cohort (128e 79.22%, v7-coder 84.42%), and it leads on HumanEval+ 93.29%, HumanEval 96.95%, MATH-500 95.0% and AIME 76.67%. On the easier LCB-medium-v4 slices it sits a little below the generalists (LCB-55 92.73 / LCB-100 91.0 vs 128e's 96.36 / 97.0). This is also the loop-fixed build — it force-keeps the agentic loop-protection experts and replaces the earlier looping fs2440 prune. The trade is graduate science: GPQA-diamond sits at 51.01% (no targeted_gpqa term). For the broader LCB-medium lead and HumanEval, see the sibling v7-coder (LCB-55-v4 98.18%, HE 98.17%, LCB-hard-77 84.42%; GPQA ≈ 51 like this model).

A research checkpoint that prunes the unpruned Gemma 4 26B-A4B-it (128 experts/layer, top-8 + shared, 30 layers) down to 98 experts per layer. The code4/lcb3 drop map (generate_drop_map_v5fk) up-weights generic-code (4×) and LiveCodeBench-medium (3×) with no science or multilingual targeting — the code-maximal member of the v7-coder cohort — and force-keeps the agentic loop-protection experts (46 experts, 0 dropped) so the served model does not loop. Same 98e shape, same router, same attention, same norms as the rest of the cohort, plus the mandatory shared-FFN α=1.2 upweight all coder variants carry.

Quantized formats

Format	Repo	Notes
bf16 (this repo)	`…-v7-coderx-it`	9 shards. code4/lcb3 drop map (fkbroad) + agentic_eog force-keep + shared α=1.2.
GGUF (llama.cpp / ollama)	`…-v7-coderx-it-GGUF`	Bartowski tier sweep (imatrix K-quants) + ContribDynamic CD-* per-layer quants + F16 + `imatrix.dat` + `mmproj`.
NVFP4A16 (vLLM)	`…-v7-coderx-NVFP4A16`	Native vLLM 4-bit + FP8 block scales, via NVIDIA `modelopt` main (0.45.0.dev, `_QuantFusedExperts`). ~13 GB. Deployment format — not separately benchmarked.
Ollama	`mannix/gemma4-98e-v7-coderx`	`ollama pull mannix/gemma4-98e-v7-coderx:<tier>` (`:latest` = Q4_K_M; `:vision-<tier>` adds the SigLIP vision tower).

Benchmarks

Q6_K · llama.cpp · greedy (temperature 0.0, top_p 1.0, top_k 0), all four models scored on the same host from summary.json. Row-max in bold. This repo = v7-coderx.

Benchmark	128e (unpruned)	v6-coder	v7-coder	v7-coderx
GPQA-diamond (198q)	67.17	61.11	51.52	51.01
AIME (30q)	73.33	56.67	80.00	76.67
MATH500 (100q)	92.00	89.00	95.00	95.00
GSM8K (100q)	89.00	88.00	91.00	93.00
ARC-Challenge (full)	96.50	95.39	92.15	86.60
IFEval (100q, strict)	97.00	92.00	92.00	92.00
HumanEval (164)	97.56	98.17	98.17	96.95
HumanEval+ (164)	92.07	92.68	92.07	93.29
LCB-medium-55 v4	96.36	92.73	98.18	92.73
LCB-medium-100 v4	97.00	94.00	94.00	91.00
MultiPL-E (100)	90.00	89.00	89.67	89.00

_{Metrics: GPQA & GSM8K = exact_match flexible-extract · MATH500 = math_verify ·
ARC & AIME = exact_match · IFEval = prompt_level_strict_acc · HumanEval/+ = pass@1
chat-extract · LCB-55/100 & MultiPL-E = pass@1. 128e uses the lcb_medium_55/100
templates; the prunes use lcb_medium_*_v4 (corrected harness, equivalent task). The all-hard LCB-77 cross-model comparison is the discriminating code slice (v7-coderx 85.71%, cohort-best).}

v7-coderx leads the cohort on the hardest code slice (LCB-hard-77, below) and on HE+ / MATH-500 / AIME; the budget is paid on graduate science (GPQA) and the easier instruction / ARC axes, which carry no protection term in this recipe.

Coder-field comparison — v7-coderx vs Qwen2.5-Coder-14B / 7B + Qwen3.5-9B (Q6_K, llama.cpp, greedy)

The 9 canonical benches + MultiPL-E-100, all on the identical llama.cpp Q6_K / greedy recipe (reasoning models served with --reasoning-format deepseek --reasoning-budget 12288 --parallel 2). Architectures differ — this is a same-harness comparison, not a same-class one:

v7-coderx — Gemma-4 26B-A4B MoE pruned to 98 experts (~20.8B total, ~A4B active), reasoning.
Qwen2.5-Coder-14B / 7B-Instruct — dense, non-reasoning code specialists (bartowski Q6_K).
Qwen3.5-9B — dense reasoning model (bartowski Q6_K).

Bench (n)	v7-coderx Q6_K	Qwen2.5-Coder-14B	Qwen2.5-Coder-7B	Qwen3.5-9B
ARC-Challenge-chat (1172)	86.60%	90.53%	85.58%	96.76%
GPQA Diamond flex (198)	51.01%	34.85%	26.26%	73.74%
GSM8K-100 flex	93.00%	89.00%	80.00%	79.00%
MATH-500-100 math_verify	95.00%	62.00%	66.00%	59.00%
AIME 2024 (30)	76.67%	10.00%	10.00%	56.67%
IFEval-100 (prompt_strict)	92.00%	68.00%	54.00%	93.00%
HumanEval-164 chat	96.95%	90.85%	87.20%	89.02%
HumanEval+-164 chat	93.29%	84.76% †	83.54%	80.49%
LCB-medium-55 v4	92.73%	18.18% †	12.73%	58.18%
MultiPL-E-100 (macro)	89.00%	84.67%	80.67%	80.33%

† Qwen2.5-Coder-14B HumanEval+ / LCB-medium-55 are the same-stack GGUF HE+ sweep numbers (not re-run in this chain). All Qwen cells are the same-host reference runs used on the v6-coder card — Qwen is a fixed reference, so the columns are identical across the cohort; only the Gemma column changes.

Note on Qwen3.5-9B. Qwen3.5-9B is a verbose, slow thinking model: it emits long <think> reasoning chains (often ≥1900 tokens even on a trivial GSM8K question), so it runs several× slower per question than the non-reasoning Qwen2.5-Coder models — well beyond what its 9B size would suggest. Its GSM8K / MATH-500 / GPQA cells were re-run after a harness fix (under batched, reasoning-parsed serving the verbose thinking intermittently left the final answer inside the reasoning block, mis-scored as empty content).

LiveCodeBench across problem sets

v7-coderx's code score depends on the LiveCodeBench slice. All cells are the same greedy Q6_K / imat-Q6 llama.cpp stack (build provenance verified per run); v4-55/100 mirror the 9-bench above. The all-hard 77q set is the most demanding and the most discriminating across the cohort.

LCB problem set	128e	v7-coder	v7-coderx
LCB-medium-55 (v4, 55q)	96.36%	96.36%	92.73%
LCB-medium-100 (v4, 100q)	97.00%	97.00%	91.00%
LCB-v6-55 (55q) †	—	92.73%	98.18%
LCB-hard-77 (all-hard, 77q)	79.22%	84.42%	85.71%

† LCB-v6-55 is a small, noisier 55-problem slice (no greedy 128e baseline was run); it is included for completeness, but all-hard 77q is the reference for cross-model comparison.

Answer-length analysis (anti-rumination)

The pruned reasoning model thinks with a bounded thinking_token_budget=12288; the question is whether that length is productive (long thinking that PASSes) or rumination (long thinking that fails). Per-problem completion length is measured from omk_eval token_stats (characters from the raw completion; tokens via the 128e tokenizer) on the real-n benches, against 128e and v6-coder on the same problems, same greedy Q6_K / llama.cpp stack.

Per-problem completion length — characters (p50 / p90 / max):

Bench (n)	128e	v6-coder	v7-coderx
GPQA Diamond (198)	2571/16136/27811	2582/16100/25243	2458/17984/32411
AIME 2024 (30)	1963/7748/8680	2141/7469/9433	2061/7449/9660
LCB-medium-55	3734/16430/36462	31015/36260/43278	31167/38631/41953
LCB-medium-100	2056/15467/48569	29384/35389/43633	29883/36439/55504
MultiPL-E-100 (300)	245/566/3353	245/573/2725	244/619/1861
MATH-500 (100)	1083/1873/7899	1089/2025/9236	1080/1981/2312
GSM8K (100)	294/746/25989	283/780/11378	279/676/19868
IFEval (100)	877/3755/8263	855/3489/20908	791/3702/8179
HumanEval (164)	698/1284/5354	711/1438/5954	745/1427/4044
HumanEval+ (164)	714/1461/3289	694/1390/5282	742/1423/4461
ARC-Challenge (1172)	1210/1633/6254	1221/1674/48886	1335/2193/45174

Per-problem completion length — tokens (p50 / p90 / max):

Bench (n)	128e	v6-coder	v7-coderx
GPQA Diamond (198)	843/8189/8189	879/8189/8189	837/8189/8189
AIME 2024 (30)	933/3994/4021	946/3993/4011	950/3994/4013
LCB-medium-55	1005/5622/16022	12818/13318/15976	12834/13286/16019
LCB-medium-100	542/5353/16022	12740/13212/15976	12750/13374/15976
MultiPL-E-100 (300)	84/171/1013	85/184/965	84/182/976
MATH-500 (100)	431/895/3377	424/863/3377	403/832/1219
GSM8K (100)	131/271/8853	129/276/4687	122/258/6957
IFEval (100)	219/850/1561	222/797/3898	186/801/4057
HumanEval (164)	226/431/1611	226/448/2084	235/437/1463
HumanEval+ (164)	226/455/996	224/437/2040	230/459/1440
ARC-Challenge (1172)	258/355/1417	259/365/16266	282/487/16260

Budget-saturation incidence — share of problems whose completion reached ≥12k tokens (at/near the thinking_token_budget=12288 cap). Saturation by itself is not rumination — a saturated output that PASSes is productive use of the budget; the pruned reasoning model saturates on nearly every LCB problem, 128e almost never does.

Bench (n)	128e	v6-coder	v7-coderx
LCB-medium-55	1 / 55 (1.8%)	54 / 55 (98.2%)	54 / 55 (98.2%)
LCB-medium-100	2 / 100 (2.0%)	98 / 100 (98.0%)	97 / 100 (97.0%)

Rumination — long thinking that fails to PASS. The right metric is not median length (128e looks short only because it answers easy problems fast). It is the share of the model's budget-saturated outputs that still fail — tokens burned without a correct answer:

Bench (n)	128e	v6-coder	v7-coderx
LCB-medium-55 — saturated-and-failed	1 / 1 (100.0%)	4 / 54 (7.4%)	4 / 54 (7.4%)
LCB-medium-100 — saturated-and-failed	2 / 2 (100.0%)	6 / 98 (6.1%)	9 / 97 (9.3%)
LCB-100 — mean completion tokens, PASS vs FAIL	1392 vs 13782	12698 vs 15051	12623 vs 15143

Key findings:

128e only thinks long when it is lost. Every 128e output that reaches the budget cap is a failure (1/1 on LCB-55, 2/2 on LCB-100), and its failed problems run several× longer than its passed ones (mean 13782 vs 1392 tok on LCB-100).
v7-coderx's long thinking is overwhelmingly productive. It saturates on ~97% of LCB-100 problems but only 9/97 of those saturated outputs fail (9.3%); its PASS and FAIL completions are nearly the same length (mean 12623 vs 15143 tok), so failures are not driven by extra rumination. On LCB-55 it is 4/54 saturated-and-failed.
Comparable to v6-coder's rumination rate. v6-coder ran 4/54 (LCB-55) and 6/98 (LCB-100) saturated-and-failed; v7-coderx is at or below on LCB-55 (4/54) and near on LCB-100 (9/97). The saturated-fail share tracks the model's LCB pass-rate — these are the genuinely hard problems, not extra rumination (PASS and FAIL completions are near-equal length).
Non-LCB benches stay tight. On the short-answer benches (GSM8K / MATH-500 / HE / HE+ / MultiPL-E) p50/p90 length tracks 128e and v6-coder within a few tokens — the targeted prune did not trade length for accuracy on the everyday benches.

Methodology. Per-problem lengths come from omk_eval token_stats over each bench's samples_*.jsonl / lcb_result.samples.jsonl; saturation/PASS-FAIL is computed per problem from completion_tokens + passed. MultiPL-E measures code length, not reasoning (its samples store only the final code block, no <think> trace), so it is a code-conciseness reference rather than a thinking-length signal.

At a glance

	128e (base)	v7-coderx	v7-coder (sibling)
Total params	~26B	~20.8B	~20.8B
Active / token	~4B (top-8 + shared)	~4B	~4B
Experts / layer	128	98 (30 dropped)	98 (30 dropped)
Per-layer floor	—	none (no clamp)	none (no clamp)
Code / LCB weight	—	4× / 3×	3× / 2×
Science targeting	—	off	off
Loop protection	—	agentic_eog force-keep (46 experts)	agentic_eog force-keep
Shared FFN α	1.0	1.2 (`mlp.down_proj`)	1.2
Built from	—	128e original (fresh prune)	128e original

Recipe

The drop map is produced by generate_drop_map_v5.py (omnimergekit) from per-expert, per-class contribution scores on the rebuilt v7 competence maps (expert_neuron_v7_code_gpqa.json — 10 classes, audited producers, multilingual category included), then applied with expert_drop.py, then the agentic loop-protection experts are force-kept and the shared expert is upweighted.

1. code4/lcb3 base recipe (STD16 / fkbroad)

generator     = generate_drop_map_v5fk    # fkbroad (force-keep aware)
target        = 98          # 30 experts/layer dropped
protect_top   = 16          # 16 highest-scoring experts/layer never dropped
alpha         = 2.0         # contribution sharpening exponent
strategy      = max         # per-expert score = MAX over classes (not mean/geomean)
normalize     = rank        # rank-normalize within each (layer, class)
breadth_bonus = 0.5         # reward experts useful across many classes (anti-overfit)
v4_floor_clamp = null       # NO per-layer floor band (unlike fs2440's [24,40])
force_keep    = agentic_eog # pin the 46 loop-protection experts (0/46 dropped)
outlier_mode  = median      # clamp bf16 weight-norm artifacts to layer median
baseline      = teacher_force_98e_p16_clean.json   # tie-break anchor

strategy=max + breadth_bonus is the load-bearing pair — it favours experts strongly useful to at least one class and broadly useful across classes, the optimizer-off-manifold lesson encoded as a recipe. No floor clamp is applied (the fkbroad selection plus the agentic_eog force-keep carry loop-stability instead of a fixed per-layer band).

2. Calibration class weights — code only

Ten contribution classes are scored; the weights steer which specialists survive. v7-coderx zeroes every non-code targeting term:

Class	v7-coderx	v7-coder
generic_math	1	1
generic_logic	1	1
generic_code	4	3
generic_science	1	1
generic_creative	1	1
generic_multilingual	0	0
targeted_humaneval	0	0
targeted_humanevalplus	0	0
targeted_lcb_medium_55	3	2
targeted_gpqa	0	0

HE/HE+ targeting is off because both already sit at/above the un-targeted baseline; the protection budget goes to LiveCodeBench-medium, the bench where pruning hurt most on earlier variants. v7-coderx is the code-maximal sibling of v7-coder: heavier code/LCB weighting (4×/3× vs 3×/2×), plus the agentic loop-protection force-keep both carry. It wins the all-hard LCB-77 and HE+/MATH; v7-coder leads the easier LCB-medium slices and HumanEval. Neither sibling carries a targeted_gpqa term, so both sit near GPQA 51 (no science recovery).

3. Agentic loop-protection force-keep

The earlier fs2440 prune dropped some of the experts that emit end-of-turn / answer-channel tokens, which let the served model loop in agentic use. code4/lcb3 force-keeps the 46 agentic_eog loop-protection experts (identified on the 128e teacher; verified 0/46 dropped by the selection), which is what makes this the loop-fixed re-release. No DERN / redistribution fold is applied.

4. Mandatory shared-FFN α=1.2 (cohort rule)

After expert drop, router_shared_upweight.py --alpha 1.2 --target mlp.down_proj.weight upweights Gemma 4's always-on shared FFN. Every coder variant carries this; omitting it yields the "weak / ruminating" pre-shared baseline and makes cross-variant comparison unfair. A .shared_applied marker records it.

Chat template

chat_template.jinja in this repo is not Google's stock Gemma 4 template — it is our agentic-loop fix (19,177 B, md5 8119c2dcd5e62a4a6b79301ab13ac81d), rebased on 2026-07-30 onto Google's current upstream template (revision 2026-07-20, 18,683 B). transformers picks this file up automatically; tokenizer_config.json deliberately carries no competing chat_template key.

The bug it fixes: the stock template re-injects earlier assistant turns' thinking content back into the prompt on every turn. In long agentic / tool-calling sessions that feeds the model its own reasoning back to itself and drives repetition loops. Google's current 18,683 B template is still affected — its thinking gate carries an unconditional "index past the last user message" disjunct — so this fix remains necessary on top of a fresh upstream template. The rebase leaves Google's newer preserve_thinking flag intact (default false).

Serving the GGUF builds instead? Those embed the same template — pass --jinja to llama.cpp, or it falls back to its own built-in formatter and the fix does not apply.

Intended use

A compact (~12–13 GB at Q4_K_M / NVFP4A16, fits a single 12–16 GB GPU) Gemma 4 checkpoint for maximal coding throughput and instruction-following — the code-extreme (x) member of the v7-coder cohort. For the broader LCB-medium lead and HumanEval, use v7-coder, which leads those slices (GPQA ≈ 51 on both siblings — neither recovers science).

Inherits Gemma 4's thinking format — serve with the reasoning parser enabled (--reasoning-parser gemma4 on vLLM; --reasoning-format deepseek --reasoning-budget 8192 on llama-server).

Limitations

A research prune, not an official Google release. Expert pruning trades breadth for size: generic_multilingual is de-weighted (0×) and graduate science (GPQA) is a budget axis — at 51.01% it is well below the unpruned 128e (67.17% on the same Q6_K run), on par with v7-coder (51.52%); neither sibling recovers science. Quality below ~Q3 / 3-bit degrades on the Gemma 4 MoE — prefer Q4_K_M or higher for production. The GGUF and NVFP4A16 formats are provided for deployment but are not separately benchmarked.

Lineage

128e → (v4 → v5 → v6-coder code line) → v7 competence-map rebuild → code4/lcb3 selection + agentic loop-protection force-keep = v7-coderx. Built and evaluated on the omnimergekit toolchain.