Gemma 4 A4B 98-Expert v7-coder — loop-fixed code prune (~20.8B)

Eval complete (Q6_K / llama.cpp, greedy, same host). Every cell in the scoreboard is read from summary.json under the cohort-pinned greedy recipe (temperature 0.0, top_p 1.0, top_k 0). The 128e, v6-coder and v7-coderx columns are the matching same-host Q6_K runs. The GGUF and NVFP4A16 formats are deployment targets and are not separately benchmarked (cohort policy) — the Q6_K column is representative.

Headline — the cohort's balanced code build. v7-coder leads the cohort on the broad LiveCodeBench-medium slices (LCB-55-v4 98.18%, LCB-100-v4 94.0%) and on HumanEval (98.17%), ties the cohort top on MATH-500 (95.0%), and takes AIME (80.0%). On the all-hard LiveCodeBench-77 set — the most demanding and most discriminating LCB slice — it scores 84.42% (128e 79.22%, v7-coderx 85.71%), just behind the code-maximal sibling. This is the loop-fixed build: it force-keeps the agentic loop-protection experts and replaces the earlier looping fs2440 prune. Like its sibling it spends the prune budget on graduate science — GPQA-diamond sits at 51.52% (no targeted_gpqa term; ≈ v7-coderx 51.01%). For the hardest-code lean (all-hard LCB + HE+), see the sibling v7-coderx.

A research checkpoint that prunes the unpruned Gemma 4 26B-A4B-it (128 experts/layer, top-8 + shared, 30 layers) down to 98 experts per layer. The fkbroad drop map (generate_drop_map_v5) up-weights generic-code (3×) and LiveCodeBench-medium (2×) with no science or multilingual targeting, and force-keeps the agentic loop-protection experts (agentic_eog, 46 experts, 0/46 dropped) so the served model does not loop. Same 98e shape, same router, same attention, same norms as the rest of the cohort, plus the mandatory shared-FFN α=1.2 upweight all coder variants carry. No per-layer floor clamp and no DERN fold.

Quantized formats

Format	Repo	Notes
bf16 (this repo)	`…-v7-coder-it`	9 shards. fkbroad code3/lcb2 drop map + agentic_eog force-keep + shared α=1.2.
GGUF (llama.cpp / ollama)	`…-v7-coder-it-GGUF`	Bartowski tier sweep (imatrix K-quants) + ContribDynamic CD-* per-layer quant + F16 + `imatrix.dat` + `mmproj`.
NVFP4A16 (vLLM)	`…-v7-coder-NVFP4A16`	Native vLLM 4-bit + FP8 block scales, via NVIDIA `modelopt` main (0.45.0.dev, `_QuantFusedExperts`). ~13 GB. Deployment format — not separately benchmarked.
Ollama	`mannix/gemma4-98e-v7-coder`	`ollama pull mannix/gemma4-98e-v7-coder:<tier>` (`:latest` = Q4_K_M; `:vision-<tier>` adds the SigLIP vision tower).

Benchmarks

Q6_K · llama.cpp · greedy (temperature 0.0, top_p 1.0, top_k 0), all four models scored on the same host from summary.json. Row-max in bold. This repo = v7-coder.

Benchmark	128e (unpruned)	v6-coder	v7-coder	v7-coderx
GPQA-diamond (198q)	67.17	61.11	51.52	51.01
AIME (30q)	73.33	56.67	80.00	76.67
MATH500 (100q)	92.00	89.00	95.00	95.00
GSM8K (100q)	89.00	88.00	91.00	93.00
ARC-Challenge (full)	96.50	95.39	92.15	86.60
IFEval (100q, strict)	97.00	92.00	92.00	92.00
HumanEval (164)	97.56	98.17	98.17	96.95
HumanEval+ (164)	92.07	92.68	92.07	93.29
LCB-medium-55 v4	96.36	92.73	98.18	92.73
LCB-medium-100 v4	97.00	94.00	94.00	91.00
MultiPL-E (100)	90.00	89.00	89.67	89.00

_{Metrics: GPQA & GSM8K = exact_match flexible-extract · MATH500 = math_verify ·
ARC & AIME = exact_match · IFEval = prompt_level_strict_acc · HumanEval/+ = pass@1
chat-extract · LCB-55/100 & MultiPL-E = pass@1. 128e uses the lcb_medium_55/100
templates; the prunes use lcb_medium_*_v4 (corrected harness, equivalent task).}

v7-coder is the balanced code sibling: it tops the cohort on LCB-medium and HumanEval and ties on MATH/AIME, while v7-coderx leads the all-hard LCB-77 and HE+. Both pay the budget on graduate science (GPQA) and the easier ARC / instruction axes.

LiveCodeBench across problem sets

The code score depends on the LiveCodeBench slice. All cells are the same greedy Q6_K / imat-Q6 llama.cpp stack (build provenance verified per run); v4-55/100 mirror the 9-bench above. The all-hard 77q set is the most demanding and the most discriminating across the cohort.

LCB problem set	128e	v7-coder	v7-coderx
LCB-medium-55 (v4, 55q)	96.36%	98.18%	92.73%
LCB-medium-100 (v4, 100q)	97.00%	94.00%	91.00%
LCB-hard-77 (all-hard, 77q)	79.22%	84.42%	85.71%

Coder-field comparison — v7-coder vs Qwen2.5-Coder-14B / 7B + Qwen3.5-9B (Q6_K, llama.cpp, greedy)

The 9 canonical benches + MultiPL-E-100, all on the identical llama.cpp Q6_K / greedy recipe (reasoning models served with --reasoning-format deepseek --reasoning-budget 12288 --parallel 2). Architectures differ — this is a same-harness comparison, not a same-class one:

v7-coder — Gemma-4 26B-A4B MoE pruned to 98 experts (~20.8B total, ~A4B active), reasoning.
Qwen2.5-Coder-14B / 7B-Instruct — dense, non-reasoning code specialists (bartowski Q6_K).
Qwen3.5-9B — dense reasoning model (bartowski Q6_K).

Bench (n)	v7-coder Q6_K	Qwen2.5-Coder-14B	Qwen2.5-Coder-7B	Qwen3.5-9B
ARC-Challenge-chat (1172)	92.15%	90.53%	85.58%	96.76%
GPQA Diamond flex (198)	51.52%	34.85%	26.26%	73.74%
GSM8K-100 flex	91.00%	89.00%	80.00%	79.00%
MATH-500-100 math_verify	95.00%	62.00%	66.00%	59.00%
AIME 2024 (30)	80.00%	10.00%	10.00%	56.67%
IFEval-100 (prompt_strict)	92.00%	68.00%	54.00%	93.00%
HumanEval-164 chat	98.17%	90.85%	87.20%	89.02%
HumanEval+-164 chat	92.07%	84.76% †	83.54%	80.49%
LCB-medium-55 v4	98.18%	18.18% †	12.73%	58.18%
MultiPL-E-100 (macro)	89.67%	84.67%	80.67%	80.33%

† Qwen2.5-Coder-14B HumanEval+ / LCB-medium-55 are the same-stack GGUF HE+ sweep numbers (not re-run in this chain). All Qwen cells are the same-host reference runs used on the v6-coder card — Qwen is a fixed reference, so the columns are identical across the cohort; only the Gemma column changes.

Note on Qwen3.5-9B. Qwen3.5-9B is a verbose, slow thinking model: it emits long <think> reasoning chains (often ≥1900 tokens even on a trivial GSM8K question), so it runs several× slower per question than the non-reasoning Qwen2.5-Coder models — well beyond what its 9B size would suggest. Its GSM8K / MATH-500 / GPQA cells were re-run after a harness fix (under batched, reasoning-parsed serving the verbose thinking intermittently left the final answer inside the reasoning block, mis-scored as empty content).

At a glance

	128e (base)	v7-coder	v7-coderx (sibling)
Total params	~26B	~20.8B	~20.8B
Active / token	~4B (top-8 + shared)	~4B	~4B
Experts / layer	128	98 (30 dropped)	98 (30 dropped)
Per-layer floor	—	none (no clamp)	none (no clamp)
Code / LCB weight	—	3× / 2×	4× / 3×
Science targeting	—	off	off
Loop protection	—	agentic_eog force-keep (46 experts)	agentic_eog force-keep (46 experts)
Shared FFN α	1.0	1.2 (`mlp.down_proj`)	1.2
Built from	—	128e original (fresh prune)	128e original

Recipe

The drop map is produced by generate_drop_map_v5.py (omnimergekit) from per-expert, per-class contribution scores on the rebuilt v7 competence maps (expert_neuron_v7_code.json — 10 classes, audited producers, multilingual category included), then applied with expert_drop.py, then the agentic loop-protection experts are force-kept and the shared expert is upweighted.

1. fkbroad base recipe (STD16)

generator     = generate_drop_map_v5    # fkbroad (force-keep aware)
target        = 98          # 30 experts/layer dropped
protect_top   = 16          # 16 highest-scoring experts/layer never dropped
alpha         = 2.0         # contribution sharpening exponent
strategy      = max         # per-expert score = MAX over classes (not mean/geomean)
normalize     = rank        # rank-normalize within each (layer, class)
breadth_bonus = 0.5         # reward experts useful across many classes (anti-overfit)
v4_floor_clamp = null       # NO per-layer floor band (unlike the retired fs2440's [24,40])
force_keep    = agentic_eog # pin the 46 loop-protection experts (0/46 dropped)
outlier_mode  = median      # clamp bf16 weight-norm artifacts to layer median
baseline      = teacher_force_98e_p16_clean.json   # tie-break anchor

strategy=max + breadth_bonus is the load-bearing pair — it favours experts strongly useful to at least one class and broadly useful across classes, the optimizer-off-manifold lesson encoded as a recipe. No floor clamp is applied (the fkbroad selection plus the agentic_eog force-keep carry loop-stability instead of a fixed per-layer band).

2. Calibration class weights — code only

Ten contribution classes are scored; the weights steer which specialists survive. v7-coder zeroes every non-code targeting term (no targeted_gpqa):

Class	v7-coder	v7-coderx
generic_math	1	1
generic_logic	1	1
generic_code	3	4
generic_science	1	1
generic_creative	1	1
generic_multilingual	0	0
targeted_humaneval	0	0
targeted_humanevalplus	0	0
targeted_lcb_medium_55	2	3
targeted_gpqa	0	0

v7-coder is the balanced code sibling of v7-coderx: lighter code/LCB weighting (3×/2× vs coderx's 4×/3×), no science or multilingual targeting, plus the agentic loop-protection force-keep. It leads the cohort on LCB-medium and HumanEval; v7-coderx spends more of the budget on the all-hard LCB-77 and HE+. Neither carries a targeted_gpqa term, so both sit near GPQA 51 (no science recovery).

3. Agentic loop-protection force-keep

The earlier fs2440 prune dropped some of the experts that emit end-of-turn / answer-channel tokens, which let the served model loop in agentic use. The fkbroad selection force-keeps the 46 agentic_eog loop-protection experts (identified on the 128e teacher; verified 0/46 dropped by the selection — the same loop-protection set the sibling v7-coderx carries), which is what makes this the loop-fixed re-release. No DERN / redistribution fold is applied.

4. Mandatory shared-FFN α=1.2 (cohort rule)

After expert drop, router_shared_upweight.py --alpha 1.2 --target mlp.down_proj.weight upweights Gemma 4's always-on shared FFN. Every coder variant carries this; omitting it yields the "weak / ruminating" pre-shared baseline and makes cross-variant comparison unfair. A .shared_applied marker records it.

Chat template

chat_template.jinja in this repo is not Google's stock Gemma 4 template — it is our agentic-loop fix (19,177 B, md5 8119c2dcd5e62a4a6b79301ab13ac81d), rebased on 2026-07-30 onto Google's current upstream template (revision 2026-07-20, 18,683 B). transformers picks this file up automatically; tokenizer_config.json deliberately carries no competing chat_template key.

The bug it fixes: the stock template re-injects earlier assistant turns' thinking content back into the prompt on every turn. In long agentic / tool-calling sessions that feeds the model its own reasoning back to itself and drives repetition loops. Google's current 18,683 B template is still affected — its thinking gate carries an unconditional "index past the last user message" disjunct — so this fix remains necessary on top of a fresh upstream template. The rebase leaves Google's newer preserve_thinking flag intact (default false).

Serving the GGUF builds instead? Those embed the same template — pass --jinja to llama.cpp, or it falls back to its own built-in formatter and the fix does not apply.

Intended use

A compact (~13 GB at Q4_K_M / NVFP4A16, fits a single 12–16 GB GPU) Gemma 4 checkpoint for agentic coding and code reasoning — the balanced code member of the v7-coder cohort (leads LCB-medium + HumanEval). For the hardest-code lean (all-hard LCB-77 + HE+), use v7-coderx.

Inherits Gemma 4's thinking format — serve with the reasoning parser enabled (--reasoning-parser gemma4 on vLLM; --reasoning-format deepseek --reasoning-budget 8192 on llama-server).

Limitations

A research prune, not an official Google release. Expert pruning trades breadth for size: generic_multilingual is de-weighted (0×) and graduate science (GPQA) is a budget axis — at 51.52% it is well below the unpruned 128e (67.17% on the same Q6_K run), on par with v7-coderx (51.01%). Neither sibling recovers science; both are code specialists. Quality below ~Q3 / 3-bit degrades on the Gemma 4 MoE — prefer Q4_K_M or higher for production. The GGUF and NVFP4A16 formats are provided for deployment but are not separately benchmarked.

Lineage

128e → (v4 → v5 → v6-coder code line) → v7 competence-map rebuild → fkbroad code3/lcb2 selection + agentic loop-protection force-keep = v7-coder (loop-fixed; supersedes fs2440). Built and evaluated on the omnimergekit toolchain.