Gemma 4 A4B 98-Expert v7-coder — science-augmented code prune (~20.8B)
Eval complete (Q6_K / llama.cpp, greedy, same host). Every cell in the scoreboard is read from
summary.jsonunder the cohort-pinned greedy recipe (temperature 0.0,top_p 1.0,top_k 0). The 128e and v6-coder columns are the matching same-host Q6_K runs. The GGUF and NVFP4A16 formats are deployment targets and are not separately benchmarked (cohort policy) — the Q6_K column is representative.Headline — a coder that kept all its science. v7-coder holds v6-coder's top-tier code profile (LCB-medium-55 96.36%, LCB-medium-100 97.0%, HumanEval 98.78%, HumanEval+ 92.68%, IFEval 95%) while pulling GPQA-diamond to 70.71% — +9.6pp over v6-coder and at parity with the unpruned 128e on the same Q6_K run (67.17%; GPQA-diamond carries ≈±3pp run-to-run noise at 198 questions). On this same-host run it also edges 128e on AIME (76.67 vs 73.33) and GSM8K (93 vs 89). The science is bought by a dedicated
targeted_gpqacalibration class; the code axes are left untouched. If you want the same prune with code pushed even harder (and science left at baseline), see the sibling v7-coderx.
A research checkpoint that prunes the unpruned
Gemma 4 26B-A4B-it
(128 experts/layer, top-8 + shared, 30 layers) down to 98 experts per layer.
The drop map is the fs2440 code recipe (generic-code 3× + LiveCodeBench-medium
2×, on a [24,40] per-layer floor) plus a targeted_gpqa class at weight 1.5 —
a science-specialist protection term derived from a GPQA-diamond pass-trace
calibration set. Same 98e shape, same router, same attention, same norms as the
rest of the cohort, plus the mandatory shared-FFN α=1.2 upweight all coder
variants carry.
Quantized formats
| Format | Repo | Notes |
|---|---|---|
| bf16 (this repo) | …-v7-coder-it |
9 shards. fs2440 + targeted_gpqa 1.5 drop map + shared α=1.2. |
| GGUF (llama.cpp / ollama) | …-v7-coder-it-GGUF |
Bartowski tier sweep (imatrix K-quants) + ContribDynamic CD-* per-layer quants + F16 + imatrix.dat + mmproj. |
| NVFP4A16 (vLLM) | …-v7-coder-NVFP4A16 |
Native vLLM 4-bit + FP8 block scales, via NVIDIA modelopt main (0.45.0.dev, _QuantFusedExperts). ~13 GB. Deployment format — not separately benchmarked. |
| Ollama | mannix/gemma4-98e-v7-coder |
ollama pull mannix/gemma4-98e-v7-coder:<tier> (:latest = Q4_K_M; :vision-<tier> adds the SigLIP vision tower). |
Benchmarks
Q6_K · llama.cpp · greedy (temperature 0.0, top_p 1.0, top_k 0), all four
models scored on the same host from summary.json. Row-max in bold.
This repo = v7-coder.
| Benchmark | 128e (unpruned) | v6-coder | v7-coder | v7-coderx |
|---|---|---|---|---|
| GPQA-diamond (198q) | 67.17 | 61.11 | 70.71 | 48.48 |
| AIME (30q) | 73.33 | 56.67 | 76.67 | 70.00 |
| MATH500 (100q) | 92.00 | 89.00 | 92.00 | 89.00 |
| GSM8K (100q) | 89.00 | 88.00 | 93.00 | 91.00 |
| ARC-Challenge (full) | 96.50 | 95.39 | 94.80 | 94.28 |
| IFEval (100q, strict) | 97.00 | 92.00 | 95.00 | 95.00 |
| HumanEval (164) | 97.56 | 98.17 | 98.78 | 95.73 |
| HumanEval+ (164) | 92.07 | 92.68 | 92.68 | 92.68 |
| LCB-medium-55 | 96.36 | 92.73 | 96.36 | 98.18 |
| LCB-medium-100 | 97.00 | 94.00 | 97.00 | 99.00 |
| MultiPL-E (100) | 90.00 | 89.00 | 88.67 | 90.00 |
Metrics: GPQA & GSM8K = exact_match flexible-extract · MATH500 = math_verify ·
ARC & AIME = exact_match · IFEval = prompt_level_strict_acc · HumanEval/+ = pass@1
chat-extract · LCB-55/100 & MultiPL-E = pass@1. 128e uses the lcb_medium_55/100
templates; the prunes use lcb_medium_*_v4 (corrected harness, equivalent task).
The code profile is held at or above v6-coder on every axis, and the entire recipe delta lands where it was aimed — +9.6pp GPQA — with no code regression.
Coder-field comparison — v7-coder vs Qwen2.5-Coder-14B / 7B + Qwen3.5-9B (Q6_K, llama.cpp, greedy)
The 9 canonical benches + MultiPL-E-100, all on the identical llama.cpp Q6_K / greedy
recipe (reasoning models served with --reasoning-format deepseek --reasoning-budget 12288 --parallel 2). Architectures differ — this is a same-harness comparison, not a same-class one:
- v7-coder — Gemma-4 26B-A4B MoE pruned to 98 experts (~20.8B total, ~A4B active), reasoning.
- Qwen2.5-Coder-14B / 7B-Instruct — dense, non-reasoning code specialists (bartowski Q6_K).
- Qwen3.5-9B — dense reasoning model (bartowski Q6_K).
| Bench (n) | v7-coder Q6_K | Qwen2.5-Coder-14B | Qwen2.5-Coder-7B | Qwen3.5-9B |
|---|---|---|---|---|
| ARC-Challenge-chat (1172) | 94.80% | 90.53% | 85.58% | 96.76% |
| GPQA Diamond flex (198) | 70.71% | 34.85% | 26.26% | 73.74% |
| GSM8K-100 flex | 93.00% | 89.00% | 80.00% | 79.00% |
| MATH-500-100 math_verify | 92.00% | 62.00% | 66.00% | 59.00% |
| AIME 2024 (30) | 76.67% | 10.00% | 10.00% | 56.67% |
| IFEval-100 (prompt_strict) | 95.00% | 68.00% | 54.00% | 93.00% |
| HumanEval-164 chat | 98.78% | 90.85% | 87.20% | 89.02% |
| HumanEval+-164 chat | 92.68% | 84.76% † | 83.54% | 80.49% |
| LCB-medium-55 v4 | 96.36% | 18.18% † | 12.73% | 58.18% |
| MultiPL-E-100 (macro) | 88.67% | 84.67% | 80.67% | 80.33% |
† Qwen2.5-Coder-14B HumanEval+ / LCB-medium-55 are the same-stack GGUF HE+ sweep numbers (not re-run in this chain). All Qwen cells are the same-host reference runs used on the v6-coder card — Qwen is a fixed reference, so the columns are identical across the cohort; only the Gemma column changes.
Note on Qwen3.5-9B. Qwen3.5-9B is a verbose, slow thinking model: it emits long
<think>reasoning chains (often ≥1900 tokens even on a trivial GSM8K question), so it runs several× slower per question than the non-reasoning Qwen2.5-Coder models — well beyond what its 9B size would suggest. Its GSM8K / MATH-500 / GPQA cells were re-run after a harness fix (under batched, reasoning-parsed serving the verbose thinking intermittently left the final answer inside the reasoning block, mis-scored as empty content).
Answer-length analysis (anti-rumination)
The pruned reasoning model thinks with a bounded thinking_token_budget=12288; the
question is whether that length is productive (long thinking that PASSes) or
rumination (long thinking that fails). Per-problem completion length is measured from
omk_eval token_stats (characters from the raw completion; tokens via the 128e tokenizer)
on the real-n benches, against 128e and v6-coder on the same problems, same greedy
Q6_K / llama.cpp stack.
Per-problem completion length — characters (p50 / p90 / max):
| Bench (n) | 128e | v6-coder | v7-coder |
|---|---|---|---|
| GPQA Diamond (198) | 2571/16136/27811 | 2582/16100/25243 | 2608/16779/27097 |
| AIME 2024 (30) | 1963/7748/8680 | 2141/7469/9433 | 2016/6859/8272 |
| LCB-medium-55 | 3734/16430/36462 | 31015/36260/43278 | 31164/39302/52930 |
| LCB-medium-100 | 2056/15467/48569 | 29384/35389/43633 | 29670/36443/52930 |
| MultiPL-E-100 (300) | 245/566/3353 | 245/573/2725 | 246/641/3578 |
| MATH-500 (100) | 1083/1873/7899 | 1089/2025/9236 | 1064/1849/8921 |
| GSM8K (100) | 294/746/25989 | 283/780/11378 | 270/668/1527 |
| IFEval (100) | 877/3755/8263 | 855/3489/20908 | 775/3263/18052 |
| HumanEval (164) | 698/1284/5354 | 711/1438/5954 | 759/1426/3801 |
| HumanEval+ (164) | 714/1461/3289 | 694/1390/5282 | 745/1360/3822 |
| ARC-Challenge (1172) | 1210/1633/6254 | 1221/1674/48886 | 1222/1691/8344 |
Per-problem completion length — tokens (p50 / p90 / max):
| Bench (n) | 128e | v6-coder | v7-coder |
|---|---|---|---|
| GPQA Diamond (198) | 843/8189/8189 | 879/8189/8189 | 853/8189/8190 |
| AIME 2024 (30) | 933/3994/4021 | 946/3993/4011 | 889/3955/4001 |
| LCB-medium-55 | 1005/5622/16022 | 12818/13318/15976 | 12724/13134/15667 |
| LCB-medium-100 | 542/5353/16022 | 12740/13212/15976 | 12709/13115/15976 |
| MultiPL-E-100 (300) | 84/171/1013 | 85/184/965 | 84/188/1013 |
| MATH-500 (100) | 431/895/3377 | 424/863/3377 | 407/844/3352 |
| GSM8K (100) | 131/271/8853 | 129/276/4687 | 121/247/485 |
| IFEval (100) | 219/850/1561 | 222/797/3898 | 190/741/4008 |
| HumanEval (164) | 226/431/1611 | 226/448/2084 | 239/450/1501 |
| HumanEval+ (164) | 226/455/996 | 224/437/2040 | 238/440/1382 |
| ARC-Challenge (1172) | 258/355/1417 | 259/365/16266 | 261/370/1960 |
Budget-saturation incidence — share of problems whose completion reached ≥12k tokens
(at/near the thinking_token_budget=12288 cap). Saturation by itself is not rumination —
a saturated output that PASSes is productive use of the budget; the pruned reasoning model
saturates on nearly every LCB problem, 128e almost never does.
| Bench (n) | 128e | v6-coder | v7-coder |
|---|---|---|---|
| LCB-medium-55 | 1 / 55 (1.8%) | 54 / 55 (98.2%) | 54 / 55 (98.2%) |
| LCB-medium-100 | 2 / 100 (2.0%) | 98 / 100 (98.0%) | 96 / 98 (98.0%) |
Rumination — long thinking that fails to PASS. The right metric is not median length (128e looks short only because it answers easy problems fast). It is the share of the model's budget-saturated outputs that still fail — tokens burned without a correct answer:
| Bench (n) | 128e | v6-coder | v7-coder |
|---|---|---|---|
| LCB-medium-55 — saturated-and-failed | 1 / 1 (100.0%) | 4 / 54 (7.4%) | 2 / 54 (3.7%) |
| LCB-medium-100 — saturated-and-failed | 2 / 2 (100.0%) | 6 / 98 (6.1%) | 1 / 96 (1.0%) |
| LCB-100 — mean completion tokens, PASS vs FAIL | 1392 vs 13782 | 12698 vs 15051 | 12651 vs 12844 |
Key findings:
- 128e only thinks long when it is lost. Every 128e output that reaches the budget cap is a failure (1/1 on LCB-55, 2/2 on LCB-100), and its failed problems run several× longer than its passed ones (mean 13782 vs 1392 tok on LCB-100).
- v7-coder's long thinking is overwhelmingly productive. It saturates on ~98% of LCB-100 problems but only 1/96 of those saturated outputs fail (1.0%); its PASS and FAIL completions are nearly the same length (mean 12651 vs 12844 tok), so failures are not driven by extra rumination. On LCB-55 it is 2/54 saturated-and-failed.
- At or below v6-coder's rumination rate. v6-coder ran 4/54 (LCB-55) and 6/98 (LCB-100) saturated-and-failed; v7-coder matches or improves on both.
- Non-LCB benches stay tight. On the short-answer benches (GSM8K / MATH-500 / HE / HE+ / MultiPL-E) p50/p90 length tracks 128e and v6-coder within a few tokens — the targeted prune did not trade length for accuracy on the everyday benches.
Methodology. Per-problem lengths come from omk_eval
token_statsover each bench'ssamples_*.jsonl/lcb_result.samples.jsonl; saturation/PASS-FAIL is computed per problem fromcompletion_tokens+passed. MultiPL-E measures code length, not reasoning (its samples store only the final code block, no<think>trace), so it is a code-conciseness reference rather than a thinking-length signal.
At a glance
| 128e (base) | v7-coder | v7-coderx (sibling) | |
|---|---|---|---|
| Total params | ~26B | ~20.8B | ~20.8B |
| Active / token | ~4B (top-8 + shared) | ~4B | ~4B |
| Experts / layer | 128 | 98 (30 dropped) | 98 (30 dropped) |
| Per-layer floor | — | [24, 40] | [24, 40] |
| Science targeting | — | targeted_gpqa 1.5× |
off |
| Shared FFN α | 1.0 | 1.2 (mlp.down_proj) |
1.2 |
| Built from | — | 128e original (fresh prune) | 128e original |
Recipe
The drop map is produced by generate_drop_map_v5.py (omnimergekit) from
per-expert, per-class contribution scores on the rebuilt v7 competence maps
(expert_neuron_v7_code_gpqa.json — 10 classes, audited producers, multilingual
category included), then applied with expert_drop.py, then the shared expert is
upweighted.
1. fs2440 base (shared with v7-coderx)
target = 98 # 30 experts/layer dropped
protect_top = 16 # 16 highest-scoring experts/layer never dropped
alpha = 2.0 # contribution sharpening exponent
strategy = max # per-expert score = MAX over classes (not mean/geomean)
normalize = rank # rank-normalize within each (layer, class)
breadth_bonus = 0.5 # reward experts useful across many classes (anti-overfit)
v4_floor_map = v4_layer_floor_map_v7.json # per-layer keep floor
v4_floor_data = expert_neuron_base_v7.json
v4_floor_clamp = [24, 40] # floor bounded into this band per layer
outlier_mode = median # clamp bf16 weight-norm artifacts to layer median
outlier_wnorm_thresh = 1e4
baseline = teacher_force_98e_p16_clean.json # tie-break anchor
strategy=max + breadth_bonus is the load-bearing pair — it favours experts
strongly useful to at least one class and broadly useful across classes,
the optimizer-off-manifold
lesson encoded as a recipe. The [24,40] floor is the 98e-scaled analogue of the
62e [15,25] band that won the loop-floor study.
2. Calibration class weights — the +gpqa term
Ten contribution classes are scored; the weights steer which specialists survive. v7-coder concentrates the targeted budget on code and adds a science term:
| Class | v7-coder | v7-coderx |
|---|---|---|
| generic_math | 1 | 1 |
| generic_logic | 1 | 1 |
| generic_code | 3 | 3 |
| generic_science | 1 | 1 |
| generic_creative | 1 | 1 |
| generic_multilingual | 0 | 0 |
| targeted_humaneval | 0 | 0 |
| targeted_humanevalplus | 0 | 0 |
| targeted_lcb_medium_55 | 2 | 2 |
| targeted_gpqa | 1.5 | 0 |
A sweep over the targeted_gpqa weight × floor band located weight 1.5 at floor
[24,40] as the frontier point: weight 1.0 under-recovers GPQA (≈53%, heavy
rumination), weight 2.0 overshoots into a small code regression, and the
[24,40] floor beats [20,35] by ~3.6pp LCB-55 for a coder. 1.5 is the knee
that buys ~+22pp GPQA over the un-targeted prune (v7-coderx) for ≈1.8pp LCB-55.
3. Mandatory shared-FFN α=1.2 (cohort rule)
After expert drop, router_shared_upweight.py --alpha 1.2 --target mlp.down_proj.weight
upweights Gemma 4's always-on shared FFN. Every coder variant carries this; omitting
it yields the "weak / ruminating" pre-shared baseline and makes cross-variant
comparison unfair. A .shared_applied marker records it.
Intended use
A compact (~12–13 GB at Q4_K_M / NVFP4A16, fits a single 12–16 GB GPU) Gemma 4 checkpoint for agentic coding and code reasoning that also needs solid graduate science — the science-augmented member of the v7-coder cohort. For maximal raw coding throughput with science at baseline, use v7-coderx.
Inherits Gemma 4's thinking format — serve with the reasoning parser enabled
(--reasoning-parser gemma4 on vLLM; --reasoning-format deepseek --reasoning-budget 8192
on llama-server).
Limitations
A research prune, not an official Google release. Expert pruning trades a slice of
breadth for size: generic_multilingual is de-weighted (0×), and the non-code
generalist axes carry the prune budget. GPQA/AIME/GSM8K parity-with-128e is measured
on small benches (30–198 questions) at greedy and carries run-to-run variance —
read the scoreboard as "recovered the science gap", not a robust win over the base.
Quality below ~Q3 / 3-bit degrades on the Gemma 4 MoE — prefer Q4_K_M or higher for
production. The GGUF and NVFP4A16 formats are provided for deployment but are not
separately benchmarked.
Lineage
128e → (v4 → v5 → v6-coder code line) → v7 competence-map rebuild → fs2440 code
floor → +targeted_gpqa 1.5× = v7-coder. Built and evaluated on the
omnimergekit toolchain.
- Downloads last month
- 333