Bottom Line
Full 330 GEPA mean
0.7350
v10 0.7307, delta +0.0043
Full 330 Micro-F1
0.8206
v10 0.8231, delta -0.0025
Precision / Recall
0.8246 / 0.8167
precision down, recall up
FP / FN
110 / 116
v10 102 / 119
Heldout Micro-F1
0.8417
v10 0.8296, delta +0.0121
Best Pareto score
0.6979
seed 0.5742
GEPA-best is not a clear replacement for v10. It improves the GEPA objective and exact match, but false positives increase and full-set micro-F1 is slightly lower.
Whole 330 Score Graph
v10 seedGEPA best
Open The Detailed Graphs
Proposal GraphsEvery proposal attempt, accepted/rejected status, subsample deltas, and best-so-far Pareto score.
Whole Dataset ComparisonFull 330-row GEPA-best versus v10 charts and metric table.
Iteration GraphOriginal GEPA score report for the run.
Prompt Diff PickerDropdown-to-dropdown comparison for individual candidate prompts.
Candidate TreeGEPA-native candidate tree visualization.
Final ReportRun settings, heldout comparison, whole-330 check, and artifact links.
Metric Table
| Metric | v10 seed | GEPA best | Delta |
|---|---|---|---|
| GEPA mean score | 0.7307 | 0.7350 | +0.0043 |
| Micro-F1 | 0.8231 | 0.8206 | -0.0025 |
| Precision | 0.8344 | 0.8246 | -0.0098 |
| Recall | 0.8120 | 0.8167 | +0.0047 |
| Exact match | 0.5242 | 0.5424 | +0.0182 |
| False positives | 102 | 110 | +8 |
| False negatives | 119 | 116 | -3 |
Archive
2026-06-14 12B cardinality report 2026-06-14 12B score graph 2026-06-14 prompt diffs