Spaces:
Configuration error
Configuration error
Upload localpager GEPA report outputs
Browse filesThis view is limited to 50 files because it contains too many changes. See raw diff
- .gitattributes +2 -0
- README.md +9 -5
- compare-12b-e4b-smoke-20260612T1850/e4b-smoke-best-12b.json +49 -0
- compare-12b-e4b-smoke-20260612T1850/seed-12b.json +47 -0
- compare-12b-e4b-smoke-20260612T1856/e4b-smoke-best-12b.json +181 -0
- compare-12b-e4b-smoke-20260612T1856/seed-12b.json +174 -0
- compare-12b-gepa-six-holdout-20260612T1914/gepa-12b-six-best-12b-offset6-limit6.json +172 -0
- compare-12b-gepa-six-holdout-20260612T1914/seed-12b-offset6-limit6.json +170 -0
- compare-12b-gepa-twelve-holdout-20260612T1944/gepa-12b-twelve-best-12b-offset12-limit6.json +164 -0
- compare-12b-gepa-twelve-holdout-20260612T1944/seed-12b-offset12-limit6.json +166 -0
- dashboard-20260613-gepa/artifacts/2026-06-13-gepa-12b-six-best.prompt.md +100 -0
- dashboard-20260613-gepa/artifacts/2026-06-13-gepa-12b-six-best.routing_policy.md +45 -0
- dashboard-20260613-gepa/index.html +23 -0
- dashboard-20260613-gepa/iteration-score-data.json +193 -0
- dashboard-20260613-gepa/iteration-score-summary.csv +16 -0
- dashboard-20260613-gepa/iterations.html +236 -0
- dashboard-20260613-gepa/legacy-gepa-six-dashboard.html +838 -0
- dashboard-20260613-gepa/live-gepa-iteration-data.json +29 -0
- dashboard-20260613-gepa/live-iterations.html +121 -0
- dashboard-20260613-gepa/summary.json +51 -0
- final-cardinality-report.html +107 -0
- gepa-12b-multi-from-six-20260613T051216Z/best.prompt.md +165 -0
- gepa-12b-multi-from-six-20260613T051216Z/best.routing_policy.md +110 -0
- gepa-12b-multi-from-six-20260613T051216Z/candidate_tree.html +179 -0
- gepa-12b-multi-from-six-20260613T051216Z/candidates.json +14 -0
- gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_0/iter_0_prog_0.json +1 -0
- gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_1/iter_0_prog_0.json +1 -0
- gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_10/iter_0_prog_0.json +1 -0
- gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_11/iter_0_prog_0.json +1 -0
- gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_12/iter_0_prog_0.json +1 -0
- gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_13/iter_0_prog_0.json +1 -0
- gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_14/iter_0_prog_0.json +1 -0
- gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_15/iter_0_prog_0.json +1 -0
- gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_16/iter_0_prog_0.json +1 -0
- gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_17/iter_0_prog_0.json +1 -0
- gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_2/iter_0_prog_0.json +1 -0
- gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_3/iter_0_prog_0.json +1 -0
- gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_4/iter_0_prog_0.json +1 -0
- gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_5/iter_0_prog_0.json +1 -0
- gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_6/iter_0_prog_0.json +1 -0
- gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_7/iter_0_prog_0.json +1 -0
- gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_8/iter_0_prog_0.json +1 -0
- gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_9/iter_0_prog_0.json +1 -0
- gepa-12b-multi-from-six-20260613T051216Z/gepa-result.json +223 -0
- gepa-12b-multi-from-six-20260613T051216Z/gepa_state.bin +3 -0
- gepa-12b-multi-from-six-20260613T051216Z/optimize.stderr.log +0 -0
- gepa-12b-multi-from-six-20260613T051216Z/optimize.stdout.json +275 -0
- gepa-12b-multi-from-six-20260613T051216Z/run_log.json +131 -0
- gepa-12b-multi-from-six-20260613T051216Z/run_log.txt +246 -0
- gepa-12b-multi-from-six-20260613T051216Z/run_log_stderr.txt +0 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
screenshots/prompt-diffs-after.png filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
screenshots/prompt-diffs-before.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -1,10 +1,14 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: static
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: Localpager GEPA Reports
|
| 3 |
+
emoji: 📊
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: green
|
| 6 |
sdk: static
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
+
# Localpager GEPA Reports
|
| 11 |
+
|
| 12 |
+
Static report bundle for the Localpager prompt-optimizer GEPA run.
|
| 13 |
+
|
| 14 |
+
Open the Space app to view the final cardinality-safe report, score graphs, and prompt diff explorer.
|
compare-12b-e4b-smoke-20260612T1850/e4b-smoke-best-12b.json
ADDED
|
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"candidate": "e4b-smoke-best",
|
| 3 |
+
"concurrency": 1,
|
| 4 |
+
"harness": "localpager-agent",
|
| 5 |
+
"mean_score": 0.625,
|
| 6 |
+
"routing_policy_path": "prompt-optimizer/out/gepa-e4b-smoke-20260612T184748Z/best.routing_policy.md",
|
| 7 |
+
"routing_policy_sha256": "7858c420430d3620b96ea66fc6c84d6f54d99dd0b2865c7f986751d317e7bd6a",
|
| 8 |
+
"row_reports": [
|
| 9 |
+
{
|
| 10 |
+
"error": null,
|
| 11 |
+
"gold_topics": [
|
| 12 |
+
"acp",
|
| 13 |
+
"gateway",
|
| 14 |
+
"agent_runtime"
|
| 15 |
+
],
|
| 16 |
+
"id": "openclaw-openclaw-48940",
|
| 17 |
+
"predicted_topics": [
|
| 18 |
+
"acp",
|
| 19 |
+
"gateway",
|
| 20 |
+
"agent_runtime"
|
| 21 |
+
],
|
| 22 |
+
"score": 1.0,
|
| 23 |
+
"target": "https://github.com/openclaw/openclaw/pull/48940",
|
| 24 |
+
"title": "ACP: add gateway-owned node-backed runtime"
|
| 25 |
+
},
|
| 26 |
+
{
|
| 27 |
+
"error": null,
|
| 28 |
+
"gold_topics": [
|
| 29 |
+
"mcp_tooling",
|
| 30 |
+
"config",
|
| 31 |
+
"security"
|
| 32 |
+
],
|
| 33 |
+
"id": "openclaw-openclaw-80783",
|
| 34 |
+
"predicted_topics": [
|
| 35 |
+
"mcp_tooling",
|
| 36 |
+
"local_model_providers",
|
| 37 |
+
"security"
|
| 38 |
+
],
|
| 39 |
+
"score": 0.25,
|
| 40 |
+
"target": "https://github.com/openclaw/openclaw/pull/80783",
|
| 41 |
+
"title": "Policy: add model, network, and MCP conformance checks"
|
| 42 |
+
}
|
| 43 |
+
],
|
| 44 |
+
"rows": 2,
|
| 45 |
+
"scores": [
|
| 46 |
+
1.0,
|
| 47 |
+
0.25
|
| 48 |
+
]
|
| 49 |
+
}
|
compare-12b-e4b-smoke-20260612T1850/seed-12b.json
ADDED
|
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"candidate": "seed",
|
| 3 |
+
"concurrency": 1,
|
| 4 |
+
"harness": "localpager-agent",
|
| 5 |
+
"mean_score": 0.375,
|
| 6 |
+
"routing_policy_sha256": "3d84a551239a9c26865a44758f0602548c1361482eef087c194d878a2d92ad37",
|
| 7 |
+
"row_reports": [
|
| 8 |
+
{
|
| 9 |
+
"error": null,
|
| 10 |
+
"gold_topics": [
|
| 11 |
+
"acp",
|
| 12 |
+
"gateway",
|
| 13 |
+
"agent_runtime"
|
| 14 |
+
],
|
| 15 |
+
"id": "openclaw-openclaw-48940",
|
| 16 |
+
"predicted_topics": [
|
| 17 |
+
"acp",
|
| 18 |
+
"gateway"
|
| 19 |
+
],
|
| 20 |
+
"score": 0.5,
|
| 21 |
+
"target": "https://github.com/openclaw/openclaw/pull/48940",
|
| 22 |
+
"title": "ACP: add gateway-owned node-backed runtime"
|
| 23 |
+
},
|
| 24 |
+
{
|
| 25 |
+
"error": null,
|
| 26 |
+
"gold_topics": [
|
| 27 |
+
"mcp_tooling",
|
| 28 |
+
"config",
|
| 29 |
+
"security"
|
| 30 |
+
],
|
| 31 |
+
"id": "openclaw-openclaw-80783",
|
| 32 |
+
"predicted_topics": [
|
| 33 |
+
"mcp_tooling",
|
| 34 |
+
"security",
|
| 35 |
+
"model_serving"
|
| 36 |
+
],
|
| 37 |
+
"score": 0.25,
|
| 38 |
+
"target": "https://github.com/openclaw/openclaw/pull/80783",
|
| 39 |
+
"title": "Policy: add model, network, and MCP conformance checks"
|
| 40 |
+
}
|
| 41 |
+
],
|
| 42 |
+
"rows": 2,
|
| 43 |
+
"scores": [
|
| 44 |
+
0.5,
|
| 45 |
+
0.25
|
| 46 |
+
]
|
| 47 |
+
}
|
compare-12b-e4b-smoke-20260612T1856/e4b-smoke-best-12b.json
ADDED
|
@@ -0,0 +1,181 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"candidate": "e4b-smoke-best",
|
| 3 |
+
"concurrency": 2,
|
| 4 |
+
"harness": "localpager-agent",
|
| 5 |
+
"mean_score": 0.3607142857142857,
|
| 6 |
+
"routing_policy_path": "prompt-optimizer/out/gepa-e4b-smoke-20260612T184748Z/best.routing_policy.md",
|
| 7 |
+
"routing_policy_sha256": "7858c420430d3620b96ea66fc6c84d6f54d99dd0b2865c7f986751d317e7bd6a",
|
| 8 |
+
"row_reports": [
|
| 9 |
+
{
|
| 10 |
+
"error": null,
|
| 11 |
+
"false_negatives": [],
|
| 12 |
+
"false_positives": [],
|
| 13 |
+
"gold_topics": [
|
| 14 |
+
"acp",
|
| 15 |
+
"gateway",
|
| 16 |
+
"agent_runtime"
|
| 17 |
+
],
|
| 18 |
+
"id": "openclaw-openclaw-48940",
|
| 19 |
+
"loss": 0.0,
|
| 20 |
+
"over_label_count": 0,
|
| 21 |
+
"predicted_topics": [
|
| 22 |
+
"acp",
|
| 23 |
+
"gateway",
|
| 24 |
+
"agent_runtime"
|
| 25 |
+
],
|
| 26 |
+
"score": 1.0,
|
| 27 |
+
"target": "https://github.com/openclaw/openclaw/pull/48940",
|
| 28 |
+
"title": "ACP: add gateway-owned node-backed runtime",
|
| 29 |
+
"true_positives": [
|
| 30 |
+
"acp",
|
| 31 |
+
"gateway",
|
| 32 |
+
"agent_runtime"
|
| 33 |
+
]
|
| 34 |
+
},
|
| 35 |
+
{
|
| 36 |
+
"error": null,
|
| 37 |
+
"false_negatives": [
|
| 38 |
+
"config",
|
| 39 |
+
"security"
|
| 40 |
+
],
|
| 41 |
+
"false_positives": [
|
| 42 |
+
"local_model_providers"
|
| 43 |
+
],
|
| 44 |
+
"gold_topics": [
|
| 45 |
+
"mcp_tooling",
|
| 46 |
+
"config",
|
| 47 |
+
"security"
|
| 48 |
+
],
|
| 49 |
+
"id": "openclaw-openclaw-80783",
|
| 50 |
+
"loss": 4.0,
|
| 51 |
+
"over_label_count": 0,
|
| 52 |
+
"predicted_topics": [
|
| 53 |
+
"mcp_tooling",
|
| 54 |
+
"local_model_providers"
|
| 55 |
+
],
|
| 56 |
+
"score": 0.2,
|
| 57 |
+
"target": "https://github.com/openclaw/openclaw/pull/80783",
|
| 58 |
+
"title": "Policy: add model, network, and MCP conformance checks",
|
| 59 |
+
"true_positives": [
|
| 60 |
+
"mcp_tooling"
|
| 61 |
+
]
|
| 62 |
+
},
|
| 63 |
+
{
|
| 64 |
+
"error": null,
|
| 65 |
+
"false_negatives": [
|
| 66 |
+
"browser_automation",
|
| 67 |
+
"cron_automation"
|
| 68 |
+
],
|
| 69 |
+
"false_positives": [
|
| 70 |
+
"ui_tui",
|
| 71 |
+
"gateway"
|
| 72 |
+
],
|
| 73 |
+
"gold_topics": [
|
| 74 |
+
"exec_tools",
|
| 75 |
+
"browser_automation",
|
| 76 |
+
"cron_automation"
|
| 77 |
+
],
|
| 78 |
+
"id": "openclaw-openclaw-42027",
|
| 79 |
+
"loss": 6.0,
|
| 80 |
+
"over_label_count": 0,
|
| 81 |
+
"predicted_topics": [
|
| 82 |
+
"exec_tools",
|
| 83 |
+
"ui_tui",
|
| 84 |
+
"gateway"
|
| 85 |
+
],
|
| 86 |
+
"score": 0.14285714285714285,
|
| 87 |
+
"target": "https://github.com/openclaw/openclaw/pull/42027",
|
| 88 |
+
"title": "fix: resolve exec PATH fallback, layered browser diagnostics, and cron force-run deadlock",
|
| 89 |
+
"true_positives": [
|
| 90 |
+
"exec_tools"
|
| 91 |
+
]
|
| 92 |
+
},
|
| 93 |
+
{
|
| 94 |
+
"error": null,
|
| 95 |
+
"false_negatives": [],
|
| 96 |
+
"false_positives": [
|
| 97 |
+
"gateway"
|
| 98 |
+
],
|
| 99 |
+
"gold_topics": [
|
| 100 |
+
"codex",
|
| 101 |
+
"chat_integrations"
|
| 102 |
+
],
|
| 103 |
+
"id": "openclaw-openclaw-77748",
|
| 104 |
+
"loss": 2.5,
|
| 105 |
+
"over_label_count": 1,
|
| 106 |
+
"predicted_topics": [
|
| 107 |
+
"codex",
|
| 108 |
+
"gateway",
|
| 109 |
+
"chat_integrations"
|
| 110 |
+
],
|
| 111 |
+
"score": 0.2857142857142857,
|
| 112 |
+
"target": "https://github.com/openclaw/openclaw/pull/77748",
|
| 113 |
+
"title": "fix: Codex startup plugins + WhatsApp history & Docker Codex OAuth",
|
| 114 |
+
"true_positives": [
|
| 115 |
+
"codex",
|
| 116 |
+
"chat_integrations"
|
| 117 |
+
]
|
| 118 |
+
},
|
| 119 |
+
{
|
| 120 |
+
"error": null,
|
| 121 |
+
"false_negatives": [],
|
| 122 |
+
"false_positives": [
|
| 123 |
+
"telemetry_usage"
|
| 124 |
+
],
|
| 125 |
+
"gold_topics": [
|
| 126 |
+
"model_serving"
|
| 127 |
+
],
|
| 128 |
+
"id": "openclaw-openclaw-79897",
|
| 129 |
+
"loss": 2.5,
|
| 130 |
+
"over_label_count": 1,
|
| 131 |
+
"predicted_topics": [
|
| 132 |
+
"model_serving",
|
| 133 |
+
"telemetry_usage"
|
| 134 |
+
],
|
| 135 |
+
"score": 0.2857142857142857,
|
| 136 |
+
"target": "https://github.com/openclaw/openclaw/issues/79897",
|
| 137 |
+
"title": "OpenAI-compatible streaming with llama.cpp saves zero usage (stream closed before final usage chunk)",
|
| 138 |
+
"true_positives": [
|
| 139 |
+
"model_serving"
|
| 140 |
+
]
|
| 141 |
+
},
|
| 142 |
+
{
|
| 143 |
+
"error": null,
|
| 144 |
+
"false_negatives": [
|
| 145 |
+
"acpx"
|
| 146 |
+
],
|
| 147 |
+
"false_positives": [
|
| 148 |
+
"sessions"
|
| 149 |
+
],
|
| 150 |
+
"gold_topics": [
|
| 151 |
+
"acp",
|
| 152 |
+
"approvals",
|
| 153 |
+
"acpx"
|
| 154 |
+
],
|
| 155 |
+
"id": "openclaw-openclaw-40332",
|
| 156 |
+
"loss": 3.0,
|
| 157 |
+
"over_label_count": 0,
|
| 158 |
+
"predicted_topics": [
|
| 159 |
+
"acp",
|
| 160 |
+
"sessions",
|
| 161 |
+
"approvals"
|
| 162 |
+
],
|
| 163 |
+
"score": 0.25,
|
| 164 |
+
"target": "https://github.com/openclaw/openclaw/issues/40332",
|
| 165 |
+
"title": "[Feature]: Per-binding and per-agent permissionMode for ACP sessions",
|
| 166 |
+
"true_positives": [
|
| 167 |
+
"acp",
|
| 168 |
+
"approvals"
|
| 169 |
+
]
|
| 170 |
+
}
|
| 171 |
+
],
|
| 172 |
+
"rows": 6,
|
| 173 |
+
"scores": [
|
| 174 |
+
1.0,
|
| 175 |
+
0.2,
|
| 176 |
+
0.14285714285714285,
|
| 177 |
+
0.2857142857142857,
|
| 178 |
+
0.2857142857142857,
|
| 179 |
+
0.25
|
| 180 |
+
]
|
| 181 |
+
}
|
compare-12b-e4b-smoke-20260612T1856/seed-12b.json
ADDED
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"candidate": "seed",
|
| 3 |
+
"concurrency": 2,
|
| 4 |
+
"harness": "localpager-agent",
|
| 5 |
+
"mean_score": 0.5214285714285715,
|
| 6 |
+
"routing_policy_sha256": "3d84a551239a9c26865a44758f0602548c1361482eef087c194d878a2d92ad37",
|
| 7 |
+
"row_reports": [
|
| 8 |
+
{
|
| 9 |
+
"error": null,
|
| 10 |
+
"false_negatives": [],
|
| 11 |
+
"false_positives": [],
|
| 12 |
+
"gold_topics": [
|
| 13 |
+
"acp",
|
| 14 |
+
"gateway",
|
| 15 |
+
"agent_runtime"
|
| 16 |
+
],
|
| 17 |
+
"id": "openclaw-openclaw-48940",
|
| 18 |
+
"loss": 0.0,
|
| 19 |
+
"over_label_count": 0,
|
| 20 |
+
"predicted_topics": [
|
| 21 |
+
"acp",
|
| 22 |
+
"gateway",
|
| 23 |
+
"agent_runtime"
|
| 24 |
+
],
|
| 25 |
+
"score": 1.0,
|
| 26 |
+
"target": "https://github.com/openclaw/openclaw/pull/48940",
|
| 27 |
+
"title": "ACP: add gateway-owned node-backed runtime",
|
| 28 |
+
"true_positives": [
|
| 29 |
+
"acp",
|
| 30 |
+
"gateway",
|
| 31 |
+
"agent_runtime"
|
| 32 |
+
]
|
| 33 |
+
},
|
| 34 |
+
{
|
| 35 |
+
"error": null,
|
| 36 |
+
"false_negatives": [
|
| 37 |
+
"config",
|
| 38 |
+
"security"
|
| 39 |
+
],
|
| 40 |
+
"false_positives": [
|
| 41 |
+
"local_model_providers"
|
| 42 |
+
],
|
| 43 |
+
"gold_topics": [
|
| 44 |
+
"mcp_tooling",
|
| 45 |
+
"config",
|
| 46 |
+
"security"
|
| 47 |
+
],
|
| 48 |
+
"id": "openclaw-openclaw-80783",
|
| 49 |
+
"loss": 4.0,
|
| 50 |
+
"over_label_count": 0,
|
| 51 |
+
"predicted_topics": [
|
| 52 |
+
"mcp_tooling",
|
| 53 |
+
"local_model_providers"
|
| 54 |
+
],
|
| 55 |
+
"score": 0.2,
|
| 56 |
+
"target": "https://github.com/openclaw/openclaw/pull/80783",
|
| 57 |
+
"title": "Policy: add model, network, and MCP conformance checks",
|
| 58 |
+
"true_positives": [
|
| 59 |
+
"mcp_tooling"
|
| 60 |
+
]
|
| 61 |
+
},
|
| 62 |
+
{
|
| 63 |
+
"error": null,
|
| 64 |
+
"false_negatives": [
|
| 65 |
+
"browser_automation",
|
| 66 |
+
"cron_automation"
|
| 67 |
+
],
|
| 68 |
+
"false_positives": [
|
| 69 |
+
"gateway",
|
| 70 |
+
"ui_tui"
|
| 71 |
+
],
|
| 72 |
+
"gold_topics": [
|
| 73 |
+
"exec_tools",
|
| 74 |
+
"browser_automation",
|
| 75 |
+
"cron_automation"
|
| 76 |
+
],
|
| 77 |
+
"id": "openclaw-openclaw-42027",
|
| 78 |
+
"loss": 6.0,
|
| 79 |
+
"over_label_count": 0,
|
| 80 |
+
"predicted_topics": [
|
| 81 |
+
"exec_tools",
|
| 82 |
+
"gateway",
|
| 83 |
+
"ui_tui"
|
| 84 |
+
],
|
| 85 |
+
"score": 0.14285714285714285,
|
| 86 |
+
"target": "https://github.com/openclaw/openclaw/pull/42027",
|
| 87 |
+
"title": "fix: resolve exec PATH fallback, layered browser diagnostics, and cron force-run deadlock",
|
| 88 |
+
"true_positives": [
|
| 89 |
+
"exec_tools"
|
| 90 |
+
]
|
| 91 |
+
},
|
| 92 |
+
{
|
| 93 |
+
"error": null,
|
| 94 |
+
"false_negatives": [],
|
| 95 |
+
"false_positives": [],
|
| 96 |
+
"gold_topics": [
|
| 97 |
+
"codex",
|
| 98 |
+
"chat_integrations"
|
| 99 |
+
],
|
| 100 |
+
"id": "openclaw-openclaw-77748",
|
| 101 |
+
"loss": 0.0,
|
| 102 |
+
"over_label_count": 0,
|
| 103 |
+
"predicted_topics": [
|
| 104 |
+
"chat_integrations",
|
| 105 |
+
"codex"
|
| 106 |
+
],
|
| 107 |
+
"score": 1.0,
|
| 108 |
+
"target": "https://github.com/openclaw/openclaw/pull/77748",
|
| 109 |
+
"title": "fix: Codex startup plugins + WhatsApp history & Docker Codex OAuth",
|
| 110 |
+
"true_positives": [
|
| 111 |
+
"chat_integrations",
|
| 112 |
+
"codex"
|
| 113 |
+
]
|
| 114 |
+
},
|
| 115 |
+
{
|
| 116 |
+
"error": null,
|
| 117 |
+
"false_negatives": [],
|
| 118 |
+
"false_positives": [
|
| 119 |
+
"telemetry_usage"
|
| 120 |
+
],
|
| 121 |
+
"gold_topics": [
|
| 122 |
+
"model_serving"
|
| 123 |
+
],
|
| 124 |
+
"id": "openclaw-openclaw-79897",
|
| 125 |
+
"loss": 2.5,
|
| 126 |
+
"over_label_count": 1,
|
| 127 |
+
"predicted_topics": [
|
| 128 |
+
"telemetry_usage",
|
| 129 |
+
"model_serving"
|
| 130 |
+
],
|
| 131 |
+
"score": 0.2857142857142857,
|
| 132 |
+
"target": "https://github.com/openclaw/openclaw/issues/79897",
|
| 133 |
+
"title": "OpenAI-compatible streaming with llama.cpp saves zero usage (stream closed before final usage chunk)",
|
| 134 |
+
"true_positives": [
|
| 135 |
+
"model_serving"
|
| 136 |
+
]
|
| 137 |
+
},
|
| 138 |
+
{
|
| 139 |
+
"error": null,
|
| 140 |
+
"false_negatives": [
|
| 141 |
+
"acpx"
|
| 142 |
+
],
|
| 143 |
+
"false_positives": [],
|
| 144 |
+
"gold_topics": [
|
| 145 |
+
"acp",
|
| 146 |
+
"approvals",
|
| 147 |
+
"acpx"
|
| 148 |
+
],
|
| 149 |
+
"id": "openclaw-openclaw-40332",
|
| 150 |
+
"loss": 1.0,
|
| 151 |
+
"over_label_count": 0,
|
| 152 |
+
"predicted_topics": [
|
| 153 |
+
"approvals",
|
| 154 |
+
"acp"
|
| 155 |
+
],
|
| 156 |
+
"score": 0.5,
|
| 157 |
+
"target": "https://github.com/openclaw/openclaw/issues/40332",
|
| 158 |
+
"title": "[Feature]: Per-binding and per-agent permissionMode for ACP sessions",
|
| 159 |
+
"true_positives": [
|
| 160 |
+
"approvals",
|
| 161 |
+
"acp"
|
| 162 |
+
]
|
| 163 |
+
}
|
| 164 |
+
],
|
| 165 |
+
"rows": 6,
|
| 166 |
+
"scores": [
|
| 167 |
+
1.0,
|
| 168 |
+
0.2,
|
| 169 |
+
0.14285714285714285,
|
| 170 |
+
1.0,
|
| 171 |
+
0.2857142857142857,
|
| 172 |
+
0.5
|
| 173 |
+
]
|
| 174 |
+
}
|
compare-12b-gepa-six-holdout-20260612T1914/gepa-12b-six-best-12b-offset6-limit6.json
ADDED
|
@@ -0,0 +1,172 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"candidate": "gepa-12b-six-best",
|
| 3 |
+
"concurrency": 2,
|
| 4 |
+
"harness": "localpager-agent",
|
| 5 |
+
"mean_score": 0.4916666666666667,
|
| 6 |
+
"offset": 6,
|
| 7 |
+
"routing_policy_path": "prompt-optimizer/out/gepa-12b-six-20260612T190217Z/best.routing_policy.md",
|
| 8 |
+
"routing_policy_sha256": "f4b161bb9bbaf366f1d4f1841243d73544bbd3c553ca6be5eb2818e757007187",
|
| 9 |
+
"row_reports": [
|
| 10 |
+
{
|
| 11 |
+
"error": null,
|
| 12 |
+
"false_negatives": [
|
| 13 |
+
"sessions"
|
| 14 |
+
],
|
| 15 |
+
"false_positives": [
|
| 16 |
+
"hooks"
|
| 17 |
+
],
|
| 18 |
+
"gold_topics": [
|
| 19 |
+
"gateway",
|
| 20 |
+
"sessions"
|
| 21 |
+
],
|
| 22 |
+
"id": "openclaw-openclaw-63007",
|
| 23 |
+
"loss": 3.0,
|
| 24 |
+
"over_label_count": 0,
|
| 25 |
+
"predicted_topics": [
|
| 26 |
+
"gateway",
|
| 27 |
+
"hooks"
|
| 28 |
+
],
|
| 29 |
+
"score": 0.25,
|
| 30 |
+
"target": "https://github.com/openclaw/openclaw/pull/63007",
|
| 31 |
+
"title": "Pass outbound session identity into message_sending and surface guarded gateway send denial",
|
| 32 |
+
"true_positives": [
|
| 33 |
+
"gateway"
|
| 34 |
+
]
|
| 35 |
+
},
|
| 36 |
+
{
|
| 37 |
+
"error": null,
|
| 38 |
+
"false_negatives": [
|
| 39 |
+
"reliability"
|
| 40 |
+
],
|
| 41 |
+
"false_positives": [],
|
| 42 |
+
"gold_topics": [
|
| 43 |
+
"memory",
|
| 44 |
+
"reliability"
|
| 45 |
+
],
|
| 46 |
+
"id": "openclaw-openclaw-80255",
|
| 47 |
+
"loss": 1.0,
|
| 48 |
+
"over_label_count": 0,
|
| 49 |
+
"predicted_topics": [
|
| 50 |
+
"memory"
|
| 51 |
+
],
|
| 52 |
+
"score": 0.5,
|
| 53 |
+
"target": "https://github.com/openclaw/openclaw/pull/80255",
|
| 54 |
+
"title": "fix #79026: active-memory recall subagent can deadlock on the main lane inside before_prompt_build",
|
| 55 |
+
"true_positives": [
|
| 56 |
+
"memory"
|
| 57 |
+
]
|
| 58 |
+
},
|
| 59 |
+
{
|
| 60 |
+
"error": null,
|
| 61 |
+
"false_negatives": [
|
| 62 |
+
"api_surface"
|
| 63 |
+
],
|
| 64 |
+
"false_positives": [],
|
| 65 |
+
"gold_topics": [
|
| 66 |
+
"gateway",
|
| 67 |
+
"api_surface",
|
| 68 |
+
"ui_tui"
|
| 69 |
+
],
|
| 70 |
+
"id": "openclaw-openclaw-84670",
|
| 71 |
+
"loss": 1.0,
|
| 72 |
+
"over_label_count": 0,
|
| 73 |
+
"predicted_topics": [
|
| 74 |
+
"gateway",
|
| 75 |
+
"ui_tui"
|
| 76 |
+
],
|
| 77 |
+
"score": 0.5,
|
| 78 |
+
"target": "https://github.com/openclaw/openclaw/pull/84670",
|
| 79 |
+
"title": "[codex] fix webchat full-message reader for truncated history",
|
| 80 |
+
"true_positives": [
|
| 81 |
+
"gateway",
|
| 82 |
+
"ui_tui"
|
| 83 |
+
]
|
| 84 |
+
},
|
| 85 |
+
{
|
| 86 |
+
"error": null,
|
| 87 |
+
"false_negatives": [],
|
| 88 |
+
"false_positives": [],
|
| 89 |
+
"gold_topics": [
|
| 90 |
+
"queueing",
|
| 91 |
+
"docs"
|
| 92 |
+
],
|
| 93 |
+
"id": "openclaw-openclaw-46552",
|
| 94 |
+
"loss": 0.0,
|
| 95 |
+
"over_label_count": 0,
|
| 96 |
+
"predicted_topics": [
|
| 97 |
+
"docs",
|
| 98 |
+
"queueing"
|
| 99 |
+
],
|
| 100 |
+
"score": 1.0,
|
| 101 |
+
"target": "https://github.com/openclaw/openclaw/pull/46552",
|
| 102 |
+
"title": "docs(queue): clarify steer behavior with partial streaming and tool boundaries",
|
| 103 |
+
"true_positives": [
|
| 104 |
+
"docs",
|
| 105 |
+
"queueing"
|
| 106 |
+
]
|
| 107 |
+
},
|
| 108 |
+
{
|
| 109 |
+
"error": null,
|
| 110 |
+
"false_negatives": [
|
| 111 |
+
"sandboxing",
|
| 112 |
+
"approvals"
|
| 113 |
+
],
|
| 114 |
+
"false_positives": [
|
| 115 |
+
"security"
|
| 116 |
+
],
|
| 117 |
+
"gold_topics": [
|
| 118 |
+
"exec_tools",
|
| 119 |
+
"sandboxing",
|
| 120 |
+
"approvals"
|
| 121 |
+
],
|
| 122 |
+
"id": "openclaw-openclaw-62428",
|
| 123 |
+
"loss": 4.0,
|
| 124 |
+
"over_label_count": 0,
|
| 125 |
+
"predicted_topics": [
|
| 126 |
+
"exec_tools",
|
| 127 |
+
"security"
|
| 128 |
+
],
|
| 129 |
+
"score": 0.2,
|
| 130 |
+
"target": "https://github.com/openclaw/openclaw/pull/62428",
|
| 131 |
+
"title": "test(exec): land exec v2 contract follow-through",
|
| 132 |
+
"true_positives": [
|
| 133 |
+
"exec_tools"
|
| 134 |
+
]
|
| 135 |
+
},
|
| 136 |
+
{
|
| 137 |
+
"error": null,
|
| 138 |
+
"false_negatives": [
|
| 139 |
+
"codex"
|
| 140 |
+
],
|
| 141 |
+
"false_positives": [],
|
| 142 |
+
"gold_topics": [
|
| 143 |
+
"acpx",
|
| 144 |
+
"codex",
|
| 145 |
+
"skills_plugins"
|
| 146 |
+
],
|
| 147 |
+
"id": "openclaw-openclaw-82507",
|
| 148 |
+
"loss": 1.0,
|
| 149 |
+
"over_label_count": 0,
|
| 150 |
+
"predicted_topics": [
|
| 151 |
+
"acpx",
|
| 152 |
+
"skills_plugins"
|
| 153 |
+
],
|
| 154 |
+
"score": 0.5,
|
| 155 |
+
"target": "https://github.com/openclaw/openclaw/issues/82507",
|
| 156 |
+
"title": "[Feature]: ACPX Codex sandbox should inherit user-installed plugins (e.g. Superpowers)",
|
| 157 |
+
"true_positives": [
|
| 158 |
+
"acpx",
|
| 159 |
+
"skills_plugins"
|
| 160 |
+
]
|
| 161 |
+
}
|
| 162 |
+
],
|
| 163 |
+
"rows": 6,
|
| 164 |
+
"scores": [
|
| 165 |
+
0.25,
|
| 166 |
+
0.5,
|
| 167 |
+
0.5,
|
| 168 |
+
1.0,
|
| 169 |
+
0.2,
|
| 170 |
+
0.5
|
| 171 |
+
]
|
| 172 |
+
}
|
compare-12b-gepa-six-holdout-20260612T1914/seed-12b-offset6-limit6.json
ADDED
|
@@ -0,0 +1,170 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"candidate": "seed",
|
| 3 |
+
"concurrency": 2,
|
| 4 |
+
"harness": "localpager-agent",
|
| 5 |
+
"mean_score": 0.48333333333333334,
|
| 6 |
+
"offset": 6,
|
| 7 |
+
"routing_policy_sha256": "3d84a551239a9c26865a44758f0602548c1361482eef087c194d878a2d92ad37",
|
| 8 |
+
"row_reports": [
|
| 9 |
+
{
|
| 10 |
+
"error": null,
|
| 11 |
+
"false_negatives": [],
|
| 12 |
+
"false_positives": [],
|
| 13 |
+
"gold_topics": [
|
| 14 |
+
"gateway",
|
| 15 |
+
"sessions"
|
| 16 |
+
],
|
| 17 |
+
"id": "openclaw-openclaw-63007",
|
| 18 |
+
"loss": 0.0,
|
| 19 |
+
"over_label_count": 0,
|
| 20 |
+
"predicted_topics": [
|
| 21 |
+
"gateway",
|
| 22 |
+
"sessions"
|
| 23 |
+
],
|
| 24 |
+
"score": 1.0,
|
| 25 |
+
"target": "https://github.com/openclaw/openclaw/pull/63007",
|
| 26 |
+
"title": "Pass outbound session identity into message_sending and surface guarded gateway send denial",
|
| 27 |
+
"true_positives": [
|
| 28 |
+
"gateway",
|
| 29 |
+
"sessions"
|
| 30 |
+
]
|
| 31 |
+
},
|
| 32 |
+
{
|
| 33 |
+
"error": null,
|
| 34 |
+
"false_negatives": [
|
| 35 |
+
"reliability"
|
| 36 |
+
],
|
| 37 |
+
"false_positives": [],
|
| 38 |
+
"gold_topics": [
|
| 39 |
+
"memory",
|
| 40 |
+
"reliability"
|
| 41 |
+
],
|
| 42 |
+
"id": "openclaw-openclaw-80255",
|
| 43 |
+
"loss": 1.0,
|
| 44 |
+
"over_label_count": 0,
|
| 45 |
+
"predicted_topics": [
|
| 46 |
+
"memory"
|
| 47 |
+
],
|
| 48 |
+
"score": 0.5,
|
| 49 |
+
"target": "https://github.com/openclaw/openclaw/pull/80255",
|
| 50 |
+
"title": "fix #79026: active-memory recall subagent can deadlock on the main lane inside before_prompt_build",
|
| 51 |
+
"true_positives": [
|
| 52 |
+
"memory"
|
| 53 |
+
]
|
| 54 |
+
},
|
| 55 |
+
{
|
| 56 |
+
"error": null,
|
| 57 |
+
"false_negatives": [
|
| 58 |
+
"api_surface"
|
| 59 |
+
],
|
| 60 |
+
"false_positives": [],
|
| 61 |
+
"gold_topics": [
|
| 62 |
+
"gateway",
|
| 63 |
+
"api_surface",
|
| 64 |
+
"ui_tui"
|
| 65 |
+
],
|
| 66 |
+
"id": "openclaw-openclaw-84670",
|
| 67 |
+
"loss": 1.0,
|
| 68 |
+
"over_label_count": 0,
|
| 69 |
+
"predicted_topics": [
|
| 70 |
+
"gateway",
|
| 71 |
+
"ui_tui"
|
| 72 |
+
],
|
| 73 |
+
"score": 0.5,
|
| 74 |
+
"target": "https://github.com/openclaw/openclaw/pull/84670",
|
| 75 |
+
"title": "[codex] fix webchat full-message reader for truncated history",
|
| 76 |
+
"true_positives": [
|
| 77 |
+
"gateway",
|
| 78 |
+
"ui_tui"
|
| 79 |
+
]
|
| 80 |
+
},
|
| 81 |
+
{
|
| 82 |
+
"error": null,
|
| 83 |
+
"false_negatives": [
|
| 84 |
+
"queueing"
|
| 85 |
+
],
|
| 86 |
+
"false_positives": [],
|
| 87 |
+
"gold_topics": [
|
| 88 |
+
"queueing",
|
| 89 |
+
"docs"
|
| 90 |
+
],
|
| 91 |
+
"id": "openclaw-openclaw-46552",
|
| 92 |
+
"loss": 1.0,
|
| 93 |
+
"over_label_count": 0,
|
| 94 |
+
"predicted_topics": [
|
| 95 |
+
"docs"
|
| 96 |
+
],
|
| 97 |
+
"score": 0.5,
|
| 98 |
+
"target": "https://github.com/openclaw/openclaw/pull/46552",
|
| 99 |
+
"title": "docs(queue): clarify steer behavior with partial streaming and tool boundaries",
|
| 100 |
+
"true_positives": [
|
| 101 |
+
"docs"
|
| 102 |
+
]
|
| 103 |
+
},
|
| 104 |
+
{
|
| 105 |
+
"error": null,
|
| 106 |
+
"false_negatives": [
|
| 107 |
+
"sandboxing",
|
| 108 |
+
"approvals"
|
| 109 |
+
],
|
| 110 |
+
"false_positives": [
|
| 111 |
+
"security"
|
| 112 |
+
],
|
| 113 |
+
"gold_topics": [
|
| 114 |
+
"exec_tools",
|
| 115 |
+
"sandboxing",
|
| 116 |
+
"approvals"
|
| 117 |
+
],
|
| 118 |
+
"id": "openclaw-openclaw-62428",
|
| 119 |
+
"loss": 4.0,
|
| 120 |
+
"over_label_count": 0,
|
| 121 |
+
"predicted_topics": [
|
| 122 |
+
"exec_tools",
|
| 123 |
+
"security"
|
| 124 |
+
],
|
| 125 |
+
"score": 0.2,
|
| 126 |
+
"target": "https://github.com/openclaw/openclaw/pull/62428",
|
| 127 |
+
"title": "test(exec): land exec v2 contract follow-through",
|
| 128 |
+
"true_positives": [
|
| 129 |
+
"exec_tools"
|
| 130 |
+
]
|
| 131 |
+
},
|
| 132 |
+
{
|
| 133 |
+
"error": null,
|
| 134 |
+
"false_negatives": [
|
| 135 |
+
"codex",
|
| 136 |
+
"skills_plugins"
|
| 137 |
+
],
|
| 138 |
+
"false_positives": [
|
| 139 |
+
"security"
|
| 140 |
+
],
|
| 141 |
+
"gold_topics": [
|
| 142 |
+
"acpx",
|
| 143 |
+
"codex",
|
| 144 |
+
"skills_plugins"
|
| 145 |
+
],
|
| 146 |
+
"id": "openclaw-openclaw-82507",
|
| 147 |
+
"loss": 4.0,
|
| 148 |
+
"over_label_count": 0,
|
| 149 |
+
"predicted_topics": [
|
| 150 |
+
"acpx",
|
| 151 |
+
"security"
|
| 152 |
+
],
|
| 153 |
+
"score": 0.2,
|
| 154 |
+
"target": "https://github.com/openclaw/openclaw/issues/82507",
|
| 155 |
+
"title": "[Feature]: ACPX Codex sandbox should inherit user-installed plugins (e.g. Superpowers)",
|
| 156 |
+
"true_positives": [
|
| 157 |
+
"acpx"
|
| 158 |
+
]
|
| 159 |
+
}
|
| 160 |
+
],
|
| 161 |
+
"rows": 6,
|
| 162 |
+
"scores": [
|
| 163 |
+
1.0,
|
| 164 |
+
0.5,
|
| 165 |
+
0.5,
|
| 166 |
+
0.5,
|
| 167 |
+
0.2,
|
| 168 |
+
0.2
|
| 169 |
+
]
|
| 170 |
+
}
|
compare-12b-gepa-twelve-holdout-20260612T1944/gepa-12b-twelve-best-12b-offset12-limit6.json
ADDED
|
@@ -0,0 +1,164 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"candidate": "gepa-12b-twelve-best",
|
| 3 |
+
"concurrency": 2,
|
| 4 |
+
"harness": "localpager-agent",
|
| 5 |
+
"mean_score": 0.5416666666666666,
|
| 6 |
+
"offset": 12,
|
| 7 |
+
"routing_policy_path": "prompt-optimizer/out/gepa-12b-twelve-from-six-iter-20260612T192815Z/best.routing_policy.md",
|
| 8 |
+
"routing_policy_sha256": "6ab4227828618436d7f81662b5cc4993fb5b30557e3e56616801dbec6d2da34a",
|
| 9 |
+
"row_reports": [
|
| 10 |
+
{
|
| 11 |
+
"error": null,
|
| 12 |
+
"false_negatives": [],
|
| 13 |
+
"false_positives": [],
|
| 14 |
+
"gold_topics": [
|
| 15 |
+
"self_hosted_inference",
|
| 16 |
+
"memory"
|
| 17 |
+
],
|
| 18 |
+
"id": "openclaw-openclaw-80479",
|
| 19 |
+
"loss": 0.0,
|
| 20 |
+
"over_label_count": 0,
|
| 21 |
+
"predicted_topics": [
|
| 22 |
+
"memory",
|
| 23 |
+
"self_hosted_inference"
|
| 24 |
+
],
|
| 25 |
+
"score": 1.0,
|
| 26 |
+
"target": "https://github.com/openclaw/openclaw/pull/80479",
|
| 27 |
+
"title": "feat(memory/embeddings): add openai-compatible provider for self-hosted servers (llama.cpp, Ollama, vLLM, TGI, LocalAI)",
|
| 28 |
+
"true_positives": [
|
| 29 |
+
"memory",
|
| 30 |
+
"self_hosted_inference"
|
| 31 |
+
]
|
| 32 |
+
},
|
| 33 |
+
{
|
| 34 |
+
"error": null,
|
| 35 |
+
"false_negatives": [
|
| 36 |
+
"local_model_providers"
|
| 37 |
+
],
|
| 38 |
+
"false_positives": [
|
| 39 |
+
"model_serving"
|
| 40 |
+
],
|
| 41 |
+
"gold_topics": [
|
| 42 |
+
"local_model_providers",
|
| 43 |
+
"reliability"
|
| 44 |
+
],
|
| 45 |
+
"id": "openclaw-openclaw-90146",
|
| 46 |
+
"loss": 3.0,
|
| 47 |
+
"over_label_count": 0,
|
| 48 |
+
"predicted_topics": [
|
| 49 |
+
"model_serving",
|
| 50 |
+
"reliability"
|
| 51 |
+
],
|
| 52 |
+
"score": 0.25,
|
| 53 |
+
"target": "https://github.com/openclaw/openclaw/issues/90146",
|
| 54 |
+
"title": "google-vertex: Missing gemini-3.1-flash-lite in provider catalog causes silent failure instead of error",
|
| 55 |
+
"true_positives": [
|
| 56 |
+
"reliability"
|
| 57 |
+
]
|
| 58 |
+
},
|
| 59 |
+
{
|
| 60 |
+
"error": null,
|
| 61 |
+
"false_negatives": [],
|
| 62 |
+
"false_positives": [],
|
| 63 |
+
"gold_topics": [
|
| 64 |
+
"docs"
|
| 65 |
+
],
|
| 66 |
+
"id": "openclaw-openclaw-51849",
|
| 67 |
+
"loss": 0.0,
|
| 68 |
+
"over_label_count": 0,
|
| 69 |
+
"predicted_topics": [
|
| 70 |
+
"docs"
|
| 71 |
+
],
|
| 72 |
+
"score": 1.0,
|
| 73 |
+
"target": "https://github.com/openclaw/openclaw/pull/51849",
|
| 74 |
+
"title": "Docs: add freeCodeCamp OpenClaw full tutorial to showcase",
|
| 75 |
+
"true_positives": [
|
| 76 |
+
"docs"
|
| 77 |
+
]
|
| 78 |
+
},
|
| 79 |
+
{
|
| 80 |
+
"error": null,
|
| 81 |
+
"false_negatives": [
|
| 82 |
+
"local_model_providers"
|
| 83 |
+
],
|
| 84 |
+
"false_positives": [
|
| 85 |
+
"model_serving"
|
| 86 |
+
],
|
| 87 |
+
"gold_topics": [
|
| 88 |
+
"open_weight_models",
|
| 89 |
+
"local_model_providers"
|
| 90 |
+
],
|
| 91 |
+
"id": "openclaw-openclaw-68725",
|
| 92 |
+
"loss": 3.0,
|
| 93 |
+
"over_label_count": 0,
|
| 94 |
+
"predicted_topics": [
|
| 95 |
+
"open_weight_models",
|
| 96 |
+
"model_serving"
|
| 97 |
+
],
|
| 98 |
+
"score": 0.25,
|
| 99 |
+
"target": "https://github.com/openclaw/openclaw/pull/68725",
|
| 100 |
+
"title": "feat(amazon-bedrock-mantle): add known context windows for open-weight Mantle models",
|
| 101 |
+
"true_positives": [
|
| 102 |
+
"open_weight_models"
|
| 103 |
+
]
|
| 104 |
+
},
|
| 105 |
+
{
|
| 106 |
+
"error": null,
|
| 107 |
+
"false_negatives": [
|
| 108 |
+
"chat_integrations"
|
| 109 |
+
],
|
| 110 |
+
"false_positives": [
|
| 111 |
+
"cron_automation"
|
| 112 |
+
],
|
| 113 |
+
"gold_topics": [
|
| 114 |
+
"notifications",
|
| 115 |
+
"chat_integrations"
|
| 116 |
+
],
|
| 117 |
+
"id": "openclaw-openclaw-84297",
|
| 118 |
+
"loss": 3.0,
|
| 119 |
+
"over_label_count": 0,
|
| 120 |
+
"predicted_topics": [
|
| 121 |
+
"cron_automation",
|
| 122 |
+
"notifications"
|
| 123 |
+
],
|
| 124 |
+
"score": 0.25,
|
| 125 |
+
"target": "https://github.com/openclaw/openclaw/issues/84297",
|
| 126 |
+
"title": "[Bug]: Per-agent identity overlay dropped on cron --announce and heartbeat target-channel Slack pushes (announce path; reply path was fixed in #38235)",
|
| 127 |
+
"true_positives": [
|
| 128 |
+
"notifications"
|
| 129 |
+
]
|
| 130 |
+
},
|
| 131 |
+
{
|
| 132 |
+
"error": null,
|
| 133 |
+
"false_negatives": [
|
| 134 |
+
"local_models"
|
| 135 |
+
],
|
| 136 |
+
"false_positives": [],
|
| 137 |
+
"gold_topics": [
|
| 138 |
+
"model_serving",
|
| 139 |
+
"local_models"
|
| 140 |
+
],
|
| 141 |
+
"id": "openclaw-openclaw-77827",
|
| 142 |
+
"loss": 1.0,
|
| 143 |
+
"over_label_count": 0,
|
| 144 |
+
"predicted_topics": [
|
| 145 |
+
"model_serving"
|
| 146 |
+
],
|
| 147 |
+
"score": 0.5,
|
| 148 |
+
"target": "https://github.com/openclaw/openclaw/pull/77827",
|
| 149 |
+
"title": "fix: LM Studio thinking blocks invisible with Responses API",
|
| 150 |
+
"true_positives": [
|
| 151 |
+
"model_serving"
|
| 152 |
+
]
|
| 153 |
+
}
|
| 154 |
+
],
|
| 155 |
+
"rows": 6,
|
| 156 |
+
"scores": [
|
| 157 |
+
1.0,
|
| 158 |
+
0.25,
|
| 159 |
+
1.0,
|
| 160 |
+
0.25,
|
| 161 |
+
0.25,
|
| 162 |
+
0.5
|
| 163 |
+
]
|
| 164 |
+
}
|
compare-12b-gepa-twelve-holdout-20260612T1944/seed-12b-offset12-limit6.json
ADDED
|
@@ -0,0 +1,166 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"candidate": "seed",
|
| 3 |
+
"concurrency": 2,
|
| 4 |
+
"harness": "localpager-agent",
|
| 5 |
+
"mean_score": 0.5,
|
| 6 |
+
"offset": 12,
|
| 7 |
+
"routing_policy_sha256": "3d84a551239a9c26865a44758f0602548c1361482eef087c194d878a2d92ad37",
|
| 8 |
+
"row_reports": [
|
| 9 |
+
{
|
| 10 |
+
"error": null,
|
| 11 |
+
"false_negatives": [
|
| 12 |
+
"memory"
|
| 13 |
+
],
|
| 14 |
+
"false_positives": [
|
| 15 |
+
"local_models"
|
| 16 |
+
],
|
| 17 |
+
"gold_topics": [
|
| 18 |
+
"self_hosted_inference",
|
| 19 |
+
"memory"
|
| 20 |
+
],
|
| 21 |
+
"id": "openclaw-openclaw-80479",
|
| 22 |
+
"loss": 3.0,
|
| 23 |
+
"over_label_count": 0,
|
| 24 |
+
"predicted_topics": [
|
| 25 |
+
"local_models",
|
| 26 |
+
"self_hosted_inference"
|
| 27 |
+
],
|
| 28 |
+
"score": 0.25,
|
| 29 |
+
"target": "https://github.com/openclaw/openclaw/pull/80479",
|
| 30 |
+
"title": "feat(memory/embeddings): add openai-compatible provider for self-hosted servers (llama.cpp, Ollama, vLLM, TGI, LocalAI)",
|
| 31 |
+
"true_positives": [
|
| 32 |
+
"self_hosted_inference"
|
| 33 |
+
]
|
| 34 |
+
},
|
| 35 |
+
{
|
| 36 |
+
"error": null,
|
| 37 |
+
"false_negatives": [
|
| 38 |
+
"local_model_providers"
|
| 39 |
+
],
|
| 40 |
+
"false_positives": [
|
| 41 |
+
"model_serving"
|
| 42 |
+
],
|
| 43 |
+
"gold_topics": [
|
| 44 |
+
"local_model_providers",
|
| 45 |
+
"reliability"
|
| 46 |
+
],
|
| 47 |
+
"id": "openclaw-openclaw-90146",
|
| 48 |
+
"loss": 3.0,
|
| 49 |
+
"over_label_count": 0,
|
| 50 |
+
"predicted_topics": [
|
| 51 |
+
"model_serving",
|
| 52 |
+
"reliability"
|
| 53 |
+
],
|
| 54 |
+
"score": 0.25,
|
| 55 |
+
"target": "https://github.com/openclaw/openclaw/issues/90146",
|
| 56 |
+
"title": "google-vertex: Missing gemini-3.1-flash-lite in provider catalog causes silent failure instead of error",
|
| 57 |
+
"true_positives": [
|
| 58 |
+
"reliability"
|
| 59 |
+
]
|
| 60 |
+
},
|
| 61 |
+
{
|
| 62 |
+
"error": null,
|
| 63 |
+
"false_negatives": [],
|
| 64 |
+
"false_positives": [],
|
| 65 |
+
"gold_topics": [
|
| 66 |
+
"docs"
|
| 67 |
+
],
|
| 68 |
+
"id": "openclaw-openclaw-51849",
|
| 69 |
+
"loss": 0.0,
|
| 70 |
+
"over_label_count": 0,
|
| 71 |
+
"predicted_topics": [
|
| 72 |
+
"docs"
|
| 73 |
+
],
|
| 74 |
+
"score": 1.0,
|
| 75 |
+
"target": "https://github.com/openclaw/openclaw/pull/51849",
|
| 76 |
+
"title": "Docs: add freeCodeCamp OpenClaw full tutorial to showcase",
|
| 77 |
+
"true_positives": [
|
| 78 |
+
"docs"
|
| 79 |
+
]
|
| 80 |
+
},
|
| 81 |
+
{
|
| 82 |
+
"error": null,
|
| 83 |
+
"false_negatives": [
|
| 84 |
+
"local_model_providers"
|
| 85 |
+
],
|
| 86 |
+
"false_positives": [
|
| 87 |
+
"model_serving"
|
| 88 |
+
],
|
| 89 |
+
"gold_topics": [
|
| 90 |
+
"open_weight_models",
|
| 91 |
+
"local_model_providers"
|
| 92 |
+
],
|
| 93 |
+
"id": "openclaw-openclaw-68725",
|
| 94 |
+
"loss": 3.0,
|
| 95 |
+
"over_label_count": 0,
|
| 96 |
+
"predicted_topics": [
|
| 97 |
+
"open_weight_models",
|
| 98 |
+
"model_serving"
|
| 99 |
+
],
|
| 100 |
+
"score": 0.25,
|
| 101 |
+
"target": "https://github.com/openclaw/openclaw/pull/68725",
|
| 102 |
+
"title": "feat(amazon-bedrock-mantle): add known context windows for open-weight Mantle models",
|
| 103 |
+
"true_positives": [
|
| 104 |
+
"open_weight_models"
|
| 105 |
+
]
|
| 106 |
+
},
|
| 107 |
+
{
|
| 108 |
+
"error": null,
|
| 109 |
+
"false_negatives": [
|
| 110 |
+
"notifications"
|
| 111 |
+
],
|
| 112 |
+
"false_positives": [
|
| 113 |
+
"cron_automation"
|
| 114 |
+
],
|
| 115 |
+
"gold_topics": [
|
| 116 |
+
"notifications",
|
| 117 |
+
"chat_integrations"
|
| 118 |
+
],
|
| 119 |
+
"id": "openclaw-openclaw-84297",
|
| 120 |
+
"loss": 3.0,
|
| 121 |
+
"over_label_count": 0,
|
| 122 |
+
"predicted_topics": [
|
| 123 |
+
"chat_integrations",
|
| 124 |
+
"cron_automation"
|
| 125 |
+
],
|
| 126 |
+
"score": 0.25,
|
| 127 |
+
"target": "https://github.com/openclaw/openclaw/issues/84297",
|
| 128 |
+
"title": "[Bug]: Per-agent identity overlay dropped on cron --announce and heartbeat target-channel Slack pushes (announce path; reply path was fixed in #38235)",
|
| 129 |
+
"true_positives": [
|
| 130 |
+
"chat_integrations"
|
| 131 |
+
]
|
| 132 |
+
},
|
| 133 |
+
{
|
| 134 |
+
"error": null,
|
| 135 |
+
"false_negatives": [],
|
| 136 |
+
"false_positives": [],
|
| 137 |
+
"gold_topics": [
|
| 138 |
+
"model_serving",
|
| 139 |
+
"local_models"
|
| 140 |
+
],
|
| 141 |
+
"id": "openclaw-openclaw-77827",
|
| 142 |
+
"loss": 0.0,
|
| 143 |
+
"over_label_count": 0,
|
| 144 |
+
"predicted_topics": [
|
| 145 |
+
"local_models",
|
| 146 |
+
"model_serving"
|
| 147 |
+
],
|
| 148 |
+
"score": 1.0,
|
| 149 |
+
"target": "https://github.com/openclaw/openclaw/pull/77827",
|
| 150 |
+
"title": "fix: LM Studio thinking blocks invisible with Responses API",
|
| 151 |
+
"true_positives": [
|
| 152 |
+
"local_models",
|
| 153 |
+
"model_serving"
|
| 154 |
+
]
|
| 155 |
+
}
|
| 156 |
+
],
|
| 157 |
+
"rows": 6,
|
| 158 |
+
"scores": [
|
| 159 |
+
0.25,
|
| 160 |
+
0.25,
|
| 161 |
+
1.0,
|
| 162 |
+
0.25,
|
| 163 |
+
0.25,
|
| 164 |
+
1.0
|
| 165 |
+
]
|
| 166 |
+
}
|
dashboard-20260613-gepa/artifacts/2026-06-13-gepa-12b-six-best.prompt.md
ADDED
|
@@ -0,0 +1,100 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# OpenClaw Routing Classifier
|
| 2 |
+
|
| 3 |
+
Classify one OpenClaw GitHub issue or pull request for maintainer notification
|
| 4 |
+
routing, not code search. Return only the final structured JSON required by the
|
| 5 |
+
schema. No prose, markdown, analysis, or extra fields.
|
| 6 |
+
|
| 7 |
+
Required output shape:
|
| 8 |
+
|
| 9 |
+
```json
|
| 10 |
+
{"topics_of_interest":[],"description":"One concise evidence-backed sentence.","caveats":[]}
|
| 11 |
+
```
|
| 12 |
+
|
| 13 |
+
## Inner Monologue
|
| 14 |
+
|
| 15 |
+
You MUST keep your inner monologue, your thought process, your Chain of Thought restricted to 2 short paragraphs maximum. Do not deliberate topic by topic; weigh only the strongest candidates, then call final_json. It is ABSOLUTELY IMPERATIVE that you DO NOT EXCEED 50 WORDS and reply as soon as possible.
|
| 16 |
+
|
| 17 |
+
## Repository Reads
|
| 18 |
+
|
| 19 |
+
A read-only `bash` tool may be available in the OpenClaw repo snapshot. Use it
|
| 20 |
+
only when the GitHub context is ambiguous or missing repo evidence needed for a
|
| 21 |
+
correct routing decision. Prefer short commands such as `pwd`, `ls`, `find`,
|
| 22 |
+
`rg`, `grep`, `sed -n`, `cat`, `head`, `tail`, `wc -l`,
|
| 23 |
+
`git show --name-only`, `git ls-files`, or `git grep`.
|
| 24 |
+
For repo-wide text search, use `rg -n -i "phrase"` or explicit recursive grep
|
| 25 |
+
such as `grep -R -n -i "phrase" .`. For file discovery, use
|
| 26 |
+
`rg --files -g "*.ts"` or `git ls-files src`.
|
| 27 |
+
Do not call `bash` when the provided GitHub context is enough.
|
| 28 |
+
|
| 29 |
+
## Allowed Topics
|
| 30 |
+
|
| 31 |
+
```json
|
| 32 |
+
__ALLOWED_TOPICS_JSON__
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
Topic definitions and cue words:
|
| 36 |
+
|
| 37 |
+
__TOPIC_DESCRIPTIONS__
|
| 38 |
+
|
| 39 |
+
You are classifying GitHub issues or pull requests into the smallest complete set of allowed topic ids.
|
| 40 |
+
|
| 41 |
+
This is a fuzzy multi-label routing task. Your goal is not to mention every related area. Your goal is to choose the minimum topic set that sends the item to the right maintainer bucket without dropping an explicit central second concern.
|
| 42 |
+
|
| 43 |
+
Process:
|
| 44 |
+
|
| 45 |
+
1. Read the title first.
|
| 46 |
+
2. Identify the main user-visible problem, feature, or policy change.
|
| 47 |
+
3. Pick one primary topic.
|
| 48 |
+
4. Read only the first clear body summary if needed to disambiguate.
|
| 49 |
+
5. Add a secondary topic only when it is explicitly central and removing it would route the item away from a maintainer who must see it.
|
| 50 |
+
6. Remove topics that come only from symptoms, implementation details, tests, examples, files changed, broad impact, or incidental words.
|
| 51 |
+
7. Return only exact allowed topic ids.
|
| 52 |
+
|
| 53 |
+
Do not over-label from keywords.
|
| 54 |
+
|
| 55 |
+
Important domain rules:
|
| 56 |
+
|
| 57 |
+
- OpenAI-compatible streaming, final usage chunks, stream lifecycle, endpoint compatibility, base URL behavior, vLLM/TGI/LocalAI/llama.cpp serving behavior, and request routing are `model_serving`.
|
| 58 |
+
- Do not add `telemetry_usage` merely because the title mentions usage, tokens, counts, cost, or chunks when those are symptoms of a model-serving protocol bug.
|
| 59 |
+
- Example: “OpenAI-compatible streaming with llama.cpp saves zero usage (stream closed before final usage chunk)” is only `model_serving`. The central issue is the OpenAI-compatible streaming/final usage chunk behavior, not telemetry reporting.
|
| 60 |
+
- Use `telemetry_usage` only when the metric, usage accounting/reporting, cost display, diagnostic count, trace, or status reporting surface is itself the feature or bug.
|
| 61 |
+
|
| 62 |
+
Policy/config rules:
|
| 63 |
+
|
| 64 |
+
- Items about policy rules, conformance checks, quality gates, allowed behavior, or configuration-governed enforcement usually include `config` when the policy/checking behavior is central.
|
| 65 |
+
- Do not map the word “model” in “model policy”, “model conformance”, or “model checks” to `model_serving` unless the item is actually about serving endpoints, streaming, endpoint lifecycle, routing, or model-server compatibility.
|
| 66 |
+
- Network policy, network conformance, access restrictions, outbound rules, or boundary checks can be `security` when they concern allowed/blocked network behavior.
|
| 67 |
+
- MCP conformance, MCP policy, MCP tool behavior, or MCP protocol checks route to `mcp_tooling`.
|
| 68 |
+
- Example: “Policy: add model, network, and MCP conformance checks” should be `mcp_tooling`, `config`, and `security`, not `model_serving`.
|
| 69 |
+
|
| 70 |
+
Cardinality guidance:
|
| 71 |
+
|
| 72 |
+
- Use 0 topics when no allowed topic is central.
|
| 73 |
+
- Use 1 topic for a single-focus item.
|
| 74 |
+
- Use 2 topics for normal cross-topic items.
|
| 75 |
+
- Use 3 topics only when the title or first clear summary explicitly has three central facets.
|
| 76 |
+
- Use 4+ topics only for explicit multi-system coordination.
|
| 77 |
+
|
| 78 |
+
Final suppression checks before output:
|
| 79 |
+
|
| 80 |
+
- If a topic was added only because of a word like “usage”, “model”, “network”, “test”, “policy”, “status”, or “chunk”, verify that the topic is actually the subject, not just context.
|
| 81 |
+
- Prefer the narrower central topic over a broad fallback.
|
| 82 |
+
- Never invent topic ids.
|
| 83 |
+
- Output only the final JSON with the selected topic ids.## Target
|
| 84 |
+
|
| 85 |
+
`__TARGET__`
|
| 86 |
+
|
| 87 |
+
## GitHub Context
|
| 88 |
+
|
| 89 |
+
__GITHUB_CONTEXT__
|
| 90 |
+
|
| 91 |
+
Use this context as source of truth. If important sections are missing,
|
| 92 |
+
unavailable, selected, or truncated, classify from what is available and mention
|
| 93 |
+
material limits in `caveats`.
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
You MUST keep your inner monologue, your thought process, your Chain of Thought restricted to 2 short paragraphs maximum. Do not deliberate topic by topic; weigh only the strongest candidates, then call final_json. It is ABSOLUTELY IMPERATIVE that you DO NOT EXCEED 50 WORDS and reply as soon as possible.
|
| 97 |
+
|
| 98 |
+
You MUST keep your inner monologue, your thought process, your Chain of Thought restricted to 2 short paragraphs maximum. Do not deliberate topic by topic; weigh only the strongest candidates, then call final_json. It is ABSOLUTELY IMPERATIVE that you DO NOT EXCEED 50 WORDS and reply as soon as possible.
|
| 99 |
+
|
| 100 |
+
You MUST keep your inner monologue, your thought process, your Chain of Thought restricted to 2 short paragraphs maximum. Do not deliberate topic by topic; weigh only the strongest candidates, then call final_json. It is ABSOLUTELY IMPERATIVE that you DO NOT EXCEED 50 WORDS and reply as soon as possible.
|
dashboard-20260613-gepa/artifacts/2026-06-13-gepa-12b-six-best.routing_policy.md
ADDED
|
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
You are classifying GitHub issues or pull requests into the smallest complete set of allowed topic ids.
|
| 2 |
+
|
| 3 |
+
This is a fuzzy multi-label routing task. Your goal is not to mention every related area. Your goal is to choose the minimum topic set that sends the item to the right maintainer bucket without dropping an explicit central second concern.
|
| 4 |
+
|
| 5 |
+
Process:
|
| 6 |
+
|
| 7 |
+
1. Read the title first.
|
| 8 |
+
2. Identify the main user-visible problem, feature, or policy change.
|
| 9 |
+
3. Pick one primary topic.
|
| 10 |
+
4. Read only the first clear body summary if needed to disambiguate.
|
| 11 |
+
5. Add a secondary topic only when it is explicitly central and removing it would route the item away from a maintainer who must see it.
|
| 12 |
+
6. Remove topics that come only from symptoms, implementation details, tests, examples, files changed, broad impact, or incidental words.
|
| 13 |
+
7. Return only exact allowed topic ids.
|
| 14 |
+
|
| 15 |
+
Do not over-label from keywords.
|
| 16 |
+
|
| 17 |
+
Important domain rules:
|
| 18 |
+
|
| 19 |
+
- OpenAI-compatible streaming, final usage chunks, stream lifecycle, endpoint compatibility, base URL behavior, vLLM/TGI/LocalAI/llama.cpp serving behavior, and request routing are `model_serving`.
|
| 20 |
+
- Do not add `telemetry_usage` merely because the title mentions usage, tokens, counts, cost, or chunks when those are symptoms of a model-serving protocol bug.
|
| 21 |
+
- Example: “OpenAI-compatible streaming with llama.cpp saves zero usage (stream closed before final usage chunk)” is only `model_serving`. The central issue is the OpenAI-compatible streaming/final usage chunk behavior, not telemetry reporting.
|
| 22 |
+
- Use `telemetry_usage` only when the metric, usage accounting/reporting, cost display, diagnostic count, trace, or status reporting surface is itself the feature or bug.
|
| 23 |
+
|
| 24 |
+
Policy/config rules:
|
| 25 |
+
|
| 26 |
+
- Items about policy rules, conformance checks, quality gates, allowed behavior, or configuration-governed enforcement usually include `config` when the policy/checking behavior is central.
|
| 27 |
+
- Do not map the word “model” in “model policy”, “model conformance”, or “model checks” to `model_serving` unless the item is actually about serving endpoints, streaming, endpoint lifecycle, routing, or model-server compatibility.
|
| 28 |
+
- Network policy, network conformance, access restrictions, outbound rules, or boundary checks can be `security` when they concern allowed/blocked network behavior.
|
| 29 |
+
- MCP conformance, MCP policy, MCP tool behavior, or MCP protocol checks route to `mcp_tooling`.
|
| 30 |
+
- Example: “Policy: add model, network, and MCP conformance checks” should be `mcp_tooling`, `config`, and `security`, not `model_serving`.
|
| 31 |
+
|
| 32 |
+
Cardinality guidance:
|
| 33 |
+
|
| 34 |
+
- Use 0 topics when no allowed topic is central.
|
| 35 |
+
- Use 1 topic for a single-focus item.
|
| 36 |
+
- Use 2 topics for normal cross-topic items.
|
| 37 |
+
- Use 3 topics only when the title or first clear summary explicitly has three central facets.
|
| 38 |
+
- Use 4+ topics only for explicit multi-system coordination.
|
| 39 |
+
|
| 40 |
+
Final suppression checks before output:
|
| 41 |
+
|
| 42 |
+
- If a topic was added only because of a word like “usage”, “model”, “network”, “test”, “policy”, “status”, or “chunk”, verify that the topic is actually the subject, not just context.
|
| 43 |
+
- Prefer the narrower central topic over a broad fallback.
|
| 44 |
+
- Never invent topic ids.
|
| 45 |
+
- Output only the final JSON with the selected topic ids.
|
dashboard-20260613-gepa/index.html
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!doctype html>
|
| 2 |
+
<html lang="en">
|
| 3 |
+
<head>
|
| 4 |
+
<meta charset="utf-8">
|
| 5 |
+
<meta name="viewport" content="width=device-width, initial-scale=1">
|
| 6 |
+
<meta http-equiv="refresh" content="0; url=../final-cardinality-report.html">
|
| 7 |
+
<title>Localpager GEPA Dashboard Updated</title>
|
| 8 |
+
<style>
|
| 9 |
+
body{margin:0;background:#f6f8fb;color:#172033;font-family:-apple-system,BlinkMacSystemFont,"Segoe UI",Roboto,Arial,sans-serif;line-height:1.45}
|
| 10 |
+
main{max-width:760px;margin:0 auto;padding:48px 24px}
|
| 11 |
+
section{background:#fff;border:1px solid #d8dee9;border-radius:8px;padding:22px}
|
| 12 |
+
a{color:#0b57d0;text-decoration:none}.button{display:inline-block;border:1px solid #c9dcff;background:#eef4ff;border-radius:7px;padding:8px 11px;margin:4px 8px 4px 0}.muted{color:#667085}code{font-family:ui-monospace,SFMono-Regular,Consolas,monospace;font-size:12px}
|
| 13 |
+
</style>
|
| 14 |
+
</head>
|
| 15 |
+
<body>
|
| 16 |
+
<main><section>
|
| 17 |
+
<h1>Dashboard Updated</h1>
|
| 18 |
+
<p>This earlier GEPA dashboard now points to the final cardinality-safe report.</p>
|
| 19 |
+
<p><a class="button" href="../final-cardinality-report.html">Open final report</a><a class="button" href="legacy-gepa-six-dashboard.html">Open legacy GEPA-six dashboard</a><a class="button" href="../gepa-12b-row30-prop20-continuation-20260614T021448Z/score_report.html">Open prop20 iteration graph</a><a class="button" href="../prompt-diffs/index.html">Open prompt diffs</a></p>
|
| 20 |
+
<p class="muted">Promoted artifact: <code>2026-06-14-gepa-12b-prop20-cardinality-repair</code>.</p>
|
| 21 |
+
</section></main>
|
| 22 |
+
</body>
|
| 23 |
+
</html>
|
dashboard-20260613-gepa/iteration-score-data.json
ADDED
|
@@ -0,0 +1,193 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"generated_at": "2026-06-13T04:20:49.182286+00:00",
|
| 3 |
+
"source": "GEPA gepa-result.json, run_log.json, and external 12B validation JSON artifacts.",
|
| 4 |
+
"notes": [
|
| 5 |
+
"Scores use the localpager classifier scoring function, including penalties for false positives, false negatives, and over-labeling.",
|
| 6 |
+
"False positives are treated as worse than false negatives by the scoring function used for these runs.",
|
| 7 |
+
"The available GEPA runs each completed at most one proposal step, so candidate 0 -> candidate 1 is the observed optimizer trajectory."
|
| 8 |
+
],
|
| 9 |
+
"internal_candidate_scores": [
|
| 10 |
+
{
|
| 11 |
+
"run_key": "e4b-smoke",
|
| 12 |
+
"run": "E4B smoke GEPA",
|
| 13 |
+
"candidate_index": 0,
|
| 14 |
+
"score": 0.25,
|
| 15 |
+
"is_best": false,
|
| 16 |
+
"discovery_eval_count": 0
|
| 17 |
+
},
|
| 18 |
+
{
|
| 19 |
+
"run_key": "e4b-smoke",
|
| 20 |
+
"run": "E4B smoke GEPA",
|
| 21 |
+
"candidate_index": 1,
|
| 22 |
+
"score": 0.625,
|
| 23 |
+
"is_best": true,
|
| 24 |
+
"discovery_eval_count": 4
|
| 25 |
+
},
|
| 26 |
+
{
|
| 27 |
+
"run_key": "12b-six",
|
| 28 |
+
"run": "12B six-row GEPA (selected)",
|
| 29 |
+
"candidate_index": 0,
|
| 30 |
+
"score": 0.2857142857142857,
|
| 31 |
+
"is_best": false,
|
| 32 |
+
"discovery_eval_count": 0
|
| 33 |
+
},
|
| 34 |
+
{
|
| 35 |
+
"run_key": "12b-six",
|
| 36 |
+
"run": "12B six-row GEPA (selected)",
|
| 37 |
+
"candidate_index": 1,
|
| 38 |
+
"score": 0.5555555555555556,
|
| 39 |
+
"is_best": true,
|
| 40 |
+
"discovery_eval_count": 10
|
| 41 |
+
},
|
| 42 |
+
{
|
| 43 |
+
"run_key": "12b-twelve-init",
|
| 44 |
+
"run": "12B twelve-row init-only",
|
| 45 |
+
"candidate_index": 0,
|
| 46 |
+
"score": 0.4244047619047619,
|
| 47 |
+
"is_best": true,
|
| 48 |
+
"discovery_eval_count": 0
|
| 49 |
+
},
|
| 50 |
+
{
|
| 51 |
+
"run_key": "12b-twelve-iter",
|
| 52 |
+
"run": "12B twelve-row continuation GEPA",
|
| 53 |
+
"candidate_index": 0,
|
| 54 |
+
"score": 0.5375,
|
| 55 |
+
"is_best": false,
|
| 56 |
+
"discovery_eval_count": 0
|
| 57 |
+
},
|
| 58 |
+
{
|
| 59 |
+
"run_key": "12b-twelve-iter",
|
| 60 |
+
"run": "12B twelve-row continuation GEPA",
|
| 61 |
+
"candidate_index": 1,
|
| 62 |
+
"score": 0.6101190476190476,
|
| 63 |
+
"is_best": true,
|
| 64 |
+
"discovery_eval_count": 20
|
| 65 |
+
}
|
| 66 |
+
],
|
| 67 |
+
"gepa_step_scores": [
|
| 68 |
+
{
|
| 69 |
+
"run_key": "e4b-smoke",
|
| 70 |
+
"run": "E4B smoke GEPA",
|
| 71 |
+
"short": "E4B smoke",
|
| 72 |
+
"iteration": 0,
|
| 73 |
+
"selected_candidate": 0,
|
| 74 |
+
"new_candidate": 1,
|
| 75 |
+
"subsample_ids": [
|
| 76 |
+
0
|
| 77 |
+
],
|
| 78 |
+
"old_scores": [
|
| 79 |
+
0.14285714285714285
|
| 80 |
+
],
|
| 81 |
+
"new_scores": [
|
| 82 |
+
1.0
|
| 83 |
+
],
|
| 84 |
+
"old_mean": 0.14285714285714285,
|
| 85 |
+
"new_mean": 1.0,
|
| 86 |
+
"delta": 0.8571428571428572
|
| 87 |
+
},
|
| 88 |
+
{
|
| 89 |
+
"run_key": "12b-six",
|
| 90 |
+
"run": "12B six-row GEPA (selected)",
|
| 91 |
+
"short": "12B six selected",
|
| 92 |
+
"iteration": 0,
|
| 93 |
+
"selected_candidate": 0,
|
| 94 |
+
"new_candidate": 1,
|
| 95 |
+
"subsample_ids": [
|
| 96 |
+
4,
|
| 97 |
+
1
|
| 98 |
+
],
|
| 99 |
+
"old_scores": [
|
| 100 |
+
0.2857142857142857,
|
| 101 |
+
0.25
|
| 102 |
+
],
|
| 103 |
+
"new_scores": [
|
| 104 |
+
1.0,
|
| 105 |
+
1.0
|
| 106 |
+
],
|
| 107 |
+
"old_mean": 0.26785714285714285,
|
| 108 |
+
"new_mean": 1.0,
|
| 109 |
+
"delta": 0.7321428571428572
|
| 110 |
+
},
|
| 111 |
+
{
|
| 112 |
+
"run_key": "12b-twelve-iter",
|
| 113 |
+
"run": "12B twelve-row continuation GEPA",
|
| 114 |
+
"short": "12B twelve continuation",
|
| 115 |
+
"iteration": 0,
|
| 116 |
+
"selected_candidate": 0,
|
| 117 |
+
"new_candidate": 1,
|
| 118 |
+
"subsample_ids": [
|
| 119 |
+
1,
|
| 120 |
+
10,
|
| 121 |
+
9,
|
| 122 |
+
5
|
| 123 |
+
],
|
| 124 |
+
"old_scores": [
|
| 125 |
+
1.0,
|
| 126 |
+
0.2,
|
| 127 |
+
1.0,
|
| 128 |
+
0.5
|
| 129 |
+
],
|
| 130 |
+
"new_scores": [
|
| 131 |
+
1.0,
|
| 132 |
+
1.0,
|
| 133 |
+
0.5,
|
| 134 |
+
1.0
|
| 135 |
+
],
|
| 136 |
+
"old_mean": 0.675,
|
| 137 |
+
"new_mean": 0.875,
|
| 138 |
+
"delta": 0.19999999999999996
|
| 139 |
+
}
|
| 140 |
+
],
|
| 141 |
+
"external_cumulative_scores": [
|
| 142 |
+
{
|
| 143 |
+
"rows": "1-6",
|
| 144 |
+
"checkpoint": 6,
|
| 145 |
+
"seed_segment_mean": 0.5214285714285715,
|
| 146 |
+
"six_segment_mean": 0.5833333333333334,
|
| 147 |
+
"segment_delta": 0.06190476190476191,
|
| 148 |
+
"seed_cumulative_mean": 0.5214285714285715,
|
| 149 |
+
"six_cumulative_mean": 0.5833333333333334,
|
| 150 |
+
"cumulative_delta": 0.06190476190476191
|
| 151 |
+
},
|
| 152 |
+
{
|
| 153 |
+
"rows": "7-12",
|
| 154 |
+
"checkpoint": 12,
|
| 155 |
+
"seed_segment_mean": 0.48333333333333334,
|
| 156 |
+
"six_segment_mean": 0.4916666666666667,
|
| 157 |
+
"segment_delta": 0.00833333333333336,
|
| 158 |
+
"seed_cumulative_mean": 0.5023809523809524,
|
| 159 |
+
"six_cumulative_mean": 0.5375,
|
| 160 |
+
"cumulative_delta": 0.035119047619047605
|
| 161 |
+
},
|
| 162 |
+
{
|
| 163 |
+
"rows": "13-18",
|
| 164 |
+
"checkpoint": 18,
|
| 165 |
+
"seed_segment_mean": 0.5,
|
| 166 |
+
"six_segment_mean": 0.5833333333333334,
|
| 167 |
+
"segment_delta": 0.08333333333333337,
|
| 168 |
+
"seed_cumulative_mean": 0.5015873015873016,
|
| 169 |
+
"six_cumulative_mean": 0.5527777777777777,
|
| 170 |
+
"cumulative_delta": 0.05119047619047612
|
| 171 |
+
},
|
| 172 |
+
{
|
| 173 |
+
"rows": "19-30",
|
| 174 |
+
"checkpoint": 30,
|
| 175 |
+
"seed_segment_mean": 0.39464285714285713,
|
| 176 |
+
"six_segment_mean": 0.45515873015873015,
|
| 177 |
+
"segment_delta": 0.06051587301587302,
|
| 178 |
+
"seed_cumulative_mean": 0.45880952380952383,
|
| 179 |
+
"six_cumulative_mean": 0.5137301587301587,
|
| 180 |
+
"cumulative_delta": 0.05492063492063487
|
| 181 |
+
},
|
| 182 |
+
{
|
| 183 |
+
"rows": "31-60",
|
| 184 |
+
"checkpoint": 60,
|
| 185 |
+
"seed_segment_mean": 0.4056349206349206,
|
| 186 |
+
"six_segment_mean": 0.46865079365079365,
|
| 187 |
+
"segment_delta": 0.06301587301587303,
|
| 188 |
+
"seed_cumulative_mean": 0.43222222222222223,
|
| 189 |
+
"six_cumulative_mean": 0.4911904761904762,
|
| 190 |
+
"cumulative_delta": 0.05896825396825395
|
| 191 |
+
}
|
| 192 |
+
]
|
| 193 |
+
}
|
dashboard-20260613-gepa/iteration-score-summary.csv
ADDED
|
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
kind,run_or_checkpoint,x,score_a,score_b,delta,notes
|
| 2 |
+
internal_candidate,E4B smoke GEPA,0,0.25,,,
|
| 3 |
+
internal_candidate,E4B smoke GEPA,1,0.625,,,best
|
| 4 |
+
internal_candidate,12B six-row GEPA (selected),0,0.2857142857142857,,,
|
| 5 |
+
internal_candidate,12B six-row GEPA (selected),1,0.5555555555555556,,,best
|
| 6 |
+
internal_candidate,12B twelve-row init-only,0,0.4244047619047619,,,best
|
| 7 |
+
internal_candidate,12B twelve-row continuation GEPA,0,0.5375,,,
|
| 8 |
+
internal_candidate,12B twelve-row continuation GEPA,1,0.6101190476190476,,,best
|
| 9 |
+
gepa_step_subsample,E4B smoke GEPA,0,0.14285714285714285,1.0,0.8571428571428572,candidate 0 -> 1; subsample [0]
|
| 10 |
+
gepa_step_subsample,12B six-row GEPA (selected),0,0.26785714285714285,1.0,0.7321428571428572,"candidate 0 -> 1; subsample [4, 1]"
|
| 11 |
+
gepa_step_subsample,12B twelve-row continuation GEPA,0,0.675,0.875,0.19999999999999996,"candidate 0 -> 1; subsample [1, 10, 9, 5]"
|
| 12 |
+
external_cumulative,1-6,6,0.5214285714285715,0.5833333333333334,0.06190476190476191,seed vs selected 12B six GEPA
|
| 13 |
+
external_cumulative,7-12,12,0.5023809523809524,0.5375,0.035119047619047605,seed vs selected 12B six GEPA
|
| 14 |
+
external_cumulative,13-18,18,0.5015873015873016,0.5527777777777777,0.05119047619047612,seed vs selected 12B six GEPA
|
| 15 |
+
external_cumulative,19-30,30,0.45880952380952383,0.5137301587301587,0.05492063492063487,seed vs selected 12B six GEPA
|
| 16 |
+
external_cumulative,31-60,60,0.43222222222222223,0.4911904761904762,0.05896825396825395,seed vs selected 12B six GEPA
|
dashboard-20260613-gepa/iterations.html
ADDED
|
@@ -0,0 +1,236 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!doctype html>
|
| 2 |
+
<html lang="en">
|
| 3 |
+
<head>
|
| 4 |
+
<meta charset="utf-8">
|
| 5 |
+
<meta name="viewport" content="width=device-width,initial-scale=1">
|
| 6 |
+
<title>GEPA Iteration Score Graphs</title>
|
| 7 |
+
<style>
|
| 8 |
+
:root { --bg:#f5f6f8; --panel:#fff; --ink:#1f2733; --muted:#677180; --line:#d8dee8; --blue:#1463d9; --good:#0f7a45; --bad:#b42318; }
|
| 9 |
+
* { box-sizing:border-box; }
|
| 10 |
+
body { margin:0; background:var(--bg); color:var(--ink); font-family:-apple-system,BlinkMacSystemFont,"Segoe UI",Roboto,Arial,sans-serif; line-height:1.45; }
|
| 11 |
+
header { background:#fff; border-bottom:1px solid var(--line); padding:26px 32px 20px; }
|
| 12 |
+
h1 { margin:0 0 8px; font-size:28px; }
|
| 13 |
+
main { max-width:1180px; margin:0 auto; padding:24px 32px 48px; }
|
| 14 |
+
section { background:var(--panel); border:1px solid var(--line); border-radius:8px; padding:20px; margin:0 0 20px; }
|
| 15 |
+
h2 { margin:0 0 12px; font-size:20px; }
|
| 16 |
+
h3 { margin:20px 0 10px; font-size:16px; }
|
| 17 |
+
p { margin:8px 0 12px; }
|
| 18 |
+
a { color:#124fba; }
|
| 19 |
+
nav { display:flex; flex-wrap:wrap; gap:10px; margin-top:16px; }
|
| 20 |
+
nav a, .button { display:inline-block; color:#123c7c; text-decoration:none; background:#eef4ff; border:1px solid #c9dcff; padding:7px 10px; border-radius:6px; }
|
| 21 |
+
.grid { display:grid; grid-template-columns:repeat(auto-fit,minmax(210px,1fr)); gap:12px; }
|
| 22 |
+
.card { border:1px solid var(--line); background:#fbfcfe; border-radius:8px; padding:14px; min-height:96px; }
|
| 23 |
+
.label { color:var(--muted); font-size:12px; text-transform:uppercase; }
|
| 24 |
+
.value { margin-top:8px; font-size:24px; font-weight:700; }
|
| 25 |
+
.sub, .muted { color:var(--muted); }
|
| 26 |
+
.good { color:var(--good); font-weight:650; }
|
| 27 |
+
.bad { color:var(--bad); font-weight:650; }
|
| 28 |
+
.chart-wrap { overflow-x:auto; border:1px solid var(--line); border-radius:8px; background:#fff; }
|
| 29 |
+
.chart { display:block; min-width:920px; width:100%; height:auto; }
|
| 30 |
+
.axis, .legend, .point-label { font-family:-apple-system,BlinkMacSystemFont,"Segoe UI",Roboto,Arial,sans-serif; fill:#5d6675; font-size:12px; }
|
| 31 |
+
.axis.label { fill:#303847; font-size:13px; }
|
| 32 |
+
.point-label { fill:#1f2733; font-size:11px; font-weight:650; }
|
| 33 |
+
table { border-collapse:collapse; width:100%; font-size:13px; }
|
| 34 |
+
th, td { border-bottom:1px solid var(--line); padding:8px 9px; text-align:left; vertical-align:top; }
|
| 35 |
+
th { background:#f0f3f8; }
|
| 36 |
+
.num { font-variant-numeric:tabular-nums; text-align:right; }
|
| 37 |
+
.scroll { overflow:auto; border:1px solid var(--line); border-radius:8px; }
|
| 38 |
+
.heat td, .heat th { text-align:center; }
|
| 39 |
+
code { font-family:"SFMono-Regular",Consolas,"Liberation Mono",monospace; font-size:12px; }
|
| 40 |
+
.note { border-left:4px solid #d26b00; padding:10px 12px; background:#fff8ed; }
|
| 41 |
+
@media (max-width:760px) { header, main { padding-left:16px; padding-right:16px; } h1 { font-size:23px; } }
|
| 42 |
+
</style>
|
| 43 |
+
</head>
|
| 44 |
+
<body>
|
| 45 |
+
<header>
|
| 46 |
+
<h1>GEPA Iteration Score Graphs</h1>
|
| 47 |
+
<div class="muted">Generated 2026-06-13T04:20:49.182286+00:00. Data source: local GEPA artifacts and 12B external validation runs.</div>
|
| 48 |
+
<nav><a href="#internal">Internal Candidate Scores</a><a href="#step">GEPA Step Scores</a><a href="#external">External Checkpoints</a><a href="#heatmap">Per-Row Heatmaps</a><a href="index.html">Back to summary dashboard</a></nav>
|
| 49 |
+
</header>
|
| 50 |
+
<main>
|
| 51 |
+
<section>
|
| 52 |
+
<h2>What This Shows</h2>
|
| 53 |
+
<p>The optimizer artifacts contain one completed proposal step per GEPA run. The visible trajectory is therefore candidate 0, the seed/inherited prompt, to candidate 1, the reflected prompt produced by GEPA.</p>
|
| 54 |
+
<p class="note">These are not label-count reward graphs. The score is the localpager classifier scoring function: it penalizes false positives, false negatives, and over-labeling. Extra random labels cannot improve the score unless they match the gold labels in <code>ds4.jsonl</code>, and false positives are weighted as the more costly error.</p>
|
| 55 |
+
<div class="grid"><div class="card"><div class="label">E4B smoke</div><div class="value good">0.250 -> 0.625</div><div class="sub">best candidate 1; delta 0.375</div></div><div class="card"><div class="label">12B six selected</div><div class="value good">0.286 -> 0.556</div><div class="sub">best candidate 1; delta 0.270</div></div><div class="card"><div class="label">12B twelve init</div><div class="value good">0.424 -> 0.424</div><div class="sub">best candidate 0; delta 0.000</div></div><div class="card"><div class="label">12B twelve continuation</div><div class="value good">0.537 -> 0.610</div><div class="sub">best candidate 1; delta 0.073</div></div></div>
|
| 56 |
+
</section>
|
| 57 |
+
<section id="internal">
|
| 58 |
+
<h2>Internal GEPA Validation Score By Candidate</h2>
|
| 59 |
+
<p>This is the direct score-over-iteration view from each run's <code>gepa-result.json</code>. Candidate 0 is the baseline/inherited prompt; candidate 1 is the GEPA-reflected candidate when a proposal completed.</p>
|
| 60 |
+
<div class="chart-wrap"><svg class="chart" viewBox="0 0 920 360" role="img" aria-label="Internal GEPA validation score by candidate index">
|
| 61 |
+
<title>Internal GEPA validation score by candidate index</title>
|
| 62 |
+
<rect x="0" y="0" width="920" height="360" fill="#ffffff"/>
|
| 63 |
+
<line x1="62" y1="306.00" x2="742" y2="306.00" stroke="#e3e8f0" stroke-width="1"/>
|
| 64 |
+
<text x="52" y="310.00" text-anchor="end" class="axis">0.0</text>
|
| 65 |
+
<line x1="62" y1="250.40" x2="742" y2="250.40" stroke="#e3e8f0" stroke-width="1"/>
|
| 66 |
+
<text x="52" y="254.40" text-anchor="end" class="axis">0.2</text>
|
| 67 |
+
<line x1="62" y1="194.80" x2="742" y2="194.80" stroke="#e3e8f0" stroke-width="1"/>
|
| 68 |
+
<text x="52" y="198.80" text-anchor="end" class="axis">0.4</text>
|
| 69 |
+
<line x1="62" y1="139.20" x2="742" y2="139.20" stroke="#e3e8f0" stroke-width="1"/>
|
| 70 |
+
<text x="52" y="143.20" text-anchor="end" class="axis">0.6</text>
|
| 71 |
+
<line x1="62" y1="83.60" x2="742" y2="83.60" stroke="#e3e8f0" stroke-width="1"/>
|
| 72 |
+
<text x="52" y="87.60" text-anchor="end" class="axis">0.8</text>
|
| 73 |
+
<line x1="62" y1="28.00" x2="742" y2="28.00" stroke="#e3e8f0" stroke-width="1"/>
|
| 74 |
+
<text x="52" y="32.00" text-anchor="end" class="axis">1.0</text>
|
| 75 |
+
<line x1="62.00" y1="28" x2="62.00" y2="306" stroke="#f1f4f8" stroke-width="1"/>
|
| 76 |
+
<text x="62.00" y="331" text-anchor="middle" class="axis">0</text>
|
| 77 |
+
<line x1="742.00" y1="28" x2="742.00" y2="306" stroke="#f1f4f8" stroke-width="1"/>
|
| 78 |
+
<text x="742.00" y="331" text-anchor="middle" class="axis">1</text>
|
| 79 |
+
<line x1="62" y1="306" x2="742" y2="306" stroke="#9aa4b2"/>
|
| 80 |
+
<line x1="62" y1="28" x2="62" y2="306" stroke="#9aa4b2"/>
|
| 81 |
+
<text x="402.00" y="348" text-anchor="middle" class="axis label">candidate index / GEPA proposal step</text>
|
| 82 |
+
<text x="17" y="167.00" transform="rotate(-90 17 167.00)" text-anchor="middle" class="axis label">score</text>
|
| 83 |
+
<path d="M62.00,236.50 L742.00,132.25" fill="none" stroke="#8a5cf6" stroke-width="3" stroke-linecap="round" stroke-linejoin="round"/>
|
| 84 |
+
<circle cx="62.00" cy="236.50" r="5.5" fill="#8a5cf6" stroke="#fff" stroke-width="2"><title>E4B smoke candidate/checkpoint 0: 0.2500</title></circle>
|
| 85 |
+
<text x="62.00" y="226.50" text-anchor="middle" class="point-label">0.250</text>
|
| 86 |
+
<circle cx="742.00" cy="132.25" r="5.5" fill="#8a5cf6" stroke="#fff" stroke-width="2"><title>E4B smoke candidate/checkpoint 1: 0.6250</title></circle>
|
| 87 |
+
<text x="742.00" y="122.25" text-anchor="middle" class="point-label">0.625</text>
|
| 88 |
+
<path d="M62.00,226.57 L742.00,151.56" fill="none" stroke="#1463d9" stroke-width="3" stroke-linecap="round" stroke-linejoin="round"/>
|
| 89 |
+
<circle cx="62.00" cy="226.57" r="5.5" fill="#1463d9" stroke="#fff" stroke-width="2"><title>12B six selected candidate/checkpoint 0: 0.2857</title></circle>
|
| 90 |
+
<text x="62.00" y="216.57" text-anchor="middle" class="point-label">0.286</text>
|
| 91 |
+
<circle cx="742.00" cy="151.56" r="5.5" fill="#1463d9" stroke="#fff" stroke-width="2"><title>12B six selected candidate/checkpoint 1: 0.5556</title></circle>
|
| 92 |
+
<text x="742.00" y="141.56" text-anchor="middle" class="point-label">0.556</text>
|
| 93 |
+
<circle cx="62.00" cy="188.02" r="5.5" fill="#596579" stroke="#fff" stroke-width="2"><title>12B twelve init candidate/checkpoint 0: 0.4244</title></circle>
|
| 94 |
+
<text x="62.00" y="178.02" text-anchor="middle" class="point-label">0.424</text>
|
| 95 |
+
<path d="M62.00,156.58 L742.00,136.39" fill="none" stroke="#d26b00" stroke-width="3" stroke-linecap="round" stroke-linejoin="round"/>
|
| 96 |
+
<circle cx="62.00" cy="156.58" r="5.5" fill="#d26b00" stroke="#fff" stroke-width="2"><title>12B twelve continuation candidate/checkpoint 0: 0.5375</title></circle>
|
| 97 |
+
<text x="62.00" y="146.58" text-anchor="middle" class="point-label">0.537</text>
|
| 98 |
+
<circle cx="742.00" cy="136.39" r="5.5" fill="#d26b00" stroke="#fff" stroke-width="2"><title>12B twelve continuation candidate/checkpoint 1: 0.6101</title></circle>
|
| 99 |
+
<text x="742.00" y="126.39" text-anchor="middle" class="point-label">0.610</text>
|
| 100 |
+
<line x1="764" y1="36" x2="786" y2="36" stroke="#8a5cf6" stroke-width="3"/>
|
| 101 |
+
<circle cx="775" cy="36" r="4" fill="#8a5cf6"/>
|
| 102 |
+
<text x="795" y="40" class="legend">E4B smoke </text>
|
| 103 |
+
<line x1="764" y1="60" x2="786" y2="60" stroke="#1463d9" stroke-width="3"/>
|
| 104 |
+
<circle cx="775" cy="60" r="4" fill="#1463d9"/>
|
| 105 |
+
<text x="795" y="64" class="legend">12B six selected </text>
|
| 106 |
+
<line x1="764" y1="84" x2="786" y2="84" stroke="#596579" stroke-width="3"/>
|
| 107 |
+
<circle cx="775" cy="84" r="4" fill="#596579"/>
|
| 108 |
+
<text x="795" y="88" class="legend">12B twelve init </text>
|
| 109 |
+
<line x1="764" y1="108" x2="786" y2="108" stroke="#d26b00" stroke-width="3"/>
|
| 110 |
+
<circle cx="775" cy="108" r="4" fill="#d26b00"/>
|
| 111 |
+
<text x="795" y="112" class="legend">12B twelve continuation </text>
|
| 112 |
+
</svg></div>
|
| 113 |
+
<h3>Numbers</h3>
|
| 114 |
+
<div class="scroll"><table><thead><tr><th>Run</th><th>Candidate</th><th class="num">Validation score</th><th>Best?</th><th>Discovery eval count</th></tr></thead><tbody><tr><td>E4B smoke GEPA</td><td>0</td><td class="num">0.2500</td><td></td><td>0</td></tr><tr><td>E4B smoke GEPA</td><td>1</td><td class="num">0.6250</td><td>yes</td><td>4</td></tr><tr><td>12B six-row GEPA (selected)</td><td>0</td><td class="num">0.2857</td><td></td><td>0</td></tr><tr><td>12B six-row GEPA (selected)</td><td>1</td><td class="num">0.5556</td><td>yes</td><td>10</td></tr><tr><td>12B twelve-row init-only</td><td>0</td><td class="num">0.4244</td><td>yes</td><td>0</td></tr><tr><td>12B twelve-row continuation GEPA</td><td>0</td><td class="num">0.5375</td><td></td><td>0</td></tr><tr><td>12B twelve-row continuation GEPA</td><td>1</td><td class="num">0.6101</td><td>yes</td><td>20</td></tr></tbody></table></div>
|
| 115 |
+
</section>
|
| 116 |
+
<section id="step">
|
| 117 |
+
<h2>GEPA Subsample Before/After Scores</h2>
|
| 118 |
+
<p>This uses <code>run_log.json</code>. It shows the score on the exact subsample GEPA reflected on, before and after the proposed prompt.</p>
|
| 119 |
+
<div class="chart-wrap"><svg class="chart" viewBox="0 0 920 360" role="img" aria-label="GEPA subsample before/after score by optimizer step">
|
| 120 |
+
<title>GEPA subsample before/after score by optimizer step</title>
|
| 121 |
+
<rect x="0" y="0" width="920" height="360" fill="#ffffff"/>
|
| 122 |
+
<line x1="62" y1="268.00" x2="896" y2="268.00" stroke="#e3e8f0"/>
|
| 123 |
+
<text x="52" y="272.00" text-anchor="end" class="axis">0.0</text>
|
| 124 |
+
<line x1="62" y1="220.00" x2="896" y2="220.00" stroke="#e3e8f0"/>
|
| 125 |
+
<text x="52" y="224.00" text-anchor="end" class="axis">0.2</text>
|
| 126 |
+
<line x1="62" y1="172.00" x2="896" y2="172.00" stroke="#e3e8f0"/>
|
| 127 |
+
<text x="52" y="176.00" text-anchor="end" class="axis">0.4</text>
|
| 128 |
+
<line x1="62" y1="124.00" x2="896" y2="124.00" stroke="#e3e8f0"/>
|
| 129 |
+
<text x="52" y="128.00" text-anchor="end" class="axis">0.6</text>
|
| 130 |
+
<line x1="62" y1="76.00" x2="896" y2="76.00" stroke="#e3e8f0"/>
|
| 131 |
+
<text x="52" y="80.00" text-anchor="end" class="axis">0.8</text>
|
| 132 |
+
<line x1="62" y1="28.00" x2="896" y2="28.00" stroke="#e3e8f0"/>
|
| 133 |
+
<text x="52" y="32.00" text-anchor="end" class="axis">1.0</text>
|
| 134 |
+
<rect x="137.20" y="233.71" width="58.00" height="34.29" rx="4" fill="#7b8798"><title>E4B smoke before: 0.1429</title></rect>
|
| 135 |
+
<text x="166.20" y="225.71" text-anchor="middle" class="point-label">0.143</text>
|
| 136 |
+
<rect x="206.80" y="28.00" width="58.00" height="240.00" rx="4" fill="#148553"><title>E4B smoke after: 1.0000</title></rect>
|
| 137 |
+
<text x="235.80" y="20.00" text-anchor="middle" class="point-label">1.000</text>
|
| 138 |
+
<text x="201.00" y="295" text-anchor="middle" class="axis label">E4B smoke</text>
|
| 139 |
+
<text x="201.00" y="316" text-anchor="middle" class="axis">delta 0.857</text>
|
| 140 |
+
<rect x="415.20" y="203.71" width="58.00" height="64.29" rx="4" fill="#7b8798"><title>12B six selected before: 0.2679</title></rect>
|
| 141 |
+
<text x="444.20" y="195.71" text-anchor="middle" class="point-label">0.268</text>
|
| 142 |
+
<rect x="484.80" y="28.00" width="58.00" height="240.00" rx="4" fill="#148553"><title>12B six selected after: 1.0000</title></rect>
|
| 143 |
+
<text x="513.80" y="20.00" text-anchor="middle" class="point-label">1.000</text>
|
| 144 |
+
<text x="479.00" y="295" text-anchor="middle" class="axis label">12B six selected</text>
|
| 145 |
+
<text x="479.00" y="316" text-anchor="middle" class="axis">delta 0.732</text>
|
| 146 |
+
<rect x="693.20" y="106.00" width="58.00" height="162.00" rx="4" fill="#7b8798"><title>12B twelve continuation before: 0.6750</title></rect>
|
| 147 |
+
<text x="722.20" y="98.00" text-anchor="middle" class="point-label">0.675</text>
|
| 148 |
+
<rect x="762.80" y="58.00" width="58.00" height="210.00" rx="4" fill="#148553"><title>12B twelve continuation after: 0.8750</title></rect>
|
| 149 |
+
<text x="791.80" y="50.00" text-anchor="middle" class="point-label">0.875</text>
|
| 150 |
+
<text x="757.00" y="295" text-anchor="middle" class="axis label">12B twelve continuation</text>
|
| 151 |
+
<text x="757.00" y="316" text-anchor="middle" class="axis">delta 0.200</text>
|
| 152 |
+
<line x1="62" y1="268" x2="896" y2="268" stroke="#9aa4b2"/>
|
| 153 |
+
<line x1="62" y1="28" x2="62" y2="268" stroke="#9aa4b2"/>
|
| 154 |
+
<rect x="715" y="24" width="12" height="12" fill="#7b8798"/><text x="734" y="35" class="legend">selected candidate</text>
|
| 155 |
+
<rect x="715" y="48" width="12" height="12" fill="#148553"/><text x="734" y="59" class="legend">new reflected candidate</text>
|
| 156 |
+
</svg></div>
|
| 157 |
+
<h3>Numbers</h3>
|
| 158 |
+
<div class="scroll"><table><thead><tr><th>Run</th><th>Iteration</th><th>Candidate transition</th><th>Subsample ids</th><th class="num">Before mean</th><th class="num">After mean</th><th class="num">Delta</th></tr></thead><tbody><tr><td>E4B smoke GEPA</td><td>0</td><td>0 -> 1</td><td>[0]</td><td class="num">0.1429</td><td class="num">1.0000</td><td class="num good">0.8571</td></tr><tr><td>12B six-row GEPA (selected)</td><td>0</td><td>0 -> 1</td><td>[4, 1]</td><td class="num">0.2679</td><td class="num">1.0000</td><td class="num good">0.7321</td></tr><tr><td>12B twelve-row continuation GEPA</td><td>0</td><td>0 -> 1</td><td>[1, 10, 9, 5]</td><td class="num">0.6750</td><td class="num">0.8750</td><td class="num good">0.2000</td></tr></tbody></table></div>
|
| 159 |
+
</section>
|
| 160 |
+
<section id="external">
|
| 161 |
+
<h2>External Cumulative Validation Checkpoints</h2>
|
| 162 |
+
<p>This is the scientific sanity check: the selected 12B-six candidate compared against the v9.1 seed as more of Shaun's 60-row set is accumulated. This graph is outside the GEPA training loop.</p>
|
| 163 |
+
<div class="chart-wrap"><svg class="chart" viewBox="0 0 920 360" role="img" aria-label="External cumulative validation score checkpoints">
|
| 164 |
+
<title>External cumulative validation score checkpoints</title>
|
| 165 |
+
<rect x="0" y="0" width="920" height="360" fill="#ffffff"/>
|
| 166 |
+
<line x1="62" y1="306.00" x2="742" y2="306.00" stroke="#e3e8f0" stroke-width="1"/>
|
| 167 |
+
<text x="52" y="310.00" text-anchor="end" class="axis">0.0</text>
|
| 168 |
+
<line x1="62" y1="250.40" x2="742" y2="250.40" stroke="#e3e8f0" stroke-width="1"/>
|
| 169 |
+
<text x="52" y="254.40" text-anchor="end" class="axis">0.2</text>
|
| 170 |
+
<line x1="62" y1="194.80" x2="742" y2="194.80" stroke="#e3e8f0" stroke-width="1"/>
|
| 171 |
+
<text x="52" y="198.80" text-anchor="end" class="axis">0.4</text>
|
| 172 |
+
<line x1="62" y1="139.20" x2="742" y2="139.20" stroke="#e3e8f0" stroke-width="1"/>
|
| 173 |
+
<text x="52" y="143.20" text-anchor="end" class="axis">0.6</text>
|
| 174 |
+
<line x1="62" y1="83.60" x2="742" y2="83.60" stroke="#e3e8f0" stroke-width="1"/>
|
| 175 |
+
<text x="52" y="87.60" text-anchor="end" class="axis">0.8</text>
|
| 176 |
+
<line x1="62" y1="28.00" x2="742" y2="28.00" stroke="#e3e8f0" stroke-width="1"/>
|
| 177 |
+
<text x="52" y="32.00" text-anchor="end" class="axis">1.0</text>
|
| 178 |
+
<line x1="62.00" y1="28" x2="62.00" y2="306" stroke="#f1f4f8" stroke-width="1"/>
|
| 179 |
+
<text x="62.00" y="331" text-anchor="middle" class="axis">6</text>
|
| 180 |
+
<line x1="137.56" y1="28" x2="137.56" y2="306" stroke="#f1f4f8" stroke-width="1"/>
|
| 181 |
+
<text x="137.56" y="331" text-anchor="middle" class="axis">12</text>
|
| 182 |
+
<line x1="213.11" y1="28" x2="213.11" y2="306" stroke="#f1f4f8" stroke-width="1"/>
|
| 183 |
+
<text x="213.11" y="331" text-anchor="middle" class="axis">18</text>
|
| 184 |
+
<line x1="364.22" y1="28" x2="364.22" y2="306" stroke="#f1f4f8" stroke-width="1"/>
|
| 185 |
+
<text x="364.22" y="331" text-anchor="middle" class="axis">30</text>
|
| 186 |
+
<line x1="742.00" y1="28" x2="742.00" y2="306" stroke="#f1f4f8" stroke-width="1"/>
|
| 187 |
+
<text x="742.00" y="331" text-anchor="middle" class="axis">60</text>
|
| 188 |
+
<line x1="62" y1="306" x2="742" y2="306" stroke="#9aa4b2"/>
|
| 189 |
+
<line x1="62" y1="28" x2="62" y2="306" stroke="#9aa4b2"/>
|
| 190 |
+
<text x="402.00" y="348" text-anchor="middle" class="axis label">cumulative validation rows</text>
|
| 191 |
+
<text x="17" y="167.00" transform="rotate(-90 17 167.00)" text-anchor="middle" class="axis label">score</text>
|
| 192 |
+
<path d="M62.00,161.04 L137.56,166.34 L213.11,166.56 L364.22,178.45 L742.00,185.84" fill="none" stroke="#6b7280" stroke-width="3" stroke-linecap="round" stroke-linejoin="round"/>
|
| 193 |
+
<circle cx="62.00" cy="161.04" r="5.5" fill="#6b7280" stroke="#fff" stroke-width="2"><title>v9.1 seed candidate/checkpoint 6: 0.5214</title></circle>
|
| 194 |
+
<text x="62.00" y="151.04" text-anchor="middle" class="point-label">0.521</text>
|
| 195 |
+
<circle cx="137.56" cy="166.34" r="5.5" fill="#6b7280" stroke="#fff" stroke-width="2"><title>v9.1 seed candidate/checkpoint 12: 0.5024</title></circle>
|
| 196 |
+
<text x="137.56" y="156.34" text-anchor="middle" class="point-label">0.502</text>
|
| 197 |
+
<circle cx="213.11" cy="166.56" r="5.5" fill="#6b7280" stroke="#fff" stroke-width="2"><title>v9.1 seed candidate/checkpoint 18: 0.5016</title></circle>
|
| 198 |
+
<text x="213.11" y="156.56" text-anchor="middle" class="point-label">0.502</text>
|
| 199 |
+
<circle cx="364.22" cy="178.45" r="5.5" fill="#6b7280" stroke="#fff" stroke-width="2"><title>v9.1 seed candidate/checkpoint 30: 0.4588</title></circle>
|
| 200 |
+
<text x="364.22" y="168.45" text-anchor="middle" class="point-label">0.459</text>
|
| 201 |
+
<circle cx="742.00" cy="185.84" r="5.5" fill="#6b7280" stroke="#fff" stroke-width="2"><title>v9.1 seed candidate/checkpoint 60: 0.4322</title></circle>
|
| 202 |
+
<text x="742.00" y="175.84" text-anchor="middle" class="point-label">0.432</text>
|
| 203 |
+
<path d="M62.00,143.83 L137.56,156.58 L213.11,152.33 L364.22,163.18 L742.00,169.45" fill="none" stroke="#1463d9" stroke-width="3" stroke-linecap="round" stroke-linejoin="round"/>
|
| 204 |
+
<circle cx="62.00" cy="143.83" r="5.5" fill="#1463d9" stroke="#fff" stroke-width="2"><title>GEPA 12B six selected candidate/checkpoint 6: 0.5833</title></circle>
|
| 205 |
+
<text x="62.00" y="133.83" text-anchor="middle" class="point-label">0.583</text>
|
| 206 |
+
<circle cx="137.56" cy="156.58" r="5.5" fill="#1463d9" stroke="#fff" stroke-width="2"><title>GEPA 12B six selected candidate/checkpoint 12: 0.5375</title></circle>
|
| 207 |
+
<text x="137.56" y="146.58" text-anchor="middle" class="point-label">0.537</text>
|
| 208 |
+
<circle cx="213.11" cy="152.33" r="5.5" fill="#1463d9" stroke="#fff" stroke-width="2"><title>GEPA 12B six selected candidate/checkpoint 18: 0.5528</title></circle>
|
| 209 |
+
<text x="213.11" y="142.33" text-anchor="middle" class="point-label">0.553</text>
|
| 210 |
+
<circle cx="364.22" cy="163.18" r="5.5" fill="#1463d9" stroke="#fff" stroke-width="2"><title>GEPA 12B six selected candidate/checkpoint 30: 0.5137</title></circle>
|
| 211 |
+
<text x="364.22" y="153.18" text-anchor="middle" class="point-label">0.514</text>
|
| 212 |
+
<circle cx="742.00" cy="169.45" r="5.5" fill="#1463d9" stroke="#fff" stroke-width="2"><title>GEPA 12B six selected candidate/checkpoint 60: 0.4912</title></circle>
|
| 213 |
+
<text x="742.00" y="159.45" text-anchor="middle" class="point-label">0.491</text>
|
| 214 |
+
<line x1="764" y1="36" x2="786" y2="36" stroke="#6b7280" stroke-width="3"/>
|
| 215 |
+
<circle cx="775" cy="36" r="4" fill="#6b7280"/>
|
| 216 |
+
<text x="795" y="40" class="legend">v9.1 seed </text>
|
| 217 |
+
<line x1="764" y1="60" x2="786" y2="60" stroke="#1463d9" stroke-width="3"/>
|
| 218 |
+
<circle cx="775" cy="60" r="4" fill="#1463d9"/>
|
| 219 |
+
<text x="795" y="64" class="legend">GEPA 12B six selected </text>
|
| 220 |
+
</svg></div>
|
| 221 |
+
<h3>Numbers</h3>
|
| 222 |
+
<div class="scroll"><table><thead><tr><th>Rows added</th><th>Checkpoint rows</th><th class="num">Seed segment</th><th class="num">GEPA segment</th><th class="num">Segment delta</th><th class="num">Seed cumulative</th><th class="num">GEPA cumulative</th><th class="num">Cumulative delta</th></tr></thead><tbody><tr><td>1-6</td><td>6</td><td class="num">0.5214</td><td class="num">0.5833</td><td class="num good">0.0619</td><td class="num">0.5214</td><td class="num">0.5833</td><td class="num good">0.0619</td></tr><tr><td>7-12</td><td>12</td><td class="num">0.4833</td><td class="num">0.4917</td><td class="num good">0.0083</td><td class="num">0.5024</td><td class="num">0.5375</td><td class="num good">0.0351</td></tr><tr><td>13-18</td><td>18</td><td class="num">0.5000</td><td class="num">0.5833</td><td class="num good">0.0833</td><td class="num">0.5016</td><td class="num">0.5528</td><td class="num good">0.0512</td></tr><tr><td>19-30</td><td>30</td><td class="num">0.3946</td><td class="num">0.4552</td><td class="num good">0.0605</td><td class="num">0.4588</td><td class="num">0.5137</td><td class="num good">0.0549</td></tr><tr><td>31-60</td><td>60</td><td class="num">0.4056</td><td class="num">0.4687</td><td class="num good">0.0630</td><td class="num">0.4322</td><td class="num">0.4912</td><td class="num good">0.0590</td></tr></tbody></table></div>
|
| 223 |
+
</section>
|
| 224 |
+
<section id="heatmap">
|
| 225 |
+
<h2>Per-Validation-Row Candidate Heatmaps</h2>
|
| 226 |
+
<p>These show which validation rows changed inside the GEPA run, not just the aggregate mean.</p>
|
| 227 |
+
<h3>E4B smoke GEPA per-instance candidate scores</h3><div class="scroll"><table class="heat"><thead><tr><th>Validation row</th><th>candidate 0</th><th>candidate 1 best</th></tr></thead><tbody><tr><td>0</td><td style="background:#f8d8d2">0.2500</td><td style="background:#d9f2e3">1.0000</td></tr><tr><td>1</td><td style="background:#f8d8d2">0.2500</td><td style="background:#f8d8d2">0.2500</td></tr></tbody></table></div><h3>12B six-row GEPA (selected) per-instance candidate scores</h3><div class="scroll"><table class="heat"><thead><tr><th>Validation row</th><th>candidate 0</th><th>candidate 1 best</th></tr></thead><tbody><tr><td>0</td><td style="background:#f8d8d2">0.2500</td><td style="background:#f8d8d2">0.3333</td></tr><tr><td>1</td><td style="background:#f8d8d2">0.2500</td><td style="background:#d9f2e3">1.0000</td></tr><tr><td>2</td><td style="background:#f8d8d2">0.1429</td><td style="background:#f8d8d2">0.2500</td></tr><tr><td>3</td><td style="background:#f8d8d2">0.2857</td><td style="background:#f8d8d2">0.2500</td></tr><tr><td>4</td><td style="background:#f8d8d2">0.2857</td><td style="background:#d9f2e3">1.0000</td></tr><tr><td>5</td><td style="background:#fff0bf">0.5000</td><td style="background:#fff0bf">0.5000</td></tr></tbody></table></div><h3>12B twelve-row init-only per-instance candidate scores</h3><div class="scroll"><table class="heat"><thead><tr><th>Validation row</th><th>candidate 0 best</th></tr></thead><tbody><tr><td>0</td><td style="background:#fff0bf">0.5000</td></tr><tr><td>1</td><td style="background:#f8d8d2">0.2500</td></tr><tr><td>2</td><td style="background:#f8d8d2">0.1429</td></tr><tr><td>3</td><td style="background:#f8d8d2">0.0000</td></tr><tr><td>4</td><td style="background:#d9f2e3">1.0000</td></tr><tr><td>5</td><td style="background:#fff0bf">0.5000</td></tr><tr><td>6</td><td style="background:#f8d8d2">0.2500</td></tr><tr><td>7</td><td style="background:#fff0bf">0.5000</td></tr><tr><td>8</td><td style="background:#fff0bf">0.5000</td></tr><tr><td>9</td><td style="background:#d9f2e3">1.0000</td></tr><tr><td>10</td><td style="background:#f8d8d2">0.2000</td></tr><tr><td>11</td><td style="background:#f8d8d2">0.2500</td></tr></tbody></table></div><h3>12B twelve-row continuation GEPA per-instance candidate scores</h3><div class="scroll"><table class="heat"><thead><tr><th>Validation row</th><th>candidate 0</th><th>candidate 1 best</th></tr></thead><tbody><tr><td>0</td><td style="background:#fff0bf">0.5000</td><td style="background:#f8d8d2">0.2500</td></tr><tr><td>1</td><td style="background:#fff0bf">0.5000</td><td style="background:#d9f2e3">1.0000</td></tr><tr><td>2</td><td style="background:#f8d8d2">0.2500</td><td style="background:#f8d8d2">0.2500</td></tr><tr><td>3</td><td style="background:#f8d8d2">0.2500</td><td style="background:#f8d8d2">0.2857</td></tr><tr><td>4</td><td style="background:#d9f2e3">1.0000</td><td style="background:#d9f2e3">1.0000</td></tr><tr><td>5</td><td style="background:#fff0bf">0.5000</td><td style="background:#d9f2e3">1.0000</td></tr><tr><td>6</td><td style="background:#d9f2e3">1.0000</td><td style="background:#f8d8d2">0.2500</td></tr><tr><td>7</td><td style="background:#fff0bf">0.5000</td><td style="background:#fff0bf">0.5000</td></tr><tr><td>8</td><td style="background:#fff0bf">0.5000</td><td style="background:#fff0bf">0.5000</td></tr><tr><td>9</td><td style="background:#d9f2e3">1.0000</td><td style="background:#d9f2e3">1.0000</td></tr><tr><td>10</td><td style="background:#f8d8d2">0.2000</td><td style="background:#d9f2e3">1.0000</td></tr><tr><td>11</td><td style="background:#f8d8d2">0.2500</td><td style="background:#f8d8d2">0.2857</td></tr></tbody></table></div>
|
| 228 |
+
</section>
|
| 229 |
+
<section id="artifacts">
|
| 230 |
+
<h2>Raw Artifacts</h2>
|
| 231 |
+
<p><a class="button" href="iteration-score-data.json">iteration-score-data.json</a> <a class="button" href="iteration-score-summary.csv">iteration-score-summary.csv</a></p>
|
| 232 |
+
<p>Candidate trees from GEPA's built-in visualization: <a href="../gepa-12b-six-20260612T190217Z/candidate_tree.html">12B six</a>, <a href="../gepa-12b-twelve-from-six-iter-20260612T192815Z/candidate_tree.html">12B twelve continuation</a>, <a href="../gepa-e4b-smoke-20260612T184748Z/candidate_tree.html">E4B smoke</a>.</p>
|
| 233 |
+
</section>
|
| 234 |
+
</main>
|
| 235 |
+
</body>
|
| 236 |
+
</html>
|
dashboard-20260613-gepa/legacy-gepa-six-dashboard.html
ADDED
|
@@ -0,0 +1,838 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!doctype html>
|
| 2 |
+
<html lang="en">
|
| 3 |
+
<head>
|
| 4 |
+
<meta charset="utf-8">
|
| 5 |
+
<meta name="viewport" content="width=device-width,initial-scale=1">
|
| 6 |
+
<title>Localpager GEPA Prompt Optimization Dashboard</title>
|
| 7 |
+
<style>
|
| 8 |
+
:root { --bg:#f6f7f9; --panel:#fff; --ink:#20242a; --muted:#6b7280; --line:#d9dee7; --blue:#2067d1; --green:#137a3a; --red:#b42318; --orange:#b76b00; --gray:#9aa3af; }
|
| 9 |
+
* { box-sizing:border-box; }
|
| 10 |
+
body { margin:0; background:var(--bg); color:var(--ink); font-family:-apple-system,BlinkMacSystemFont,"Segoe UI",Roboto,Arial,sans-serif; line-height:1.4; }
|
| 11 |
+
header { padding:28px 32px 20px; border-bottom:1px solid var(--line); background:#fff; position:sticky; top:0; z-index:2; }
|
| 12 |
+
h1 { margin:0 0 8px; font-size:26px; }
|
| 13 |
+
nav { display:flex; flex-wrap:wrap; gap:10px; margin-top:14px; }
|
| 14 |
+
nav a, .button { color:#123c7c; text-decoration:none; background:#eef4ff; border:1px solid #c9dcff; padding:7px 10px; border-radius:6px; }
|
| 15 |
+
main { padding:24px 32px 48px; max-width:1600px; margin:0 auto; }
|
| 16 |
+
section { background:var(--panel); border:1px solid var(--line); border-radius:8px; padding:20px; margin:0 0 22px; }
|
| 17 |
+
h2 { margin:0 0 14px; font-size:20px; }
|
| 18 |
+
h3 { margin:18px 0 10px; font-size:16px; }
|
| 19 |
+
.grid { display:grid; grid-template-columns:repeat(auto-fit,minmax(180px,1fr)); gap:12px; }
|
| 20 |
+
.card { border:1px solid var(--line); border-radius:8px; padding:14px; background:#fbfcfe; min-height:105px; }
|
| 21 |
+
.card .label { color:var(--muted); font-size:12px; text-transform:uppercase; letter-spacing:.04em; }
|
| 22 |
+
.card .value { margin-top:8px; font-size:28px; font-weight:700; }
|
| 23 |
+
.card .sub { margin-top:5px; color:var(--muted); font-size:13px; }
|
| 24 |
+
table { border-collapse:collapse; width:100%; font-size:13px; }
|
| 25 |
+
th,td { border-bottom:1px solid var(--line); padding:8px 9px; vertical-align:top; text-align:left; }
|
| 26 |
+
th { background:#f0f3f8; position:sticky; top:92px; z-index:1; }
|
| 27 |
+
.good { color:var(--green); font-weight:650; } .bad { color:var(--red); font-weight:650; } .neutral { color:var(--muted); } .muted { color:var(--muted); }
|
| 28 |
+
.pill { display:inline-block; border:1px solid #ccd4df; background:#f7f9fc; border-radius:999px; padding:1px 7px; margin:1px 2px 1px 0; white-space:nowrap; }
|
| 29 |
+
.miss { color:#7a2530; }
|
| 30 |
+
.bar { width:120px; height:8px; background:#e8edf4; border-radius:99px; overflow:hidden; display:inline-block; margin-left:8px; vertical-align:middle; }
|
| 31 |
+
.bar span { display:block; height:100%; } .bar .blue { background:var(--blue); } .bar .gray { background:var(--gray); } .bar .orange { background:var(--orange); }
|
| 32 |
+
tr.win { background:#f2fbf5; } tr.loss { background:#fff6f5; } tr.tie { background:#fff; }
|
| 33 |
+
.scroll { max-height:760px; overflow:auto; border:1px solid var(--line); border-radius:8px; }
|
| 34 |
+
.cols { display:grid; grid-template-columns:1fr 1fr; gap:18px; }
|
| 35 |
+
pre, code { font-family:"SFMono-Regular",Consolas,"Liberation Mono",monospace; font-size:12px; }
|
| 36 |
+
ul.links { columns:2; } ul.links li { break-inside:avoid; margin-bottom:6px; }
|
| 37 |
+
@media (max-width:900px) { header { position:static; } th { position:static; } .cols { grid-template-columns:1fr; } ul.links { columns:1; } }
|
| 38 |
+
</style>
|
| 39 |
+
</head>
|
| 40 |
+
<body>
|
| 41 |
+
<header>
|
| 42 |
+
<h1>Localpager GEPA Prompt Optimization Dashboard</h1>
|
| 43 |
+
<div class="muted">Generated 2026-06-13T03:00:34+00:00. Dataset: Shaun ordered 60-row set. Gold labels: canonical <code>ds4.jsonl</code>. Evaluator: local 12B, concurrency 2.</div>
|
| 44 |
+
<nav><a href="iterations.html">Iteration Graphs</a><a href="#summary">Summary</a><a href="#slices">Slices</a><a href="#rows">Row Deltas</a><a href="#topics">Topics</a><a href="#failures">Failures</a><a href="#gepa">GEPA Artifacts</a><a href="#raw">Raw JSON</a></nav>
|
| 45 |
+
</header>
|
| 46 |
+
<main>
|
| 47 |
+
<section id="iteration-score-graphs">
|
| 48 |
+
<h2>Iteration Score Graphs</h2>
|
| 49 |
+
<p>This summary dashboard is the final validation view. For GEPA score changes across candidates/iterations, open the focused graph page.</p>
|
| 50 |
+
<p><a class="button" href="iterations.html">Open GEPA iteration score graphs</a></p>
|
| 51 |
+
</section>
|
| 52 |
+
<section id="summary">
|
| 53 |
+
<h2>Headline Result</h2>
|
| 54 |
+
<div class="grid"><div class="card "><div class="label">Mean score delta</div><div class="value"><span class="good">+0.0590</span></div><div class="sub">0.4322 -> 0.4912</div></div><div class="card "><div class="label">Micro F1 delta</div><div class="value"><span class="good">+0.0308</span></div><div class="sub">0.6510 -> 0.6818</div></div><div class="card "><div class="label">False positives</div><div class="value"><span class="good">29 -> 23</span></div><div class="sub">lower is better</div></div><div class="card "><div class="label">False negatives</div><div class="value"><span class="bad">60 -> 61</span></div><div class="sub">one extra missed gold label</div></div><div class="card "><div class="label">Over-label events</div><div class="value"><span class="good">2 -> 1</span></div><div class="sub">random extra labels penalized</div></div><div class="card "><div class="label">Structural failures</div><div class="value"><span class="good">3 -> 0</span></div><div class="sub">final_json failures</div></div><div class="card "><div class="label">Wins / ties / losses</div><div class="value">21 / 26 / 13</div><div class="sub">GEPA six vs v9.1 seed</div></div><div class="card "><div class="label">Exact matches</div><div class="value">10 -> 12</div><div class="sub">gold set exactly matched</div></div></div>
|
| 55 |
+
<h3>Metric table</h3>
|
| 56 |
+
<div class="scroll"><table><thead><tr><th>Candidate</th><th>Rows</th><th>Mean</th><th>Precision</th><th>Recall</th><th>F1</th><th>TP</th><th>FP</th><th>FN</th><th>Over</th><th>Failures</th></tr></thead><tbody><tr>
|
| 57 |
+
<td>v9.1 Seed</td><td>60</td><td>0.4322</td><td>0.7411</td><td>0.5804</td><td>0.6510</td><td>83</td><td>29</td><td>60</td><td>2</td><td>3</td>
|
| 58 |
+
</tr><tr>
|
| 59 |
+
<td>GEPA 12B Six Candidate</td><td>60</td><td>0.4912</td><td>0.7965</td><td>0.5960</td><td>0.6818</td><td>90</td><td>23</td><td>61</td><td>1</td><td>0</td>
|
| 60 |
+
</tr><tr>
|
| 61 |
+
<td>GEPA 12B Twelve Candidate (rows 1-30 only)</td><td>30</td><td>0.4800</td><td>0.7377</td><td>0.7258</td><td>0.7317</td><td>45</td><td>16</td><td>17</td><td>6</td><td>2</td>
|
| 62 |
+
</tr></tbody></table></div>
|
| 63 |
+
<p class="muted">The selected artifact is <code>GEPA 12B Six Candidate</code>. It is more precise, has fewer false positives, and eliminated structural failures. The main caveat is one additional false negative.</p>
|
| 64 |
+
</section>
|
| 65 |
+
<section id="slices">
|
| 66 |
+
<h2>Score By Evaluation Slice</h2>
|
| 67 |
+
<div class="scroll"><table><thead><tr><th>Rows</th><th>v9.1 seed</th><th>GEPA six</th><th>Delta</th><th>GEPA twelve</th><th>E4B smoke</th></tr></thead><tbody><tr><td>1-6</td><td>0.5214 <div class="bar"><span class="gray" style="width:52.1%"></span></div></td><td>0.5833 <div class="bar"><span class="blue" style="width:58.3%"></span></div></td><td class="good">+0.0619</td><td>0.5158 <div class="bar"><span class="blue" style="width:51.6%"></span></div></td><td>0.3607 <div class="bar"><span class="orange" style="width:36.1%"></span></div> </td></tr><tr><td>7-12</td><td>0.4833 <div class="bar"><span class="gray" style="width:48.3%"></span></div></td><td>0.4917 <div class="bar"><span class="blue" style="width:49.2%"></span></div></td><td class="good">+0.0083</td><td><span class="muted">n/a</span></td><td><span class="muted">n/a</span></td></tr><tr><td>13-18</td><td>0.5000 <div class="bar"><span class="gray" style="width:50.0%"></span></div></td><td>0.5833 <div class="bar"><span class="blue" style="width:58.3%"></span></div></td><td class="good">+0.0833</td><td>0.5417 <div class="bar"><span class="blue" style="width:54.2%"></span></div></td><td><span class="muted">n/a</span></td></tr><tr><td>19-30</td><td>0.3946 <div class="bar"><span class="gray" style="width:39.5%"></span></div></td><td>0.4552 <div class="bar"><span class="blue" style="width:45.5%"></span></div></td><td class="good">+0.0605</td><td>0.4134 <div class="bar"><span class="blue" style="width:41.3%"></span></div></td><td><span class="muted">n/a</span></td></tr><tr><td>31-60</td><td>0.4056 <div class="bar"><span class="gray" style="width:40.6%"></span></div></td><td>0.4687 <div class="bar"><span class="blue" style="width:46.9%"></span></div></td><td class="good">+0.0630</td><td><span class="muted">n/a</span></td><td><span class="muted">n/a</span></td></tr></tbody></table></div>
|
| 68 |
+
</section>
|
| 69 |
+
<section id="rows">
|
| 70 |
+
<h2>Per-Row Deltas: GEPA Six vs v9.1 Seed</h2>
|
| 71 |
+
<div class="cols"><div><h3>Largest losses to scrutinize</h3><ol><li><a href="https://github.com/openclaw/openclaw/pull/77748">#77748</a> <b class="bad">-0.7500</b> fix: Codex startup plugins + WhatsApp history & Docker Codex OAuth<br><span class="muted">gold codex, chat_integrations; seed chat_integrations, codex; GEPA codex, gateway</span></li>
|
| 72 |
+
<li><a href="https://github.com/openclaw/openclaw/pull/63007">#63007</a> <b class="bad">-0.7500</b> Pass outbound session identity into message_sending and surface guarded gateway send denial<br><span class="muted">gold gateway, sessions; seed gateway, sessions; GEPA gateway, hooks</span></li>
|
| 73 |
+
<li><a href="https://github.com/openclaw/openclaw/issues/59878">#59878</a> <b class="bad">-0.7500</b> Session lane stuck in 'running' after run dies — sessions.abort + gateway restart fail to clear stale state<br><span class="muted">gold sessions, reliability; seed sessions, reliability; GEPA sessions, gateway</span></li>
|
| 74 |
+
<li><a href="https://github.com/openclaw/openclaw/pull/48940">#48940</a> <b class="bad">-0.5000</b> ACP: add gateway-owned node-backed runtime<br><span class="muted">gold acp, gateway, agent_runtime; seed acp, gateway, agent_runtime; GEPA acp, gateway</span></li>
|
| 75 |
+
<li><a href="https://github.com/openclaw/openclaw/pull/77827">#77827</a> <b class="bad">-0.5000</b> fix: LM Studio thinking blocks invisible with Responses API<br><span class="muted">gold model_serving, local_models; seed local_models, model_serving; GEPA model_serving</span></li>
|
| 76 |
+
<li><a href="https://github.com/openclaw/openclaw/pull/70882">#70882</a> <b class="bad">-0.5000</b> fix(bundle-mcp): coerce stringified object/array params before MCP tool calls<br><span class="muted">gold mcp_tooling, tool_calling; seed mcp_tooling, tool_calling; GEPA mcp_tooling</span></li>
|
| 77 |
+
<li><a href="https://github.com/openclaw/openclaw/issues/84746">#84746</a> <b class="bad">-0.5000</b> [Bug]: Auto-compaction crashes active responses after 5.18 transcript lock scope change (#13744)<br><span class="muted">gold reliability, sessions; seed sessions, reliability; GEPA sessions</span></li>
|
| 78 |
+
<li><a href="https://github.com/openclaw/openclaw/pull/47083">#47083</a> <b class="bad">-0.3000</b> fix: respect totalTokensFresh flag to avoid showing stale token counts<br><span class="muted">gold sessions, telemetry_usage; seed telemetry_usage; GEPA ui_tui</span></li>
|
| 79 |
+
<li><a href="https://github.com/openclaw/openclaw/pull/65242">#65242</a> <b class="bad">-0.3000</b> fix: CompletionDeliveryGate to prevent duplicate ACP completion delivery<br><span class="muted">gold acp, coding_agents, reliability; seed acp, reliability; GEPA acp, notifications</span></li></ol></div><div><h3>Largest wins</h3><ol><li><a href="https://github.com/openclaw/openclaw/pull/80783">#80783</a> <b class="good">+0.8000</b> Policy: add model, network, and MCP conformance checks<br><span class="muted">gold mcp_tooling, config, security; seed mcp_tooling, local_model_providers; GEPA mcp_tooling, security, config</span></li>
|
| 80 |
+
<li><a href="https://github.com/openclaw/openclaw/issues/84789">#84789</a> <b class="good">+0.7500</b> Active memory crashes on Telegram forum topic sessions (dirName validation)<br><span class="muted">gold memory, sessions; seed memory, chat_integrations; GEPA memory, sessions</span></li>
|
| 81 |
+
<li><a href="https://github.com/openclaw/openclaw/issues/71216">#71216</a> <b class="good">+0.7500</b> Config schema: add `sandbox`, `routing.rules`, `instances`, and `gateway.nodes.denyPaths`<br><span class="muted">gold config, sandboxing, gateway; seed sandboxing, model_serving, gateway; GEPA config, sandboxing, gateway</span></li>
|
| 82 |
+
<li><a href="https://github.com/openclaw/openclaw/pull/80479">#80479</a> <b class="good">+0.7500</b> feat(memory/embeddings): add openai-compatible provider for self-hosted servers (llama.cpp, Ollama, vLLM, TGI, LocalAI)<br><span class="muted">gold self_hosted_inference, memory; seed local_models, self_hosted_inference; GEPA memory, self_hosted_inference</span></li>
|
| 83 |
+
<li><a href="https://github.com/openclaw/openclaw/issues/79897">#79897</a> <b class="good">+0.7143</b> OpenAI-compatible streaming with llama.cpp saves zero usage (stream closed before final usage chunk)<br><span class="muted">gold model_serving; seed telemetry_usage, model_serving; GEPA model_serving</span></li>
|
| 84 |
+
<li><a href="https://github.com/openclaw/openclaw/issues/74305">#74305</a> <b class="good">+0.5000</b> [Bug]: ACPX Codex worker fails when model/thinking overrides are configured<br><span class="muted">gold acpx, codex; seed acpx; GEPA acpx, codex</span></li>
|
| 85 |
+
<li><a href="https://github.com/openclaw/openclaw/issues/78528">#78528</a> <b class="good">+0.5000</b> Security: skill SecretRef API keys still leak into exec child environments<br><span class="muted">gold security, exec_tools, skills_plugins; seed security, exec_tools; GEPA security, exec_tools, skills_plugins</span></li>
|
| 86 |
+
<li><a href="https://github.com/openclaw/openclaw/pull/84752">#84752</a> <b class="good">+0.5000</b> fix: self-heal lane wedges + restore openai-codex OAuth on embedded path<br><span class="muted">gold reliability, auth_identity, sessions; seed ; GEPA reliability, auth_identity</span></li>
|
| 87 |
+
<li><a href="https://github.com/openclaw/openclaw/issues/70529">#70529</a> <b class="good">+0.5000</b> [Bug]: Desktop cannot use existing Chrome sessions: EasyClaw Google sign-in fails, and user profile attach fails with spawn npx ENOENT<br><span class="muted">gold browser_automation, packaging_deployment; seed browser_automation; GEPA browser_automation, packaging_deployment</span></li>
|
| 88 |
+
<li><a href="https://github.com/openclaw/openclaw/pull/63826">#63826</a> <b class="good">+0.5000</b> security: fix HIGH/CRITICAL vulns in skill scanner, SSRF, hook priority, and token verification<br><span class="muted">gold security, hooks, skills_plugins; seed ; GEPA security, hooks</span></li>
|
| 89 |
+
<li><a href="https://github.com/openclaw/openclaw/pull/46552">#46552</a> <b class="good">+0.5000</b> docs(queue): clarify steer behavior with partial streaming and tool boundaries<br><span class="muted">gold queueing, docs; seed docs; GEPA docs, queueing</span></li>
|
| 90 |
+
<li><a href="https://github.com/openclaw/openclaw/issues/48580">#48580</a> <b class="good">+0.3000</b> Bug: acpx codex sessions 创建的会话立即退出 - stdin is not a terminal<br><span class="muted">gold acpx, codex, sessions; seed acp, sessions; GEPA codex, sessions</span></li>
|
| 91 |
+
<li><a href="https://github.com/openclaw/openclaw/issues/83863">#83863</a> <b class="good">+0.3000</b> ACP/Codex child tasks can be marked succeeded with progress-only output and no final deliverable<br><span class="muted">gold acp, codex, agent_runtime; seed acp, reliability; GEPA acp, codex</span></li>
|
| 92 |
+
<li><a href="https://github.com/openclaw/openclaw/issues/82507">#82507</a> <b class="good">+0.3000</b> [Feature]: ACPX Codex sandbox should inherit user-installed plugins (e.g. Superpowers)<br><span class="muted">gold acpx, codex, skills_plugins; seed acpx, security; GEPA acpx, skills_plugins</span></li>
|
| 93 |
+
<li><a href="https://github.com/openclaw/openclaw/pull/43246">#43246</a> <b class="good">+0.2500</b> fix(message): deny same-provider cross-context sends by default [AI-assisted]<br><span class="muted">gold tool_calling, security; seed ; GEPA security, config</span></li>
|
| 94 |
+
<li><a href="https://github.com/openclaw/openclaw/pull/68725">#68725</a> <b class="good">+0.2500</b> feat(amazon-bedrock-mantle): add known context windows for open-weight Mantle models<br><span class="muted">gold open_weight_models, local_model_providers; seed open_weight_models, model_serving; GEPA open_weight_models</span></li></ol></div></div>
|
| 95 |
+
<h3>All rows</h3>
|
| 96 |
+
<div class="scroll"><table><thead><tr><th>Target</th><th>Title</th><th>Seed</th><th>GEPA</th><th>Delta</th><th>Gold</th><th>Seed pred</th><th>GEPA pred</th><th>Seed FP/FN</th><th>GEPA FP/FN</th><th>Error</th></tr></thead><tbody><tr class="loss">
|
| 97 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/77748">#77748</a></td>
|
| 98 |
+
<td>fix: Codex startup plugins + WhatsApp history & Docker Codex OAuth</td>
|
| 99 |
+
<td>1.0000</td><td>0.2500</td><td class="bad">-0.7500</td>
|
| 100 |
+
<td><span class="pill">codex</span> <span class="pill">chat_integrations</span></td>
|
| 101 |
+
<td><span class="pill">chat_integrations</span> <span class="pill">codex</span></td>
|
| 102 |
+
<td><span class="pill">codex</span> <span class="pill">gateway</span></td>
|
| 103 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 104 |
+
<td><span class="pill">gateway</span><br><span class="miss">FN <span class="pill">chat_integrations</span></span></td>
|
| 105 |
+
<td></td>
|
| 106 |
+
</tr><tr class="loss">
|
| 107 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/63007">#63007</a></td>
|
| 108 |
+
<td>Pass outbound session identity into message_sending and surface guarded gateway send denial</td>
|
| 109 |
+
<td>1.0000</td><td>0.2500</td><td class="bad">-0.7500</td>
|
| 110 |
+
<td><span class="pill">gateway</span> <span class="pill">sessions</span></td>
|
| 111 |
+
<td><span class="pill">gateway</span> <span class="pill">sessions</span></td>
|
| 112 |
+
<td><span class="pill">gateway</span> <span class="pill">hooks</span></td>
|
| 113 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 114 |
+
<td><span class="pill">hooks</span><br><span class="miss">FN <span class="pill">sessions</span></span></td>
|
| 115 |
+
<td></td>
|
| 116 |
+
</tr><tr class="loss">
|
| 117 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/59878">#59878</a></td>
|
| 118 |
+
<td>Session lane stuck in 'running' after run dies — sessions.abort + gateway restart fail to clear stale state</td>
|
| 119 |
+
<td>1.0000</td><td>0.2500</td><td class="bad">-0.7500</td>
|
| 120 |
+
<td><span class="pill">sessions</span> <span class="pill">reliability</span></td>
|
| 121 |
+
<td><span class="pill">sessions</span> <span class="pill">reliability</span></td>
|
| 122 |
+
<td><span class="pill">sessions</span> <span class="pill">gateway</span></td>
|
| 123 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 124 |
+
<td><span class="pill">gateway</span><br><span class="miss">FN <span class="pill">reliability</span></span></td>
|
| 125 |
+
<td></td>
|
| 126 |
+
</tr><tr class="loss">
|
| 127 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/48940">#48940</a></td>
|
| 128 |
+
<td>ACP: add gateway-owned node-backed runtime</td>
|
| 129 |
+
<td>1.0000</td><td>0.5000</td><td class="bad">-0.5000</td>
|
| 130 |
+
<td><span class="pill">acp</span> <span class="pill">gateway</span> <span class="pill">agent_runtime</span></td>
|
| 131 |
+
<td><span class="pill">acp</span> <span class="pill">gateway</span> <span class="pill">agent_runtime</span></td>
|
| 132 |
+
<td><span class="pill">acp</span> <span class="pill">gateway</span></td>
|
| 133 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 134 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">agent_runtime</span></span></td>
|
| 135 |
+
<td></td>
|
| 136 |
+
</tr><tr class="loss">
|
| 137 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/77827">#77827</a></td>
|
| 138 |
+
<td>fix: LM Studio thinking blocks invisible with Responses API</td>
|
| 139 |
+
<td>1.0000</td><td>0.5000</td><td class="bad">-0.5000</td>
|
| 140 |
+
<td><span class="pill">model_serving</span> <span class="pill">local_models</span></td>
|
| 141 |
+
<td><span class="pill">local_models</span> <span class="pill">model_serving</span></td>
|
| 142 |
+
<td><span class="pill">model_serving</span></td>
|
| 143 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 144 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">local_models</span></span></td>
|
| 145 |
+
<td></td>
|
| 146 |
+
</tr><tr class="loss">
|
| 147 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/70882">#70882</a></td>
|
| 148 |
+
<td>fix(bundle-mcp): coerce stringified object/array params before MCP tool calls</td>
|
| 149 |
+
<td>1.0000</td><td>0.5000</td><td class="bad">-0.5000</td>
|
| 150 |
+
<td><span class="pill">mcp_tooling</span> <span class="pill">tool_calling</span></td>
|
| 151 |
+
<td><span class="pill">mcp_tooling</span> <span class="pill">tool_calling</span></td>
|
| 152 |
+
<td><span class="pill">mcp_tooling</span></td>
|
| 153 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 154 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">tool_calling</span></span></td>
|
| 155 |
+
<td></td>
|
| 156 |
+
</tr><tr class="loss">
|
| 157 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/84746">#84746</a></td>
|
| 158 |
+
<td>[Bug]: Auto-compaction crashes active responses after 5.18 transcript lock scope change (#13744)</td>
|
| 159 |
+
<td>1.0000</td><td>0.5000</td><td class="bad">-0.5000</td>
|
| 160 |
+
<td><span class="pill">reliability</span> <span class="pill">sessions</span></td>
|
| 161 |
+
<td><span class="pill">sessions</span> <span class="pill">reliability</span></td>
|
| 162 |
+
<td><span class="pill">sessions</span></td>
|
| 163 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 164 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">reliability</span></span></td>
|
| 165 |
+
<td></td>
|
| 166 |
+
</tr><tr class="loss">
|
| 167 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/47083">#47083</a></td>
|
| 168 |
+
<td>fix: respect totalTokensFresh flag to avoid showing stale token counts</td>
|
| 169 |
+
<td>0.5000</td><td>0.2000</td><td class="bad">-0.3000</td>
|
| 170 |
+
<td><span class="pill">sessions</span> <span class="pill">telemetry_usage</span></td>
|
| 171 |
+
<td><span class="pill">telemetry_usage</span></td>
|
| 172 |
+
<td><span class="pill">ui_tui</span></td>
|
| 173 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">sessions</span></span></td>
|
| 174 |
+
<td><span class="pill">ui_tui</span><br><span class="miss">FN <span class="pill">sessions</span> <span class="pill">telemetry_usage</span></span></td>
|
| 175 |
+
<td></td>
|
| 176 |
+
</tr><tr class="loss">
|
| 177 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/65242">#65242</a></td>
|
| 178 |
+
<td>fix: CompletionDeliveryGate to prevent duplicate ACP completion delivery</td>
|
| 179 |
+
<td>0.5000</td><td>0.2000</td><td class="bad">-0.3000</td>
|
| 180 |
+
<td><span class="pill">acp</span> <span class="pill">coding_agents</span> <span class="pill">reliability</span></td>
|
| 181 |
+
<td><span class="pill">acp</span> <span class="pill">reliability</span></td>
|
| 182 |
+
<td><span class="pill">acp</span> <span class="pill">notifications</span></td>
|
| 183 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">coding_agents</span></span></td>
|
| 184 |
+
<td><span class="pill">notifications</span><br><span class="miss">FN <span class="pill">coding_agents</span> <span class="pill">reliability</span></span></td>
|
| 185 |
+
<td></td>
|
| 186 |
+
</tr><tr class="loss">
|
| 187 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/69256">#69256</a></td>
|
| 188 |
+
<td>fix(cron): prevent premature session cleanup when subagents are running</td>
|
| 189 |
+
<td>0.5000</td><td>0.3333</td><td class="bad">-0.1667</td>
|
| 190 |
+
<td><span class="pill">cron_automation</span> <span class="pill">sessions</span> <span class="pill">reliability</span></td>
|
| 191 |
+
<td><span class="pill">cron_automation</span> <span class="pill">sessions</span></td>
|
| 192 |
+
<td><span class="pill">cron_automation</span></td>
|
| 193 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">reliability</span></span></td>
|
| 194 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">sessions</span> <span class="pill">reliability</span></span></td>
|
| 195 |
+
<td></td>
|
| 196 |
+
</tr><tr class="loss">
|
| 197 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/81249">#81249</a></td>
|
| 198 |
+
<td>[Feature/Bug]: Local Ollama embeddings fail when proxy is enabled (SSRF defenses ignore NO_PROXY)</td>
|
| 199 |
+
<td>0.2500</td><td>0.1429</td><td class="bad">-0.1071</td>
|
| 200 |
+
<td><span class="pill">local_models</span> <span class="pill">self_hosted_inference</span></td>
|
| 201 |
+
<td><span class="pill">local_models</span> <span class="pill">security</span></td>
|
| 202 |
+
<td><span class="pill">security</span> <span class="pill">config</span></td>
|
| 203 |
+
<td><span class="pill">security</span><br><span class="miss">FN <span class="pill">self_hosted_inference</span></span></td>
|
| 204 |
+
<td><span class="pill">security</span> <span class="pill">config</span><br><span class="miss">FN <span class="pill">local_models</span> <span class="pill">self_hosted_inference</span></span></td>
|
| 205 |
+
<td></td>
|
| 206 |
+
</tr><tr class="loss">
|
| 207 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/84477">#84477</a></td>
|
| 208 |
+
<td>Discord embedded-run prep wedge before strict-agentic, recovery skips sessionId=unknown lanes</td>
|
| 209 |
+
<td>0.2500</td><td>0.2000</td><td class="bad">-0.0500</td>
|
| 210 |
+
<td><span class="pill">sessions</span> <span class="pill">agent_runtime</span> <span class="pill">reliability</span></td>
|
| 211 |
+
<td><span class="pill">chat_integrations</span> <span class="pill">sessions</span> <span class="pill">reliability</span></td>
|
| 212 |
+
<td><span class="pill">gateway</span> <span class="pill">sessions</span></td>
|
| 213 |
+
<td><span class="pill">chat_integrations</span><br><span class="miss">FN <span class="pill">agent_runtime</span></span></td>
|
| 214 |
+
<td><span class="pill">gateway</span><br><span class="miss">FN <span class="pill">agent_runtime</span> <span class="pill">reliability</span></span></td>
|
| 215 |
+
<td></td>
|
| 216 |
+
</tr><tr class="loss">
|
| 217 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/44379">#44379</a></td>
|
| 218 |
+
<td>fix(pi-runner): harden context-overflow recovery with one suppress-hook retry</td>
|
| 219 |
+
<td>0.1667</td><td>0.1429</td><td class="bad">-0.0238</td>
|
| 220 |
+
<td><span class="pill">coding_agents</span> <span class="pill">memory</span> <span class="pill">hooks</span> <span class="pill">reliability</span></td>
|
| 221 |
+
<td><span class="pill">agent_runtime</span> <span class="pill">reliability</span></td>
|
| 222 |
+
<td><span class="pill">agent_runtime</span></td>
|
| 223 |
+
<td><span class="pill">agent_runtime</span><br><span class="miss">FN <span class="pill">coding_agents</span> <span class="pill">memory</span> <span class="pill">hooks</span></span></td>
|
| 224 |
+
<td><span class="pill">agent_runtime</span><br><span class="miss">FN <span class="pill">coding_agents</span> <span class="pill">memory</span> <span class="pill">hooks</span> <span class="pill">reliability</span></span></td>
|
| 225 |
+
<td></td>
|
| 226 |
+
</tr><tr class="tie">
|
| 227 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/40332">#40332</a></td>
|
| 228 |
+
<td>[Feature]: Per-binding and per-agent permissionMode for ACP sessions</td>
|
| 229 |
+
<td>0.5000</td><td>0.5000</td><td class="neutral">+0.0000</td>
|
| 230 |
+
<td><span class="pill">acp</span> <span class="pill">approvals</span> <span class="pill">acpx</span></td>
|
| 231 |
+
<td><span class="pill">approvals</span> <span class="pill">acp</span></td>
|
| 232 |
+
<td><span class="pill">acp</span> <span class="pill">approvals</span></td>
|
| 233 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">acpx</span></span></td>
|
| 234 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">acpx</span></span></td>
|
| 235 |
+
<td></td>
|
| 236 |
+
</tr><tr class="tie">
|
| 237 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/80255">#80255</a></td>
|
| 238 |
+
<td>fix #79026: active-memory recall subagent can deadlock on the main lane inside before_prompt_build</td>
|
| 239 |
+
<td>0.5000</td><td>0.5000</td><td class="neutral">+0.0000</td>
|
| 240 |
+
<td><span class="pill">memory</span> <span class="pill">reliability</span></td>
|
| 241 |
+
<td><span class="pill">memory</span></td>
|
| 242 |
+
<td><span class="pill">memory</span></td>
|
| 243 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">reliability</span></span></td>
|
| 244 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">reliability</span></span></td>
|
| 245 |
+
<td></td>
|
| 246 |
+
</tr><tr class="tie">
|
| 247 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/84670">#84670</a></td>
|
| 248 |
+
<td>[codex] fix webchat full-message reader for truncated history</td>
|
| 249 |
+
<td>0.5000</td><td>0.5000</td><td class="neutral">+0.0000</td>
|
| 250 |
+
<td><span class="pill">gateway</span> <span class="pill">api_surface</span> <span class="pill">ui_tui</span></td>
|
| 251 |
+
<td><span class="pill">gateway</span> <span class="pill">ui_tui</span></td>
|
| 252 |
+
<td><span class="pill">gateway</span> <span class="pill">ui_tui</span></td>
|
| 253 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">api_surface</span></span></td>
|
| 254 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">api_surface</span></span></td>
|
| 255 |
+
<td></td>
|
| 256 |
+
</tr><tr class="tie">
|
| 257 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/62428">#62428</a></td>
|
| 258 |
+
<td>test(exec): land exec v2 contract follow-through</td>
|
| 259 |
+
<td>0.2000</td><td>0.2000</td><td class="neutral">+0.0000</td>
|
| 260 |
+
<td><span class="pill">exec_tools</span> <span class="pill">sandboxing</span> <span class="pill">approvals</span></td>
|
| 261 |
+
<td><span class="pill">exec_tools</span> <span class="pill">security</span></td>
|
| 262 |
+
<td><span class="pill">exec_tools</span> <span class="pill">security</span></td>
|
| 263 |
+
<td><span class="pill">security</span><br><span class="miss">FN <span class="pill">sandboxing</span> <span class="pill">approvals</span></span></td>
|
| 264 |
+
<td><span class="pill">security</span><br><span class="miss">FN <span class="pill">sandboxing</span> <span class="pill">approvals</span></span></td>
|
| 265 |
+
<td></td>
|
| 266 |
+
</tr><tr class="tie">
|
| 267 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/90146">#90146</a></td>
|
| 268 |
+
<td>google-vertex: Missing gemini-3.1-flash-lite in provider catalog causes silent failure instead of error</td>
|
| 269 |
+
<td>0.2500</td><td>0.2500</td><td class="neutral">+0.0000</td>
|
| 270 |
+
<td><span class="pill">local_model_providers</span> <span class="pill">reliability</span></td>
|
| 271 |
+
<td><span class="pill">model_serving</span> <span class="pill">reliability</span></td>
|
| 272 |
+
<td><span class="pill">config</span> <span class="pill">reliability</span></td>
|
| 273 |
+
<td><span class="pill">model_serving</span><br><span class="miss">FN <span class="pill">local_model_providers</span></span></td>
|
| 274 |
+
<td><span class="pill">config</span><br><span class="miss">FN <span class="pill">local_model_providers</span></span></td>
|
| 275 |
+
<td></td>
|
| 276 |
+
</tr><tr class="tie">
|
| 277 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/51849">#51849</a></td>
|
| 278 |
+
<td>Docs: add freeCodeCamp OpenClaw full tutorial to showcase</td>
|
| 279 |
+
<td>1.0000</td><td>1.0000</td><td class="neutral">+0.0000</td>
|
| 280 |
+
<td><span class="pill">docs</span></td>
|
| 281 |
+
<td><span class="pill">docs</span></td>
|
| 282 |
+
<td><span class="pill">docs</span></td>
|
| 283 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 284 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 285 |
+
<td></td>
|
| 286 |
+
</tr><tr class="tie">
|
| 287 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/84297">#84297</a></td>
|
| 288 |
+
<td>[Bug]: Per-agent identity overlay dropped on cron --announce and heartbeat target-channel Slack pushes (announce path; reply path was fixed in #38235)</td>
|
| 289 |
+
<td>0.2500</td><td>0.2500</td><td class="neutral">+0.0000</td>
|
| 290 |
+
<td><span class="pill">notifications</span> <span class="pill">chat_integrations</span></td>
|
| 291 |
+
<td><span class="pill">chat_integrations</span> <span class="pill">cron_automation</span></td>
|
| 292 |
+
<td><span class="pill">cron_automation</span> <span class="pill">chat_integrations</span></td>
|
| 293 |
+
<td><span class="pill">cron_automation</span><br><span class="miss">FN <span class="pill">notifications</span></span></td>
|
| 294 |
+
<td><span class="pill">cron_automation</span><br><span class="miss">FN <span class="pill">notifications</span></span></td>
|
| 295 |
+
<td></td>
|
| 296 |
+
</tr><tr class="tie">
|
| 297 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/81957">#81957</a></td>
|
| 298 |
+
<td>ci: harden GitHub Actions supply-chain boundaries</td>
|
| 299 |
+
<td>0.2857</td><td>0.2857</td><td class="neutral">+0.0000</td>
|
| 300 |
+
<td><span class="pill">security</span></td>
|
| 301 |
+
<td><span class="pill">security</span> <span class="pill">packaging_deployment</span></td>
|
| 302 |
+
<td><span class="pill">security</span> <span class="pill">packaging_deployment</span></td>
|
| 303 |
+
<td><span class="pill">packaging_deployment</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 304 |
+
<td><span class="pill">packaging_deployment</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 305 |
+
<td></td>
|
| 306 |
+
</tr><tr class="tie">
|
| 307 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/87277">#87277</a></td>
|
| 308 |
+
<td>[Feature] Add MiMo-V2.5 to Xiaomi catalog + automatic multimodal routing when DeepSeek V4-Pro is primary model</td>
|
| 309 |
+
<td>0.2500</td><td>0.2500</td><td class="neutral">+0.0000</td>
|
| 310 |
+
<td><span class="pill">local_model_providers</span> <span class="pill">model_serving</span></td>
|
| 311 |
+
<td><span class="pill">model_releases</span> <span class="pill">model_serving</span></td>
|
| 312 |
+
<td><span class="pill">config</span> <span class="pill">model_serving</span></td>
|
| 313 |
+
<td><span class="pill">model_releases</span><br><span class="miss">FN <span class="pill">local_model_providers</span></span></td>
|
| 314 |
+
<td><span class="pill">config</span><br><span class="miss">FN <span class="pill">local_model_providers</span></span></td>
|
| 315 |
+
<td></td>
|
| 316 |
+
</tr><tr class="tie">
|
| 317 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/64199">#64199</a></td>
|
| 318 |
+
<td>[Bug]: ACP configured binding uses parent channel ID for session key — all threads under same channel share one persistent Claude Code process</td>
|
| 319 |
+
<td>1.0000</td><td>1.0000</td><td class="neutral">+0.0000</td>
|
| 320 |
+
<td><span class="pill">acp</span> <span class="pill">sessions</span></td>
|
| 321 |
+
<td><span class="pill">acp</span> <span class="pill">sessions</span></td>
|
| 322 |
+
<td><span class="pill">acp</span> <span class="pill">sessions</span></td>
|
| 323 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 324 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 325 |
+
<td></td>
|
| 326 |
+
</tr><tr class="tie">
|
| 327 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/84583">#84583</a></td>
|
| 328 |
+
<td>cron announce delivery triggers EmbeddedAttemptSessionTakeoverError when user is actively chatting</td>
|
| 329 |
+
<td>0.5000</td><td>0.5000</td><td class="neutral">+0.0000</td>
|
| 330 |
+
<td><span class="pill">cron_automation</span> <span class="pill">sessions</span> <span class="pill">reliability</span></td>
|
| 331 |
+
<td><span class="pill">sessions</span> <span class="pill">reliability</span></td>
|
| 332 |
+
<td><span class="pill">cron_automation</span> <span class="pill">sessions</span></td>
|
| 333 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">cron_automation</span></span></td>
|
| 334 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">reliability</span></span></td>
|
| 335 |
+
<td></td>
|
| 336 |
+
</tr><tr class="tie">
|
| 337 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/67244">#67244</a></td>
|
| 338 |
+
<td>Explicit ACP agent runs: embedded backend visibility failure and stale final JSON state after sessions_yield</td>
|
| 339 |
+
<td>0.2500</td><td>0.2500</td><td class="neutral">+0.0000</td>
|
| 340 |
+
<td><span class="pill">acpx</span> <span class="pill">acp</span></td>
|
| 341 |
+
<td><span class="pill">acp</span> <span class="pill">sessions</span></td>
|
| 342 |
+
<td><span class="pill">acp</span> <span class="pill">sessions</span></td>
|
| 343 |
+
<td><span class="pill">sessions</span><br><span class="miss">FN <span class="pill">acpx</span></span></td>
|
| 344 |
+
<td><span class="pill">sessions</span><br><span class="miss">FN <span class="pill">acpx</span></span></td>
|
| 345 |
+
<td></td>
|
| 346 |
+
</tr><tr class="tie">
|
| 347 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/73910">#73910</a></td>
|
| 348 |
+
<td>BUG: OpenClaw-managed Codex ACP uses isolated CODEX_HOME without auth bridge and sends unsupported timeout config</td>
|
| 349 |
+
<td>0.3333</td><td>0.3333</td><td class="neutral">+0.0000</td>
|
| 350 |
+
<td><span class="pill">codex</span> <span class="pill">acp</span> <span class="pill">acpx</span> <span class="pill">auth_identity</span></td>
|
| 351 |
+
<td><span class="pill">acp</span> <span class="pill">codex</span></td>
|
| 352 |
+
<td><span class="pill">codex</span> <span class="pill">auth_identity</span></td>
|
| 353 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">acpx</span> <span class="pill">auth_identity</span></span></td>
|
| 354 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">acp</span> <span class="pill">acpx</span></span></td>
|
| 355 |
+
<td></td>
|
| 356 |
+
</tr><tr class="tie">
|
| 357 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/80008">#80008</a></td>
|
| 358 |
+
<td>feat(plugins): expose ACP spawn and prompt in plugin runtime</td>
|
| 359 |
+
<td>0.2500</td><td>0.2500</td><td class="neutral">+0.0000</td>
|
| 360 |
+
<td><span class="pill">acp</span> <span class="pill">coding_agents</span></td>
|
| 361 |
+
<td><span class="pill">acp</span> <span class="pill">skills_plugins</span></td>
|
| 362 |
+
<td><span class="pill">acp</span> <span class="pill">skills_plugins</span></td>
|
| 363 |
+
<td><span class="pill">skills_plugins</span><br><span class="miss">FN <span class="pill">coding_agents</span></span></td>
|
| 364 |
+
<td><span class="pill">skills_plugins</span><br><span class="miss">FN <span class="pill">coding_agents</span></span></td>
|
| 365 |
+
<td></td>
|
| 366 |
+
</tr><tr class="tie">
|
| 367 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/60979">#60979</a></td>
|
| 368 |
+
<td>feature: sessions_spawn ACP delivery to channel (stream output to Zulip/Discord topic)</td>
|
| 369 |
+
<td>0.5000</td><td>0.5000</td><td class="neutral">+0.0000</td>
|
| 370 |
+
<td><span class="pill">acp</span> <span class="pill">chat_integrations</span> <span class="pill">sessions</span></td>
|
| 371 |
+
<td><span class="pill">acp</span> <span class="pill">chat_integrations</span></td>
|
| 372 |
+
<td><span class="pill">acp</span> <span class="pill">chat_integrations</span></td>
|
| 373 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">sessions</span></span></td>
|
| 374 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">sessions</span></span></td>
|
| 375 |
+
<td></td>
|
| 376 |
+
</tr><tr class="tie">
|
| 377 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/84715">#84715</a></td>
|
| 378 |
+
<td>[Bug]: @openclaw/codex peer link failure reproduced on 2026.5.19 after update</td>
|
| 379 |
+
<td>0.5000</td><td>0.5000</td><td class="neutral">+0.0000</td>
|
| 380 |
+
<td><span class="pill">codex</span> <span class="pill">packaging_deployment</span></td>
|
| 381 |
+
<td><span class="pill">codex</span></td>
|
| 382 |
+
<td><span class="pill">codex</span></td>
|
| 383 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">packaging_deployment</span></span></td>
|
| 384 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">packaging_deployment</span></span></td>
|
| 385 |
+
<td></td>
|
| 386 |
+
</tr><tr class="tie">
|
| 387 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/84757">#84757</a></td>
|
| 388 |
+
<td>[Bug]: Telegram session can get stuck after compaction when encrypted reasoning content fails verification</td>
|
| 389 |
+
<td>0.5000</td><td>0.5000</td><td class="neutral">+0.0000</td>
|
| 390 |
+
<td><span class="pill">sessions</span> <span class="pill">chat_integrations</span> <span class="pill">reliability</span></td>
|
| 391 |
+
<td><span class="pill">chat_integrations</span> <span class="pill">sessions</span></td>
|
| 392 |
+
<td><span class="pill">chat_integrations</span> <span class="pill">sessions</span></td>
|
| 393 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">reliability</span></span></td>
|
| 394 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">reliability</span></span></td>
|
| 395 |
+
<td></td>
|
| 396 |
+
</tr><tr class="tie">
|
| 397 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/56442">#56442</a></td>
|
| 398 |
+
<td>feat: Add opt-in ACP parent completion notify for sessions_spawn</td>
|
| 399 |
+
<td>0.5000</td><td>0.5000</td><td class="neutral">+0.0000</td>
|
| 400 |
+
<td><span class="pill">acp</span> <span class="pill">sessions</span> <span class="pill">agent_runtime</span></td>
|
| 401 |
+
<td><span class="pill">acp</span> <span class="pill">sessions</span></td>
|
| 402 |
+
<td><span class="pill">acp</span> <span class="pill">sessions</span></td>
|
| 403 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">agent_runtime</span></span></td>
|
| 404 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">agent_runtime</span></span></td>
|
| 405 |
+
<td></td>
|
| 406 |
+
</tr><tr class="tie">
|
| 407 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/84763">#84763</a></td>
|
| 408 |
+
<td>fix(acpx): scrub provider credential env from ACP harness spawns</td>
|
| 409 |
+
<td>0.5000</td><td>0.5000</td><td class="neutral">+0.0000</td>
|
| 410 |
+
<td><span class="pill">acpx</span> <span class="pill">acp</span> <span class="pill">security</span></td>
|
| 411 |
+
<td><span class="pill">acp</span> <span class="pill">security</span></td>
|
| 412 |
+
<td><span class="pill">acpx</span> <span class="pill">security</span></td>
|
| 413 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">acpx</span></span></td>
|
| 414 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">acp</span></span></td>
|
| 415 |
+
<td></td>
|
| 416 |
+
</tr><tr class="tie">
|
| 417 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/65364">#65364</a></td>
|
| 418 |
+
<td>feat(plugins): add registerProviderRuntimeAuthOverride API</td>
|
| 419 |
+
<td>0.2500</td><td>0.2500</td><td class="neutral">+0.0000</td>
|
| 420 |
+
<td><span class="pill">auth_identity</span> <span class="pill">api_surface</span></td>
|
| 421 |
+
<td><span class="pill">skills_plugins</span> <span class="pill">auth_identity</span></td>
|
| 422 |
+
<td><span class="pill">skills_plugins</span> <span class="pill">auth_identity</span></td>
|
| 423 |
+
<td><span class="pill">skills_plugins</span><br><span class="miss">FN <span class="pill">api_surface</span></span></td>
|
| 424 |
+
<td><span class="pill">skills_plugins</span><br><span class="miss">FN <span class="pill">api_surface</span></span></td>
|
| 425 |
+
<td></td>
|
| 426 |
+
</tr><tr class="tie">
|
| 427 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/52747">#52747</a></td>
|
| 428 |
+
<td>fix(acp): time out stuck session lane tasks</td>
|
| 429 |
+
<td>0.5000</td><td>0.5000</td><td class="neutral">+0.0000</td>
|
| 430 |
+
<td><span class="pill">acp</span> <span class="pill">sessions</span> <span class="pill">reliability</span></td>
|
| 431 |
+
<td><span class="pill">acp</span> <span class="pill">sessions</span></td>
|
| 432 |
+
<td><span class="pill">acp</span> <span class="pill">reliability</span></td>
|
| 433 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">reliability</span></span></td>
|
| 434 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">sessions</span></span></td>
|
| 435 |
+
<td></td>
|
| 436 |
+
</tr><tr class="tie">
|
| 437 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/10467">#10467</a></td>
|
| 438 |
+
<td>[Feature Request]: Multi-lane concurrency support for sub-agents via sessions_spawn</td>
|
| 439 |
+
<td>0.2000</td><td>0.2000</td><td class="neutral">+0.0000</td>
|
| 440 |
+
<td><span class="pill">queueing</span> <span class="pill">sessions</span> <span class="pill">coding_agents</span></td>
|
| 441 |
+
<td><span class="pill">queueing</span> <span class="pill">agent_runtime</span></td>
|
| 442 |
+
<td><span class="pill">queueing</span> <span class="pill">agent_runtime</span></td>
|
| 443 |
+
<td><span class="pill">agent_runtime</span><br><span class="miss">FN <span class="pill">sessions</span> <span class="pill">coding_agents</span></span></td>
|
| 444 |
+
<td><span class="pill">agent_runtime</span><br><span class="miss">FN <span class="pill">sessions</span> <span class="pill">coding_agents</span></span></td>
|
| 445 |
+
<td></td>
|
| 446 |
+
</tr><tr class="tie">
|
| 447 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/45393">#45393</a></td>
|
| 448 |
+
<td>fix(errors): friendly message and last-message repair for tool_use/tool_result mismatch (#45385)</td>
|
| 449 |
+
<td>0.2000</td><td>0.2000</td><td class="neutral">+0.0000</td>
|
| 450 |
+
<td><span class="pill">tool_calling</span> <span class="pill">coding_agents</span> <span class="pill">reliability</span></td>
|
| 451 |
+
<td><span class="pill">tool_calling</span> <span class="pill">security</span></td>
|
| 452 |
+
<td><span class="pill">tool_calling</span> <span class="pill">security</span></td>
|
| 453 |
+
<td><span class="pill">security</span><br><span class="miss">FN <span class="pill">coding_agents</span> <span class="pill">reliability</span></span></td>
|
| 454 |
+
<td><span class="pill">security</span><br><span class="miss">FN <span class="pill">coding_agents</span> <span class="pill">reliability</span></span></td>
|
| 455 |
+
<td></td>
|
| 456 |
+
</tr><tr class="tie">
|
| 457 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/84771">#84771</a></td>
|
| 458 |
+
<td>Event loop saturation during startup: synchronous model-prewarm and session-locks block event loop for 28-64 seconds</td>
|
| 459 |
+
<td>1.0000</td><td>1.0000</td><td class="neutral">+0.0000</td>
|
| 460 |
+
<td><span class="pill">reliability</span> <span class="pill">sessions</span></td>
|
| 461 |
+
<td><span class="pill">reliability</span> <span class="pill">sessions</span></td>
|
| 462 |
+
<td><span class="pill">sessions</span> <span class="pill">reliability</span></td>
|
| 463 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 464 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 465 |
+
<td></td>
|
| 466 |
+
</tr><tr class="tie">
|
| 467 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/68187">#68187</a></td>
|
| 468 |
+
<td>SSE-backed MCP sessions can stay stale after server restart and fail with 'Session not found'</td>
|
| 469 |
+
<td>0.5000</td><td>0.5000</td><td class="neutral">+0.0000</td>
|
| 470 |
+
<td><span class="pill">mcp_tooling</span> <span class="pill">sessions</span> <span class="pill">gateway</span></td>
|
| 471 |
+
<td><span class="pill">mcp_tooling</span> <span class="pill">sessions</span></td>
|
| 472 |
+
<td><span class="pill">mcp_tooling</span> <span class="pill">sessions</span></td>
|
| 473 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">gateway</span></span></td>
|
| 474 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">gateway</span></span></td>
|
| 475 |
+
<td></td>
|
| 476 |
+
</tr><tr class="tie">
|
| 477 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/52249">#52249</a></td>
|
| 478 |
+
<td>ACP parent session stuck until refresh when yielded waiting for child completion</td>
|
| 479 |
+
<td>0.5000</td><td>0.5000</td><td class="neutral">+0.0000</td>
|
| 480 |
+
<td><span class="pill">acp</span> <span class="pill">sessions</span> <span class="pill">reliability</span></td>
|
| 481 |
+
<td><span class="pill">acp</span> <span class="pill">sessions</span></td>
|
| 482 |
+
<td><span class="pill">acp</span> <span class="pill">sessions</span></td>
|
| 483 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">reliability</span></span></td>
|
| 484 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">reliability</span></span></td>
|
| 485 |
+
<td></td>
|
| 486 |
+
</tr><tr class="win">
|
| 487 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/44202">#44202</a></td>
|
| 488 |
+
<td>[Bug]: local memory embeddings on Apple Silicon can crash gateway in ggml-metal / node-llama-cpp; need official Metal/GPU guidance</td>
|
| 489 |
+
<td>0.1429</td><td>0.2000</td><td class="good">+0.0571</td>
|
| 490 |
+
<td><span class="pill">local_models</span> <span class="pill">memory</span> <span class="pill">self_hosted_inference</span></td>
|
| 491 |
+
<td><span class="pill">memory</span> <span class="pill">gateway</span> <span class="pill">reliability</span></td>
|
| 492 |
+
<td><span class="pill">memory</span> <span class="pill">reliability</span></td>
|
| 493 |
+
<td><span class="pill">gateway</span> <span class="pill">reliability</span><br><span class="miss">FN <span class="pill">local_models</span> <span class="pill">self_hosted_inference</span></span></td>
|
| 494 |
+
<td><span class="pill">reliability</span><br><span class="miss">FN <span class="pill">local_models</span> <span class="pill">self_hosted_inference</span></span></td>
|
| 495 |
+
<td></td>
|
| 496 |
+
</tr><tr class="win">
|
| 497 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/42027">#42027</a></td>
|
| 498 |
+
<td>fix: resolve exec PATH fallback, layered browser diagnostics, and cron force-run deadlock</td>
|
| 499 |
+
<td>0.1429</td><td>0.2500</td><td class="good">+0.1071</td>
|
| 500 |
+
<td><span class="pill">exec_tools</span> <span class="pill">browser_automation</span> <span class="pill">cron_automation</span></td>
|
| 501 |
+
<td><span class="pill">exec_tools</span> <span class="pill">gateway</span> <span class="pill">ui_tui</span></td>
|
| 502 |
+
<td><span class="pill">exec_tools</span> <span class="pill">cron_automation</span> <span class="pill">ui_tui</span></td>
|
| 503 |
+
<td><span class="pill">gateway</span> <span class="pill">ui_tui</span><br><span class="miss">FN <span class="pill">browser_automation</span> <span class="pill">cron_automation</span></span></td>
|
| 504 |
+
<td><span class="pill">ui_tui</span><br><span class="miss">FN <span class="pill">browser_automation</span></span></td>
|
| 505 |
+
<td></td>
|
| 506 |
+
</tr><tr class="win">
|
| 507 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/43765">#43765</a></td>
|
| 508 |
+
<td>Improve runtime recovery for heartbeat, Feishu, and exec sessions</td>
|
| 509 |
+
<td>0.1429</td><td>0.2500</td><td class="good">+0.1071</td>
|
| 510 |
+
<td><span class="pill">reliability</span> <span class="pill">exec_tools</span> <span class="pill">cron_automation</span></td>
|
| 511 |
+
<td><span class="pill">reliability</span> <span class="pill">agent_runtime</span> <span class="pill">chat_integrations</span></td>
|
| 512 |
+
<td><span class="pill">reliability</span> <span class="pill">gateway</span> <span class="pill">exec_tools</span></td>
|
| 513 |
+
<td><span class="pill">agent_runtime</span> <span class="pill">chat_integrations</span><br><span class="miss">FN <span class="pill">exec_tools</span> <span class="pill">cron_automation</span></span></td>
|
| 514 |
+
<td><span class="pill">gateway</span><br><span class="miss">FN <span class="pill">cron_automation</span></span></td>
|
| 515 |
+
<td></td>
|
| 516 |
+
</tr><tr class="win">
|
| 517 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/39248">#39248</a></td>
|
| 518 |
+
<td>Bug: sandbox.mode: "non-main" silently breaks sessions_spawn subagent initialization</td>
|
| 519 |
+
<td>0.2000</td><td>0.3333</td><td class="good">+0.1333</td>
|
| 520 |
+
<td><span class="pill">coding_agents</span> <span class="pill">sandboxing</span> <span class="pill">agent_runtime</span></td>
|
| 521 |
+
<td><span class="pill">sandboxing</span> <span class="pill">sessions</span></td>
|
| 522 |
+
<td><span class="pill">sandboxing</span></td>
|
| 523 |
+
<td><span class="pill">sessions</span><br><span class="miss">FN <span class="pill">coding_agents</span> <span class="pill">agent_runtime</span></span></td>
|
| 524 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">coding_agents</span> <span class="pill">agent_runtime</span></span></td>
|
| 525 |
+
<td></td>
|
| 526 |
+
</tr><tr class="win">
|
| 527 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/51667">#51667</a></td>
|
| 528 |
+
<td>Feature: Native Audio Input for Omni-Modal Models (skip STT transcription)</td>
|
| 529 |
+
<td>0.3333</td><td>0.5000</td><td class="good">+0.1667</td>
|
| 530 |
+
<td><span class="pill">model_serving</span> <span class="pill">security</span> <span class="pill">config</span></td>
|
| 531 |
+
<td><span class="pill">model_serving</span></td>
|
| 532 |
+
<td><span class="pill">model_serving</span> <span class="pill">config</span></td>
|
| 533 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">security</span> <span class="pill">config</span></span></td>
|
| 534 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">security</span></span></td>
|
| 535 |
+
<td></td>
|
| 536 |
+
</tr><tr class="win">
|
| 537 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/68725">#68725</a></td>
|
| 538 |
+
<td>feat(amazon-bedrock-mantle): add known context windows for open-weight Mantle models</td>
|
| 539 |
+
<td>0.2500</td><td>0.5000</td><td class="good">+0.2500</td>
|
| 540 |
+
<td><span class="pill">open_weight_models</span> <span class="pill">local_model_providers</span></td>
|
| 541 |
+
<td><span class="pill">open_weight_models</span> <span class="pill">model_serving</span></td>
|
| 542 |
+
<td><span class="pill">open_weight_models</span></td>
|
| 543 |
+
<td><span class="pill">model_serving</span><br><span class="miss">FN <span class="pill">local_model_providers</span></span></td>
|
| 544 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">local_model_providers</span></span></td>
|
| 545 |
+
<td></td>
|
| 546 |
+
</tr><tr class="win">
|
| 547 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/43246">#43246</a></td>
|
| 548 |
+
<td>fix(message): deny same-provider cross-context sends by default [AI-assisted]</td>
|
| 549 |
+
<td>0.0000</td><td>0.2500</td><td class="good">+0.2500</td>
|
| 550 |
+
<td><span class="pill">tool_calling</span> <span class="pill">security</span></td>
|
| 551 |
+
<td><span class="muted">none</span></td>
|
| 552 |
+
<td><span class="pill">security</span> <span class="pill">config</span></td>
|
| 553 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 554 |
+
<td><span class="pill">config</span><br><span class="miss">FN <span class="pill">tool_calling</span></span></td>
|
| 555 |
+
<td>seed fail</td>
|
| 556 |
+
</tr><tr class="win">
|
| 557 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/82507">#82507</a></td>
|
| 558 |
+
<td>[Feature]: ACPX Codex sandbox should inherit user-installed plugins (e.g. Superpowers)</td>
|
| 559 |
+
<td>0.2000</td><td>0.5000</td><td class="good">+0.3000</td>
|
| 560 |
+
<td><span class="pill">acpx</span> <span class="pill">codex</span> <span class="pill">skills_plugins</span></td>
|
| 561 |
+
<td><span class="pill">acpx</span> <span class="pill">security</span></td>
|
| 562 |
+
<td><span class="pill">acpx</span> <span class="pill">skills_plugins</span></td>
|
| 563 |
+
<td><span class="pill">security</span><br><span class="miss">FN <span class="pill">codex</span> <span class="pill">skills_plugins</span></span></td>
|
| 564 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">codex</span></span></td>
|
| 565 |
+
<td></td>
|
| 566 |
+
</tr><tr class="win">
|
| 567 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/83863">#83863</a></td>
|
| 568 |
+
<td>ACP/Codex child tasks can be marked succeeded with progress-only output and no final deliverable</td>
|
| 569 |
+
<td>0.2000</td><td>0.5000</td><td class="good">+0.3000</td>
|
| 570 |
+
<td><span class="pill">acp</span> <span class="pill">codex</span> <span class="pill">agent_runtime</span></td>
|
| 571 |
+
<td><span class="pill">acp</span> <span class="pill">reliability</span></td>
|
| 572 |
+
<td><span class="pill">acp</span> <span class="pill">codex</span></td>
|
| 573 |
+
<td><span class="pill">reliability</span><br><span class="miss">FN <span class="pill">codex</span> <span class="pill">agent_runtime</span></span></td>
|
| 574 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">agent_runtime</span></span></td>
|
| 575 |
+
<td></td>
|
| 576 |
+
</tr><tr class="win">
|
| 577 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/48580">#48580</a></td>
|
| 578 |
+
<td>Bug: acpx codex sessions 创建的会话立即退出 - stdin is not a terminal</td>
|
| 579 |
+
<td>0.2000</td><td>0.5000</td><td class="good">+0.3000</td>
|
| 580 |
+
<td><span class="pill">acpx</span> <span class="pill">codex</span> <span class="pill">sessions</span></td>
|
| 581 |
+
<td><span class="pill">acp</span> <span class="pill">sessions</span></td>
|
| 582 |
+
<td><span class="pill">codex</span> <span class="pill">sessions</span></td>
|
| 583 |
+
<td><span class="pill">acp</span><br><span class="miss">FN <span class="pill">acpx</span> <span class="pill">codex</span></span></td>
|
| 584 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">acpx</span></span></td>
|
| 585 |
+
<td></td>
|
| 586 |
+
</tr><tr class="win">
|
| 587 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/46552">#46552</a></td>
|
| 588 |
+
<td>docs(queue): clarify steer behavior with partial streaming and tool boundaries</td>
|
| 589 |
+
<td>0.5000</td><td>1.0000</td><td class="good">+0.5000</td>
|
| 590 |
+
<td><span class="pill">queueing</span> <span class="pill">docs</span></td>
|
| 591 |
+
<td><span class="pill">docs</span></td>
|
| 592 |
+
<td><span class="pill">docs</span> <span class="pill">queueing</span></td>
|
| 593 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">queueing</span></span></td>
|
| 594 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 595 |
+
<td></td>
|
| 596 |
+
</tr><tr class="win">
|
| 597 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/63826">#63826</a></td>
|
| 598 |
+
<td>security: fix HIGH/CRITICAL vulns in skill scanner, SSRF, hook priority, and token verification</td>
|
| 599 |
+
<td>0.0000</td><td>0.5000</td><td class="good">+0.5000</td>
|
| 600 |
+
<td><span class="pill">security</span> <span class="pill">hooks</span> <span class="pill">skills_plugins</span></td>
|
| 601 |
+
<td><span class="muted">none</span></td>
|
| 602 |
+
<td><span class="pill">security</span> <span class="pill">hooks</span></td>
|
| 603 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 604 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">skills_plugins</span></span></td>
|
| 605 |
+
<td>seed fail</td>
|
| 606 |
+
</tr><tr class="win">
|
| 607 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/70529">#70529</a></td>
|
| 608 |
+
<td>[Bug]: Desktop cannot use existing Chrome sessions: EasyClaw Google sign-in fails, and user profile attach fails with spawn npx ENOENT</td>
|
| 609 |
+
<td>0.5000</td><td>1.0000</td><td class="good">+0.5000</td>
|
| 610 |
+
<td><span class="pill">browser_automation</span> <span class="pill">packaging_deployment</span></td>
|
| 611 |
+
<td><span class="pill">browser_automation</span></td>
|
| 612 |
+
<td><span class="pill">browser_automation</span> <span class="pill">packaging_deployment</span></td>
|
| 613 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">packaging_deployment</span></span></td>
|
| 614 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 615 |
+
<td></td>
|
| 616 |
+
</tr><tr class="win">
|
| 617 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/84752">#84752</a></td>
|
| 618 |
+
<td>fix: self-heal lane wedges + restore openai-codex OAuth on embedded path</td>
|
| 619 |
+
<td>0.0000</td><td>0.5000</td><td class="good">+0.5000</td>
|
| 620 |
+
<td><span class="pill">reliability</span> <span class="pill">auth_identity</span> <span class="pill">sessions</span></td>
|
| 621 |
+
<td><span class="muted">none</span></td>
|
| 622 |
+
<td><span class="pill">reliability</span> <span class="pill">auth_identity</span></td>
|
| 623 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 624 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">sessions</span></span></td>
|
| 625 |
+
<td>seed fail</td>
|
| 626 |
+
</tr><tr class="win">
|
| 627 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/78528">#78528</a></td>
|
| 628 |
+
<td>Security: skill SecretRef API keys still leak into exec child environments</td>
|
| 629 |
+
<td>0.5000</td><td>1.0000</td><td class="good">+0.5000</td>
|
| 630 |
+
<td><span class="pill">security</span> <span class="pill">exec_tools</span> <span class="pill">skills_plugins</span></td>
|
| 631 |
+
<td><span class="pill">security</span> <span class="pill">exec_tools</span></td>
|
| 632 |
+
<td><span class="pill">security</span> <span class="pill">exec_tools</span> <span class="pill">skills_plugins</span></td>
|
| 633 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">skills_plugins</span></span></td>
|
| 634 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 635 |
+
<td></td>
|
| 636 |
+
</tr><tr class="win">
|
| 637 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/74305">#74305</a></td>
|
| 638 |
+
<td>[Bug]: ACPX Codex worker fails when model/thinking overrides are configured</td>
|
| 639 |
+
<td>0.5000</td><td>1.0000</td><td class="good">+0.5000</td>
|
| 640 |
+
<td><span class="pill">acpx</span> <span class="pill">codex</span></td>
|
| 641 |
+
<td><span class="pill">acpx</span></td>
|
| 642 |
+
<td><span class="pill">acpx</span> <span class="pill">codex</span></td>
|
| 643 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="pill">codex</span></span></td>
|
| 644 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 645 |
+
<td></td>
|
| 646 |
+
</tr><tr class="win">
|
| 647 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/79897">#79897</a></td>
|
| 648 |
+
<td>OpenAI-compatible streaming with llama.cpp saves zero usage (stream closed before final usage chunk)</td>
|
| 649 |
+
<td>0.2857</td><td>1.0000</td><td class="good">+0.7143</td>
|
| 650 |
+
<td><span class="pill">model_serving</span></td>
|
| 651 |
+
<td><span class="pill">telemetry_usage</span> <span class="pill">model_serving</span></td>
|
| 652 |
+
<td><span class="pill">model_serving</span></td>
|
| 653 |
+
<td><span class="pill">telemetry_usage</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 654 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 655 |
+
<td></td>
|
| 656 |
+
</tr><tr class="win">
|
| 657 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/80479">#80479</a></td>
|
| 658 |
+
<td>feat(memory/embeddings): add openai-compatible provider for self-hosted servers (llama.cpp, Ollama, vLLM, TGI, LocalAI)</td>
|
| 659 |
+
<td>0.2500</td><td>1.0000</td><td class="good">+0.7500</td>
|
| 660 |
+
<td><span class="pill">self_hosted_inference</span> <span class="pill">memory</span></td>
|
| 661 |
+
<td><span class="pill">local_models</span> <span class="pill">self_hosted_inference</span></td>
|
| 662 |
+
<td><span class="pill">memory</span> <span class="pill">self_hosted_inference</span></td>
|
| 663 |
+
<td><span class="pill">local_models</span><br><span class="miss">FN <span class="pill">memory</span></span></td>
|
| 664 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 665 |
+
<td></td>
|
| 666 |
+
</tr><tr class="win">
|
| 667 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/71216">#71216</a></td>
|
| 668 |
+
<td>Config schema: add `sandbox`, `routing.rules`, `instances`, and `gateway.nodes.denyPaths`</td>
|
| 669 |
+
<td>0.2500</td><td>1.0000</td><td class="good">+0.7500</td>
|
| 670 |
+
<td><span class="pill">config</span> <span class="pill">sandboxing</span> <span class="pill">gateway</span></td>
|
| 671 |
+
<td><span class="pill">sandboxing</span> <span class="pill">model_serving</span> <span class="pill">gateway</span></td>
|
| 672 |
+
<td><span class="pill">config</span> <span class="pill">sandboxing</span> <span class="pill">gateway</span></td>
|
| 673 |
+
<td><span class="pill">model_serving</span><br><span class="miss">FN <span class="pill">config</span></span></td>
|
| 674 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 675 |
+
<td></td>
|
| 676 |
+
</tr><tr class="win">
|
| 677 |
+
<td><a href="https://github.com/openclaw/openclaw/issues/84789">#84789</a></td>
|
| 678 |
+
<td>Active memory crashes on Telegram forum topic sessions (dirName validation)</td>
|
| 679 |
+
<td>0.2500</td><td>1.0000</td><td class="good">+0.7500</td>
|
| 680 |
+
<td><span class="pill">memory</span> <span class="pill">sessions</span></td>
|
| 681 |
+
<td><span class="pill">memory</span> <span class="pill">chat_integrations</span></td>
|
| 682 |
+
<td><span class="pill">memory</span> <span class="pill">sessions</span></td>
|
| 683 |
+
<td><span class="pill">chat_integrations</span><br><span class="miss">FN <span class="pill">sessions</span></span></td>
|
| 684 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 685 |
+
<td></td>
|
| 686 |
+
</tr><tr class="win">
|
| 687 |
+
<td><a href="https://github.com/openclaw/openclaw/pull/80783">#80783</a></td>
|
| 688 |
+
<td>Policy: add model, network, and MCP conformance checks</td>
|
| 689 |
+
<td>0.2000</td><td>1.0000</td><td class="good">+0.8000</td>
|
| 690 |
+
<td><span class="pill">mcp_tooling</span> <span class="pill">config</span> <span class="pill">security</span></td>
|
| 691 |
+
<td><span class="pill">mcp_tooling</span> <span class="pill">local_model_providers</span></td>
|
| 692 |
+
<td><span class="pill">mcp_tooling</span> <span class="pill">security</span> <span class="pill">config</span></td>
|
| 693 |
+
<td><span class="pill">local_model_providers</span><br><span class="miss">FN <span class="pill">config</span> <span class="pill">security</span></span></td>
|
| 694 |
+
<td><span class="muted">none</span><br><span class="miss">FN <span class="muted">none</span></span></td>
|
| 695 |
+
<td></td>
|
| 696 |
+
</tr></tbody></table></div>
|
| 697 |
+
</section>
|
| 698 |
+
<section id="topics">
|
| 699 |
+
<h2>Per-Topic TP/FP/FN and F1</h2>
|
| 700 |
+
<p class="muted">Cells in TP/FP/FN order. Rows are sorted by F1 delta, worst first.</p>
|
| 701 |
+
<div class="scroll"><table><thead><tr><th>Topic</th><th>Seed TP/FP/FN</th><th>Seed P</th><th>Seed R</th><th>Seed F1</th><th>GEPA TP/FP/FN</th><th>GEPA P</th><th>GEPA R</th><th>GEPA F1</th><th>ΔF1</th><th>ΔFP</th><th>ΔFN</th></tr></thead><tbody><tr class="bad"><td>local_models</td>
|
| 702 |
+
<td>2/1/1</td><td>0.6667</td><td>0.6667</td><td>0.6667</td>
|
| 703 |
+
<td>0/0/3</td><td>0.0000</td><td>0.0000</td><td>0.0000</td>
|
| 704 |
+
<td class="bad">-0.6667</td><td>-1</td><td>+2</td></tr><tr class="bad"><td>telemetry_usage</td>
|
| 705 |
+
<td>1/1/0</td><td>0.5000</td><td>1.0000</td><td>0.6667</td>
|
| 706 |
+
<td>0/0/1</td><td>0.0000</td><td>0.0000</td><td>0.0000</td>
|
| 707 |
+
<td class="bad">-0.6667</td><td>-1</td><td>+1</td></tr><tr class="bad"><td>tool_calling</td>
|
| 708 |
+
<td>2/0/0</td><td>1.0000</td><td>1.0000</td><td>1.0000</td>
|
| 709 |
+
<td>1/0/2</td><td>1.0000</td><td>0.3333</td><td>0.5000</td>
|
| 710 |
+
<td class="bad">-0.5000</td><td>+0</td><td>+2</td></tr><tr class="bad"><td>reliability</td>
|
| 711 |
+
<td>9/2/6</td><td>0.8182</td><td>0.6000</td><td>0.6923</td>
|
| 712 |
+
<td>5/1/11</td><td>0.8333</td><td>0.3125</td><td>0.4545</td>
|
| 713 |
+
<td class="bad">-0.2378</td><td>-1</td><td>+5</td></tr><tr class="bad"><td>agent_runtime</td>
|
| 714 |
+
<td>1/3/4</td><td>0.2500</td><td>0.2000</td><td>0.2222</td>
|
| 715 |
+
<td>0/2/5</td><td>0.0000</td><td>0.0000</td><td>0.0000</td>
|
| 716 |
+
<td class="bad">-0.2222</td><td>-1</td><td>+1</td></tr><tr class="bad"><td>ui_tui</td>
|
| 717 |
+
<td>1/1/0</td><td>0.5000</td><td>1.0000</td><td>0.6667</td>
|
| 718 |
+
<td>1/2/0</td><td>0.3333</td><td>1.0000</td><td>0.5000</td>
|
| 719 |
+
<td class="bad">-0.1667</td><td>+1</td><td>+0</td></tr><tr class="bad"><td>gateway</td>
|
| 720 |
+
<td>4/2/1</td><td>0.6667</td><td>0.8000</td><td>0.7273</td>
|
| 721 |
+
<td>4/4/1</td><td>0.5000</td><td>0.8000</td><td>0.6154</td>
|
| 722 |
+
<td class="bad">-0.1119</td><td>+2</td><td>+0</td></tr><tr class="bad"><td>sessions</td>
|
| 723 |
+
<td>14/2/4</td><td>0.8750</td><td>0.7778</td><td>0.8235</td>
|
| 724 |
+
<td>12/1/7</td><td>0.9231</td><td>0.6316</td><td>0.7500</td>
|
| 725 |
+
<td class="bad">-0.0735</td><td>-1</td><td>+3</td></tr><tr class="bad"><td>acp</td>
|
| 726 |
+
<td>13/1/0</td><td>0.9286</td><td>1.0000</td><td>0.9630</td>
|
| 727 |
+
<td>11/0/2</td><td>1.0000</td><td>0.8462</td><td>0.9167</td>
|
| 728 |
+
<td class="bad">-0.0463</td><td>-1</td><td>+2</td></tr><tr class="neutral"><td>local_model_providers</td>
|
| 729 |
+
<td>0/1/3</td><td>0.0000</td><td>0.0000</td><td>0.0000</td>
|
| 730 |
+
<td>0/0/3</td><td>0.0000</td><td>0.0000</td><td>0.0000</td>
|
| 731 |
+
<td class="neutral">+0.0000</td><td>-1</td><td>+0</td></tr><tr class="neutral"><td>model_releases</td>
|
| 732 |
+
<td>0/1/0</td><td>0.0000</td><td>0.0000</td><td>0.0000</td>
|
| 733 |
+
<td>0/0/0</td><td>0.0000</td><td>0.0000</td><td>0.0000</td>
|
| 734 |
+
<td class="neutral">+0.0000</td><td>-1</td><td>+0</td></tr><tr class="neutral"><td>notifications</td>
|
| 735 |
+
<td>0/0/1</td><td>0.0000</td><td>0.0000</td><td>0.0000</td>
|
| 736 |
+
<td>0/1/1</td><td>0.0000</td><td>0.0000</td><td>0.0000</td>
|
| 737 |
+
<td class="neutral">+0.0000</td><td>+1</td><td>+0</td></tr><tr class="neutral"><td>api_surface</td>
|
| 738 |
+
<td>0/0/2</td><td>0.0000</td><td>0.0000</td><td>0.0000</td>
|
| 739 |
+
<td>0/0/2</td><td>0.0000</td><td>0.0000</td><td>0.0000</td>
|
| 740 |
+
<td class="neutral">+0.0000</td><td>+0</td><td>+0</td></tr><tr class="neutral"><td>approvals</td>
|
| 741 |
+
<td>1/0/1</td><td>1.0000</td><td>0.5000</td><td>0.6667</td>
|
| 742 |
+
<td>1/0/1</td><td>1.0000</td><td>0.5000</td><td>0.6667</td>
|
| 743 |
+
<td class="neutral">+0.0000</td><td>+0</td><td>+0</td></tr><tr class="neutral"><td>browser_automation</td>
|
| 744 |
+
<td>1/0/1</td><td>1.0000</td><td>0.5000</td><td>0.6667</td>
|
| 745 |
+
<td>1/0/1</td><td>1.0000</td><td>0.5000</td><td>0.6667</td>
|
| 746 |
+
<td class="neutral">+0.0000</td><td>+0</td><td>+0</td></tr><tr class="neutral"><td>coding_agents</td>
|
| 747 |
+
<td>0/0/6</td><td>0.0000</td><td>0.0000</td><td>0.0000</td>
|
| 748 |
+
<td>0/0/6</td><td>0.0000</td><td>0.0000</td><td>0.0000</td>
|
| 749 |
+
<td class="neutral">+0.0000</td><td>+0</td><td>+0</td></tr><tr class="neutral"><td>docs</td>
|
| 750 |
+
<td>2/0/0</td><td>1.0000</td><td>1.0000</td><td>1.0000</td>
|
| 751 |
+
<td>2/0/0</td><td>1.0000</td><td>1.0000</td><td>1.0000</td>
|
| 752 |
+
<td class="neutral">+0.0000</td><td>+0</td><td>+0</td></tr><tr class="neutral"><td>mcp_tooling</td>
|
| 753 |
+
<td>3/0/0</td><td>1.0000</td><td>1.0000</td><td>1.0000</td>
|
| 754 |
+
<td>3/0/0</td><td>1.0000</td><td>1.0000</td><td>1.0000</td>
|
| 755 |
+
<td class="neutral">+0.0000</td><td>+0</td><td>+0</td></tr><tr class="neutral"><td>open_weight_models</td>
|
| 756 |
+
<td>1/0/0</td><td>1.0000</td><td>1.0000</td><td>1.0000</td>
|
| 757 |
+
<td>1/0/0</td><td>1.0000</td><td>1.0000</td><td>1.0000</td>
|
| 758 |
+
<td class="neutral">+0.0000</td><td>+0</td><td>+0</td></tr><tr class="neutral"><td>sandboxing</td>
|
| 759 |
+
<td>2/0/1</td><td>1.0000</td><td>0.6667</td><td>0.8000</td>
|
| 760 |
+
<td>2/0/1</td><td>1.0000</td><td>0.6667</td><td>0.8000</td>
|
| 761 |
+
<td class="neutral">+0.0000</td><td>+0</td><td>+0</td></tr><tr class="neutral"><td>self_hosted_inference</td>
|
| 762 |
+
<td>1/0/2</td><td>1.0000</td><td>0.3333</td><td>0.5000</td>
|
| 763 |
+
<td>1/0/2</td><td>1.0000</td><td>0.3333</td><td>0.5000</td>
|
| 764 |
+
<td class="neutral">+0.0000</td><td>+0</td><td>+0</td></tr><tr class="good"><td>chat_integrations</td>
|
| 765 |
+
<td>4/3/0</td><td>0.5714</td><td>1.0000</td><td>0.7273</td>
|
| 766 |
+
<td>3/0/1</td><td>1.0000</td><td>0.7500</td><td>0.8571</td>
|
| 767 |
+
<td class="good">+0.1299</td><td>-3</td><td>+1</td></tr><tr class="good"><td>memory</td>
|
| 768 |
+
<td>3/0/2</td><td>1.0000</td><td>0.6000</td><td>0.7500</td>
|
| 769 |
+
<td>4/0/1</td><td>1.0000</td><td>0.8000</td><td>0.8889</td>
|
| 770 |
+
<td class="good">+0.1389</td><td>+0</td><td>-1</td></tr><tr class="good"><td>exec_tools</td>
|
| 771 |
+
<td>3/0/1</td><td>1.0000</td><td>0.7500</td><td>0.8571</td>
|
| 772 |
+
<td>4/0/0</td><td>1.0000</td><td>1.0000</td><td>1.0000</td>
|
| 773 |
+
<td class="good">+0.1429</td><td>+0</td><td>-1</td></tr><tr class="good"><td>acpx</td>
|
| 774 |
+
<td>2/0/5</td><td>1.0000</td><td>0.2857</td><td>0.4444</td>
|
| 775 |
+
<td>3/0/4</td><td>1.0000</td><td>0.4286</td><td>0.6000</td>
|
| 776 |
+
<td class="good">+0.1556</td><td>+0</td><td>-1</td></tr><tr class="good"><td>security</td>
|
| 777 |
+
<td>3/4/2</td><td>0.4286</td><td>0.6000</td><td>0.5000</td>
|
| 778 |
+
<td>6/3/1</td><td>0.6667</td><td>0.8571</td><td>0.7500</td>
|
| 779 |
+
<td class="good">+0.2500</td><td>-1</td><td>-1</td></tr><tr class="good"><td>model_serving</td>
|
| 780 |
+
<td>4/3/0</td><td>0.5714</td><td>1.0000</td><td>0.7273</td>
|
| 781 |
+
<td>4/0/0</td><td>1.0000</td><td>1.0000</td><td>1.0000</td>
|
| 782 |
+
<td class="good">+0.2727</td><td>-3</td><td>+0</td></tr><tr class="good"><td>codex</td>
|
| 783 |
+
<td>3/0/4</td><td>1.0000</td><td>0.4286</td><td>0.6000</td>
|
| 784 |
+
<td>6/0/1</td><td>1.0000</td><td>0.8571</td><td>0.9231</td>
|
| 785 |
+
<td class="good">+0.3231</td><td>+0</td><td>-3</td></tr><tr class="good"><td>auth_identity</td>
|
| 786 |
+
<td>1/0/1</td><td>1.0000</td><td>0.5000</td><td>0.6667</td>
|
| 787 |
+
<td>3/0/0</td><td>1.0000</td><td>1.0000</td><td>1.0000</td>
|
| 788 |
+
<td class="good">+0.3333</td><td>+0</td><td>-1</td></tr><tr class="good"><td>queueing</td>
|
| 789 |
+
<td>1/0/1</td><td>1.0000</td><td>0.5000</td><td>0.6667</td>
|
| 790 |
+
<td>2/0/0</td><td>1.0000</td><td>1.0000</td><td>1.0000</td>
|
| 791 |
+
<td class="good">+0.3333</td><td>+0</td><td>-1</td></tr><tr class="good"><td>cron_automation</td>
|
| 792 |
+
<td>1/1/3</td><td>0.5000</td><td>0.2500</td><td>0.3333</td>
|
| 793 |
+
<td>3/1/1</td><td>0.7500</td><td>0.7500</td><td>0.7500</td>
|
| 794 |
+
<td class="good">+0.4167</td><td>+0</td><td>-2</td></tr><tr class="good"><td>hooks</td>
|
| 795 |
+
<td>0/0/1</td><td>0.0000</td><td>0.0000</td><td>0.0000</td>
|
| 796 |
+
<td>1/1/1</td><td>0.5000</td><td>0.5000</td><td>0.5000</td>
|
| 797 |
+
<td class="good">+0.5000</td><td>+1</td><td>+0</td></tr><tr class="good"><td>packaging_deployment</td>
|
| 798 |
+
<td>0/1/2</td><td>0.0000</td><td>0.0000</td><td>0.0000</td>
|
| 799 |
+
<td>1/1/1</td><td>0.5000</td><td>0.5000</td><td>0.5000</td>
|
| 800 |
+
<td class="good">+0.5000</td><td>+0</td><td>-1</td></tr><tr class="good"><td>skills_plugins</td>
|
| 801 |
+
<td>0/2/2</td><td>0.0000</td><td>0.0000</td><td>0.0000</td>
|
| 802 |
+
<td>2/2/1</td><td>0.5000</td><td>0.6667</td><td>0.5714</td>
|
| 803 |
+
<td class="good">+0.5714</td><td>+0</td><td>-1</td></tr><tr class="good"><td>config</td>
|
| 804 |
+
<td>0/0/3</td><td>0.0000</td><td>0.0000</td><td>0.0000</td>
|
| 805 |
+
<td>3/4/0</td><td>0.4286</td><td>1.0000</td><td>0.6000</td>
|
| 806 |
+
<td class="good">+0.6000</td><td>+4</td><td>-3</td></tr></tbody></table></div>
|
| 807 |
+
</section>
|
| 808 |
+
<section id="failures">
|
| 809 |
+
<h2>Structural Failures</h2>
|
| 810 |
+
<div class="scroll"><table><thead><tr><th>Candidate</th><th>Row</th><th>Title</th><th>Error</th></tr></thead><tbody><tr><td>Seed</td><td>openclaw-openclaw-63826</td><td>security: fix HIGH/CRITICAL vulns in skill scanner, SSRF, hook priority, and token verification</td><td><code>classifier exit 2: prompt: /home/bob/.local/state/localpager/classifier/prompts/20260612T201105Z-3532713.md
|
| 811 |
+
schema: /home/bob/.local/state/localpager/classifier/schemas/20260612T201105Z-3532713.json
|
| 812 |
+
session: /home/bob/.local/state/localpager/classifier/sessions/20260612T201105Z-3532713
|
| 813 |
+
localpager-agent: final_json was not called; no structured output was captured</code></td></tr><tr><td>Seed</td><td>openclaw-openclaw-84752</td><td>fix: self-heal lane wedges + restore openai-codex OAuth on embedded path</td><td><code>classifier exit 2: prompt: /home/bob/.local/state/localpager/classifier/prompts/20260612T201357Z-3534749.md
|
| 814 |
+
schema: /home/bob/.local/state/localpager/classifier/schemas/20260612T201357Z-3534749.json
|
| 815 |
+
session: /home/bob/.local/state/localpager/classifier/sessions/20260612T201357Z-3534749
|
| 816 |
+
localpager-agent: final_json was not called; no structured output was captured</code></td></tr><tr><td>Seed</td><td>openclaw-openclaw-43246</td><td>fix(message): deny same-provider cross-context sends by default [AI-assisted]</td><td><code>classifier exit 2: prompt: /home/bob/.local/state/localpager/classifier/prompts/20260612T204409Z-3555993.md
|
| 817 |
+
schema: /home/bob/.local/state/localpager/classifier/schemas/20260612T204409Z-3555993.json
|
| 818 |
+
session: /home/bob/.local/state/localpager/classifier/sessions/20260612T204409Z-3555993
|
| 819 |
+
localpager-agent: final_json was not called; no structured output was captured</code></td></tr></tbody></table></div>
|
| 820 |
+
</section>
|
| 821 |
+
<section id="gepa">
|
| 822 |
+
<h2>GEPA Run Artifacts</h2>
|
| 823 |
+
<ul class="links"><li><a href="../gepa-12b-six-20260612T190217Z/candidate_tree.html">Best current 12B six candidate tree</a></li><li><a href="../gepa-12b-twelve-from-six-iter-20260612T192815Z/candidate_tree.html">Later 12B continuation candidate tree</a></li><li><a href="../gepa-e4b-smoke-20260612T184748Z/candidate_tree.html">E4B smoke candidate tree</a></li></ul>
|
| 824 |
+
<h3>Selected prompt artifacts</h3>
|
| 825 |
+
<ul>
|
| 826 |
+
<li><a href="artifacts/2026-06-13-gepa-12b-six-best.prompt.md">tracked assembled prompt</a></li>
|
| 827 |
+
<li><a href="artifacts/2026-06-13-gepa-12b-six-best.routing_policy.md">tracked routing policy</a></li>
|
| 828 |
+
<li><a href="../gepa-12b-six-20260612T190217Z/best.prompt.md">generated best prompt</a></li>
|
| 829 |
+
<li><a href="../gepa-12b-six-20260612T190217Z/best.routing_policy.md">generated best routing policy</a></li>
|
| 830 |
+
</ul>
|
| 831 |
+
</section>
|
| 832 |
+
<section id="raw">
|
| 833 |
+
<h2>Raw Evaluation JSON</h2>
|
| 834 |
+
<ul class="links"><li><a href="../compare-12b-e4b-smoke-20260612T1856/e4b-smoke-best-12b.json">E4B smoke candidate rows 1-6</a></li><li><a href="../gepa-12b-six-20260612T190217Z/best-reeval-12b-train6.json">GEPA six rows 1-6</a></li><li><a href="../validation-12b-rows19-30-20260613/gepa-12b-six-best-offset12-limit6.json">GEPA six rows 13-18</a></li><li><a href="../validation-12b-rows19-30-20260613/gepa-12b-six-best-offset18-limit12.json">GEPA six rows 19-30</a></li><li><a href="../validation-12b-rows31-60-20260613/gepa-12b-six-best-offset30-limit30.json">GEPA six rows 31-60</a></li><li><a href="../compare-12b-gepa-six-holdout-20260612T1914/gepa-12b-six-best-12b-offset6-limit6.json">GEPA six rows 7-12</a></li><li><a href="../gepa-12b-twelve-from-six-iter-20260612T192815Z/best-reeval-12b-train12.json">GEPA twelve rows 1-12</a></li><li><a href="../compare-12b-gepa-twelve-holdout-20260612T1944/gepa-12b-twelve-best-12b-offset12-limit6.json">GEPA twelve rows 13-18</a></li><li><a href="../validation-12b-rows19-30-20260613/gepa-12b-twelve-best-offset18-limit12.json">GEPA twelve rows 19-30</a></li><li><a href="../compare-12b-e4b-smoke-20260612T1856/seed-12b.json">Seed rows 1-6</a></li><li><a href="../compare-12b-gepa-twelve-holdout-20260612T1944/seed-12b-offset12-limit6.json">Seed rows 13-18</a></li><li><a href="../validation-12b-rows19-30-20260613/seed-12b-offset18-limit12.json">Seed rows 19-30</a></li><li><a href="../validation-12b-rows31-60-20260613/seed-12b-offset30-limit30.json">Seed rows 31-60</a></li><li><a href="../compare-12b-gepa-six-holdout-20260612T1914/seed-12b-offset6-limit6.json">Seed rows 7-12</a></li></ul>
|
| 835 |
+
</section>
|
| 836 |
+
</main>
|
| 837 |
+
</body>
|
| 838 |
+
</html>
|
dashboard-20260613-gepa/live-gepa-iteration-data.json
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"generated_at": "2026-06-13T05:35:08.830332+00:00",
|
| 3 |
+
"run_dir": "prompt-optimizer/out/gepa-12b-multi-from-six-20260613T051216Z",
|
| 4 |
+
"status": "running",
|
| 5 |
+
"candidate_scores": [
|
| 6 |
+
{
|
| 7 |
+
"iteration": 0,
|
| 8 |
+
"candidate": 0,
|
| 9 |
+
"score": 0.4972222222222222,
|
| 10 |
+
"kind": "base"
|
| 11 |
+
},
|
| 12 |
+
{
|
| 13 |
+
"iteration": 1,
|
| 14 |
+
"candidate": 1,
|
| 15 |
+
"score": 0.5380952380952381,
|
| 16 |
+
"kind": "candidate"
|
| 17 |
+
}
|
| 18 |
+
],
|
| 19 |
+
"proposal_events": [
|
| 20 |
+
{
|
| 21 |
+
"iteration": 1,
|
| 22 |
+
"old_subsample_sum": 2.0357142857142856,
|
| 23 |
+
"new_subsample_sum": 4.0,
|
| 24 |
+
"accepted_for_full_eval": true
|
| 25 |
+
}
|
| 26 |
+
],
|
| 27 |
+
"current_selected_candidate": 1,
|
| 28 |
+
"run_log_bytes": 7569
|
| 29 |
+
}
|
dashboard-20260613-gepa/live-iterations.html
ADDED
|
@@ -0,0 +1,121 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!doctype html>
|
| 2 |
+
<html lang="en">
|
| 3 |
+
<head>
|
| 4 |
+
<meta charset="utf-8">
|
| 5 |
+
<meta name="viewport" content="width=device-width,initial-scale=1">
|
| 6 |
+
<title>Live GEPA Iteration Scores</title>
|
| 7 |
+
<style>
|
| 8 |
+
body{margin:0;background:#f5f6f8;color:#1f2733;font-family:-apple-system,BlinkMacSystemFont,"Segoe UI",Roboto,Arial,sans-serif;line-height:1.45}
|
| 9 |
+
header{background:#fff;border-bottom:1px solid #d8dee8;padding:24px 32px}
|
| 10 |
+
main{max-width:1080px;margin:0 auto;padding:24px 32px 48px}
|
| 11 |
+
section{background:#fff;border:1px solid #d8dee8;border-radius:8px;padding:20px;margin-bottom:20px}
|
| 12 |
+
h1{margin:0 0 8px;font-size:26px} h2{margin:0 0 12px;font-size:20px}
|
| 13 |
+
.muted{color:#677180} .status{font-variant-numeric:tabular-nums}
|
| 14 |
+
.chart-wrap{overflow-x:auto;border:1px solid #d8dee8;border-radius:8px;background:#fff}
|
| 15 |
+
svg{min-width:880px;width:100%;height:auto;display:block}
|
| 16 |
+
.axis{fill:#596273;font-size:12px} .axis.label{font-size:13px;fill:#273244} .point{fill:#1f2733;font-size:12px;font-weight:700}
|
| 17 |
+
table{border-collapse:collapse;width:100%;font-size:13px} th,td{border-bottom:1px solid #d8dee8;padding:8px 9px;text-align:left;vertical-align:top} th{background:#f0f3f8} .num{text-align:right;font-variant-numeric:tabular-nums}
|
| 18 |
+
a{color:#124fba} code{font-family:ui-monospace,SFMono-Regular,Consolas,monospace;font-size:12px}
|
| 19 |
+
.good{color:#0f7a45;font-weight:650} .bad{color:#b42318;font-weight:650}
|
| 20 |
+
</style>
|
| 21 |
+
</head>
|
| 22 |
+
<body>
|
| 23 |
+
<header>
|
| 24 |
+
<h1>Live GEPA Iteration Scores</h1>
|
| 25 |
+
<div class="muted status" id="status">Loading run log...</div>
|
| 26 |
+
</header>
|
| 27 |
+
<main>
|
| 28 |
+
<section>
|
| 29 |
+
<h2>Current Run</h2>
|
| 30 |
+
<p><code>prompt-optimizer/out/gepa-12b-proper-from-best-20260613T055906Z</code></p>
|
| 31 |
+
<p class="muted">Live properly budgeted 12B GEPA run continuing from the best candidate found so far. Runtime settings: row_limit=18, concurrency=2, max_tokens=1536, max_metric_calls=240. This page fetches <code>run_log.txt</code> every 30 seconds and redraws the graph.</p>
|
| 32 |
+
</section>
|
| 33 |
+
<section>
|
| 34 |
+
<h2>Validation Score By Candidate</h2>
|
| 35 |
+
<div class="chart-wrap" id="chart"></div>
|
| 36 |
+
<table><thead><tr><th>GEPA iteration</th><th>Candidate</th><th class="num">18-row validation score</th><th>Kind</th></tr></thead><tbody id="scoreRows"></tbody></table>
|
| 37 |
+
</section>
|
| 38 |
+
<section>
|
| 39 |
+
<h2>Proposal Events</h2>
|
| 40 |
+
<table><thead><tr><th>GEPA iteration</th><th class="num">Old subsample sum</th><th class="num">New subsample sum</th><th class="num">Delta</th><th>Decision</th></tr></thead><tbody id="eventRows"></tbody></table>
|
| 41 |
+
</section>
|
| 42 |
+
<section>
|
| 43 |
+
<h2>Raw</h2>
|
| 44 |
+
<p><a href="../gepa-12b-proper-from-best-20260613T055906Z/run_log.txt">run_log.txt</a> · <a href="iterations.html">static iteration dashboard</a> · <a href="index.html">summary dashboard</a></p>
|
| 45 |
+
</section>
|
| 46 |
+
</main>
|
| 47 |
+
<script>
|
| 48 |
+
const runLogUrl = '../gepa-12b-proper-from-best-20260613T055906Z/run_log.txt';
|
| 49 |
+
const num = '([0-9]+(?:\.[0-9]+)?)';
|
| 50 |
+
function fmt(x) { return Number(x).toFixed(4); }
|
| 51 |
+
function esc(s) { return String(s).replace(/[&<>"']/g, c => ({'&':'&','<':'<','>':'>','"':'"',"'":'''}[c])); }
|
| 52 |
+
function parseLog(text) {
|
| 53 |
+
const scores = [];
|
| 54 |
+
const events = [];
|
| 55 |
+
let pending = null;
|
| 56 |
+
let selected = null;
|
| 57 |
+
for (const line of text.split(/\n/)) {
|
| 58 |
+
let m = line.match(new RegExp('Iteration 0: Base program full valset score: ' + num));
|
| 59 |
+
if (m) { scores.push({iteration:0,candidate:0,score:Number(m[1]),kind:'base'}); continue; }
|
| 60 |
+
m = line.match(new RegExp('Iteration ([0-9]+): New subsample score ' + num + ' is better than old score ' + num));
|
| 61 |
+
if (m) { events.push({iteration:Number(m[1]), old:Number(m[3]), next:Number(m[2]), accepted:true}); continue; }
|
| 62 |
+
m = line.match(new RegExp('Iteration ([0-9]+): New subsample score ' + num + ' is not better than old score ' + num));
|
| 63 |
+
if (m) { events.push({iteration:Number(m[1]), old:Number(m[3]), next:Number(m[2]), accepted:false}); continue; }
|
| 64 |
+
m = line.match(new RegExp('Iteration ([0-9]+): Valset score for new program: ' + num));
|
| 65 |
+
if (m) { pending = {iteration:Number(m[1]), score:Number(m[2])}; continue; }
|
| 66 |
+
m = line.match(/Iteration ([0-9]+): New program candidate index: ([0-9]+)/);
|
| 67 |
+
if (m && pending && pending.iteration === Number(m[1])) { scores.push({iteration:pending.iteration,candidate:Number(m[2]),score:pending.score,kind:'candidate'}); pending = null; continue; }
|
| 68 |
+
m = line.match(new RegExp('Iteration ([0-9]+): Selected program ([0-9]+) score: ' + num));
|
| 69 |
+
if (m) { selected = Number(m[2]); continue; }
|
| 70 |
+
}
|
| 71 |
+
return {scores, events, selected, bytes: new Blob([text]).size};
|
| 72 |
+
}
|
| 73 |
+
function renderChart(points) {
|
| 74 |
+
if (!points.length) return '<p class="muted">No validation points yet.</p>';
|
| 75 |
+
const width = 880, height = 320;
|
| 76 |
+
const m = {l:60,r:24,t:28,b:54};
|
| 77 |
+
const pw = width - m.l - m.r, ph = height - m.t - m.b;
|
| 78 |
+
const xs = points.map(p => p.candidate), ys = points.map(p => p.score);
|
| 79 |
+
let xmin = Math.min(...xs), xmax = Math.max(...xs); if (xmin === xmax) xmax = xmin + 1;
|
| 80 |
+
let ymin = Math.max(0, Math.min(...ys) - 0.05), ymax = Math.min(1, Math.max(...ys) + 0.05); if (ymin === ymax) ymax = ymin + 0.1;
|
| 81 |
+
const sx = x => m.l + (x - xmin) / (xmax - xmin) * pw;
|
| 82 |
+
const sy = y => m.t + (ymax - y) / (ymax - ymin) * ph;
|
| 83 |
+
let out = `<svg viewBox="0 0 ${width} ${height}" role="img" aria-label="Live GEPA candidate score chart"><rect width="100%" height="100%" fill="#fff"/>`;
|
| 84 |
+
for (let i=0; i<5; i++) {
|
| 85 |
+
const y = ymin + (ymax-ymin)*i/4, yy = sy(y);
|
| 86 |
+
out += `<line x1="${m.l}" y1="${yy.toFixed(1)}" x2="${m.l+pw}" y2="${yy.toFixed(1)}" stroke="#e4e9f1"/>`;
|
| 87 |
+
out += `<text x="${m.l-10}" y="${(yy+4).toFixed(1)}" text-anchor="end" class="axis">${y.toFixed(2)}</text>`;
|
| 88 |
+
}
|
| 89 |
+
for (const p of points) out += `<text x="${sx(p.candidate).toFixed(1)}" y="${m.t+ph+24}" text-anchor="middle" class="axis">${p.candidate}</text>`;
|
| 90 |
+
if (points.length > 1) {
|
| 91 |
+
const d = points.map((p,i) => `${i === 0 ? 'M' : 'L'}${sx(p.candidate).toFixed(1)},${sy(p.score).toFixed(1)}`).join(' ');
|
| 92 |
+
out += `<path d="${d}" fill="none" stroke="#1463d9" stroke-width="3" stroke-linejoin="round"/>`;
|
| 93 |
+
}
|
| 94 |
+
for (const p of points) {
|
| 95 |
+
const x = sx(p.candidate), y = sy(p.score);
|
| 96 |
+
out += `<circle cx="${x.toFixed(1)}" cy="${y.toFixed(1)}" r="6" fill="#1463d9" stroke="#fff" stroke-width="2"/>`;
|
| 97 |
+
out += `<text x="${x.toFixed(1)}" y="${(y-12).toFixed(1)}" text-anchor="middle" class="point">${fmt(p.score)}</text>`;
|
| 98 |
+
}
|
| 99 |
+
out += `<text x="${(m.l+pw/2).toFixed(1)}" y="${height-12}" text-anchor="middle" class="axis label">candidate index</text></svg>`;
|
| 100 |
+
return out;
|
| 101 |
+
}
|
| 102 |
+
function render(parsed) {
|
| 103 |
+
document.getElementById('chart').innerHTML = renderChart(parsed.scores);
|
| 104 |
+
document.getElementById('scoreRows').innerHTML = parsed.scores.map(p => `<tr><td>${p.iteration}</td><td>${p.candidate}</td><td class="num">${fmt(p.score)}</td><td>${esc(p.kind)}</td></tr>`).join('');
|
| 105 |
+
document.getElementById('eventRows').innerHTML = parsed.events.map(e => `<tr><td>${e.iteration}</td><td class="num">${fmt(e.old)}</td><td class="num">${fmt(e.next)}</td><td class="num ${e.next-e.old >= 0 ? 'good' : 'bad'}">${fmt(e.next-e.old)}</td><td>${e.accepted ? 'accepted for full eval' : 'rejected'}</td></tr>`).join('');
|
| 106 |
+
document.getElementById('status').textContent = `Last refresh: ${new Date().toLocaleTimeString()}; log bytes: ${parsed.bytes}; selected candidate: ${parsed.selected ?? 'n/a'}`;
|
| 107 |
+
}
|
| 108 |
+
async function refresh() {
|
| 109 |
+
try {
|
| 110 |
+
const res = await fetch(runLogUrl + '?t=' + Date.now(), {cache:'no-store'});
|
| 111 |
+
if (!res.ok) throw new Error(res.status + ' ' + res.statusText);
|
| 112 |
+
render(parseLog(await res.text()));
|
| 113 |
+
} catch (err) {
|
| 114 |
+
document.getElementById('status').textContent = 'Failed to refresh run log: ' + err;
|
| 115 |
+
}
|
| 116 |
+
}
|
| 117 |
+
refresh();
|
| 118 |
+
setInterval(refresh, 30000);
|
| 119 |
+
</script>
|
| 120 |
+
</body>
|
| 121 |
+
</html>
|
dashboard-20260613-gepa/summary.json
ADDED
|
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"dashboard": "prompt-optimizer/out/dashboard-20260613-gepa/index.html",
|
| 3 |
+
"generated_at": "2026-06-13T03:00:34.089711+00:00",
|
| 4 |
+
"losses": 13,
|
| 5 |
+
"summaries": {
|
| 6 |
+
"gepa_six": {
|
| 7 |
+
"empty_pred": 0,
|
| 8 |
+
"exact": 12,
|
| 9 |
+
"f1": 0.6818181818181819,
|
| 10 |
+
"failures": 0,
|
| 11 |
+
"fn": 61,
|
| 12 |
+
"fp": 23,
|
| 13 |
+
"mean": 0.4911904761904762,
|
| 14 |
+
"n": 60,
|
| 15 |
+
"over": 1,
|
| 16 |
+
"precision": 0.7964601769911505,
|
| 17 |
+
"recall": 0.5960264900662252,
|
| 18 |
+
"tp": 90
|
| 19 |
+
},
|
| 20 |
+
"gepa_twelve_1_30": {
|
| 21 |
+
"empty_pred": 2,
|
| 22 |
+
"exact": 8,
|
| 23 |
+
"f1": 0.7317073170731708,
|
| 24 |
+
"failures": 2,
|
| 25 |
+
"fn": 17,
|
| 26 |
+
"fp": 16,
|
| 27 |
+
"mean": 0.480018315018315,
|
| 28 |
+
"n": 30,
|
| 29 |
+
"over": 6,
|
| 30 |
+
"precision": 0.7377049180327869,
|
| 31 |
+
"recall": 0.7258064516129032,
|
| 32 |
+
"tp": 45
|
| 33 |
+
},
|
| 34 |
+
"seed": {
|
| 35 |
+
"empty_pred": 3,
|
| 36 |
+
"exact": 10,
|
| 37 |
+
"f1": 0.6509803921568628,
|
| 38 |
+
"failures": 3,
|
| 39 |
+
"fn": 60,
|
| 40 |
+
"fp": 29,
|
| 41 |
+
"mean": 0.43222222222222223,
|
| 42 |
+
"n": 60,
|
| 43 |
+
"over": 2,
|
| 44 |
+
"precision": 0.7410714285714286,
|
| 45 |
+
"recall": 0.5804195804195804,
|
| 46 |
+
"tp": 83
|
| 47 |
+
}
|
| 48 |
+
},
|
| 49 |
+
"ties": 26,
|
| 50 |
+
"wins": 21
|
| 51 |
+
}
|
final-cardinality-report.html
ADDED
|
@@ -0,0 +1,107 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!doctype html>
|
| 2 |
+
<html lang="en">
|
| 3 |
+
<head>
|
| 4 |
+
<meta charset="utf-8">
|
| 5 |
+
<meta name="viewport" content="width=device-width, initial-scale=1">
|
| 6 |
+
<title>Localpager GEPA Final Report</title>
|
| 7 |
+
<style>
|
| 8 |
+
:root { --bg:#f6f8fb; --panel:#fff; --ink:#172033; --muted:#667085; --line:#d8dee9; --blue:#185abc; --green:#137333; --red:#b42318; --amber:#ad5f00; }
|
| 9 |
+
* { box-sizing:border-box; }
|
| 10 |
+
body { margin:0; background:var(--bg); color:var(--ink); font-family:-apple-system,BlinkMacSystemFont,"Segoe UI",Roboto,Arial,sans-serif; line-height:1.45; }
|
| 11 |
+
header { background:#fff; border-bottom:1px solid var(--line); padding:28px 32px 20px; position:sticky; top:0; z-index:2; }
|
| 12 |
+
h1 { margin:0 0 8px; font-size:28px; }
|
| 13 |
+
main { max-width:1280px; margin:0 auto; padding:24px 32px 52px; }
|
| 14 |
+
nav { display:flex; gap:10px; flex-wrap:wrap; margin-top:14px; }
|
| 15 |
+
a { color:#0b57d0; text-decoration:none; }
|
| 16 |
+
nav a, .button { border:1px solid #c9dcff; background:#eef4ff; border-radius:7px; padding:7px 10px; }
|
| 17 |
+
section { background:var(--panel); border:1px solid var(--line); border-radius:8px; padding:20px; margin-bottom:20px; }
|
| 18 |
+
h2 { margin:0 0 14px; font-size:20px; }
|
| 19 |
+
h3 { margin:18px 0 10px; font-size:16px; }
|
| 20 |
+
.gridcards { display:grid; grid-template-columns:repeat(auto-fit,minmax(180px,1fr)); gap:12px; }
|
| 21 |
+
.card { border:1px solid var(--line); border-radius:8px; padding:14px; background:#fbfcfe; min-height:104px; }
|
| 22 |
+
.card .label { color:var(--muted); font-size:12px; text-transform:uppercase; letter-spacing:.04em; }
|
| 23 |
+
.card .value { margin-top:8px; font-size:26px; font-weight:750; }
|
| 24 |
+
.card .sub { margin-top:5px; color:var(--muted); font-size:13px; }
|
| 25 |
+
.good { color:var(--green); font-weight:700; } .bad { color:var(--red); font-weight:700; } .neutral,.muted { color:var(--muted); }
|
| 26 |
+
.scroll { overflow:auto; border:1px solid var(--line); border-radius:8px; }
|
| 27 |
+
table { width:100%; border-collapse:collapse; font-size:13px; }
|
| 28 |
+
th,td { border-bottom:1px solid var(--line); padding:8px 9px; text-align:left; vertical-align:top; }
|
| 29 |
+
th { background:#f0f3f8; color:#384152; position:sticky; top:92px; z-index:1; vertical-align:middle; }
|
| 30 |
+
tr.selected { background:#eefbf2; }
|
| 31 |
+
code { font-family:ui-monospace,SFMono-Regular,Consolas,monospace; font-size:12px; }
|
| 32 |
+
.chart-wrap { border:1px solid var(--line); border-radius:8px; background:#fff; overflow-x:auto; }
|
| 33 |
+
svg { min-width:900px; width:100%; height:auto; display:block; }
|
| 34 |
+
.grid { stroke:#e5eaf2; } .tick { fill:#667085; font-size:12px; } .label { font-size:11px; } .point { fill:#1f2937; font-size:12px; font-weight:700; } .axis-label { fill:#334155; font-size:13px; }
|
| 35 |
+
.note { border-left:4px solid var(--amber); padding:10px 12px; background:#fff8eb; color:#4a2d00; }
|
| 36 |
+
@media (max-width:800px) { header { position:static; } th { position:static; } main,header { padding-left:18px; padding-right:18px; } }
|
| 37 |
+
</style>
|
| 38 |
+
</head>
|
| 39 |
+
<body>
|
| 40 |
+
<header>
|
| 41 |
+
<h1>Localpager GEPA Final Report</h1>
|
| 42 |
+
<div class="muted">Updated 2026-06-14T13:35:45Z. Dataset: Shaun ordered 60-row set. Gold labels: canonical <code>ds4.jsonl</code>. Task model: local 12B. Concurrency: 2.</div>
|
| 43 |
+
<nav><a href="#summary">Summary</a><a href="#trajectory">Score Trajectory</a><a href="#metrics">Metrics</a><a href="#gates">Gates</a><a href="#misses">Remaining Misses</a><a href="gepa-12b-row30-prop20-continuation-20260614T021448Z/score_report.html">GEPA Iteration Graph</a><a href="dashboard-20260613-gepa/legacy-gepa-six-dashboard.html">Legacy Dashboard</a><a href="prompt-diffs/index.html">Prompt Diffs</a></nav>
|
| 44 |
+
</header>
|
| 45 |
+
<main>
|
| 46 |
+
<section id="summary">
|
| 47 |
+
<h2>Promoted Result</h2>
|
| 48 |
+
<div class="gridcards"><div class="card"><div class="label">Final mean score</div><div class="value good">0.6233</div><div class="sub">delta 0.1911 vs v9.1</div></div><div class="card"><div class="label">Micro-F1</div><div class="value good">0.7985</div><div class="sub">target >= 0.7000</div></div><div class="card"><div class="label">Precision / recall</div><div class="value good">0.9375 / 0.6954</div><div class="sub">FP weighted harder than FN</div></div><div class="card"><div class="label">False positives</div><div class="value good">7</div><div class="sub">target <= 20</div></div><div class="card"><div class="label">False negatives</div><div class="value good">46</div><div class="sub">target <= 58</div></div><div class="card"><div class="label">Over-label events</div><div class="value good">0</div><div class="sub">target 0 or 1</div></div><div class="card"><div class="label">Structural failures</div><div class="value good">0</div><div class="sub">target 0</div></div><div class="card"><div class="label">Mean predicted labels</div><div class="value good">1.8667</div><div class="sub">same as v9.1 baseline</div></div></div>
|
| 49 |
+
<p class="note">The promoted v3 candidate clears the strict gates and keeps mean predicted labels at the v9.1 baseline. The main caveat is scientific, not mechanical: the manual repair used mistakes from this same 60-row set, so a fresh holdout is still needed before treating it as deployment evidence.</p>
|
| 50 |
+
</section>
|
| 51 |
+
<section id="trajectory">
|
| 52 |
+
<h2>60-Row Score Trajectory</h2>
|
| 53 |
+
<div class="chart-wrap"><svg viewBox="0 0 920 310" role="img" aria-label="60-row score trajectory">
|
| 54 |
+
<rect width="100%" height="100%" fill="#fff"/>
|
| 55 |
+
<line x1="64" y1="244.0" x2="892" y2="244.0" class="grid"/>
|
| 56 |
+
<text x="54" y="248.0" text-anchor="end" class="tick">0.38</text>
|
| 57 |
+
<line x1="64" y1="189.0" x2="892" y2="189.0" class="grid"/>
|
| 58 |
+
<text x="54" y="193.0" text-anchor="end" class="tick">0.48</text>
|
| 59 |
+
<line x1="64" y1="134.0" x2="892" y2="134.0" class="grid"/>
|
| 60 |
+
<text x="54" y="138.0" text-anchor="end" class="tick">0.59</text>
|
| 61 |
+
<line x1="64" y1="79.0" x2="892" y2="79.0" class="grid"/>
|
| 62 |
+
<text x="54" y="83.0" text-anchor="end" class="tick">0.70</text>
|
| 63 |
+
<line x1="64" y1="24.0" x2="892" y2="24.0" class="grid"/>
|
| 64 |
+
<text x="54" y="28.0" text-anchor="end" class="tick">0.80</text>
|
| 65 |
+
<polyline points="64.0,216.6 229.6,185.8 395.2,162.5 560.8,132.1 726.4,38.2 892.0,116.5" fill="none" stroke="#185abc" stroke-width="3" stroke-linecap="round" stroke-linejoin="round"/>
|
| 66 |
+
<circle cx="64.0" cy="216.6" r="6" fill="#185abc" stroke="#fff" stroke-width="2"><title>v9.1 seed 0.4322</title></circle>
|
| 67 |
+
<text x="64.0" y="204.6" text-anchor="middle" class="point">0.4322</text>
|
| 68 |
+
<text x="64.0" y="274" text-anchor="middle" class="tick label">v9.1 seed</text>
|
| 69 |
+
<circle cx="229.6" cy="185.8" r="6" fill="#185abc" stroke="#fff" stroke-width="2"><title>GEPA six 0.4912</title></circle>
|
| 70 |
+
<text x="229.6" y="173.8" text-anchor="middle" class="point">0.4912</text>
|
| 71 |
+
<text x="229.6" y="274" text-anchor="middle" class="tick label">GEPA six</text>
|
| 72 |
+
<circle cx="395.2" cy="162.5" r="6" fill="#185abc" stroke="#fff" stroke-width="2"><title>Previous proper best 0.5355</title></circle>
|
| 73 |
+
<text x="395.2" y="150.5" text-anchor="middle" class="point">0.5355</text>
|
| 74 |
+
<text x="395.2" y="274" text-anchor="middle" class="tick label">Previous proper best</text>
|
| 75 |
+
<circle cx="560.8" cy="132.1" r="6" fill="#185abc" stroke="#fff" stroke-width="2"><title>Prop20 best 0.5936</title></circle>
|
| 76 |
+
<text x="560.8" y="120.1" text-anchor="middle" class="point">0.5936</text>
|
| 77 |
+
<text x="560.8" y="274" text-anchor="middle" class="tick label">Prop20 best</text>
|
| 78 |
+
<circle cx="726.4" cy="38.2" r="6" fill="#185abc" stroke="#fff" stroke-width="2"><title>Hardcase repair v2 0.7729</title></circle>
|
| 79 |
+
<text x="726.4" y="26.2" text-anchor="middle" class="point">0.7729</text>
|
| 80 |
+
<text x="726.4" y="274" text-anchor="middle" class="tick label">Hardcase repair v2</text>
|
| 81 |
+
<circle cx="892.0" cy="116.5" r="6" fill="#15803d" stroke="#fff" stroke-width="2"><title>Cardinality repair v3 0.6233</title></circle>
|
| 82 |
+
<text x="892.0" y="104.5" text-anchor="middle" class="point">0.6233</text>
|
| 83 |
+
<text x="892.0" y="274" text-anchor="middle" class="tick label">Cardinality repair v3</text>
|
| 84 |
+
<text x="478.0" y="300" text-anchor="middle" class="axis-label">candidate timeline</text>
|
| 85 |
+
</svg></div>
|
| 86 |
+
</section>
|
| 87 |
+
<section id="metrics">
|
| 88 |
+
<h2>Candidate Metrics</h2>
|
| 89 |
+
<div class="scroll"><table><thead><tr><th>Candidate</th><th>Mean</th><th>Delta</th><th>Precision</th><th>Recall</th><th>F1</th><th>FP</th><th>FN</th><th>Over</th><th>Struct</th><th>Exact</th><th>Mean Labels</th><th>Label Delta</th><th>Source</th></tr></thead><tbody><tr class=''><td><b>v9.1 seed</b><br><span class='muted'>baseline</span></td><td>0.4322</td><td class='neutral'>0.0000</td><td>0.7411</td><td>0.5804</td><td>0.6510</td><td>29</td><td>60</td><td>2</td><td>3</td><td>10</td><td>1.8667</td><td>0.0000</td><td><a href='dashboard-20260613-gepa/summary.json'>source</a></td></tr><tr class=''><td><b>GEPA six</b><br><span class='muted'>early GEPA</span></td><td>0.4912</td><td class='good'>0.0590</td><td>0.7965</td><td>0.5960</td><td>0.6818</td><td>23</td><td>61</td><td>1</td><td>0</td><td>12</td><td>n/a</td><td></td><td><a href='dashboard-20260613-gepa/summary.json'>source</a></td></tr><tr class=''><td><b>Previous proper best</b><br><span class='muted'>proper GEPA</span></td><td>0.5355</td><td class='good'>0.1033</td><td>0.7778</td><td>0.6490</td><td>0.7076</td><td>28</td><td>53</td><td>5</td><td>2</td><td>21</td><td>2.1000</td><td>0.2333</td><td><a href='validation-12b-proper-best-20260613T111155Z/gepa-12b-proper-best-limit60.json'>source</a></td></tr><tr class=''><td><b>Prop20 best</b><br><span class='muted'>proper GEPA</span></td><td>0.5936</td><td class='good'>0.1613</td><td>0.8707</td><td>0.6689</td><td>0.7566</td><td>15</td><td>50</td><td>2</td><td>4</td><td>24</td><td>1.9333</td><td>0.0667</td><td><a href='validation-12b-row30-prop16-best-20260614T081931Z/gepa-12b-row30-prop16-best-limit60.json'>source</a></td></tr><tr class=''><td><b>Hardcase repair v2</b><br><span class='muted'>manual repair, rejected</span></td><td>0.7729</td><td class='good'>0.3407</td><td>0.9343</td><td>0.8477</td><td>0.8889</td><td>9</td><td>23</td><td>4</td><td>0</td><td>38</td><td>2.2833</td><td>0.4167</td><td><a href='validation-manual-hardcase-repair-v2-max4096-20260614T100359Z/manual-hardcase-repair-v2-max4096-limit60.json'>source</a></td></tr><tr class='selected'><td><b>Cardinality repair v3</b><br><span class='muted'>promoted</span></td><td>0.6233</td><td class='good'>0.1911</td><td>0.9375</td><td>0.6954</td><td>0.7985</td><td>7</td><td>46</td><td>0</td><td>0</td><td>19</td><td>1.8667</td><td>0.0000</td><td><a href='validation-manual-hardcase-repair-v3-cardinality-max4096-20260614T103814Z/manual-hardcase-repair-v3-cardinality-max4096-limit60.json'>source</a></td></tr></tbody></table></div>
|
| 90 |
+
</section>
|
| 91 |
+
<section id="gates">
|
| 92 |
+
<h2>Strict Gate Check</h2>
|
| 93 |
+
<div class="scroll"><table><thead><tr><th>Gate</th><th>Target</th><th>Observed</th><th>Result</th></tr></thead><tbody><tr><td>mean weighted score</td><td>>= 0.5400</td><td>0.6233</td><td class='good'>pass</td></tr><tr><td>score delta vs v9.1</td><td>>= +0.1000</td><td>0.1911</td><td class='good'>pass</td></tr><tr><td>micro-F1</td><td>>= 0.7000</td><td>0.7985</td><td class='good'>pass</td></tr><tr><td>precision</td><td>>= 0.8000</td><td>0.9375</td><td class='good'>pass</td></tr><tr><td>recall</td><td>>= 0.6100</td><td>0.6954</td><td class='good'>pass</td></tr><tr><td>false positives</td><td><= 20</td><td>7</td><td class='good'>pass</td></tr><tr><td>false negatives</td><td><= 58</td><td>46</td><td class='good'>pass</td></tr><tr><td>over-label events</td><td>0 or 1</td><td>0</td><td class='good'>pass</td></tr><tr><td>structural failures</td><td>0</td><td>0</td><td class='good'>pass</td></tr><tr><td>exact matches</td><td>>= 15</td><td>19</td><td class='good'>pass</td></tr><tr><td>mean predicted-label delta</td><td><= +0.10</td><td>0.0000</td><td class='good'>pass</td></tr></tbody></table></div>
|
| 94 |
+
</section>
|
| 95 |
+
<section id="links">
|
| 96 |
+
<h2>Updated Links</h2>
|
| 97 |
+
<p><a class="button" href="gepa-12b-row30-prop20-continuation-20260614T021448Z/score_report.html">Prop20 GEPA score report</a> <a class="button" href="gepa-12b-row30-prop20-continuation-20260614T021448Z/candidate_tree.html">Prop20 candidate tree</a> <a class="button" href="dashboard-20260613-gepa/legacy-gepa-six-dashboard.html">Legacy GEPA-six dashboard</a> <a class="button" href="prompt-diffs/index.html">Prompt diffs</a></p>
|
| 98 |
+
<p class="muted">Promoted routing-policy SHA-256: <code>b2576ca027148e109a1e72029c192ea9e26be486508671ec7c11025ac80f948b</code>.</p>
|
| 99 |
+
</section>
|
| 100 |
+
<section id="misses">
|
| 101 |
+
<h2>Remaining v3 Misses</h2>
|
| 102 |
+
<p class="muted">Rows where v3 is not an exact match. These are the next useful targets for a clean holdout-aware follow-up.</p>
|
| 103 |
+
<div class="scroll"><table><thead><tr><th>Target</th><th>Title</th><th>Score</th><th>Gold</th><th>Predicted</th><th>FP</th><th>FN</th></tr></thead><tbody><tr><td><a href='https://github.com/openclaw/openclaw/pull/44379'>#44379</a></td><td>fix(pi-runner): harden context-overflow recovery with one suppress-hook retry</td><td>0.1667</td><td>coding_agents, memory, hooks, reliability</td><td>agent_runtime, reliability</td><td>agent_runtime</td><td>coding_agents, memory, hooks</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/pull/45393'>#45393</a></td><td>fix(errors): friendly message and last-message repair for tool_use/tool_result mismatch (#45385)</td><td>0.2000</td><td>tool_calling, coding_agents, reliability</td><td>tool_calling, agent_runtime</td><td>agent_runtime</td><td>coding_agents, reliability</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/pull/47083'>#47083</a></td><td>fix: respect totalTokensFresh flag to avoid showing stale token counts</td><td>0.2000</td><td>sessions, telemetry_usage</td><td>ui_tui</td><td>ui_tui</td><td>sessions, telemetry_usage</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/pull/81957'>#81957</a></td><td>ci: harden GitHub Actions supply-chain boundaries</td><td>0.2500</td><td>security</td><td>packaging_deployment</td><td>packaging_deployment</td><td>security</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/pull/65364'>#65364</a></td><td>feat(plugins): add registerProviderRuntimeAuthOverride API</td><td>0.2500</td><td>auth_identity, api_surface</td><td>skills_plugins, auth_identity</td><td>skills_plugins</td><td>api_surface</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/pull/80008'>#80008</a></td><td>feat(plugins): expose ACP spawn and prompt in plugin runtime</td><td>0.2500</td><td>acp, coding_agents</td><td>skills_plugins, acp</td><td>skills_plugins</td><td>coding_agents</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/issues/90146'>#90146</a></td><td>google-vertex: Missing gemini-3.1-flash-lite in provider catalog causes silent failure instead of error</td><td>0.2500</td><td>local_model_providers, reliability</td><td>local_model_providers, model_serving</td><td>model_serving</td><td>reliability</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/issues/73910'>#73910</a></td><td>BUG: OpenClaw-managed Codex ACP uses isolated CODEX_HOME without auth bridge and sends unsupported timeout config</td><td>0.3333</td><td>codex, acp, acpx, auth_identity</td><td>codex, acp</td><td>none</td><td>acpx, auth_identity</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/issues/52249'>#52249</a></td><td>ACP parent session stuck until refresh when yielded waiting for child completion</td><td>0.5000</td><td>acp, sessions, reliability</td><td>acp, sessions</td><td>none</td><td>reliability</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/issues/83863'>#83863</a></td><td>ACP/Codex child tasks can be marked succeeded with progress-only output and no final deliverable</td><td>0.5000</td><td>acp, codex, agent_runtime</td><td>acp, codex</td><td>none</td><td>agent_runtime</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/pull/48940'>#48940</a></td><td>ACP: add gateway-owned node-backed runtime</td><td>0.5000</td><td>acp, gateway, agent_runtime</td><td>acp, gateway</td><td>none</td><td>agent_runtime</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/issues/48580'>#48580</a></td><td>Bug: acpx codex sessions 创建的会话立即退出 - stdin is not a terminal</td><td>0.5000</td><td>acpx, codex, sessions</td><td>acpx, codex</td><td>none</td><td>sessions</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/issues/39248'>#39248</a></td><td>Bug: sandbox.mode: "non-main" silently breaks sessions_spawn subagent initialization</td><td>0.5000</td><td>coding_agents, sandboxing, agent_runtime</td><td>sandboxing, agent_runtime</td><td>none</td><td>coding_agents</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/issues/71216'>#71216</a></td><td>Config schema: add `sandbox`, `routing.rules`, `instances`, and `gateway.nodes.denyPaths`</td><td>0.5000</td><td>config, sandboxing, gateway</td><td>config, gateway</td><td>none</td><td>sandboxing</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/issues/84477'>#84477</a></td><td>Discord embedded-run prep wedge before strict-agentic, recovery skips sessionId=unknown lanes</td><td>0.5000</td><td>sessions, agent_runtime, reliability</td><td>sessions, reliability</td><td>none</td><td>agent_runtime</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/issues/51667'>#51667</a></td><td>Feature: Native Audio Input for Omni-Modal Models (skip STT transcription)</td><td>0.5000</td><td>model_serving, security, config</td><td>model_serving, config</td><td>none</td><td>security</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/pull/43765'>#43765</a></td><td>Improve runtime recovery for heartbeat, Feishu, and exec sessions</td><td>0.5000</td><td>reliability, exec_tools, cron_automation</td><td>reliability, exec_tools</td><td>none</td><td>cron_automation</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/pull/80783'>#80783</a></td><td>Policy: add model, network, and MCP conformance checks</td><td>0.5000</td><td>mcp_tooling, config, security</td><td>config, mcp_tooling</td><td>none</td><td>security</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/issues/68187'>#68187</a></td><td>SSE-backed MCP sessions can stay stale after server restart and fail with 'Session not found'</td><td>0.5000</td><td>mcp_tooling, sessions, gateway</td><td>mcp_tooling, sessions</td><td>none</td><td>gateway</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/issues/78528'>#78528</a></td><td>Security: skill SecretRef API keys still leak into exec child environments</td><td>0.5000</td><td>security, exec_tools, skills_plugins</td><td>security, exec_tools</td><td>none</td><td>skills_plugins</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/issues/84715'>#84715</a></td><td>[Bug]: @openclaw/codex peer link failure reproduced on 2026.5.19 after update</td><td>0.5000</td><td>codex, packaging_deployment</td><td>codex</td><td>none</td><td>packaging_deployment</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/issues/70529'>#70529</a></td><td>[Bug]: Desktop cannot use existing Chrome sessions: EasyClaw Google sign-in fails, and user profile attach fails with spawn npx ENOENT</td><td>0.5000</td><td>browser_automation, packaging_deployment</td><td>browser_automation</td><td>none</td><td>packaging_deployment</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/issues/84757'>#84757</a></td><td>[Bug]: Telegram session can get stuck after compaction when encrypted reasoning content fails verification</td><td>0.5000</td><td>sessions, chat_integrations, reliability</td><td>sessions, chat_integrations</td><td>none</td><td>reliability</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/issues/44202'>#44202</a></td><td>[Bug]: local memory embeddings on Apple Silicon can crash gateway in ggml-metal / node-llama-cpp; need official Metal/GPU guidance</td><td>0.5000</td><td>local_models, memory, self_hosted_inference</td><td>local_models, memory</td><td>none</td><td>self_hosted_inference</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/issues/10467'>#10467</a></td><td>[Feature Request]: Multi-lane concurrency support for sub-agents via sessions_spawn</td><td>0.5000</td><td>queueing, sessions, coding_agents</td><td>queueing, sessions</td><td>none</td><td>coding_agents</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/issues/82507'>#82507</a></td><td>[Feature]: ACPX Codex sandbox should inherit user-installed plugins (e.g. Superpowers)</td><td>0.5000</td><td>acpx, codex, skills_plugins</td><td>acpx, skills_plugins</td><td>none</td><td>codex</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/issues/40332'>#40332</a></td><td>[Feature]: Per-binding and per-agent permissionMode for ACP sessions</td><td>0.5000</td><td>acp, approvals, acpx</td><td>approvals, acpx</td><td>none</td><td>acp</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/pull/84670'>#84670</a></td><td>[codex] fix webchat full-message reader for truncated history</td><td>0.5000</td><td>gateway, api_surface, ui_tui</td><td>gateway, ui_tui</td><td>none</td><td>api_surface</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/issues/84583'>#84583</a></td><td>cron announce delivery triggers EmbeddedAttemptSessionTakeoverError when user is actively chatting</td><td>0.5000</td><td>cron_automation, sessions, reliability</td><td>cron_automation, sessions</td><td>none</td><td>reliability</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/pull/68725'>#68725</a></td><td>feat(amazon-bedrock-mantle): add known context windows for open-weight Mantle models</td><td>0.5000</td><td>open_weight_models, local_model_providers</td><td>open_weight_models</td><td>none</td><td>local_model_providers</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/pull/56442'>#56442</a></td><td>feat: Add opt-in ACP parent completion notify for sessions_spawn</td><td>0.5000</td><td>acp, sessions, agent_runtime</td><td>acp, sessions</td><td>none</td><td>agent_runtime</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/issues/60979'>#60979</a></td><td>feature: sessions_spawn ACP delivery to channel (stream output to Zulip/Discord topic)</td><td>0.5000</td><td>acp, chat_integrations, sessions</td><td>acp, sessions</td><td>none</td><td>chat_integrations</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/pull/52747'>#52747</a></td><td>fix(acp): time out stuck session lane tasks</td><td>0.5000</td><td>acp, sessions, reliability</td><td>acp, sessions</td><td>none</td><td>reliability</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/pull/84763'>#84763</a></td><td>fix(acpx): scrub provider credential env from ACP harness spawns</td><td>0.5000</td><td>acpx, acp, security</td><td>acp, acpx</td><td>none</td><td>security</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/pull/69256'>#69256</a></td><td>fix(cron): prevent premature session cleanup when subagents are running</td><td>0.5000</td><td>cron_automation, sessions, reliability</td><td>cron_automation, sessions</td><td>none</td><td>reliability</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/pull/65242'>#65242</a></td><td>fix: CompletionDeliveryGate to prevent duplicate ACP completion delivery</td><td>0.5000</td><td>acp, coding_agents, reliability</td><td>acp, reliability</td><td>none</td><td>coding_agents</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/pull/77827'>#77827</a></td><td>fix: LM Studio thinking blocks invisible with Responses API</td><td>0.5000</td><td>model_serving, local_models</td><td>local_models</td><td>none</td><td>model_serving</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/pull/42027'>#42027</a></td><td>fix: resolve exec PATH fallback, layered browser diagnostics, and cron force-run deadlock</td><td>0.5000</td><td>exec_tools, browser_automation, cron_automation</td><td>exec_tools, browser_automation</td><td>none</td><td>cron_automation</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/pull/84752'>#84752</a></td><td>fix: self-heal lane wedges + restore openai-codex OAuth on embedded path</td><td>0.5000</td><td>reliability, auth_identity, sessions</td><td>reliability, auth_identity</td><td>none</td><td>sessions</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/pull/63826'>#63826</a></td><td>security: fix HIGH/CRITICAL vulns in skill scanner, SSRF, hook priority, and token verification</td><td>0.5000</td><td>security, hooks, skills_plugins</td><td>security, skills_plugins</td><td>none</td><td>hooks</td></tr><tr><td><a href='https://github.com/openclaw/openclaw/pull/62428'>#62428</a></td><td>test(exec): land exec v2 contract follow-through</td><td>0.5000</td><td>exec_tools, sandboxing, approvals</td><td>exec_tools, sandboxing</td><td>none</td><td>approvals</td></tr></tbody></table></div>
|
| 104 |
+
</section>
|
| 105 |
+
</main>
|
| 106 |
+
</body>
|
| 107 |
+
</html>
|
gepa-12b-multi-from-six-20260613T051216Z/best.prompt.md
ADDED
|
@@ -0,0 +1,165 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# OpenClaw Routing Classifier
|
| 2 |
+
|
| 3 |
+
Classify one OpenClaw GitHub issue or pull request for maintainer notification
|
| 4 |
+
routing, not code search. Return only the final structured JSON required by the
|
| 5 |
+
schema. No prose, markdown, analysis, or extra fields.
|
| 6 |
+
|
| 7 |
+
Required output shape:
|
| 8 |
+
|
| 9 |
+
```json
|
| 10 |
+
{"topics_of_interest":[],"description":"One concise evidence-backed sentence.","caveats":[]}
|
| 11 |
+
```
|
| 12 |
+
|
| 13 |
+
## Inner Monologue
|
| 14 |
+
|
| 15 |
+
You MUST keep your inner monologue, your thought process, your Chain of Thought restricted to 2 short paragraphs maximum. Do not deliberate topic by topic; weigh only the strongest candidates, then call final_json. It is ABSOLUTELY IMPERATIVE that you DO NOT EXCEED 50 WORDS and reply as soon as possible.
|
| 16 |
+
|
| 17 |
+
## Repository Reads
|
| 18 |
+
|
| 19 |
+
A read-only `bash` tool may be available in the OpenClaw repo snapshot. Use it
|
| 20 |
+
only when the GitHub context is ambiguous or missing repo evidence needed for a
|
| 21 |
+
correct routing decision. Prefer short commands such as `pwd`, `ls`, `find`,
|
| 22 |
+
`rg`, `grep`, `sed -n`, `cat`, `head`, `tail`, `wc -l`,
|
| 23 |
+
`git show --name-only`, `git ls-files`, or `git grep`.
|
| 24 |
+
For repo-wide text search, use `rg -n -i "phrase"` or explicit recursive grep
|
| 25 |
+
such as `grep -R -n -i "phrase" .`. For file discovery, use
|
| 26 |
+
`rg --files -g "*.ts"` or `git ls-files src`.
|
| 27 |
+
Do not call `bash` when the provided GitHub context is enough.
|
| 28 |
+
|
| 29 |
+
## Allowed Topics
|
| 30 |
+
|
| 31 |
+
```json
|
| 32 |
+
__ALLOWED_TOPICS_JSON__
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
Topic definitions and cue words:
|
| 36 |
+
|
| 37 |
+
__TOPIC_DESCRIPTIONS__
|
| 38 |
+
|
| 39 |
+
You are classifying GitHub issues or pull requests into the smallest complete set of allowed topic ids.
|
| 40 |
+
|
| 41 |
+
Input format:
|
| 42 |
+
- You may receive a GitHub target URL, title, and sometimes a body or summary.
|
| 43 |
+
- The title is the primary signal.
|
| 44 |
+
- Use the first clear body summary only when the title is ambiguous.
|
| 45 |
+
- Ignore examples, tests, files changed, implementation details, incidental keywords, and broad impact unless they are the actual user-visible subject.
|
| 46 |
+
- Return only final JSON using exact allowed topic ids, for example:
|
| 47 |
+
{"topics_of_interest":["queueing","docs"]}
|
| 48 |
+
|
| 49 |
+
Task:
|
| 50 |
+
Choose the minimum topic set that routes the item to the right maintainer bucket without dropping an explicitly central second or third concern.
|
| 51 |
+
|
| 52 |
+
General process:
|
| 53 |
+
1. Read the title first.
|
| 54 |
+
2. Identify the main user-visible problem, feature, documentation change, policy change, or contract being changed.
|
| 55 |
+
3. Pick one primary topic.
|
| 56 |
+
4. Add a secondary topic only when it is explicitly central and removing it would route the item away from a maintainer who must see it.
|
| 57 |
+
5. Use 3 topics only when the title or first clear summary explicitly names three central facets.
|
| 58 |
+
6. Use 0 topics when no allowed topic is central.
|
| 59 |
+
7. Never invent topic ids.
|
| 60 |
+
8. Output only JSON.
|
| 61 |
+
|
| 62 |
+
High-signal title patterns:
|
| 63 |
+
- A Conventional Commit type like `docs(...)`, `feat(...)`, `fix(...)`, `test(...)`, or `policy(...)` can indicate the kind of change.
|
| 64 |
+
- A scope inside parentheses is often central. For example, `docs(queue): ...` usually includes both `docs` and `queueing`.
|
| 65 |
+
- Do not ignore `test(...)` scopes when the title is about landing or enforcing a behavior contract. The tested contract can be the central subject.
|
| 66 |
+
- Do not blindly label every word in the title. Confirm the word names the subject, not just a path, symptom, or context.
|
| 67 |
+
|
| 68 |
+
Domain rules and corrections:
|
| 69 |
+
|
| 70 |
+
Documentation:
|
| 71 |
+
- Documentation-only PRs should usually include `docs` plus the central documented area.
|
| 72 |
+
- Example: `docs(queue): clarify steer behavior with partial streaming and tool boundaries` => `docs`, `queueing`.
|
| 73 |
+
- Do not add `tool_calling` just because the title says “tool boundaries” unless tool-call behavior itself is the central feature or bug.
|
| 74 |
+
|
| 75 |
+
Queueing:
|
| 76 |
+
- Queue, queueing, queued execution, steer behavior in queues, or queue lifecycle route to `queueing` when central.
|
| 77 |
+
|
| 78 |
+
Tool calling:
|
| 79 |
+
- `tool_calling` is only for tool-call execution, tool-call APIs, tool selection, tool schema handling, or tool-call runtime behavior.
|
| 80 |
+
- Mentions of “tool boundaries” in docs about another system are usually context, not `tool_calling`.
|
| 81 |
+
|
| 82 |
+
ACP, gateway, and runtime:
|
| 83 |
+
- ACP-related work routes to `acp` when ACP is named centrally.
|
| 84 |
+
- ACPX-related sandbox or workflow issues route to `acpx` when ACPX is named centrally.
|
| 85 |
+
- Gateway-owned behavior routes to `gateway` only when gateway is explicitly the owner or subject.
|
| 86 |
+
- Runtime work routes to `agent_runtime` when the title is about runtimes, node-backed runtimes, agent execution runtimes, or runtime ownership.
|
| 87 |
+
- Example: `ACP: add gateway-owned node-backed runtime` => `acp`, `gateway`, `agent_runtime`.
|
| 88 |
+
|
| 89 |
+
Codex and plugins:
|
| 90 |
+
- Codex-related behavior routes to `codex` when Codex is named centrally.
|
| 91 |
+
- User-installed plugins, plugin inheritance, Superpowers, skills, plugin discovery, plugin installation, or skill/plugin availability route to `skills_plugins`.
|
| 92 |
+
- Example: `[Feature]: ACPX Codex sandbox should inherit user-installed plugins (e.g. Superpowers)` => `acpx`, `codex`, `skills_plugins`.
|
| 93 |
+
- Do not drop `skills_plugins` when plugins are the requested feature.
|
| 94 |
+
|
| 95 |
+
Notifications and chat integrations:
|
| 96 |
+
- Slack, chat app delivery, chat target channels, and chat push behavior route to `chat_integrations`.
|
| 97 |
+
- Announce messages, heartbeat pushes, target-channel pushes, identity overlays on pushed messages, and notification delivery route to `notifications`.
|
| 98 |
+
- Do not add `cron_automation` merely because the notification path mentions `cron --announce`; cron is context unless scheduling, force-run behavior, cron lifecycle, or cron execution is itself broken.
|
| 99 |
+
- Example: `Per-agent identity overlay dropped on cron --announce and heartbeat target-channel Slack pushes` => `notifications`, `chat_integrations`.
|
| 100 |
+
|
| 101 |
+
Cron:
|
| 102 |
+
- Use `cron_automation` when cron scheduling, cron force-run, cron lifecycle, cron execution, or a cron deadlock is central.
|
| 103 |
+
- Example: `cron force-run deadlock` => `cron_automation`.
|
| 104 |
+
|
| 105 |
+
Exec, sandboxing, and approvals:
|
| 106 |
+
- Exec command/tool behavior routes to `exec_tools`.
|
| 107 |
+
- Exec PATH fallback is `exec_tools`.
|
| 108 |
+
- Exec v2 contract follow-through or contract enforcement can centrally include `exec_tools`, `sandboxing`, and `approvals` when the contract covers sandbox and approval behavior.
|
| 109 |
+
- Example: `test(exec): land exec v2 contract follow-through` => `exec_tools`, `sandboxing`, `approvals`.
|
| 110 |
+
- Do not replace sandboxing or approvals with `security` unless the title is actually about a security policy, vulnerability, network restriction, credential boundary, or allowed/blocked security behavior.
|
| 111 |
+
|
| 112 |
+
Browser automation:
|
| 113 |
+
- Browser diagnostics, browser automation layers, browser runtime behavior, and browser tooling issues route to `browser_automation`.
|
| 114 |
+
- Example: `layered browser diagnostics` => `browser_automation`.
|
| 115 |
+
- Do not add `gateway` for browser diagnostics unless the gateway itself is explicitly the subject.
|
| 116 |
+
|
| 117 |
+
Memory and inference:
|
| 118 |
+
- Memory or embeddings provider work routes to `memory` when the provider exists for memory/embeddings.
|
| 119 |
+
- Self-hosted inference servers such as llama.cpp, Ollama, vLLM, TGI, and LocalAI route to `self_hosted_inference` when the item is about using those servers as inference providers.
|
| 120 |
+
- Example: `feat(memory/embeddings): add openai-compatible provider for self-hosted servers (llama.cpp, Ollama, vLLM, TGI, LocalAI)` => `memory`, `self_hosted_inference`.
|
| 121 |
+
- Do not add `model_serving` merely because the title says “openai-compatible”, “provider”, llama.cpp, Ollama, vLLM, TGI, or LocalAI.
|
| 122 |
+
|
| 123 |
+
Model serving:
|
| 124 |
+
- Use `model_serving` only when the central subject is serving endpoints, OpenAI-compatible request/response protocol behavior, streaming lifecycle, final usage chunks, base URL behavior, endpoint compatibility, request routing, or model-server compatibility.
|
| 125 |
+
- OpenAI-compatible streaming, final usage chunks, stream lifecycle, endpoint compatibility, base URL behavior, vLLM/TGI/LocalAI/llama.cpp serving behavior, and request routing are `model_serving`.
|
| 126 |
+
- Do not add `telemetry_usage` merely because the title mentions usage, tokens, counts, cost, or chunks when those are symptoms of a model-serving protocol bug.
|
| 127 |
+
- Example: `OpenAI-compatible streaming with llama.cpp saves zero usage (stream closed before final usage chunk)` => `model_serving`.
|
| 128 |
+
|
| 129 |
+
Telemetry and usage:
|
| 130 |
+
- Use `telemetry_usage` only when metric collection, usage accounting/reporting, cost display, diagnostic counts, traces, or status reporting surfaces are themselves the feature or bug.
|
| 131 |
+
|
| 132 |
+
Policy/config:
|
| 133 |
+
- Items about policy rules, conformance checks, quality gates, allowed behavior, or configuration-governed enforcement usually include `config` when the policy/checking behavior is central.
|
| 134 |
+
- Do not map “model” in “model policy”, “model conformance”, or “model checks” to `model_serving` unless the item is actually about serving endpoints, streaming, endpoint lifecycle, routing, or model-server compatibility.
|
| 135 |
+
- Network policy, network conformance, access restrictions, outbound rules, or boundary checks can be `security` when they concern allowed/blocked network behavior.
|
| 136 |
+
- MCP conformance, MCP policy, MCP tool behavior, or MCP protocol checks route to `mcp_tooling`.
|
| 137 |
+
- Example: `Policy: add model, network, and MCP conformance checks` => `mcp_tooling`, `config`, `security`, not `model_serving`.
|
| 138 |
+
|
| 139 |
+
Composite fixes:
|
| 140 |
+
- If a title lists several independent fixes, classify each central fix up to the smallest complete set.
|
| 141 |
+
- Example: `fix: resolve exec PATH fallback, layered browser diagnostics, and cron force-run deadlock` => `exec_tools`, `browser_automation`, `cron_automation`.
|
| 142 |
+
- Do not substitute a broad infrastructure topic like `gateway` unless it is explicitly one of the listed user-visible subjects.
|
| 143 |
+
|
| 144 |
+
Final suppression checks:
|
| 145 |
+
- If a topic was added only because of a word like “usage”, “model”, “network”, “test”, “policy”, “status”, “tool”, “plugin”, “chunk”, “cron”, “gateway”, or “security”, verify that the topic is actually the subject.
|
| 146 |
+
- Prefer the narrow central topic over broad fallback labels.
|
| 147 |
+
- Remove labels that come only from symptoms, implementation details, tests, examples, files changed, or incidental words.
|
| 148 |
+
- Keep required central second and third topics when dropping them would hide the item from a maintainer who owns that area.## Target
|
| 149 |
+
|
| 150 |
+
`__TARGET__`
|
| 151 |
+
|
| 152 |
+
## GitHub Context
|
| 153 |
+
|
| 154 |
+
__GITHUB_CONTEXT__
|
| 155 |
+
|
| 156 |
+
Use this context as source of truth. If important sections are missing,
|
| 157 |
+
unavailable, selected, or truncated, classify from what is available and mention
|
| 158 |
+
material limits in `caveats`.
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
You MUST keep your inner monologue, your thought process, your Chain of Thought restricted to 2 short paragraphs maximum. Do not deliberate topic by topic; weigh only the strongest candidates, then call final_json. It is ABSOLUTELY IMPERATIVE that you DO NOT EXCEED 50 WORDS and reply as soon as possible.
|
| 162 |
+
|
| 163 |
+
You MUST keep your inner monologue, your thought process, your Chain of Thought restricted to 2 short paragraphs maximum. Do not deliberate topic by topic; weigh only the strongest candidates, then call final_json. It is ABSOLUTELY IMPERATIVE that you DO NOT EXCEED 50 WORDS and reply as soon as possible.
|
| 164 |
+
|
| 165 |
+
You MUST keep your inner monologue, your thought process, your Chain of Thought restricted to 2 short paragraphs maximum. Do not deliberate topic by topic; weigh only the strongest candidates, then call final_json. It is ABSOLUTELY IMPERATIVE that you DO NOT EXCEED 50 WORDS and reply as soon as possible.
|
gepa-12b-multi-from-six-20260613T051216Z/best.routing_policy.md
ADDED
|
@@ -0,0 +1,110 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
You are classifying GitHub issues or pull requests into the smallest complete set of allowed topic ids.
|
| 2 |
+
|
| 3 |
+
Input format:
|
| 4 |
+
- You may receive a GitHub target URL, title, and sometimes a body or summary.
|
| 5 |
+
- The title is the primary signal.
|
| 6 |
+
- Use the first clear body summary only when the title is ambiguous.
|
| 7 |
+
- Ignore examples, tests, files changed, implementation details, incidental keywords, and broad impact unless they are the actual user-visible subject.
|
| 8 |
+
- Return only final JSON using exact allowed topic ids, for example:
|
| 9 |
+
{"topics_of_interest":["queueing","docs"]}
|
| 10 |
+
|
| 11 |
+
Task:
|
| 12 |
+
Choose the minimum topic set that routes the item to the right maintainer bucket without dropping an explicitly central second or third concern.
|
| 13 |
+
|
| 14 |
+
General process:
|
| 15 |
+
1. Read the title first.
|
| 16 |
+
2. Identify the main user-visible problem, feature, documentation change, policy change, or contract being changed.
|
| 17 |
+
3. Pick one primary topic.
|
| 18 |
+
4. Add a secondary topic only when it is explicitly central and removing it would route the item away from a maintainer who must see it.
|
| 19 |
+
5. Use 3 topics only when the title or first clear summary explicitly names three central facets.
|
| 20 |
+
6. Use 0 topics when no allowed topic is central.
|
| 21 |
+
7. Never invent topic ids.
|
| 22 |
+
8. Output only JSON.
|
| 23 |
+
|
| 24 |
+
High-signal title patterns:
|
| 25 |
+
- A Conventional Commit type like `docs(...)`, `feat(...)`, `fix(...)`, `test(...)`, or `policy(...)` can indicate the kind of change.
|
| 26 |
+
- A scope inside parentheses is often central. For example, `docs(queue): ...` usually includes both `docs` and `queueing`.
|
| 27 |
+
- Do not ignore `test(...)` scopes when the title is about landing or enforcing a behavior contract. The tested contract can be the central subject.
|
| 28 |
+
- Do not blindly label every word in the title. Confirm the word names the subject, not just a path, symptom, or context.
|
| 29 |
+
|
| 30 |
+
Domain rules and corrections:
|
| 31 |
+
|
| 32 |
+
Documentation:
|
| 33 |
+
- Documentation-only PRs should usually include `docs` plus the central documented area.
|
| 34 |
+
- Example: `docs(queue): clarify steer behavior with partial streaming and tool boundaries` => `docs`, `queueing`.
|
| 35 |
+
- Do not add `tool_calling` just because the title says “tool boundaries” unless tool-call behavior itself is the central feature or bug.
|
| 36 |
+
|
| 37 |
+
Queueing:
|
| 38 |
+
- Queue, queueing, queued execution, steer behavior in queues, or queue lifecycle route to `queueing` when central.
|
| 39 |
+
|
| 40 |
+
Tool calling:
|
| 41 |
+
- `tool_calling` is only for tool-call execution, tool-call APIs, tool selection, tool schema handling, or tool-call runtime behavior.
|
| 42 |
+
- Mentions of “tool boundaries” in docs about another system are usually context, not `tool_calling`.
|
| 43 |
+
|
| 44 |
+
ACP, gateway, and runtime:
|
| 45 |
+
- ACP-related work routes to `acp` when ACP is named centrally.
|
| 46 |
+
- ACPX-related sandbox or workflow issues route to `acpx` when ACPX is named centrally.
|
| 47 |
+
- Gateway-owned behavior routes to `gateway` only when gateway is explicitly the owner or subject.
|
| 48 |
+
- Runtime work routes to `agent_runtime` when the title is about runtimes, node-backed runtimes, agent execution runtimes, or runtime ownership.
|
| 49 |
+
- Example: `ACP: add gateway-owned node-backed runtime` => `acp`, `gateway`, `agent_runtime`.
|
| 50 |
+
|
| 51 |
+
Codex and plugins:
|
| 52 |
+
- Codex-related behavior routes to `codex` when Codex is named centrally.
|
| 53 |
+
- User-installed plugins, plugin inheritance, Superpowers, skills, plugin discovery, plugin installation, or skill/plugin availability route to `skills_plugins`.
|
| 54 |
+
- Example: `[Feature]: ACPX Codex sandbox should inherit user-installed plugins (e.g. Superpowers)` => `acpx`, `codex`, `skills_plugins`.
|
| 55 |
+
- Do not drop `skills_plugins` when plugins are the requested feature.
|
| 56 |
+
|
| 57 |
+
Notifications and chat integrations:
|
| 58 |
+
- Slack, chat app delivery, chat target channels, and chat push behavior route to `chat_integrations`.
|
| 59 |
+
- Announce messages, heartbeat pushes, target-channel pushes, identity overlays on pushed messages, and notification delivery route to `notifications`.
|
| 60 |
+
- Do not add `cron_automation` merely because the notification path mentions `cron --announce`; cron is context unless scheduling, force-run behavior, cron lifecycle, or cron execution is itself broken.
|
| 61 |
+
- Example: `Per-agent identity overlay dropped on cron --announce and heartbeat target-channel Slack pushes` => `notifications`, `chat_integrations`.
|
| 62 |
+
|
| 63 |
+
Cron:
|
| 64 |
+
- Use `cron_automation` when cron scheduling, cron force-run, cron lifecycle, cron execution, or a cron deadlock is central.
|
| 65 |
+
- Example: `cron force-run deadlock` => `cron_automation`.
|
| 66 |
+
|
| 67 |
+
Exec, sandboxing, and approvals:
|
| 68 |
+
- Exec command/tool behavior routes to `exec_tools`.
|
| 69 |
+
- Exec PATH fallback is `exec_tools`.
|
| 70 |
+
- Exec v2 contract follow-through or contract enforcement can centrally include `exec_tools`, `sandboxing`, and `approvals` when the contract covers sandbox and approval behavior.
|
| 71 |
+
- Example: `test(exec): land exec v2 contract follow-through` => `exec_tools`, `sandboxing`, `approvals`.
|
| 72 |
+
- Do not replace sandboxing or approvals with `security` unless the title is actually about a security policy, vulnerability, network restriction, credential boundary, or allowed/blocked security behavior.
|
| 73 |
+
|
| 74 |
+
Browser automation:
|
| 75 |
+
- Browser diagnostics, browser automation layers, browser runtime behavior, and browser tooling issues route to `browser_automation`.
|
| 76 |
+
- Example: `layered browser diagnostics` => `browser_automation`.
|
| 77 |
+
- Do not add `gateway` for browser diagnostics unless the gateway itself is explicitly the subject.
|
| 78 |
+
|
| 79 |
+
Memory and inference:
|
| 80 |
+
- Memory or embeddings provider work routes to `memory` when the provider exists for memory/embeddings.
|
| 81 |
+
- Self-hosted inference servers such as llama.cpp, Ollama, vLLM, TGI, and LocalAI route to `self_hosted_inference` when the item is about using those servers as inference providers.
|
| 82 |
+
- Example: `feat(memory/embeddings): add openai-compatible provider for self-hosted servers (llama.cpp, Ollama, vLLM, TGI, LocalAI)` => `memory`, `self_hosted_inference`.
|
| 83 |
+
- Do not add `model_serving` merely because the title says “openai-compatible”, “provider”, llama.cpp, Ollama, vLLM, TGI, or LocalAI.
|
| 84 |
+
|
| 85 |
+
Model serving:
|
| 86 |
+
- Use `model_serving` only when the central subject is serving endpoints, OpenAI-compatible request/response protocol behavior, streaming lifecycle, final usage chunks, base URL behavior, endpoint compatibility, request routing, or model-server compatibility.
|
| 87 |
+
- OpenAI-compatible streaming, final usage chunks, stream lifecycle, endpoint compatibility, base URL behavior, vLLM/TGI/LocalAI/llama.cpp serving behavior, and request routing are `model_serving`.
|
| 88 |
+
- Do not add `telemetry_usage` merely because the title mentions usage, tokens, counts, cost, or chunks when those are symptoms of a model-serving protocol bug.
|
| 89 |
+
- Example: `OpenAI-compatible streaming with llama.cpp saves zero usage (stream closed before final usage chunk)` => `model_serving`.
|
| 90 |
+
|
| 91 |
+
Telemetry and usage:
|
| 92 |
+
- Use `telemetry_usage` only when metric collection, usage accounting/reporting, cost display, diagnostic counts, traces, or status reporting surfaces are themselves the feature or bug.
|
| 93 |
+
|
| 94 |
+
Policy/config:
|
| 95 |
+
- Items about policy rules, conformance checks, quality gates, allowed behavior, or configuration-governed enforcement usually include `config` when the policy/checking behavior is central.
|
| 96 |
+
- Do not map “model” in “model policy”, “model conformance”, or “model checks” to `model_serving` unless the item is actually about serving endpoints, streaming, endpoint lifecycle, routing, or model-server compatibility.
|
| 97 |
+
- Network policy, network conformance, access restrictions, outbound rules, or boundary checks can be `security` when they concern allowed/blocked network behavior.
|
| 98 |
+
- MCP conformance, MCP policy, MCP tool behavior, or MCP protocol checks route to `mcp_tooling`.
|
| 99 |
+
- Example: `Policy: add model, network, and MCP conformance checks` => `mcp_tooling`, `config`, `security`, not `model_serving`.
|
| 100 |
+
|
| 101 |
+
Composite fixes:
|
| 102 |
+
- If a title lists several independent fixes, classify each central fix up to the smallest complete set.
|
| 103 |
+
- Example: `fix: resolve exec PATH fallback, layered browser diagnostics, and cron force-run deadlock` => `exec_tools`, `browser_automation`, `cron_automation`.
|
| 104 |
+
- Do not substitute a broad infrastructure topic like `gateway` unless it is explicitly one of the listed user-visible subjects.
|
| 105 |
+
|
| 106 |
+
Final suppression checks:
|
| 107 |
+
- If a topic was added only because of a word like “usage”, “model”, “network”, “test”, “policy”, “status”, “tool”, “plugin”, “chunk”, “cron”, “gateway”, or “security”, verify that the topic is actually the subject.
|
| 108 |
+
- Prefer the narrow central topic over broad fallback labels.
|
| 109 |
+
- Remove labels that come only from symptoms, implementation details, tests, examples, files changed, or incidental words.
|
| 110 |
+
- Keep required central second and third topics when dropping them would hide the item from a maintainer who owns that area.
|
gepa-12b-multi-from-six-20260613T051216Z/candidate_tree.html
ADDED
|
@@ -0,0 +1,179 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!DOCTYPE html>
|
| 2 |
+
<html lang="en">
|
| 3 |
+
<head>
|
| 4 |
+
<meta charset="UTF-8">
|
| 5 |
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
| 6 |
+
<title>GEPA Candidate Tree</title>
|
| 7 |
+
<style>
|
| 8 |
+
* { margin: 0; padding: 0; box-sizing: border-box; }
|
| 9 |
+
html, body { height: 100%; width: 100%; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif; background: #f8f9fa; display: flex; flex-direction: column; }
|
| 10 |
+
#header { background: #fff; border-bottom: 1px solid #dee2e6; padding: 12px 20px; display: flex; align-items: center; gap: 16px; flex-shrink: 0; }
|
| 11 |
+
#header h1 { font-size: 18px; font-weight: 600; color: #212529; }
|
| 12 |
+
.legend { display: flex; gap: 14px; font-size: 13px; color: #495057; }
|
| 13 |
+
.legend-item { display: flex; align-items: center; gap: 4px; }
|
| 14 |
+
.legend-dot { width: 12px; height: 12px; border-radius: 50%; border: 1px solid #adb5bd; }
|
| 15 |
+
#graph-container { flex: 1 1 auto; width: 100%; overflow: auto; padding: 20px; text-align: center; }
|
| 16 |
+
#graph-container svg { width: 100%; height: 100%; }
|
| 17 |
+
#tooltip {
|
| 18 |
+
display: none; position: fixed; background: #fff; border: 1px solid #dee2e6;
|
| 19 |
+
border-radius: 8px; padding: 14px 16px; max-width: 560px; max-height: 70vh; overflow-y: auto;
|
| 20 |
+
box-shadow: 0 4px 16px rgba(0,0,0,0.18); z-index: 1000; font-size: 13px; line-height: 1.5;
|
| 21 |
+
}
|
| 22 |
+
#tooltip .tt-header { font-weight: 700; font-size: 15px; margin-bottom: 6px; color: #212529; }
|
| 23 |
+
#tooltip .tt-meta { color: #6c757d; margin-bottom: 4px; }
|
| 24 |
+
#tooltip .tt-hint { color: #adb5bd; font-size: 11px; font-style: italic; margin-bottom: 10px; }
|
| 25 |
+
#tooltip .tt-comp-name { font-weight: 600; color: #495057; margin-top: 10px; border-bottom: 1px solid #e9ecef; padding-bottom: 2px; }
|
| 26 |
+
#tooltip .tt-comp-text {
|
| 27 |
+
white-space: pre-wrap; word-break: break-word; background: #f8f9fa;
|
| 28 |
+
border: 1px solid #e9ecef; border-radius: 4px; padding: 8px 10px; margin-top: 4px;
|
| 29 |
+
font-family: "SF Mono", SFMono-Regular, Menlo, Consolas, monospace; font-size: 12px;
|
| 30 |
+
max-height: 200px; overflow-y: auto;
|
| 31 |
+
}
|
| 32 |
+
.role-badge { display: inline-block; padding: 2px 8px; border-radius: 4px; font-size: 11px; font-weight: 600; margin-left: 6px; }
|
| 33 |
+
.role-best { background: #00e5ff33; color: #006064; }
|
| 34 |
+
.role-pareto { background: #ff980033; color: #e65100; }
|
| 35 |
+
.role-seed { background: #e0e0e0; color: #424242; }
|
| 36 |
+
</style>
|
| 37 |
+
</head>
|
| 38 |
+
<body>
|
| 39 |
+
<div id="header">
|
| 40 |
+
<h1>GEPA Candidate Tree</h1>
|
| 41 |
+
<div class="legend">
|
| 42 |
+
<div class="legend-item"><div class="legend-dot" style="background:cyan"></div> Best</div>
|
| 43 |
+
<div class="legend-item"><div class="legend-dot" style="background:orange"></div> Pareto Front</div>
|
| 44 |
+
<div class="legend-item"><div class="legend-dot" style="background:lightgray"></div> Other</div>
|
| 45 |
+
</div>
|
| 46 |
+
</div>
|
| 47 |
+
<div id="graph-container"><p>Loading graph…</p></div>
|
| 48 |
+
<div id="tooltip"></div>
|
| 49 |
+
|
| 50 |
+
<script type="module">
|
| 51 |
+
const NODES = [{"idx": 0, "score": 0.4972, "parents": "seed", "role": "Pareto Front", "components": {"routing_policy": "You are classifying GitHub issues or pull requests into the smallest complete set of allowed topic ids.\n\nThis is a fuzzy multi-label routing task. Your goal is not to mention every related area. Your goal is to choose the minimum topic set that sends the item to the right maintainer bucket without dropping an explicit central second concern.\n\nProcess:\n\n1. Read the title first.\n2. Identify the main user-visible problem, feature, or policy change.\n3. Pick one primary topic.\n4. Read only the first clear body summary if needed to disambiguate.\n5. Add a secondary topic only when it is explicitly central and removing it would route the item away from a maintainer who must see it.\n6. Remove topics that come only from symptoms, implementation details, tests, examples, files changed, broad impact, or incidental words.\n7. Return only exact allowed topic ids.\n\nDo not over-label from keywords.\n\nImportant domain rules:\n\n- OpenAI-compatible streaming, final usage chunks, stream lifecycle, endpoint compatibility, base URL behavior, vLLM/TGI/LocalAI/llama.cpp serving behavior, and request routing are `model_serving`.\n- Do not add `telemetry_usage` merely because the title mentions usage, tokens, counts, cost, or chunks when those are symptoms of a model-serving protocol bug.\n- Example: \u201cOpenAI-compatible streaming with llama.cpp saves zero usage (stream closed before final usage chunk)\u201d is only `model_serving`. The central issue is the OpenAI-compatible streaming/final usage chunk behavior, not telemetry reporting.\n- Use `telemetry_usage` only when the metric, usage accounting/reporting, cost display, diagnostic count, trace, or status reporting surface is itself the feature or bug.\n\nPolicy/config rules:\n\n- Items about policy rules, conformance checks, quality gates, allowed behavior, or configuration-governed enforcement usually include `config` when the policy/checking behavior is central.\n- Do not map the word \u201cmodel\u201d in \u201cmodel policy\u201d, \u201cmodel conformance\u201d, or \u201cmodel checks\u201d to `model_serving` unless the item is actually about serving endpoints, streaming, endpoint lifecycle, routing, or model-server compatibility.\n- Network policy, network conformance, access restrictions, outbound rules, or boundary checks can be `security` when they concern allowed/blocked network behavior.\n- MCP conformance, MCP policy, MCP tool behavior, or MCP protocol checks route to `mcp_tooling`.\n- Example: \u201cPolicy: add model, network, and MCP conformance checks\u201d should be `mcp_tooling`, `config`, and `security`, not `model_serving`.\n\nCardinality guidance:\n\n- Use 0 topics when no allowed topic is central.\n- Use 1 topic for a single-focus item.\n- Use 2 topics for normal cross-topic items.\n- Use 3 topics only when the title or first clear summary explicitly has three central facets.\n- Use 4+ topics only for explicit multi-system coordination.\n\nFinal suppression checks before output:\n\n- If a topic was added only because of a word like \u201cusage\u201d, \u201cmodel\u201d, \u201cnetwork\u201d, \u201ctest\u201d, \u201cpolicy\u201d, \u201cstatus\u201d, or \u201cchunk\u201d, verify that the topic is actually the subject, not just context.\n- Prefer the narrower central topic over a broad fallback.\n- Never invent topic ids.\n- Output only the final JSON with the selected topic ids."}},
|
| 52 |
+
{"idx": 1, "score": 0.5381, "parents": "0", "role": "Pareto Front", "components": {"routing_policy": "You are classifying GitHub issues or pull requests into the smallest complete set of allowed topic ids.\n\nInput format:\n- You may receive a GitHub target URL, a title, and sometimes a body or summary.\n- The title is the primary signal.\n- Use the first clear body summary only when the title is ambiguous.\n- Ignore examples, tests, files changed, implementation details, incidental keywords, and broad impact unless they are the actual user-visible subject.\n- Return only final JSON using exact allowed topic ids, for example:\n {\"topics_of_interest\":[\"queueing\",\"docs\"]}\n\nTask:\nChoose the minimum topic set that routes the item to the right maintainer bucket without dropping an explicitly central second concern.\n\nGeneral process:\n1. Read the title first.\n2. Identify the main user-visible problem, feature, documentation change, or policy change.\n3. Pick one primary topic.\n4. Add a secondary topic only when it is explicitly central and removing it would route the item away from a maintainer who must see it.\n5. Use 3 topics only when the title or first clear summary explicitly names three central facets.\n6. Use 0 topics when no allowed topic is central.\n7. Never invent topic ids.\n8. Output only JSON.\n\nHigh-signal title patterns:\n- A Conventional Commit type like `docs(...)`, `feat(...)`, `fix(...)`, or `policy(...)` can indicate the kind of change.\n- A scope inside parentheses is often central. For example, `docs(queue): ...` usually includes both `docs` and `queueing`.\n- Do not blindly label every word in the title. Confirm the word names the subject, not just context.\n\nDomain rules and corrections:\n- Documentation-only PRs should usually include `docs` plus the central documented area.\n - Example: `docs(queue): clarify steer behavior with partial streaming and tool boundaries` => `docs`, `queueing`.\n - Do not add `tool_calling` just because the title says \u201ctool boundaries\u201d unless tool calling behavior itself is the central feature or bug.\n\n- Queue, queueing, queued execution, steer behavior in queues, or queue lifecycle route to `queueing` when central.\n\n- `tool_calling` is only for tool-call execution, tool-call APIs, tool selection, tool schema handling, or tool-call runtime behavior.\n - Mentions of \u201ctool boundaries\u201d in docs about another system are usually context, not `tool_calling`.\n\n- ACPX-related sandbox or workflow issues route to `acpx` when ACPX is named centrally.\n- Codex-related behavior routes to `codex` when Codex is named centrally.\n- User-installed plugins, plugin inheritance, Superpowers, skills, plugin discovery, plugin installation, or skill/plugin availability route to `skills_plugins`.\n - Example: `[Feature]: ACPX Codex sandbox should inherit user-installed plugins (e.g. Superpowers)` => `acpx`, `codex`, `skills_plugins`.\n - Do not drop `skills_plugins` when plugins are the requested feature.\n\n- Memory or embeddings provider work routes to `memory` when the provider exists for memory/embeddings.\n- Self-hosted inference servers such as llama.cpp, Ollama, vLLM, TGI, and LocalAI route to `self_hosted_inference` when the item is about using those servers as inference providers.\n - Example: `feat(memory/embeddings): add openai-compatible provider for self-hosted servers (llama.cpp, Ollama, vLLM, TGI, LocalAI)` => `memory`, `self_hosted_inference`.\n - Do not add `model_serving` merely because the title says \u201copenai-compatible\u201d, \u201cprovider\u201d, llama.cpp, Ollama, vLLM, TGI, or LocalAI.\n\n- Use `model_serving` only when the central subject is serving endpoints, OpenAI-compatible request/response protocol behavior, streaming lifecycle, final usage chunks, base URL behavior, endpoint compatibility, request routing, or model-server compatibility.\n - OpenAI-compatible streaming, final usage chunks, stream lifecycle, endpoint compatibility, base URL behavior, vLLM/TGI/LocalAI/llama.cpp serving behavior, and request routing are `model_serving`.\n - Do not add `telemetry_usage` merely because the title mentions usage, tokens, counts, cost, or chunks when those are symptoms of a model-serving protocol bug.\n - Example: \u201cOpenAI-compatible streaming with llama.cpp saves zero usage (stream closed before final usage chunk)\u201d => `model_serving` only.\n\n- Use `telemetry_usage` only when metric collection, usage accounting/reporting, cost display, diagnostic counts, traces, or status reporting surfaces are themselves the feature or bug.\n\nPolicy/config rules:\n- Items about policy rules, conformance checks, quality gates, allowed behavior, or configuration-governed enforcement usually include `config` when the policy/checking behavior is central.\n- Do not map \u201cmodel\u201d in \u201cmodel policy\u201d, \u201cmodel conformance\u201d, or \u201cmodel checks\u201d to `model_serving` unless the item is actually about serving endpoints, streaming, endpoint lifecycle, routing, or model-server compatibility.\n- Network policy, network conformance, access restrictions, outbound rules, or boundary checks can be `security` when they concern allowed/blocked network behavior.\n- MCP conformance, MCP policy, MCP tool behavior, or MCP protocol checks route to `mcp_tooling`.\n- Example: \u201cPolicy: add model, network, and MCP conformance checks\u201d => `mcp_tooling`, `config`, `security`, not `model_serving`.\n\nFinal suppression checks:\n- If a topic was added only because of a word like \u201cusage\u201d, \u201cmodel\u201d, \u201cnetwork\u201d, \u201ctest\u201d, \u201cpolicy\u201d, \u201cstatus\u201d, \u201ctool\u201d, \u201cplugin\u201d, or \u201cchunk\u201d, verify that the topic is actually the subject.\n- Prefer the narrow central topic over broad fallback labels.\n- Remove labels that come only from symptoms, implementation details, tests, examples, files changed, or incidental words."}},
|
| 53 |
+
{"idx": 2, "score": 0.7361, "parents": "1", "role": "Best", "components": {"routing_policy": "You are classifying GitHub issues or pull requests into the smallest complete set of allowed topic ids.\n\nInput format:\n- You may receive a GitHub target URL, title, and sometimes a body or summary.\n- The title is the primary signal.\n- Use the first clear body summary only when the title is ambiguous.\n- Ignore examples, tests, files changed, implementation details, incidental keywords, and broad impact unless they are the actual user-visible subject.\n- Return only final JSON using exact allowed topic ids, for example:\n {\"topics_of_interest\":[\"queueing\",\"docs\"]}\n\nTask:\nChoose the minimum topic set that routes the item to the right maintainer bucket without dropping an explicitly central second or third concern.\n\nGeneral process:\n1. Read the title first.\n2. Identify the main user-visible problem, feature, documentation change, policy change, or contract being changed.\n3. Pick one primary topic.\n4. Add a secondary topic only when it is explicitly central and removing it would route the item away from a maintainer who must see it.\n5. Use 3 topics only when the title or first clear summary explicitly names three central facets.\n6. Use 0 topics when no allowed topic is central.\n7. Never invent topic ids.\n8. Output only JSON.\n\nHigh-signal title patterns:\n- A Conventional Commit type like `docs(...)`, `feat(...)`, `fix(...)`, `test(...)`, or `policy(...)` can indicate the kind of change.\n- A scope inside parentheses is often central. For example, `docs(queue): ...` usually includes both `docs` and `queueing`.\n- Do not ignore `test(...)` scopes when the title is about landing or enforcing a behavior contract. The tested contract can be the central subject.\n- Do not blindly label every word in the title. Confirm the word names the subject, not just a path, symptom, or context.\n\nDomain rules and corrections:\n\nDocumentation:\n- Documentation-only PRs should usually include `docs` plus the central documented area.\n- Example: `docs(queue): clarify steer behavior with partial streaming and tool boundaries` => `docs`, `queueing`.\n- Do not add `tool_calling` just because the title says \u201ctool boundaries\u201d unless tool-call behavior itself is the central feature or bug.\n\nQueueing:\n- Queue, queueing, queued execution, steer behavior in queues, or queue lifecycle route to `queueing` when central.\n\nTool calling:\n- `tool_calling` is only for tool-call execution, tool-call APIs, tool selection, tool schema handling, or tool-call runtime behavior.\n- Mentions of \u201ctool boundaries\u201d in docs about another system are usually context, not `tool_calling`.\n\nACP, gateway, and runtime:\n- ACP-related work routes to `acp` when ACP is named centrally.\n- ACPX-related sandbox or workflow issues route to `acpx` when ACPX is named centrally.\n- Gateway-owned behavior routes to `gateway` only when gateway is explicitly the owner or subject.\n- Runtime work routes to `agent_runtime` when the title is about runtimes, node-backed runtimes, agent execution runtimes, or runtime ownership.\n- Example: `ACP: add gateway-owned node-backed runtime` => `acp`, `gateway`, `agent_runtime`.\n\nCodex and plugins:\n- Codex-related behavior routes to `codex` when Codex is named centrally.\n- User-installed plugins, plugin inheritance, Superpowers, skills, plugin discovery, plugin installation, or skill/plugin availability route to `skills_plugins`.\n- Example: `[Feature]: ACPX Codex sandbox should inherit user-installed plugins (e.g. Superpowers)` => `acpx`, `codex`, `skills_plugins`.\n- Do not drop `skills_plugins` when plugins are the requested feature.\n\nNotifications and chat integrations:\n- Slack, chat app delivery, chat target channels, and chat push behavior route to `chat_integrations`.\n- Announce messages, heartbeat pushes, target-channel pushes, identity overlays on pushed messages, and notification delivery route to `notifications`.\n- Do not add `cron_automation` merely because the notification path mentions `cron --announce`; cron is context unless scheduling, force-run behavior, cron lifecycle, or cron execution is itself broken.\n- Example: `Per-agent identity overlay dropped on cron --announce and heartbeat target-channel Slack pushes` => `notifications`, `chat_integrations`.\n\nCron:\n- Use `cron_automation` when cron scheduling, cron force-run, cron lifecycle, cron execution, or a cron deadlock is central.\n- Example: `cron force-run deadlock` => `cron_automation`.\n\nExec, sandboxing, and approvals:\n- Exec command/tool behavior routes to `exec_tools`.\n- Exec PATH fallback is `exec_tools`.\n- Exec v2 contract follow-through or contract enforcement can centrally include `exec_tools`, `sandboxing`, and `approvals` when the contract covers sandbox and approval behavior.\n- Example: `test(exec): land exec v2 contract follow-through` => `exec_tools`, `sandboxing`, `approvals`.\n- Do not replace sandboxing or approvals with `security` unless the title is actually about a security policy, vulnerability, network restriction, credential boundary, or allowed/blocked security behavior.\n\nBrowser automation:\n- Browser diagnostics, browser automation layers, browser runtime behavior, and browser tooling issues route to `browser_automation`.\n- Example: `layered browser diagnostics` => `browser_automation`.\n- Do not add `gateway` for browser diagnostics unless the gateway itself is explicitly the subject.\n\nMemory and inference:\n- Memory or embeddings provider work routes to `memory` when the provider exists for memory/embeddings.\n- Self-hosted inference servers such as llama.cpp, Ollama, vLLM, TGI, and LocalAI route to `self_hosted_inference` when the item is about using those servers as inference providers.\n- Example: `feat(memory/embeddings): add openai-compatible provider for self-hosted servers (llama.cpp, Ollama, vLLM, TGI, LocalAI)` => `memory`, `self_hosted_inference`.\n- Do not add `model_serving` merely because the title says \u201copenai-compatible\u201d, \u201cprovider\u201d, llama.cpp, Ollama, vLLM, TGI, or LocalAI.\n\nModel serving:\n- Use `model_serving` only when the central subject is serving endpoints, OpenAI-compatible request/response protocol behavior, streaming lifecycle, final usage chunks, base URL behavior, endpoint compatibility, request routing, or model-server compatibility.\n- OpenAI-compatible streaming, final usage chunks, stream lifecycle, endpoint compatibility, base URL behavior, vLLM/TGI/LocalAI/llama.cpp serving behavior, and request routing are `model_serving`.\n- Do not add `telemetry_usage` merely because the title mentions usage, tokens, counts, cost, or chunks when those are symptoms of a model-serving protocol bug.\n- Example: `OpenAI-compatible streaming with llama.cpp saves zero usage (stream closed before final usage chunk)` => `model_serving`.\n\nTelemetry and usage:\n- Use `telemetry_usage` only when metric collection, usage accounting/reporting, cost display, diagnostic counts, traces, or status reporting surfaces are themselves the feature or bug.\n\nPolicy/config:\n- Items about policy rules, conformance checks, quality gates, allowed behavior, or configuration-governed enforcement usually include `config` when the policy/checking behavior is central.\n- Do not map \u201cmodel\u201d in \u201cmodel policy\u201d, \u201cmodel conformance\u201d, or \u201cmodel checks\u201d to `model_serving` unless the item is actually about serving endpoints, streaming, endpoint lifecycle, routing, or model-server compatibility.\n- Network policy, network conformance, access restrictions, outbound rules, or boundary checks can be `security` when they concern allowed/blocked network behavior.\n- MCP conformance, MCP policy, MCP tool behavior, or MCP protocol checks route to `mcp_tooling`.\n- Example: `Policy: add model, network, and MCP conformance checks` => `mcp_tooling`, `config`, `security`, not `model_serving`.\n\nComposite fixes:\n- If a title lists several independent fixes, classify each central fix up to the smallest complete set.\n- Example: `fix: resolve exec PATH fallback, layered browser diagnostics, and cron force-run deadlock` => `exec_tools`, `browser_automation`, `cron_automation`.\n- Do not substitute a broad infrastructure topic like `gateway` unless it is explicitly one of the listed user-visible subjects.\n\nFinal suppression checks:\n- If a topic was added only because of a word like \u201cusage\u201d, \u201cmodel\u201d, \u201cnetwork\u201d, \u201ctest\u201d, \u201cpolicy\u201d, \u201cstatus\u201d, \u201ctool\u201d, \u201cplugin\u201d, \u201cchunk\u201d, \u201ccron\u201d, \u201cgateway\u201d, or \u201csecurity\u201d, verify that the topic is actually the subject.\n- Prefer the narrow central topic over broad fallback labels.\n- Remove labels that come only from symptoms, implementation details, tests, examples, files changed, or incidental words.\n- Keep required central second and third topics when dropping them would hide the item from a maintainer who owns that area."}},
|
| 54 |
+
{"idx": 3, "score": 0.5089, "parents": "2", "role": "Pareto Front", "components": {"routing_policy": "Add these routing corrections to the classifier instructions:\n\n- Treat compound titles as lists of central user-visible fixes. Classify each central item, but do not add labels for every noun.\n- `skills_plugins` is label spam unless the plugin system itself is the requested feature or bug: user-installed plugins, plugin inheritance, Superpowers, skill/plugin discovery, plugin installation, or skill/plugin availability.\n- In titles like `fix: Codex startup plugins + WhatsApp history & Docker Codex OAuth`, keep `codex` because Codex behavior is central, but do not add `skills_plugins` for \u201cstartup plugins\u201d unless the plugin lifecycle is the actual subject.\n- WhatsApp, Slack, chat history, chat app delivery, chat target channels, and chat push behavior route to `chat_integrations` when central.\n- ACP session permission-mode work can require all three topics: `acp`, `approvals`, and `acpx`.\n- Specifically, titles mentioning per-binding or per-agent `permissionMode` for ACP sessions should include `acp`, `approvals`, and `acpx`. `permissionMode` is an approval/permission contract, and ACPX owns the ACP session/binding workflow concern.\n- Add `local_models` when the title centrally names local model apps or local model providers such as LM Studio.\n- LM Studio issues involving Responses API behavior, thinking blocks, streaming, request/response compatibility, or visibility of model output should usually include both `model_serving` and `local_models`.\n- Do not replace `local_models` with `self_hosted_inference` when the named subject is LM Studio or another local-model product/app rather than a generic inference server integration.\n- `Responses API`, invisible thinking blocks, OpenAI-compatible behavior, streaming lifecycle, request/response protocol handling, and model-output protocol bugs route to `model_serving`.\n\nAdditional suppression checks:\n- If `skills_plugins` was added only because the title contains \u201cplugins\u201d inside a broader Codex startup or OAuth fix, remove it unless plugin installation/discovery/inheritance/availability is the central user-visible bug.\n- If a chat product name such as WhatsApp appears as a central listed fix, include `chat_integrations`.\n- If ACP + `permissionMode` + per-binding/per-agent/session language appears, include `acpx` in addition to `acp` and `approvals`.\n- If LM Studio appears as a central subject, include `local_models`."}}];
|
| 55 |
+
const DOT = `digraph G {
|
| 56 |
+
rankdir=TB;
|
| 57 |
+
node [style=filled, shape=circle, fontsize=14, width=0.6, height=0.6];
|
| 58 |
+
0 [label="0\\n(0.50)", fillcolor=orange, tooltip=" "];
|
| 59 |
+
1 [label="1\\n(0.54)", fillcolor=orange, tooltip=" "];
|
| 60 |
+
2 [label="2\\n(0.74)", fillcolor=cyan, tooltip=" "];
|
| 61 |
+
3 [label="3\\n(0.51)", fillcolor=orange, tooltip=" "];
|
| 62 |
+
0 -> 1;
|
| 63 |
+
1 -> 2;
|
| 64 |
+
2 -> 3;
|
| 65 |
+
}`;
|
| 66 |
+
|
| 67 |
+
const nodeMap = {};
|
| 68 |
+
NODES.forEach(n => { nodeMap[n.idx] = n; });
|
| 69 |
+
|
| 70 |
+
let pinnedIdx = null; // non-null when tooltip is click-pinned (scrollable)
|
| 71 |
+
let hoverIdx = null; // non-null when hovering over a node
|
| 72 |
+
|
| 73 |
+
function renderTooltip(idx) {
|
| 74 |
+
const n = nodeMap[idx];
|
| 75 |
+
if (!n) return "";
|
| 76 |
+
let roleBadge = "";
|
| 77 |
+
if (n.role === "Best") roleBadge = '<span class="role-badge role-best">BEST</span>';
|
| 78 |
+
else if (n.role === "Pareto Front") roleBadge = '<span class="role-badge role-pareto">PARETO</span>';
|
| 79 |
+
else if (n.role === "Seed") roleBadge = '<span class="role-badge role-seed">SEED</span>';
|
| 80 |
+
|
| 81 |
+
let comps = "";
|
| 82 |
+
for (const [name, text] of Object.entries(n.components)) {
|
| 83 |
+
const escaped = text.replace(/&/g,"&").replace(/</g,"<").replace(/>/g,">");
|
| 84 |
+
comps += '<div class="tt-comp-name">' + name + '</div><div class="tt-comp-text">' + escaped + '</div>';
|
| 85 |
+
}
|
| 86 |
+
|
| 87 |
+
return '<div class="tt-header">Candidate ' + n.idx + roleBadge + '</div>' +
|
| 88 |
+
'<div class="tt-meta">Score: <strong>' + n.score + '</strong> | Parent(s): ' + n.parents +
|
| 89 |
+
'</div><div class="tt-hint">' + (pinnedIdx === idx ? 'Click node again to dismiss' : 'Click to pin & scroll') + '</div>' +
|
| 90 |
+
comps;
|
| 91 |
+
}
|
| 92 |
+
|
| 93 |
+
function positionTooltip(x, y) {
|
| 94 |
+
const tt = document.getElementById("tooltip");
|
| 95 |
+
let tx = x + 16, ty = y + 16;
|
| 96 |
+
const r = tt.getBoundingClientRect();
|
| 97 |
+
if (tx + r.width > window.innerWidth - 8) tx = x - r.width - 16;
|
| 98 |
+
if (ty + r.height > window.innerHeight - 8) ty = y - r.height - 16;
|
| 99 |
+
if (tx < 8) tx = 8;
|
| 100 |
+
if (ty < 8) ty = 8;
|
| 101 |
+
tt.style.left = tx + "px";
|
| 102 |
+
tt.style.top = ty + "px";
|
| 103 |
+
}
|
| 104 |
+
|
| 105 |
+
function showTooltip(idx, x, y) {
|
| 106 |
+
const tt = document.getElementById("tooltip");
|
| 107 |
+
tt.innerHTML = renderTooltip(idx);
|
| 108 |
+
tt.style.display = "block";
|
| 109 |
+
positionTooltip(x, y);
|
| 110 |
+
}
|
| 111 |
+
|
| 112 |
+
function hideTooltip() {
|
| 113 |
+
pinnedIdx = null;
|
| 114 |
+
hoverIdx = null;
|
| 115 |
+
document.getElementById("tooltip").style.display = "none";
|
| 116 |
+
}
|
| 117 |
+
|
| 118 |
+
// Click outside tooltip and outside nodes dismisses pinned tooltip
|
| 119 |
+
document.addEventListener("mousedown", function(e) {
|
| 120 |
+
const tt = document.getElementById("tooltip");
|
| 121 |
+
if (pinnedIdx !== null && tt.style.display === "block"
|
| 122 |
+
&& !tt.contains(e.target) && !e.target.closest(".node")) {
|
| 123 |
+
hideTooltip();
|
| 124 |
+
}
|
| 125 |
+
});
|
| 126 |
+
|
| 127 |
+
async function render() {
|
| 128 |
+
const { instance } = await import("https://cdn.jsdelivr.net/npm/@viz-js/viz@3.11.0/lib/viz-standalone.mjs");
|
| 129 |
+
const viz = await instance();
|
| 130 |
+
const svg = viz.renderSVGElement(DOT);
|
| 131 |
+
const container = document.getElementById("graph-container");
|
| 132 |
+
container.innerHTML = "";
|
| 133 |
+
container.appendChild(svg);
|
| 134 |
+
|
| 135 |
+
// Attach hover listeners and strip native <title> tooltips to avoid double tooltips
|
| 136 |
+
svg.querySelectorAll(".node").forEach(node => {
|
| 137 |
+
const title = node.querySelector("title");
|
| 138 |
+
if (!title) return;
|
| 139 |
+
const idx = parseInt(title.textContent, 10);
|
| 140 |
+
if (isNaN(idx)) return;
|
| 141 |
+
title.remove(); // remove native SVG tooltip
|
| 142 |
+
node.style.cursor = "pointer";
|
| 143 |
+
// Hover: show tooltip following the mouse (non-interactive)
|
| 144 |
+
node.addEventListener("mouseenter", e => {
|
| 145 |
+
if (pinnedIdx !== null) return; // don't override a pinned tooltip
|
| 146 |
+
hoverIdx = idx;
|
| 147 |
+
showTooltip(idx, e.clientX, e.clientY);
|
| 148 |
+
});
|
| 149 |
+
node.addEventListener("mousemove", e => {
|
| 150 |
+
if (pinnedIdx !== null || hoverIdx !== idx) return;
|
| 151 |
+
positionTooltip(e.clientX, e.clientY);
|
| 152 |
+
});
|
| 153 |
+
node.addEventListener("mouseleave", () => {
|
| 154 |
+
if (pinnedIdx !== null) return;
|
| 155 |
+
hoverIdx = null;
|
| 156 |
+
document.getElementById("tooltip").style.display = "none";
|
| 157 |
+
});
|
| 158 |
+
// Click: pin the tooltip so user can scroll it
|
| 159 |
+
node.addEventListener("click", e => {
|
| 160 |
+
e.stopPropagation();
|
| 161 |
+
if (pinnedIdx === idx) { hideTooltip(); return; }
|
| 162 |
+
pinnedIdx = idx;
|
| 163 |
+
showTooltip(idx, e.clientX, e.clientY);
|
| 164 |
+
});
|
| 165 |
+
});
|
| 166 |
+
// Also strip <title> from edges and the graph itself
|
| 167 |
+
svg.querySelectorAll(".edge title").forEach(t => t.remove());
|
| 168 |
+
const graphTitle = svg.querySelector(":scope > title");
|
| 169 |
+
if (graphTitle) graphTitle.remove();
|
| 170 |
+
}
|
| 171 |
+
|
| 172 |
+
render().catch(err => {
|
| 173 |
+
document.getElementById("graph-container").innerHTML =
|
| 174 |
+
"<p style='color:red'>Failed to render graph: " + err.message + "</p>" +
|
| 175 |
+
"<pre>" + DOT.replace(/</g,"<") + "</pre>";
|
| 176 |
+
});
|
| 177 |
+
</script>
|
| 178 |
+
</body>
|
| 179 |
+
</html>
|
gepa-12b-multi-from-six-20260613T051216Z/candidates.json
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"routing_policy": "You are classifying GitHub issues or pull requests into the smallest complete set of allowed topic ids.\n\nThis is a fuzzy multi-label routing task. Your goal is not to mention every related area. Your goal is to choose the minimum topic set that sends the item to the right maintainer bucket without dropping an explicit central second concern.\n\nProcess:\n\n1. Read the title first.\n2. Identify the main user-visible problem, feature, or policy change.\n3. Pick one primary topic.\n4. Read only the first clear body summary if needed to disambiguate.\n5. Add a secondary topic only when it is explicitly central and removing it would route the item away from a maintainer who must see it.\n6. Remove topics that come only from symptoms, implementation details, tests, examples, files changed, broad impact, or incidental words.\n7. Return only exact allowed topic ids.\n\nDo not over-label from keywords.\n\nImportant domain rules:\n\n- OpenAI-compatible streaming, final usage chunks, stream lifecycle, endpoint compatibility, base URL behavior, vLLM/TGI/LocalAI/llama.cpp serving behavior, and request routing are `model_serving`.\n- Do not add `telemetry_usage` merely because the title mentions usage, tokens, counts, cost, or chunks when those are symptoms of a model-serving protocol bug.\n- Example: \u201cOpenAI-compatible streaming with llama.cpp saves zero usage (stream closed before final usage chunk)\u201d is only `model_serving`. The central issue is the OpenAI-compatible streaming/final usage chunk behavior, not telemetry reporting.\n- Use `telemetry_usage` only when the metric, usage accounting/reporting, cost display, diagnostic count, trace, or status reporting surface is itself the feature or bug.\n\nPolicy/config rules:\n\n- Items about policy rules, conformance checks, quality gates, allowed behavior, or configuration-governed enforcement usually include `config` when the policy/checking behavior is central.\n- Do not map the word \u201cmodel\u201d in \u201cmodel policy\u201d, \u201cmodel conformance\u201d, or \u201cmodel checks\u201d to `model_serving` unless the item is actually about serving endpoints, streaming, endpoint lifecycle, routing, or model-server compatibility.\n- Network policy, network conformance, access restrictions, outbound rules, or boundary checks can be `security` when they concern allowed/blocked network behavior.\n- MCP conformance, MCP policy, MCP tool behavior, or MCP protocol checks route to `mcp_tooling`.\n- Example: \u201cPolicy: add model, network, and MCP conformance checks\u201d should be `mcp_tooling`, `config`, and `security`, not `model_serving`.\n\nCardinality guidance:\n\n- Use 0 topics when no allowed topic is central.\n- Use 1 topic for a single-focus item.\n- Use 2 topics for normal cross-topic items.\n- Use 3 topics only when the title or first clear summary explicitly has three central facets.\n- Use 4+ topics only for explicit multi-system coordination.\n\nFinal suppression checks before output:\n\n- If a topic was added only because of a word like \u201cusage\u201d, \u201cmodel\u201d, \u201cnetwork\u201d, \u201ctest\u201d, \u201cpolicy\u201d, \u201cstatus\u201d, or \u201cchunk\u201d, verify that the topic is actually the subject, not just context.\n- Prefer the narrower central topic over a broad fallback.\n- Never invent topic ids.\n- Output only the final JSON with the selected topic ids."
|
| 4 |
+
},
|
| 5 |
+
{
|
| 6 |
+
"routing_policy": "You are classifying GitHub issues or pull requests into the smallest complete set of allowed topic ids.\n\nInput format:\n- You may receive a GitHub target URL, a title, and sometimes a body or summary.\n- The title is the primary signal.\n- Use the first clear body summary only when the title is ambiguous.\n- Ignore examples, tests, files changed, implementation details, incidental keywords, and broad impact unless they are the actual user-visible subject.\n- Return only final JSON using exact allowed topic ids, for example:\n {\"topics_of_interest\":[\"queueing\",\"docs\"]}\n\nTask:\nChoose the minimum topic set that routes the item to the right maintainer bucket without dropping an explicitly central second concern.\n\nGeneral process:\n1. Read the title first.\n2. Identify the main user-visible problem, feature, documentation change, or policy change.\n3. Pick one primary topic.\n4. Add a secondary topic only when it is explicitly central and removing it would route the item away from a maintainer who must see it.\n5. Use 3 topics only when the title or first clear summary explicitly names three central facets.\n6. Use 0 topics when no allowed topic is central.\n7. Never invent topic ids.\n8. Output only JSON.\n\nHigh-signal title patterns:\n- A Conventional Commit type like `docs(...)`, `feat(...)`, `fix(...)`, or `policy(...)` can indicate the kind of change.\n- A scope inside parentheses is often central. For example, `docs(queue): ...` usually includes both `docs` and `queueing`.\n- Do not blindly label every word in the title. Confirm the word names the subject, not just context.\n\nDomain rules and corrections:\n- Documentation-only PRs should usually include `docs` plus the central documented area.\n - Example: `docs(queue): clarify steer behavior with partial streaming and tool boundaries` => `docs`, `queueing`.\n - Do not add `tool_calling` just because the title says \u201ctool boundaries\u201d unless tool calling behavior itself is the central feature or bug.\n\n- Queue, queueing, queued execution, steer behavior in queues, or queue lifecycle route to `queueing` when central.\n\n- `tool_calling` is only for tool-call execution, tool-call APIs, tool selection, tool schema handling, or tool-call runtime behavior.\n - Mentions of \u201ctool boundaries\u201d in docs about another system are usually context, not `tool_calling`.\n\n- ACPX-related sandbox or workflow issues route to `acpx` when ACPX is named centrally.\n- Codex-related behavior routes to `codex` when Codex is named centrally.\n- User-installed plugins, plugin inheritance, Superpowers, skills, plugin discovery, plugin installation, or skill/plugin availability route to `skills_plugins`.\n - Example: `[Feature]: ACPX Codex sandbox should inherit user-installed plugins (e.g. Superpowers)` => `acpx`, `codex`, `skills_plugins`.\n - Do not drop `skills_plugins` when plugins are the requested feature.\n\n- Memory or embeddings provider work routes to `memory` when the provider exists for memory/embeddings.\n- Self-hosted inference servers such as llama.cpp, Ollama, vLLM, TGI, and LocalAI route to `self_hosted_inference` when the item is about using those servers as inference providers.\n - Example: `feat(memory/embeddings): add openai-compatible provider for self-hosted servers (llama.cpp, Ollama, vLLM, TGI, LocalAI)` => `memory`, `self_hosted_inference`.\n - Do not add `model_serving` merely because the title says \u201copenai-compatible\u201d, \u201cprovider\u201d, llama.cpp, Ollama, vLLM, TGI, or LocalAI.\n\n- Use `model_serving` only when the central subject is serving endpoints, OpenAI-compatible request/response protocol behavior, streaming lifecycle, final usage chunks, base URL behavior, endpoint compatibility, request routing, or model-server compatibility.\n - OpenAI-compatible streaming, final usage chunks, stream lifecycle, endpoint compatibility, base URL behavior, vLLM/TGI/LocalAI/llama.cpp serving behavior, and request routing are `model_serving`.\n - Do not add `telemetry_usage` merely because the title mentions usage, tokens, counts, cost, or chunks when those are symptoms of a model-serving protocol bug.\n - Example: \u201cOpenAI-compatible streaming with llama.cpp saves zero usage (stream closed before final usage chunk)\u201d => `model_serving` only.\n\n- Use `telemetry_usage` only when metric collection, usage accounting/reporting, cost display, diagnostic counts, traces, or status reporting surfaces are themselves the feature or bug.\n\nPolicy/config rules:\n- Items about policy rules, conformance checks, quality gates, allowed behavior, or configuration-governed enforcement usually include `config` when the policy/checking behavior is central.\n- Do not map \u201cmodel\u201d in \u201cmodel policy\u201d, \u201cmodel conformance\u201d, or \u201cmodel checks\u201d to `model_serving` unless the item is actually about serving endpoints, streaming, endpoint lifecycle, routing, or model-server compatibility.\n- Network policy, network conformance, access restrictions, outbound rules, or boundary checks can be `security` when they concern allowed/blocked network behavior.\n- MCP conformance, MCP policy, MCP tool behavior, or MCP protocol checks route to `mcp_tooling`.\n- Example: \u201cPolicy: add model, network, and MCP conformance checks\u201d => `mcp_tooling`, `config`, `security`, not `model_serving`.\n\nFinal suppression checks:\n- If a topic was added only because of a word like \u201cusage\u201d, \u201cmodel\u201d, \u201cnetwork\u201d, \u201ctest\u201d, \u201cpolicy\u201d, \u201cstatus\u201d, \u201ctool\u201d, \u201cplugin\u201d, or \u201cchunk\u201d, verify that the topic is actually the subject.\n- Prefer the narrow central topic over broad fallback labels.\n- Remove labels that come only from symptoms, implementation details, tests, examples, files changed, or incidental words."
|
| 7 |
+
},
|
| 8 |
+
{
|
| 9 |
+
"routing_policy": "You are classifying GitHub issues or pull requests into the smallest complete set of allowed topic ids.\n\nInput format:\n- You may receive a GitHub target URL, title, and sometimes a body or summary.\n- The title is the primary signal.\n- Use the first clear body summary only when the title is ambiguous.\n- Ignore examples, tests, files changed, implementation details, incidental keywords, and broad impact unless they are the actual user-visible subject.\n- Return only final JSON using exact allowed topic ids, for example:\n {\"topics_of_interest\":[\"queueing\",\"docs\"]}\n\nTask:\nChoose the minimum topic set that routes the item to the right maintainer bucket without dropping an explicitly central second or third concern.\n\nGeneral process:\n1. Read the title first.\n2. Identify the main user-visible problem, feature, documentation change, policy change, or contract being changed.\n3. Pick one primary topic.\n4. Add a secondary topic only when it is explicitly central and removing it would route the item away from a maintainer who must see it.\n5. Use 3 topics only when the title or first clear summary explicitly names three central facets.\n6. Use 0 topics when no allowed topic is central.\n7. Never invent topic ids.\n8. Output only JSON.\n\nHigh-signal title patterns:\n- A Conventional Commit type like `docs(...)`, `feat(...)`, `fix(...)`, `test(...)`, or `policy(...)` can indicate the kind of change.\n- A scope inside parentheses is often central. For example, `docs(queue): ...` usually includes both `docs` and `queueing`.\n- Do not ignore `test(...)` scopes when the title is about landing or enforcing a behavior contract. The tested contract can be the central subject.\n- Do not blindly label every word in the title. Confirm the word names the subject, not just a path, symptom, or context.\n\nDomain rules and corrections:\n\nDocumentation:\n- Documentation-only PRs should usually include `docs` plus the central documented area.\n- Example: `docs(queue): clarify steer behavior with partial streaming and tool boundaries` => `docs`, `queueing`.\n- Do not add `tool_calling` just because the title says \u201ctool boundaries\u201d unless tool-call behavior itself is the central feature or bug.\n\nQueueing:\n- Queue, queueing, queued execution, steer behavior in queues, or queue lifecycle route to `queueing` when central.\n\nTool calling:\n- `tool_calling` is only for tool-call execution, tool-call APIs, tool selection, tool schema handling, or tool-call runtime behavior.\n- Mentions of \u201ctool boundaries\u201d in docs about another system are usually context, not `tool_calling`.\n\nACP, gateway, and runtime:\n- ACP-related work routes to `acp` when ACP is named centrally.\n- ACPX-related sandbox or workflow issues route to `acpx` when ACPX is named centrally.\n- Gateway-owned behavior routes to `gateway` only when gateway is explicitly the owner or subject.\n- Runtime work routes to `agent_runtime` when the title is about runtimes, node-backed runtimes, agent execution runtimes, or runtime ownership.\n- Example: `ACP: add gateway-owned node-backed runtime` => `acp`, `gateway`, `agent_runtime`.\n\nCodex and plugins:\n- Codex-related behavior routes to `codex` when Codex is named centrally.\n- User-installed plugins, plugin inheritance, Superpowers, skills, plugin discovery, plugin installation, or skill/plugin availability route to `skills_plugins`.\n- Example: `[Feature]: ACPX Codex sandbox should inherit user-installed plugins (e.g. Superpowers)` => `acpx`, `codex`, `skills_plugins`.\n- Do not drop `skills_plugins` when plugins are the requested feature.\n\nNotifications and chat integrations:\n- Slack, chat app delivery, chat target channels, and chat push behavior route to `chat_integrations`.\n- Announce messages, heartbeat pushes, target-channel pushes, identity overlays on pushed messages, and notification delivery route to `notifications`.\n- Do not add `cron_automation` merely because the notification path mentions `cron --announce`; cron is context unless scheduling, force-run behavior, cron lifecycle, or cron execution is itself broken.\n- Example: `Per-agent identity overlay dropped on cron --announce and heartbeat target-channel Slack pushes` => `notifications`, `chat_integrations`.\n\nCron:\n- Use `cron_automation` when cron scheduling, cron force-run, cron lifecycle, cron execution, or a cron deadlock is central.\n- Example: `cron force-run deadlock` => `cron_automation`.\n\nExec, sandboxing, and approvals:\n- Exec command/tool behavior routes to `exec_tools`.\n- Exec PATH fallback is `exec_tools`.\n- Exec v2 contract follow-through or contract enforcement can centrally include `exec_tools`, `sandboxing`, and `approvals` when the contract covers sandbox and approval behavior.\n- Example: `test(exec): land exec v2 contract follow-through` => `exec_tools`, `sandboxing`, `approvals`.\n- Do not replace sandboxing or approvals with `security` unless the title is actually about a security policy, vulnerability, network restriction, credential boundary, or allowed/blocked security behavior.\n\nBrowser automation:\n- Browser diagnostics, browser automation layers, browser runtime behavior, and browser tooling issues route to `browser_automation`.\n- Example: `layered browser diagnostics` => `browser_automation`.\n- Do not add `gateway` for browser diagnostics unless the gateway itself is explicitly the subject.\n\nMemory and inference:\n- Memory or embeddings provider work routes to `memory` when the provider exists for memory/embeddings.\n- Self-hosted inference servers such as llama.cpp, Ollama, vLLM, TGI, and LocalAI route to `self_hosted_inference` when the item is about using those servers as inference providers.\n- Example: `feat(memory/embeddings): add openai-compatible provider for self-hosted servers (llama.cpp, Ollama, vLLM, TGI, LocalAI)` => `memory`, `self_hosted_inference`.\n- Do not add `model_serving` merely because the title says \u201copenai-compatible\u201d, \u201cprovider\u201d, llama.cpp, Ollama, vLLM, TGI, or LocalAI.\n\nModel serving:\n- Use `model_serving` only when the central subject is serving endpoints, OpenAI-compatible request/response protocol behavior, streaming lifecycle, final usage chunks, base URL behavior, endpoint compatibility, request routing, or model-server compatibility.\n- OpenAI-compatible streaming, final usage chunks, stream lifecycle, endpoint compatibility, base URL behavior, vLLM/TGI/LocalAI/llama.cpp serving behavior, and request routing are `model_serving`.\n- Do not add `telemetry_usage` merely because the title mentions usage, tokens, counts, cost, or chunks when those are symptoms of a model-serving protocol bug.\n- Example: `OpenAI-compatible streaming with llama.cpp saves zero usage (stream closed before final usage chunk)` => `model_serving`.\n\nTelemetry and usage:\n- Use `telemetry_usage` only when metric collection, usage accounting/reporting, cost display, diagnostic counts, traces, or status reporting surfaces are themselves the feature or bug.\n\nPolicy/config:\n- Items about policy rules, conformance checks, quality gates, allowed behavior, or configuration-governed enforcement usually include `config` when the policy/checking behavior is central.\n- Do not map \u201cmodel\u201d in \u201cmodel policy\u201d, \u201cmodel conformance\u201d, or \u201cmodel checks\u201d to `model_serving` unless the item is actually about serving endpoints, streaming, endpoint lifecycle, routing, or model-server compatibility.\n- Network policy, network conformance, access restrictions, outbound rules, or boundary checks can be `security` when they concern allowed/blocked network behavior.\n- MCP conformance, MCP policy, MCP tool behavior, or MCP protocol checks route to `mcp_tooling`.\n- Example: `Policy: add model, network, and MCP conformance checks` => `mcp_tooling`, `config`, `security`, not `model_serving`.\n\nComposite fixes:\n- If a title lists several independent fixes, classify each central fix up to the smallest complete set.\n- Example: `fix: resolve exec PATH fallback, layered browser diagnostics, and cron force-run deadlock` => `exec_tools`, `browser_automation`, `cron_automation`.\n- Do not substitute a broad infrastructure topic like `gateway` unless it is explicitly one of the listed user-visible subjects.\n\nFinal suppression checks:\n- If a topic was added only because of a word like \u201cusage\u201d, \u201cmodel\u201d, \u201cnetwork\u201d, \u201ctest\u201d, \u201cpolicy\u201d, \u201cstatus\u201d, \u201ctool\u201d, \u201cplugin\u201d, \u201cchunk\u201d, \u201ccron\u201d, \u201cgateway\u201d, or \u201csecurity\u201d, verify that the topic is actually the subject.\n- Prefer the narrow central topic over broad fallback labels.\n- Remove labels that come only from symptoms, implementation details, tests, examples, files changed, or incidental words.\n- Keep required central second and third topics when dropping them would hide the item from a maintainer who owns that area."
|
| 10 |
+
},
|
| 11 |
+
{
|
| 12 |
+
"routing_policy": "Add these routing corrections to the classifier instructions:\n\n- Treat compound titles as lists of central user-visible fixes. Classify each central item, but do not add labels for every noun.\n- `skills_plugins` is label spam unless the plugin system itself is the requested feature or bug: user-installed plugins, plugin inheritance, Superpowers, skill/plugin discovery, plugin installation, or skill/plugin availability.\n- In titles like `fix: Codex startup plugins + WhatsApp history & Docker Codex OAuth`, keep `codex` because Codex behavior is central, but do not add `skills_plugins` for \u201cstartup plugins\u201d unless the plugin lifecycle is the actual subject.\n- WhatsApp, Slack, chat history, chat app delivery, chat target channels, and chat push behavior route to `chat_integrations` when central.\n- ACP session permission-mode work can require all three topics: `acp`, `approvals`, and `acpx`.\n- Specifically, titles mentioning per-binding or per-agent `permissionMode` for ACP sessions should include `acp`, `approvals`, and `acpx`. `permissionMode` is an approval/permission contract, and ACPX owns the ACP session/binding workflow concern.\n- Add `local_models` when the title centrally names local model apps or local model providers such as LM Studio.\n- LM Studio issues involving Responses API behavior, thinking blocks, streaming, request/response compatibility, or visibility of model output should usually include both `model_serving` and `local_models`.\n- Do not replace `local_models` with `self_hosted_inference` when the named subject is LM Studio or another local-model product/app rather than a generic inference server integration.\n- `Responses API`, invisible thinking blocks, OpenAI-compatible behavior, streaming lifecycle, request/response protocol handling, and model-output protocol bugs route to `model_serving`.\n\nAdditional suppression checks:\n- If `skills_plugins` was added only because the title contains \u201cplugins\u201d inside a broader Codex startup or OAuth fix, remove it unless plugin installation/discovery/inheritance/availability is the central user-visible bug.\n- If a chat product name such as WhatsApp appears as a central listed fix, include `chat_integrations`.\n- If ACP + `permissionMode` + per-binding/per-agent/session language appears, include `acpx` in addition to `acp` and `approvals`.\n- If LM Studio appears as a central subject, include `local_models`."
|
| 13 |
+
}
|
| 14 |
+
]
|
gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_0/iter_0_prog_0.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"ClassifierOutput(topics_of_interest=(), description='', caveats=(), error='classifier exit 217: prompt: /home/bob/.local/state/localpager/classifier/prompts/20260613T051216Z-3827873.md\\nschema: /home/bob/.local/state/localpager/classifier/schemas/20260613T051216Z-3827873.json\\nsession: /home/bob/.local/state/localpager/classifier/sessions/20260613T051216Z-3827873')"
|
gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_1/iter_0_prog_0.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"ClassifierOutput(topics_of_interest=('mcp_tooling', 'security', 'config'), description='This PR implements configuration-governed conformance checks for MCP servers (mcp_tooling), network security (security), and general policy configurations (config).', caveats=('The \"model\" aspect refers to policy enforcement on provider lists rather than serving protocols, thus it is handled by config/mcp_tooling logic.',), error=None)"
|
gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_10/iter_0_prog_0.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"ClassifierOutput(topics_of_interest=('exec_tools', 'security'), description=\"The PR implements security hardening and contract enforcement for the 'exec' tool, specifically regarding safe binary path validation and allowlist logic.\", caveats=(), error=None)"
|
gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_11/iter_0_prog_0.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"ClassifierOutput(topics_of_interest=('acpx', 'codex', 'skills_plugins'), description='This feature request addresses plugin inheritance for the ACPX Codex sandbox, involving core components like acpx protocol handling, codex runtime environments, and skill/plugin integration.', caveats=('The issue explicitly involves the ACPX infrastructure, Codex runtime environment, and the functionality of user-installed skills/plugins.',), error=None)"
|
gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_12/iter_0_prog_0.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"ClassifierOutput(topics_of_interest=('memory', 'self_hosted_inference'), description='Adds an openai-compatible embedding provider to support various self-hosted inference engines like Ollama and vLLM within the memory system.', caveats=(), error=None)"
|
gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_13/iter_0_prog_0.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"ClassifierOutput(topics_of_interest=(), description='', caveats=(), error='classifier exit 2: prompt: /home/bob/.local/state/localpager/classifier/prompts/20260613T051752Z-3832136.md\\nschema: /home/bob/.local/state/localpager/classifier/schemas/20260613T051752Z-3832136.json\\nsession: /home/bob/.local/state/localpager/classifier/sessions/20260613T051752Z-3832136\\nlocalpager-agent: final_json was not called; no structured output was captured')"
|
gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_14/iter_0_prog_0.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"ClassifierOutput(topics_of_interest=('docs',), description='The pull request exclusively modifies documentation by adding a YouTube tutorial link and embed to the showcase page.', caveats=(), error=None)"
|
gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_15/iter_0_prog_0.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"ClassifierOutput(topics_of_interest=('open_weight_models', 'model_serving'), description='This PR corrects context window metadata for open-weight models served via Mantle to fix premature truncation issues.', caveats=('The pull request specifically targets the inclusion of known context windows for \"open-weight\" models on the Mantle provider.',), error=None)"
|
gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_16/iter_0_prog_0.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"ClassifierOutput(topics_of_interest=('cron_automation', 'chat_integrations'), description='Issue reports that per-agent identity settings are ignored during cron announcements and heartbeats sent to Slack, affecting both automation logic and chat integration behavior.', caveats=('The bug specifically involves the failure of agent identity overlays in automated notification paths (cron/heartbeat) for Slack.',), error=None)"
|
gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_17/iter_0_prog_0.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"ClassifierOutput(topics_of_interest=('model_serving',), description='The PR fixes a bug in processing OpenAI-compatible streaming responses, specifically handling reasoning content from providers like LM Studio.', caveats=(), error=None)"
|
gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_2/iter_0_prog_0.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"ClassifierOutput(topics_of_interest=('exec_tools', 'gateway', 'cron_automation'), description='This multi-faceted PR fixes critical issues across shell tool execution paths, gateway-level browser status diagnostics, and prevents deadlocks in concurrent cron operations.', caveats=('The PR addresses three distinct systems: tool execution logic (exec_tools), gateway-level service reporting and lane management (gateway), and cron job concurrency handling (cron_automation).',), error=None)"
|
gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_3/iter_0_prog_0.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"ClassifierOutput(topics_of_interest=('codex', 'gateway'), description='The PR fixes gateway startup logic for Codex plugins and implements a new history feature for WhatsApp.', caveats=(), error=None)"
|
gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_4/iter_0_prog_0.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"ClassifierOutput(topics_of_interest=('model_serving',), description='The issue describes an OpenAI-compatible streaming bug where usage data is missed due to incorrect stream termination, falling under model_serving.', caveats=(), error=None)"
|
gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_5/iter_0_prog_0.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"ClassifierOutput(topics_of_interest=('acp', 'approvals'), description='The issue proposes granular per-binding and per-agent permission modes for ACP sessions, directly impacting the core ACP protocol and approval flows.', caveats=('None.',), error=None)"
|
gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_6/iter_0_prog_0.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"ClassifierOutput(topics_of_interest=('gateway', 'hooks'), description=\"The PR enhances the gateway's ability to correctly process and report outcomes from guarded hook actions during outbound message sending.\", caveats=(), error=None)"
|
gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_7/iter_0_prog_0.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"ClassifierOutput(topics_of_interest=('memory', 'reliability'), description='This PR fixes a deadlock in the active-memory recall subagent by isolating it onto its own dedicated lane.', caveats=(\"Included 'reliability' due to the specific fix for a concurrent execution deadlock.\",), error=None)"
|
gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_8/iter_0_prog_0.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"ClassifierOutput(topics_of_interest=('ui_tui', 'gateway'), description='The PR adds a new gateway RPC to fetch full message content for the web-based chat sidebar UI.', caveats=(), error=None)"
|
gepa-12b-multi-from-six-20260613T051216Z/generated_best_outputs_valset/task_9/iter_0_prog_0.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"ClassifierOutput(topics_of_interest=(), description='', caveats=(), error='classifier exit 2: prompt: /home/bob/.local/state/localpager/classifier/prompts/20260613T051540Z-3830567.md\\nschema: /home/bob/.local/state/localpager/classifier/schemas/20260613T051540Z-3830567.json\\nsession: /home/bob/.local/state/localpager/classifier/sessions/20260613T051540Z-3830567\\nlocalpager-agent: final_json was not called; no structured output was captured')"
|
gepa-12b-multi-from-six-20260613T051216Z/gepa-result.json
ADDED
|
@@ -0,0 +1,223 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_str_candidate_key": null,
|
| 3 |
+
"best_idx": 2,
|
| 4 |
+
"best_outputs_valset": null,
|
| 5 |
+
"candidates": [
|
| 6 |
+
{
|
| 7 |
+
"routing_policy": "You are classifying GitHub issues or pull requests into the smallest complete set of allowed topic ids.\n\nThis is a fuzzy multi-label routing task. Your goal is not to mention every related area. Your goal is to choose the minimum topic set that sends the item to the right maintainer bucket without dropping an explicit central second concern.\n\nProcess:\n\n1. Read the title first.\n2. Identify the main user-visible problem, feature, or policy change.\n3. Pick one primary topic.\n4. Read only the first clear body summary if needed to disambiguate.\n5. Add a secondary topic only when it is explicitly central and removing it would route the item away from a maintainer who must see it.\n6. Remove topics that come only from symptoms, implementation details, tests, examples, files changed, broad impact, or incidental words.\n7. Return only exact allowed topic ids.\n\nDo not over-label from keywords.\n\nImportant domain rules:\n\n- OpenAI-compatible streaming, final usage chunks, stream lifecycle, endpoint compatibility, base URL behavior, vLLM/TGI/LocalAI/llama.cpp serving behavior, and request routing are `model_serving`.\n- Do not add `telemetry_usage` merely because the title mentions usage, tokens, counts, cost, or chunks when those are symptoms of a model-serving protocol bug.\n- Example: \u201cOpenAI-compatible streaming with llama.cpp saves zero usage (stream closed before final usage chunk)\u201d is only `model_serving`. The central issue is the OpenAI-compatible streaming/final usage chunk behavior, not telemetry reporting.\n- Use `telemetry_usage` only when the metric, usage accounting/reporting, cost display, diagnostic count, trace, or status reporting surface is itself the feature or bug.\n\nPolicy/config rules:\n\n- Items about policy rules, conformance checks, quality gates, allowed behavior, or configuration-governed enforcement usually include `config` when the policy/checking behavior is central.\n- Do not map the word \u201cmodel\u201d in \u201cmodel policy\u201d, \u201cmodel conformance\u201d, or \u201cmodel checks\u201d to `model_serving` unless the item is actually about serving endpoints, streaming, endpoint lifecycle, routing, or model-server compatibility.\n- Network policy, network conformance, access restrictions, outbound rules, or boundary checks can be `security` when they concern allowed/blocked network behavior.\n- MCP conformance, MCP policy, MCP tool behavior, or MCP protocol checks route to `mcp_tooling`.\n- Example: \u201cPolicy: add model, network, and MCP conformance checks\u201d should be `mcp_tooling`, `config`, and `security`, not `model_serving`.\n\nCardinality guidance:\n\n- Use 0 topics when no allowed topic is central.\n- Use 1 topic for a single-focus item.\n- Use 2 topics for normal cross-topic items.\n- Use 3 topics only when the title or first clear summary explicitly has three central facets.\n- Use 4+ topics only for explicit multi-system coordination.\n\nFinal suppression checks before output:\n\n- If a topic was added only because of a word like \u201cusage\u201d, \u201cmodel\u201d, \u201cnetwork\u201d, \u201ctest\u201d, \u201cpolicy\u201d, \u201cstatus\u201d, or \u201cchunk\u201d, verify that the topic is actually the subject, not just context.\n- Prefer the narrower central topic over a broad fallback.\n- Never invent topic ids.\n- Output only the final JSON with the selected topic ids."
|
| 8 |
+
},
|
| 9 |
+
{
|
| 10 |
+
"routing_policy": "You are classifying GitHub issues or pull requests into the smallest complete set of allowed topic ids.\n\nInput format:\n- You may receive a GitHub target URL, a title, and sometimes a body or summary.\n- The title is the primary signal.\n- Use the first clear body summary only when the title is ambiguous.\n- Ignore examples, tests, files changed, implementation details, incidental keywords, and broad impact unless they are the actual user-visible subject.\n- Return only final JSON using exact allowed topic ids, for example:\n {\"topics_of_interest\":[\"queueing\",\"docs\"]}\n\nTask:\nChoose the minimum topic set that routes the item to the right maintainer bucket without dropping an explicitly central second concern.\n\nGeneral process:\n1. Read the title first.\n2. Identify the main user-visible problem, feature, documentation change, or policy change.\n3. Pick one primary topic.\n4. Add a secondary topic only when it is explicitly central and removing it would route the item away from a maintainer who must see it.\n5. Use 3 topics only when the title or first clear summary explicitly names three central facets.\n6. Use 0 topics when no allowed topic is central.\n7. Never invent topic ids.\n8. Output only JSON.\n\nHigh-signal title patterns:\n- A Conventional Commit type like `docs(...)`, `feat(...)`, `fix(...)`, or `policy(...)` can indicate the kind of change.\n- A scope inside parentheses is often central. For example, `docs(queue): ...` usually includes both `docs` and `queueing`.\n- Do not blindly label every word in the title. Confirm the word names the subject, not just context.\n\nDomain rules and corrections:\n- Documentation-only PRs should usually include `docs` plus the central documented area.\n - Example: `docs(queue): clarify steer behavior with partial streaming and tool boundaries` => `docs`, `queueing`.\n - Do not add `tool_calling` just because the title says \u201ctool boundaries\u201d unless tool calling behavior itself is the central feature or bug.\n\n- Queue, queueing, queued execution, steer behavior in queues, or queue lifecycle route to `queueing` when central.\n\n- `tool_calling` is only for tool-call execution, tool-call APIs, tool selection, tool schema handling, or tool-call runtime behavior.\n - Mentions of \u201ctool boundaries\u201d in docs about another system are usually context, not `tool_calling`.\n\n- ACPX-related sandbox or workflow issues route to `acpx` when ACPX is named centrally.\n- Codex-related behavior routes to `codex` when Codex is named centrally.\n- User-installed plugins, plugin inheritance, Superpowers, skills, plugin discovery, plugin installation, or skill/plugin availability route to `skills_plugins`.\n - Example: `[Feature]: ACPX Codex sandbox should inherit user-installed plugins (e.g. Superpowers)` => `acpx`, `codex`, `skills_plugins`.\n - Do not drop `skills_plugins` when plugins are the requested feature.\n\n- Memory or embeddings provider work routes to `memory` when the provider exists for memory/embeddings.\n- Self-hosted inference servers such as llama.cpp, Ollama, vLLM, TGI, and LocalAI route to `self_hosted_inference` when the item is about using those servers as inference providers.\n - Example: `feat(memory/embeddings): add openai-compatible provider for self-hosted servers (llama.cpp, Ollama, vLLM, TGI, LocalAI)` => `memory`, `self_hosted_inference`.\n - Do not add `model_serving` merely because the title says \u201copenai-compatible\u201d, \u201cprovider\u201d, llama.cpp, Ollama, vLLM, TGI, or LocalAI.\n\n- Use `model_serving` only when the central subject is serving endpoints, OpenAI-compatible request/response protocol behavior, streaming lifecycle, final usage chunks, base URL behavior, endpoint compatibility, request routing, or model-server compatibility.\n - OpenAI-compatible streaming, final usage chunks, stream lifecycle, endpoint compatibility, base URL behavior, vLLM/TGI/LocalAI/llama.cpp serving behavior, and request routing are `model_serving`.\n - Do not add `telemetry_usage` merely because the title mentions usage, tokens, counts, cost, or chunks when those are symptoms of a model-serving protocol bug.\n - Example: \u201cOpenAI-compatible streaming with llama.cpp saves zero usage (stream closed before final usage chunk)\u201d => `model_serving` only.\n\n- Use `telemetry_usage` only when metric collection, usage accounting/reporting, cost display, diagnostic counts, traces, or status reporting surfaces are themselves the feature or bug.\n\nPolicy/config rules:\n- Items about policy rules, conformance checks, quality gates, allowed behavior, or configuration-governed enforcement usually include `config` when the policy/checking behavior is central.\n- Do not map \u201cmodel\u201d in \u201cmodel policy\u201d, \u201cmodel conformance\u201d, or \u201cmodel checks\u201d to `model_serving` unless the item is actually about serving endpoints, streaming, endpoint lifecycle, routing, or model-server compatibility.\n- Network policy, network conformance, access restrictions, outbound rules, or boundary checks can be `security` when they concern allowed/blocked network behavior.\n- MCP conformance, MCP policy, MCP tool behavior, or MCP protocol checks route to `mcp_tooling`.\n- Example: \u201cPolicy: add model, network, and MCP conformance checks\u201d => `mcp_tooling`, `config`, `security`, not `model_serving`.\n\nFinal suppression checks:\n- If a topic was added only because of a word like \u201cusage\u201d, \u201cmodel\u201d, \u201cnetwork\u201d, \u201ctest\u201d, \u201cpolicy\u201d, \u201cstatus\u201d, \u201ctool\u201d, \u201cplugin\u201d, or \u201cchunk\u201d, verify that the topic is actually the subject.\n- Prefer the narrow central topic over broad fallback labels.\n- Remove labels that come only from symptoms, implementation details, tests, examples, files changed, or incidental words."
|
| 11 |
+
},
|
| 12 |
+
{
|
| 13 |
+
"routing_policy": "You are classifying GitHub issues or pull requests into the smallest complete set of allowed topic ids.\n\nInput format:\n- You may receive a GitHub target URL, title, and sometimes a body or summary.\n- The title is the primary signal.\n- Use the first clear body summary only when the title is ambiguous.\n- Ignore examples, tests, files changed, implementation details, incidental keywords, and broad impact unless they are the actual user-visible subject.\n- Return only final JSON using exact allowed topic ids, for example:\n {\"topics_of_interest\":[\"queueing\",\"docs\"]}\n\nTask:\nChoose the minimum topic set that routes the item to the right maintainer bucket without dropping an explicitly central second or third concern.\n\nGeneral process:\n1. Read the title first.\n2. Identify the main user-visible problem, feature, documentation change, policy change, or contract being changed.\n3. Pick one primary topic.\n4. Add a secondary topic only when it is explicitly central and removing it would route the item away from a maintainer who must see it.\n5. Use 3 topics only when the title or first clear summary explicitly names three central facets.\n6. Use 0 topics when no allowed topic is central.\n7. Never invent topic ids.\n8. Output only JSON.\n\nHigh-signal title patterns:\n- A Conventional Commit type like `docs(...)`, `feat(...)`, `fix(...)`, `test(...)`, or `policy(...)` can indicate the kind of change.\n- A scope inside parentheses is often central. For example, `docs(queue): ...` usually includes both `docs` and `queueing`.\n- Do not ignore `test(...)` scopes when the title is about landing or enforcing a behavior contract. The tested contract can be the central subject.\n- Do not blindly label every word in the title. Confirm the word names the subject, not just a path, symptom, or context.\n\nDomain rules and corrections:\n\nDocumentation:\n- Documentation-only PRs should usually include `docs` plus the central documented area.\n- Example: `docs(queue): clarify steer behavior with partial streaming and tool boundaries` => `docs`, `queueing`.\n- Do not add `tool_calling` just because the title says \u201ctool boundaries\u201d unless tool-call behavior itself is the central feature or bug.\n\nQueueing:\n- Queue, queueing, queued execution, steer behavior in queues, or queue lifecycle route to `queueing` when central.\n\nTool calling:\n- `tool_calling` is only for tool-call execution, tool-call APIs, tool selection, tool schema handling, or tool-call runtime behavior.\n- Mentions of \u201ctool boundaries\u201d in docs about another system are usually context, not `tool_calling`.\n\nACP, gateway, and runtime:\n- ACP-related work routes to `acp` when ACP is named centrally.\n- ACPX-related sandbox or workflow issues route to `acpx` when ACPX is named centrally.\n- Gateway-owned behavior routes to `gateway` only when gateway is explicitly the owner or subject.\n- Runtime work routes to `agent_runtime` when the title is about runtimes, node-backed runtimes, agent execution runtimes, or runtime ownership.\n- Example: `ACP: add gateway-owned node-backed runtime` => `acp`, `gateway`, `agent_runtime`.\n\nCodex and plugins:\n- Codex-related behavior routes to `codex` when Codex is named centrally.\n- User-installed plugins, plugin inheritance, Superpowers, skills, plugin discovery, plugin installation, or skill/plugin availability route to `skills_plugins`.\n- Example: `[Feature]: ACPX Codex sandbox should inherit user-installed plugins (e.g. Superpowers)` => `acpx`, `codex`, `skills_plugins`.\n- Do not drop `skills_plugins` when plugins are the requested feature.\n\nNotifications and chat integrations:\n- Slack, chat app delivery, chat target channels, and chat push behavior route to `chat_integrations`.\n- Announce messages, heartbeat pushes, target-channel pushes, identity overlays on pushed messages, and notification delivery route to `notifications`.\n- Do not add `cron_automation` merely because the notification path mentions `cron --announce`; cron is context unless scheduling, force-run behavior, cron lifecycle, or cron execution is itself broken.\n- Example: `Per-agent identity overlay dropped on cron --announce and heartbeat target-channel Slack pushes` => `notifications`, `chat_integrations`.\n\nCron:\n- Use `cron_automation` when cron scheduling, cron force-run, cron lifecycle, cron execution, or a cron deadlock is central.\n- Example: `cron force-run deadlock` => `cron_automation`.\n\nExec, sandboxing, and approvals:\n- Exec command/tool behavior routes to `exec_tools`.\n- Exec PATH fallback is `exec_tools`.\n- Exec v2 contract follow-through or contract enforcement can centrally include `exec_tools`, `sandboxing`, and `approvals` when the contract covers sandbox and approval behavior.\n- Example: `test(exec): land exec v2 contract follow-through` => `exec_tools`, `sandboxing`, `approvals`.\n- Do not replace sandboxing or approvals with `security` unless the title is actually about a security policy, vulnerability, network restriction, credential boundary, or allowed/blocked security behavior.\n\nBrowser automation:\n- Browser diagnostics, browser automation layers, browser runtime behavior, and browser tooling issues route to `browser_automation`.\n- Example: `layered browser diagnostics` => `browser_automation`.\n- Do not add `gateway` for browser diagnostics unless the gateway itself is explicitly the subject.\n\nMemory and inference:\n- Memory or embeddings provider work routes to `memory` when the provider exists for memory/embeddings.\n- Self-hosted inference servers such as llama.cpp, Ollama, vLLM, TGI, and LocalAI route to `self_hosted_inference` when the item is about using those servers as inference providers.\n- Example: `feat(memory/embeddings): add openai-compatible provider for self-hosted servers (llama.cpp, Ollama, vLLM, TGI, LocalAI)` => `memory`, `self_hosted_inference`.\n- Do not add `model_serving` merely because the title says \u201copenai-compatible\u201d, \u201cprovider\u201d, llama.cpp, Ollama, vLLM, TGI, or LocalAI.\n\nModel serving:\n- Use `model_serving` only when the central subject is serving endpoints, OpenAI-compatible request/response protocol behavior, streaming lifecycle, final usage chunks, base URL behavior, endpoint compatibility, request routing, or model-server compatibility.\n- OpenAI-compatible streaming, final usage chunks, stream lifecycle, endpoint compatibility, base URL behavior, vLLM/TGI/LocalAI/llama.cpp serving behavior, and request routing are `model_serving`.\n- Do not add `telemetry_usage` merely because the title mentions usage, tokens, counts, cost, or chunks when those are symptoms of a model-serving protocol bug.\n- Example: `OpenAI-compatible streaming with llama.cpp saves zero usage (stream closed before final usage chunk)` => `model_serving`.\n\nTelemetry and usage:\n- Use `telemetry_usage` only when metric collection, usage accounting/reporting, cost display, diagnostic counts, traces, or status reporting surfaces are themselves the feature or bug.\n\nPolicy/config:\n- Items about policy rules, conformance checks, quality gates, allowed behavior, or configuration-governed enforcement usually include `config` when the policy/checking behavior is central.\n- Do not map \u201cmodel\u201d in \u201cmodel policy\u201d, \u201cmodel conformance\u201d, or \u201cmodel checks\u201d to `model_serving` unless the item is actually about serving endpoints, streaming, endpoint lifecycle, routing, or model-server compatibility.\n- Network policy, network conformance, access restrictions, outbound rules, or boundary checks can be `security` when they concern allowed/blocked network behavior.\n- MCP conformance, MCP policy, MCP tool behavior, or MCP protocol checks route to `mcp_tooling`.\n- Example: `Policy: add model, network, and MCP conformance checks` => `mcp_tooling`, `config`, `security`, not `model_serving`.\n\nComposite fixes:\n- If a title lists several independent fixes, classify each central fix up to the smallest complete set.\n- Example: `fix: resolve exec PATH fallback, layered browser diagnostics, and cron force-run deadlock` => `exec_tools`, `browser_automation`, `cron_automation`.\n- Do not substitute a broad infrastructure topic like `gateway` unless it is explicitly one of the listed user-visible subjects.\n\nFinal suppression checks:\n- If a topic was added only because of a word like \u201cusage\u201d, \u201cmodel\u201d, \u201cnetwork\u201d, \u201ctest\u201d, \u201cpolicy\u201d, \u201cstatus\u201d, \u201ctool\u201d, \u201cplugin\u201d, \u201cchunk\u201d, \u201ccron\u201d, \u201cgateway\u201d, or \u201csecurity\u201d, verify that the topic is actually the subject.\n- Prefer the narrow central topic over broad fallback labels.\n- Remove labels that come only from symptoms, implementation details, tests, examples, files changed, or incidental words.\n- Keep required central second and third topics when dropping them would hide the item from a maintainer who owns that area."
|
| 14 |
+
},
|
| 15 |
+
{
|
| 16 |
+
"routing_policy": "Add these routing corrections to the classifier instructions:\n\n- Treat compound titles as lists of central user-visible fixes. Classify each central item, but do not add labels for every noun.\n- `skills_plugins` is label spam unless the plugin system itself is the requested feature or bug: user-installed plugins, plugin inheritance, Superpowers, skill/plugin discovery, plugin installation, or skill/plugin availability.\n- In titles like `fix: Codex startup plugins + WhatsApp history & Docker Codex OAuth`, keep `codex` because Codex behavior is central, but do not add `skills_plugins` for \u201cstartup plugins\u201d unless the plugin lifecycle is the actual subject.\n- WhatsApp, Slack, chat history, chat app delivery, chat target channels, and chat push behavior route to `chat_integrations` when central.\n- ACP session permission-mode work can require all three topics: `acp`, `approvals`, and `acpx`.\n- Specifically, titles mentioning per-binding or per-agent `permissionMode` for ACP sessions should include `acp`, `approvals`, and `acpx`. `permissionMode` is an approval/permission contract, and ACPX owns the ACP session/binding workflow concern.\n- Add `local_models` when the title centrally names local model apps or local model providers such as LM Studio.\n- LM Studio issues involving Responses API behavior, thinking blocks, streaming, request/response compatibility, or visibility of model output should usually include both `model_serving` and `local_models`.\n- Do not replace `local_models` with `self_hosted_inference` when the named subject is LM Studio or another local-model product/app rather than a generic inference server integration.\n- `Responses API`, invisible thinking blocks, OpenAI-compatible behavior, streaming lifecycle, request/response protocol handling, and model-output protocol bugs route to `model_serving`.\n\nAdditional suppression checks:\n- If `skills_plugins` was added only because the title contains \u201cplugins\u201d inside a broader Codex startup or OAuth fix, remove it unless plugin installation/discovery/inheritance/availability is the central user-visible bug.\n- If a chat product name such as WhatsApp appears as a central listed fix, include `chat_integrations`.\n- If ACP + `permissionMode` + per-binding/per-agent/session language appears, include `acpx` in addition to `acp` and `approvals`.\n- If LM Studio appears as a central subject, include `local_models`."
|
| 17 |
+
}
|
| 18 |
+
],
|
| 19 |
+
"discovery_eval_counts": [
|
| 20 |
+
0,
|
| 21 |
+
26,
|
| 22 |
+
52,
|
| 23 |
+
78
|
| 24 |
+
],
|
| 25 |
+
"num_full_val_evals": 4,
|
| 26 |
+
"objective_pareto_front": {
|
| 27 |
+
"weighted_score": 0.7361111111111112
|
| 28 |
+
},
|
| 29 |
+
"parents": [
|
| 30 |
+
[
|
| 31 |
+
null
|
| 32 |
+
],
|
| 33 |
+
[
|
| 34 |
+
0
|
| 35 |
+
],
|
| 36 |
+
[
|
| 37 |
+
1
|
| 38 |
+
],
|
| 39 |
+
[
|
| 40 |
+
2
|
| 41 |
+
]
|
| 42 |
+
],
|
| 43 |
+
"per_objective_best_candidates": {
|
| 44 |
+
"weighted_score": [
|
| 45 |
+
2
|
| 46 |
+
]
|
| 47 |
+
},
|
| 48 |
+
"per_val_instance_best_candidates": {
|
| 49 |
+
"0": [
|
| 50 |
+
2
|
| 51 |
+
],
|
| 52 |
+
"1": [
|
| 53 |
+
0,
|
| 54 |
+
2
|
| 55 |
+
],
|
| 56 |
+
"2": [
|
| 57 |
+
2
|
| 58 |
+
],
|
| 59 |
+
"3": [
|
| 60 |
+
3
|
| 61 |
+
],
|
| 62 |
+
"4": [
|
| 63 |
+
0,
|
| 64 |
+
1,
|
| 65 |
+
2
|
| 66 |
+
],
|
| 67 |
+
"5": [
|
| 68 |
+
3
|
| 69 |
+
],
|
| 70 |
+
"6": [
|
| 71 |
+
1
|
| 72 |
+
],
|
| 73 |
+
"7": [
|
| 74 |
+
0
|
| 75 |
+
],
|
| 76 |
+
"8": [
|
| 77 |
+
0
|
| 78 |
+
],
|
| 79 |
+
"9": [
|
| 80 |
+
1,
|
| 81 |
+
2,
|
| 82 |
+
3
|
| 83 |
+
],
|
| 84 |
+
"10": [
|
| 85 |
+
2
|
| 86 |
+
],
|
| 87 |
+
"11": [
|
| 88 |
+
0,
|
| 89 |
+
1,
|
| 90 |
+
2,
|
| 91 |
+
3
|
| 92 |
+
],
|
| 93 |
+
"12": [
|
| 94 |
+
0,
|
| 95 |
+
1,
|
| 96 |
+
2
|
| 97 |
+
],
|
| 98 |
+
"13": [
|
| 99 |
+
2
|
| 100 |
+
],
|
| 101 |
+
"14": [
|
| 102 |
+
0,
|
| 103 |
+
2,
|
| 104 |
+
3
|
| 105 |
+
],
|
| 106 |
+
"15": [
|
| 107 |
+
1,
|
| 108 |
+
2
|
| 109 |
+
],
|
| 110 |
+
"16": [
|
| 111 |
+
2
|
| 112 |
+
],
|
| 113 |
+
"17": [
|
| 114 |
+
3
|
| 115 |
+
]
|
| 116 |
+
},
|
| 117 |
+
"run_dir": "prompt-optimizer/out/gepa-12b-multi-from-six-20260613T051216Z",
|
| 118 |
+
"seed": 0,
|
| 119 |
+
"total_metric_calls": 96,
|
| 120 |
+
"val_aggregate_scores": [
|
| 121 |
+
0.4972222222222222,
|
| 122 |
+
0.5380952380952381,
|
| 123 |
+
0.7361111111111112,
|
| 124 |
+
0.5088929588929588
|
| 125 |
+
],
|
| 126 |
+
"val_aggregate_subscores": [
|
| 127 |
+
{
|
| 128 |
+
"weighted_score": 0.4972222222222222
|
| 129 |
+
},
|
| 130 |
+
{
|
| 131 |
+
"weighted_score": 0.5380952380952381
|
| 132 |
+
},
|
| 133 |
+
{
|
| 134 |
+
"weighted_score": 0.7361111111111112
|
| 135 |
+
},
|
| 136 |
+
{
|
| 137 |
+
"weighted_score": 0.5088929588929589
|
| 138 |
+
}
|
| 139 |
+
],
|
| 140 |
+
"val_subscores": [
|
| 141 |
+
{
|
| 142 |
+
"0": 0.0,
|
| 143 |
+
"1": 1.0,
|
| 144 |
+
"2": 0.25,
|
| 145 |
+
"3": 0.25,
|
| 146 |
+
"4": 1.0,
|
| 147 |
+
"5": 0.5,
|
| 148 |
+
"6": 0.25,
|
| 149 |
+
"7": 1.0,
|
| 150 |
+
"8": 0.5,
|
| 151 |
+
"9": 0.0,
|
| 152 |
+
"10": 0.2,
|
| 153 |
+
"11": 1.0,
|
| 154 |
+
"12": 1.0,
|
| 155 |
+
"13": 0.0,
|
| 156 |
+
"14": 1.0,
|
| 157 |
+
"15": 0.25,
|
| 158 |
+
"16": 0.25,
|
| 159 |
+
"17": 0.5
|
| 160 |
+
},
|
| 161 |
+
{
|
| 162 |
+
"0": 0.5,
|
| 163 |
+
"1": 0.0,
|
| 164 |
+
"2": 0.5,
|
| 165 |
+
"3": 0.25,
|
| 166 |
+
"4": 1.0,
|
| 167 |
+
"5": 0.5,
|
| 168 |
+
"6": 1.0,
|
| 169 |
+
"7": 0.5,
|
| 170 |
+
"8": 0.25,
|
| 171 |
+
"9": 1.0,
|
| 172 |
+
"10": 0.2,
|
| 173 |
+
"11": 1.0,
|
| 174 |
+
"12": 1.0,
|
| 175 |
+
"13": 0.5,
|
| 176 |
+
"14": 0.2857142857142857,
|
| 177 |
+
"15": 0.5,
|
| 178 |
+
"16": 0.2,
|
| 179 |
+
"17": 0.5
|
| 180 |
+
},
|
| 181 |
+
{
|
| 182 |
+
"0": 1.0,
|
| 183 |
+
"1": 1.0,
|
| 184 |
+
"2": 1.0,
|
| 185 |
+
"3": 0.0,
|
| 186 |
+
"4": 1.0,
|
| 187 |
+
"5": 0.5,
|
| 188 |
+
"6": 0.5,
|
| 189 |
+
"7": 0.5,
|
| 190 |
+
"8": 0.25,
|
| 191 |
+
"9": 1.0,
|
| 192 |
+
"10": 0.5,
|
| 193 |
+
"11": 1.0,
|
| 194 |
+
"12": 1.0,
|
| 195 |
+
"13": 1.0,
|
| 196 |
+
"14": 1.0,
|
| 197 |
+
"15": 0.5,
|
| 198 |
+
"16": 1.0,
|
| 199 |
+
"17": 0.5
|
| 200 |
+
},
|
| 201 |
+
{
|
| 202 |
+
"0": 0.5,
|
| 203 |
+
"1": 0.25,
|
| 204 |
+
"2": 0.15384615384615385,
|
| 205 |
+
"3": 1.0,
|
| 206 |
+
"4": 0.16666666666666666,
|
| 207 |
+
"5": 1.0,
|
| 208 |
+
"6": 0.2857142857142857,
|
| 209 |
+
"7": 0.5,
|
| 210 |
+
"8": 0.25,
|
| 211 |
+
"9": 1.0,
|
| 212 |
+
"10": 0.2,
|
| 213 |
+
"11": 1.0,
|
| 214 |
+
"12": 0.15384615384615385,
|
| 215 |
+
"13": 0.2,
|
| 216 |
+
"14": 1.0,
|
| 217 |
+
"15": 0.25,
|
| 218 |
+
"16": 0.25,
|
| 219 |
+
"17": 1.0
|
| 220 |
+
}
|
| 221 |
+
],
|
| 222 |
+
"validation_schema_version": 2
|
| 223 |
+
}
|
gepa-12b-multi-from-six-20260613T051216Z/gepa_state.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b92f2612215b619430254c1627adcc22c907e17753a4bb9b4f615f4f48b940c2
|
| 3 |
+
size 22761
|
gepa-12b-multi-from-six-20260613T051216Z/optimize.stderr.log
ADDED
|
File without changes
|
gepa-12b-multi-from-six-20260613T051216Z/optimize.stdout.json
ADDED
|
@@ -0,0 +1,275 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Iteration 0: Base program full valset score: 0.4972222222222222 over 18 / 18 examples
|
| 2 |
+
Iteration 1: Selected program 0 score: 0.4972222222222222
|
| 3 |
+
Iteration 1: Proposed new text for routing_policy: You are classifying GitHub issues or pull requests into the smallest complete set of allowed topic ids.
|
| 4 |
+
|
| 5 |
+
Input format:
|
| 6 |
+
- You may receive a GitHub target URL, a title, and sometimes a body or summary.
|
| 7 |
+
- The title is the primary signal.
|
| 8 |
+
- Use the first clear body summary only when the title is ambiguous.
|
| 9 |
+
- Ignore examples, tests, files changed, implementation details, incidental keywords, and broad impact unless they are the actual user-visible subject.
|
| 10 |
+
- Return only final JSON using exact allowed topic ids, for example:
|
| 11 |
+
{"topics_of_interest":["queueing","docs"]}
|
| 12 |
+
|
| 13 |
+
Task:
|
| 14 |
+
Choose the minimum topic set that routes the item to the right maintainer bucket without dropping an explicitly central second concern.
|
| 15 |
+
|
| 16 |
+
General process:
|
| 17 |
+
1. Read the title first.
|
| 18 |
+
2. Identify the main user-visible problem, feature, documentation change, or policy change.
|
| 19 |
+
3. Pick one primary topic.
|
| 20 |
+
4. Add a secondary topic only when it is explicitly central and removing it would route the item away from a maintainer who must see it.
|
| 21 |
+
5. Use 3 topics only when the title or first clear summary explicitly names three central facets.
|
| 22 |
+
6. Use 0 topics when no allowed topic is central.
|
| 23 |
+
7. Never invent topic ids.
|
| 24 |
+
8. Output only JSON.
|
| 25 |
+
|
| 26 |
+
High-signal title patterns:
|
| 27 |
+
- A Conventional Commit type like `docs(...)`, `feat(...)`, `fix(...)`, or `policy(...)` can indicate the kind of change.
|
| 28 |
+
- A scope inside parentheses is often central. For example, `docs(queue): ...` usually includes both `docs` and `queueing`.
|
| 29 |
+
- Do not blindly label every word in the title. Confirm the word names the subject, not just context.
|
| 30 |
+
|
| 31 |
+
Domain rules and corrections:
|
| 32 |
+
- Documentation-only PRs should usually include `docs` plus the central documented area.
|
| 33 |
+
- Example: `docs(queue): clarify steer behavior with partial streaming and tool boundaries` => `docs`, `queueing`.
|
| 34 |
+
- Do not add `tool_calling` just because the title says “tool boundaries” unless tool calling behavior itself is the central feature or bug.
|
| 35 |
+
|
| 36 |
+
- Queue, queueing, queued execution, steer behavior in queues, or queue lifecycle route to `queueing` when central.
|
| 37 |
+
|
| 38 |
+
- `tool_calling` is only for tool-call execution, tool-call APIs, tool selection, tool schema handling, or tool-call runtime behavior.
|
| 39 |
+
- Mentions of “tool boundaries” in docs about another system are usually context, not `tool_calling`.
|
| 40 |
+
|
| 41 |
+
- ACPX-related sandbox or workflow issues route to `acpx` when ACPX is named centrally.
|
| 42 |
+
- Codex-related behavior routes to `codex` when Codex is named centrally.
|
| 43 |
+
- User-installed plugins, plugin inheritance, Superpowers, skills, plugin discovery, plugin installation, or skill/plugin availability route to `skills_plugins`.
|
| 44 |
+
- Example: `[Feature]: ACPX Codex sandbox should inherit user-installed plugins (e.g. Superpowers)` => `acpx`, `codex`, `skills_plugins`.
|
| 45 |
+
- Do not drop `skills_plugins` when plugins are the requested feature.
|
| 46 |
+
|
| 47 |
+
- Memory or embeddings provider work routes to `memory` when the provider exists for memory/embeddings.
|
| 48 |
+
- Self-hosted inference servers such as llama.cpp, Ollama, vLLM, TGI, and LocalAI route to `self_hosted_inference` when the item is about using those servers as inference providers.
|
| 49 |
+
- Example: `feat(memory/embeddings): add openai-compatible provider for self-hosted servers (llama.cpp, Ollama, vLLM, TGI, LocalAI)` => `memory`, `self_hosted_inference`.
|
| 50 |
+
- Do not add `model_serving` merely because the title says “openai-compatible”, “provider”, llama.cpp, Ollama, vLLM, TGI, or LocalAI.
|
| 51 |
+
|
| 52 |
+
- Use `model_serving` only when the central subject is serving endpoints, OpenAI-compatible request/response protocol behavior, streaming lifecycle, final usage chunks, base URL behavior, endpoint compatibility, request routing, or model-server compatibility.
|
| 53 |
+
- OpenAI-compatible streaming, final usage chunks, stream lifecycle, endpoint compatibility, base URL behavior, vLLM/TGI/LocalAI/llama.cpp serving behavior, and request routing are `model_serving`.
|
| 54 |
+
- Do not add `telemetry_usage` merely because the title mentions usage, tokens, counts, cost, or chunks when those are symptoms of a model-serving protocol bug.
|
| 55 |
+
- Example: “OpenAI-compatible streaming with llama.cpp saves zero usage (stream closed before final usage chunk)” => `model_serving` only.
|
| 56 |
+
|
| 57 |
+
- Use `telemetry_usage` only when metric collection, usage accounting/reporting, cost display, diagnostic counts, traces, or status reporting surfaces are themselves the feature or bug.
|
| 58 |
+
|
| 59 |
+
Policy/config rules:
|
| 60 |
+
- Items about policy rules, conformance checks, quality gates, allowed behavior, or configuration-governed enforcement usually include `config` when the policy/checking behavior is central.
|
| 61 |
+
- Do not map “model” in “model policy”, “model conformance”, or “model checks” to `model_serving` unless the item is actually about serving endpoints, streaming, endpoint lifecycle, routing, or model-server compatibility.
|
| 62 |
+
- Network policy, network conformance, access restrictions, outbound rules, or boundary checks can be `security` when they concern allowed/blocked network behavior.
|
| 63 |
+
- MCP conformance, MCP policy, MCP tool behavior, or MCP protocol checks route to `mcp_tooling`.
|
| 64 |
+
- Example: “Policy: add model, network, and MCP conformance checks” => `mcp_tooling`, `config`, `security`, not `model_serving`.
|
| 65 |
+
|
| 66 |
+
Final suppression checks:
|
| 67 |
+
- If a topic was added only because of a word like “usage”, “model”, “network”, “test”, “policy”, “status”, “tool”, “plugin”, or “chunk”, verify that the topic is actually the subject.
|
| 68 |
+
- Prefer the narrow central topic over broad fallback labels.
|
| 69 |
+
- Remove labels that come only from symptoms, implementation details, tests, examples, files changed, or incidental words.
|
| 70 |
+
Iteration 1: New subsample score 4.0 is better than old score 2.0357142857142856. Continue to full eval and add to candidate pool.
|
| 71 |
+
Iteration 1: Found a better program on the valset with score 0.5380952380952381.
|
| 72 |
+
Iteration 1: Valset score for new program: 0.5380952380952381 (coverage 18 / 18)
|
| 73 |
+
Iteration 1: Val aggregate for new program: 0.5380952380952381
|
| 74 |
+
Iteration 1: Individual valset scores for new program: {0: 0.5, 1: 0.0, 2: 0.5, 3: 0.25, 4: 1.0, 5: 0.5, 6: 1.0, 7: 0.5, 8: 0.25, 9: 1.0, 10: 0.2, 11: 1.0, 12: 1.0, 13: 0.5, 14: 0.2857142857142857, 15: 0.5, 16: 0.2, 17: 0.5}
|
| 75 |
+
Iteration 1: Objective aggregate scores for new program: {'weighted_score': 0.5380952380952381}
|
| 76 |
+
Iteration 1: New valset pareto front scores: {0: 0.5, 1: 1.0, 2: 0.5, 3: 0.25, 4: 1.0, 5: 0.5, 6: 1.0, 7: 1.0, 8: 0.5, 9: 1.0, 10: 0.2, 11: 1.0, 12: 1.0, 13: 0.5, 14: 1.0, 15: 0.5, 16: 0.25, 17: 0.5}
|
| 77 |
+
Iteration 1: Objective pareto front scores: {'weighted_score': 0.5380952380952381}
|
| 78 |
+
Iteration 1: Valset pareto front aggregate score: 0.6777777777777777
|
| 79 |
+
Iteration 1: Updated valset pareto front programs: {0: {1}, 1: {0}, 2: {1}, 3: {0, 1}, 4: {0, 1}, 5: {0, 1}, 6: {1}, 7: {0}, 8: {0}, 9: {1}, 10: {0, 1}, 11: {0, 1}, 12: {0, 1}, 13: {1}, 14: {0}, 15: {1}, 16: {0}, 17: {0, 1}}
|
| 80 |
+
Iteration 1: Updated objective pareto front programs: {'weighted_score': {1}}
|
| 81 |
+
Iteration 1: Best valset aggregate score so far: 0.5380952380952381
|
| 82 |
+
Iteration 1: Best program as per aggregate score on valset: 1
|
| 83 |
+
Iteration 1: Best score on valset: 0.5380952380952381
|
| 84 |
+
Iteration 1: Linear pareto front program index: 1
|
| 85 |
+
Iteration 1: New program candidate index: 1
|
| 86 |
+
Iteration 2: Selected program 1 score: 0.5380952380952381
|
| 87 |
+
Iteration 2: Proposed new text for routing_policy: You are classifying GitHub issues or pull requests into the smallest complete set of allowed topic ids.
|
| 88 |
+
|
| 89 |
+
Input format:
|
| 90 |
+
- You may receive a GitHub target URL, title, and sometimes a body or summary.
|
| 91 |
+
- The title is the primary signal.
|
| 92 |
+
- Use the first clear body summary only when the title is ambiguous.
|
| 93 |
+
- Ignore examples, tests, files changed, implementation details, incidental keywords, and broad impact unless they are the actual user-visible subject.
|
| 94 |
+
- Return only final JSON using exact allowed topic ids, for example:
|
| 95 |
+
{"topics_of_interest":["queueing","docs"]}
|
| 96 |
+
|
| 97 |
+
Task:
|
| 98 |
+
Choose the minimum topic set that routes the item to the right maintainer bucket without dropping an explicitly central second or third concern.
|
| 99 |
+
|
| 100 |
+
General process:
|
| 101 |
+
1. Read the title first.
|
| 102 |
+
2. Identify the main user-visible problem, feature, documentation change, policy change, or contract being changed.
|
| 103 |
+
3. Pick one primary topic.
|
| 104 |
+
4. Add a secondary topic only when it is explicitly central and removing it would route the item away from a maintainer who must see it.
|
| 105 |
+
5. Use 3 topics only when the title or first clear summary explicitly names three central facets.
|
| 106 |
+
6. Use 0 topics when no allowed topic is central.
|
| 107 |
+
7. Never invent topic ids.
|
| 108 |
+
8. Output only JSON.
|
| 109 |
+
|
| 110 |
+
High-signal title patterns:
|
| 111 |
+
- A Conventional Commit type like `docs(...)`, `feat(...)`, `fix(...)`, `test(...)`, or `policy(...)` can indicate the kind of change.
|
| 112 |
+
- A scope inside parentheses is often central. For example, `docs(queue): ...` usually includes both `docs` and `queueing`.
|
| 113 |
+
- Do not ignore `test(...)` scopes when the title is about landing or enforcing a behavior contract. The tested contract can be the central subject.
|
| 114 |
+
- Do not blindly label every word in the title. Confirm the word names the subject, not just a path, symptom, or context.
|
| 115 |
+
|
| 116 |
+
Domain rules and corrections:
|
| 117 |
+
|
| 118 |
+
Documentation:
|
| 119 |
+
- Documentation-only PRs should usually include `docs` plus the central documented area.
|
| 120 |
+
- Example: `docs(queue): clarify steer behavior with partial streaming and tool boundaries` => `docs`, `queueing`.
|
| 121 |
+
- Do not add `tool_calling` just because the title says “tool boundaries” unless tool-call behavior itself is the central feature or bug.
|
| 122 |
+
|
| 123 |
+
Queueing:
|
| 124 |
+
- Queue, queueing, queued execution, steer behavior in queues, or queue lifecycle route to `queueing` when central.
|
| 125 |
+
|
| 126 |
+
Tool calling:
|
| 127 |
+
- `tool_calling` is only for tool-call execution, tool-call APIs, tool selection, tool schema handling, or tool-call runtime behavior.
|
| 128 |
+
- Mentions of “tool boundaries” in docs about another system are usually context, not `tool_calling`.
|
| 129 |
+
|
| 130 |
+
ACP, gateway, and runtime:
|
| 131 |
+
- ACP-related work routes to `acp` when ACP is named centrally.
|
| 132 |
+
- ACPX-related sandbox or workflow issues route to `acpx` when ACPX is named centrally.
|
| 133 |
+
- Gateway-owned behavior routes to `gateway` only when gateway is explicitly the owner or subject.
|
| 134 |
+
- Runtime work routes to `agent_runtime` when the title is about runtimes, node-backed runtimes, agent execution runtimes, or runtime ownership.
|
| 135 |
+
- Example: `ACP: add gateway-owned node-backed runtime` => `acp`, `gateway`, `agent_runtime`.
|
| 136 |
+
|
| 137 |
+
Codex and plugins:
|
| 138 |
+
- Codex-related behavior routes to `codex` when Codex is named centrally.
|
| 139 |
+
- User-installed plugins, plugin inheritance, Superpowers, skills, plugin discovery, plugin installation, or skill/plugin availability route to `skills_plugins`.
|
| 140 |
+
- Example: `[Feature]: ACPX Codex sandbox should inherit user-installed plugins (e.g. Superpowers)` => `acpx`, `codex`, `skills_plugins`.
|
| 141 |
+
- Do not drop `skills_plugins` when plugins are the requested feature.
|
| 142 |
+
|
| 143 |
+
Notifications and chat integrations:
|
| 144 |
+
- Slack, chat app delivery, chat target channels, and chat push behavior route to `chat_integrations`.
|
| 145 |
+
- Announce messages, heartbeat pushes, target-channel pushes, identity overlays on pushed messages, and notification delivery route to `notifications`.
|
| 146 |
+
- Do not add `cron_automation` merely because the notification path mentions `cron --announce`; cron is context unless scheduling, force-run behavior, cron lifecycle, or cron execution is itself broken.
|
| 147 |
+
- Example: `Per-agent identity overlay dropped on cron --announce and heartbeat target-channel Slack pushes` => `notifications`, `chat_integrations`.
|
| 148 |
+
|
| 149 |
+
Cron:
|
| 150 |
+
- Use `cron_automation` when cron scheduling, cron force-run, cron lifecycle, cron execution, or a cron deadlock is central.
|
| 151 |
+
- Example: `cron force-run deadlock` => `cron_automation`.
|
| 152 |
+
|
| 153 |
+
Exec, sandboxing, and approvals:
|
| 154 |
+
- Exec command/tool behavior routes to `exec_tools`.
|
| 155 |
+
- Exec PATH fallback is `exec_tools`.
|
| 156 |
+
- Exec v2 contract follow-through or contract enforcement can centrally include `exec_tools`, `sandboxing`, and `approvals` when the contract covers sandbox and approval behavior.
|
| 157 |
+
- Example: `test(exec): land exec v2 contract follow-through` => `exec_tools`, `sandboxing`, `approvals`.
|
| 158 |
+
- Do not replace sandboxing or approvals with `security` unless the title is actually about a security policy, vulnerability, network restriction, credential boundary, or allowed/blocked security behavior.
|
| 159 |
+
|
| 160 |
+
Browser automation:
|
| 161 |
+
- Browser diagnostics, browser automation layers, browser runtime behavior, and browser tooling issues route to `browser_automation`.
|
| 162 |
+
- Example: `layered browser diagnostics` => `browser_automation`.
|
| 163 |
+
- Do not add `gateway` for browser diagnostics unless the gateway itself is explicitly the subject.
|
| 164 |
+
|
| 165 |
+
Memory and inference:
|
| 166 |
+
- Memory or embeddings provider work routes to `memory` when the provider exists for memory/embeddings.
|
| 167 |
+
- Self-hosted inference servers such as llama.cpp, Ollama, vLLM, TGI, and LocalAI route to `self_hosted_inference` when the item is about using those servers as inference providers.
|
| 168 |
+
- Example: `feat(memory/embeddings): add openai-compatible provider for self-hosted servers (llama.cpp, Ollama, vLLM, TGI, LocalAI)` => `memory`, `self_hosted_inference`.
|
| 169 |
+
- Do not add `model_serving` merely because the title says “openai-compatible”, “provider”, llama.cpp, Ollama, vLLM, TGI, or LocalAI.
|
| 170 |
+
|
| 171 |
+
Model serving:
|
| 172 |
+
- Use `model_serving` only when the central subject is serving endpoints, OpenAI-compatible request/response protocol behavior, streaming lifecycle, final usage chunks, base URL behavior, endpoint compatibility, request routing, or model-server compatibility.
|
| 173 |
+
- OpenAI-compatible streaming, final usage chunks, stream lifecycle, endpoint compatibility, base URL behavior, vLLM/TGI/LocalAI/llama.cpp serving behavior, and request routing are `model_serving`.
|
| 174 |
+
- Do not add `telemetry_usage` merely because the title mentions usage, tokens, counts, cost, or chunks when those are symptoms of a model-serving protocol bug.
|
| 175 |
+
- Example: `OpenAI-compatible streaming with llama.cpp saves zero usage (stream closed before final usage chunk)` => `model_serving`.
|
| 176 |
+
|
| 177 |
+
Telemetry and usage:
|
| 178 |
+
- Use `telemetry_usage` only when metric collection, usage accounting/reporting, cost display, diagnostic counts, traces, or status reporting surfaces are themselves the feature or bug.
|
| 179 |
+
|
| 180 |
+
Policy/config:
|
| 181 |
+
- Items about policy rules, conformance checks, quality gates, allowed behavior, or configuration-governed enforcement usually include `config` when the policy/checking behavior is central.
|
| 182 |
+
- Do not map “model” in “model policy”, “model conformance”, or “model checks” to `model_serving` unless the item is actually about serving endpoints, streaming, endpoint lifecycle, routing, or model-server compatibility.
|
| 183 |
+
- Network policy, network conformance, access restrictions, outbound rules, or boundary checks can be `security` when they concern allowed/blocked network behavior.
|
| 184 |
+
- MCP conformance, MCP policy, MCP tool behavior, or MCP protocol checks route to `mcp_tooling`.
|
| 185 |
+
- Example: `Policy: add model, network, and MCP conformance checks` => `mcp_tooling`, `config`, `security`, not `model_serving`.
|
| 186 |
+
|
| 187 |
+
Composite fixes:
|
| 188 |
+
- If a title lists several independent fixes, classify each central fix up to the smallest complete set.
|
| 189 |
+
- Example: `fix: resolve exec PATH fallback, layered browser diagnostics, and cron force-run deadlock` => `exec_tools`, `browser_automation`, `cron_automation`.
|
| 190 |
+
- Do not substitute a broad infrastructure topic like `gateway` unless it is explicitly one of the listed user-visible subjects.
|
| 191 |
+
|
| 192 |
+
Final suppression checks:
|
| 193 |
+
- If a topic was added only because of a word like “usage”, “model”, “network”, “test”, “policy”, “status”, “tool”, “plugin”, “chunk”, “cron”, “gateway”, or “security”, verify that the topic is actually the subject.
|
| 194 |
+
- Prefer the narrow central topic over broad fallback labels.
|
| 195 |
+
- Remove labels that come only from symptoms, implementation details, tests, examples, files changed, or incidental words.
|
| 196 |
+
- Keep required central second and third topics when dropping them would hide the item from a maintainer who owns that area.
|
| 197 |
+
Iteration 2: New subsample score 3.2 is better than old score 1.2. Continue to full eval and add to candidate pool.
|
| 198 |
+
Iteration 2: Found a better program on the valset with score 0.7361111111111112.
|
| 199 |
+
Iteration 2: Valset score for new program: 0.7361111111111112 (coverage 18 / 18)
|
| 200 |
+
Iteration 2: Val aggregate for new program: 0.7361111111111112
|
| 201 |
+
Iteration 2: Individual valset scores for new program: {0: 1.0, 1: 1.0, 2: 1.0, 3: 0.0, 4: 1.0, 5: 0.5, 6: 0.5, 7: 0.5, 8: 0.25, 9: 1.0, 10: 0.5, 11: 1.0, 12: 1.0, 13: 1.0, 14: 1.0, 15: 0.5, 16: 1.0, 17: 0.5}
|
| 202 |
+
Iteration 2: Objective aggregate scores for new program: {'weighted_score': 0.7361111111111112}
|
| 203 |
+
Iteration 2: New valset pareto front scores: {0: 1.0, 1: 1.0, 2: 1.0, 3: 0.25, 4: 1.0, 5: 0.5, 6: 1.0, 7: 1.0, 8: 0.5, 9: 1.0, 10: 0.5, 11: 1.0, 12: 1.0, 13: 1.0, 14: 1.0, 15: 0.5, 16: 1.0, 17: 0.5}
|
| 204 |
+
Iteration 2: Objective pareto front scores: {'weighted_score': 0.7361111111111112}
|
| 205 |
+
Iteration 2: Valset pareto front aggregate score: 0.8194444444444444
|
| 206 |
+
Iteration 2: Updated valset pareto front programs: {0: {2}, 1: {0, 2}, 2: {2}, 3: {0, 1}, 4: {0, 1, 2}, 5: {0, 1, 2}, 6: {1}, 7: {0}, 8: {0}, 9: {1, 2}, 10: {2}, 11: {0, 1, 2}, 12: {0, 1, 2}, 13: {2}, 14: {0, 2}, 15: {1, 2}, 16: {2}, 17: {0, 1, 2}}
|
| 207 |
+
Iteration 2: Updated objective pareto front programs: {'weighted_score': {2}}
|
| 208 |
+
Iteration 2: Best valset aggregate score so far: 0.7361111111111112
|
| 209 |
+
Iteration 2: Best program as per aggregate score on valset: 2
|
| 210 |
+
Iteration 2: Best score on valset: 0.7361111111111112
|
| 211 |
+
Iteration 2: Linear pareto front program index: 2
|
| 212 |
+
Iteration 2: New program candidate index: 2
|
| 213 |
+
Iteration 3: Selected program 2 score: 0.7361111111111112
|
| 214 |
+
Iteration 3: Proposed new text for routing_policy: Add these routing corrections to the classifier instructions:
|
| 215 |
+
|
| 216 |
+
- Treat compound titles as lists of central user-visible fixes. Classify each central item, but do not add labels for every noun.
|
| 217 |
+
- `skills_plugins` is label spam unless the plugin system itself is the requested feature or bug: user-installed plugins, plugin inheritance, Superpowers, skill/plugin discovery, plugin installation, or skill/plugin availability.
|
| 218 |
+
- In titles like `fix: Codex startup plugins + WhatsApp history & Docker Codex OAuth`, keep `codex` because Codex behavior is central, but do not add `skills_plugins` for “startup plugins” unless the plugin lifecycle is the actual subject.
|
| 219 |
+
- WhatsApp, Slack, chat history, chat app delivery, chat target channels, and chat push behavior route to `chat_integrations` when central.
|
| 220 |
+
- ACP session permission-mode work can require all three topics: `acp`, `approvals`, and `acpx`.
|
| 221 |
+
- Specifically, titles mentioning per-binding or per-agent `permissionMode` for ACP sessions should include `acp`, `approvals`, and `acpx`. `permissionMode` is an approval/permission contract, and ACPX owns the ACP session/binding workflow concern.
|
| 222 |
+
- Add `local_models` when the title centrally names local model apps or local model providers such as LM Studio.
|
| 223 |
+
- LM Studio issues involving Responses API behavior, thinking blocks, streaming, request/response compatibility, or visibility of model output should usually include both `model_serving` and `local_models`.
|
| 224 |
+
- Do not replace `local_models` with `self_hosted_inference` when the named subject is LM Studio or another local-model product/app rather than a generic inference server integration.
|
| 225 |
+
- `Responses API`, invisible thinking blocks, OpenAI-compatible behavior, streaming lifecycle, request/response protocol handling, and model-output protocol bugs route to `model_serving`.
|
| 226 |
+
|
| 227 |
+
Additional suppression checks:
|
| 228 |
+
- If `skills_plugins` was added only because the title contains “plugins” inside a broader Codex startup or OAuth fix, remove it unless plugin installation/discovery/inheritance/availability is the central user-visible bug.
|
| 229 |
+
- If a chat product name such as WhatsApp appears as a central listed fix, include `chat_integrations`.
|
| 230 |
+
- If ACP + `permissionMode` + per-binding/per-agent/session language appears, include `acpx` in addition to `acp` and `approvals`.
|
| 231 |
+
- If LM Studio appears as a central subject, include `local_models`.
|
| 232 |
+
Iteration 3: New subsample score 2.571428571428571 is better than old score 2.25. Continue to full eval and add to candidate pool.
|
| 233 |
+
Iteration 3: Valset score for new program: 0.5088929588929588 (coverage 18 / 18)
|
| 234 |
+
Iteration 3: Val aggregate for new program: 0.5088929588929588
|
| 235 |
+
Iteration 3: Individual valset scores for new program: {0: 0.5, 1: 0.25, 2: 0.15384615384615385, 3: 1.0, 4: 0.16666666666666666, 5: 1.0, 6: 0.2857142857142857, 7: 0.5, 8: 0.25, 9: 1.0, 10: 0.2, 11: 1.0, 12: 0.15384615384615385, 13: 0.2, 14: 1.0, 15: 0.25, 16: 0.25, 17: 1.0}
|
| 236 |
+
Iteration 3: Objective aggregate scores for new program: {'weighted_score': 0.5088929588929589}
|
| 237 |
+
Iteration 3: New valset pareto front scores: {0: 1.0, 1: 1.0, 2: 1.0, 3: 1.0, 4: 1.0, 5: 1.0, 6: 1.0, 7: 1.0, 8: 0.5, 9: 1.0, 10: 0.5, 11: 1.0, 12: 1.0, 13: 1.0, 14: 1.0, 15: 0.5, 16: 1.0, 17: 1.0}
|
| 238 |
+
Iteration 3: Objective pareto front scores: {'weighted_score': 0.7361111111111112}
|
| 239 |
+
Iteration 3: Valset pareto front aggregate score: 0.9166666666666666
|
| 240 |
+
Iteration 3: Updated valset pareto front programs: {0: {2}, 1: {0, 2}, 2: {2}, 3: {3}, 4: {0, 1, 2}, 5: {3}, 6: {1}, 7: {0}, 8: {0}, 9: {1, 2, 3}, 10: {2}, 11: {0, 1, 2, 3}, 12: {0, 1, 2}, 13: {2}, 14: {0, 2, 3}, 15: {1, 2}, 16: {2}, 17: {3}}
|
| 241 |
+
Iteration 3: Updated objective pareto front programs: {'weighted_score': {2}}
|
| 242 |
+
Iteration 3: Best valset aggregate score so far: 0.7361111111111112
|
| 243 |
+
Iteration 3: Best program as per aggregate score on valset: 2
|
| 244 |
+
Iteration 3: Best score on valset: 0.7361111111111112
|
| 245 |
+
Iteration 3: Linear pareto front program index: 2
|
| 246 |
+
Iteration 3: New program candidate index: 3
|
| 247 |
+
{
|
| 248 |
+
"best_idx": 2,
|
| 249 |
+
"best_prompt_path": "prompt-optimizer/out/gepa-12b-multi-from-six-20260613T051216Z/best.prompt.md",
|
| 250 |
+
"best_routing_policy_path": "prompt-optimizer/out/gepa-12b-multi-from-six-20260613T051216Z/best.routing_policy.md",
|
| 251 |
+
"best_score": 0.7361111111111112,
|
| 252 |
+
"config": {
|
| 253 |
+
"harness": {
|
| 254 |
+
"base_url": null,
|
| 255 |
+
"concurrency": 2,
|
| 256 |
+
"context_window": null,
|
| 257 |
+
"max_tokens": 1536,
|
| 258 |
+
"model": "gemma-12b-q4km-reason",
|
| 259 |
+
"state_dir": null,
|
| 260 |
+
"timeout_ms": 900000
|
| 261 |
+
},
|
| 262 |
+
"max_metric_calls": 96,
|
| 263 |
+
"output_dir": "prompt-optimizer/out/gepa-12b-multi-from-six-20260613T051216Z",
|
| 264 |
+
"reflection_minibatch_size": 4,
|
| 265 |
+
"row_limit": 18,
|
| 266 |
+
"seed": 0,
|
| 267 |
+
"seed_routing_policy_chars": 3224,
|
| 268 |
+
"seed_routing_policy_sha256": "f4b161bb9bbaf366f1d4f1841243d73544bbd3c553ca6be5eb2818e757007187"
|
| 269 |
+
},
|
| 270 |
+
"created_at": "2026-06-13T05:55:33.484027+00:00",
|
| 271 |
+
"num_candidates": 4,
|
| 272 |
+
"num_full_val_evals": 4,
|
| 273 |
+
"result_path": "prompt-optimizer/out/gepa-12b-multi-from-six-20260613T051216Z/gepa-result.json",
|
| 274 |
+
"total_metric_calls": 96
|
| 275 |
+
}
|
gepa-12b-multi-from-six-20260613T051216Z/run_log.json
ADDED
|
@@ -0,0 +1,131 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"i": 0,
|
| 4 |
+
"selected_program_candidate": 0,
|
| 5 |
+
"subsample_ids": [
|
| 6 |
+
9,
|
| 7 |
+
11,
|
| 8 |
+
14,
|
| 9 |
+
12
|
| 10 |
+
],
|
| 11 |
+
"subsample_scores": [
|
| 12 |
+
0.25,
|
| 13 |
+
0.5,
|
| 14 |
+
1.0,
|
| 15 |
+
0.2857142857142857
|
| 16 |
+
],
|
| 17 |
+
"new_subsample_scores": [
|
| 18 |
+
1.0,
|
| 19 |
+
1.0,
|
| 20 |
+
1.0,
|
| 21 |
+
1.0
|
| 22 |
+
],
|
| 23 |
+
"new_program_idx": 1,
|
| 24 |
+
"evaluated_val_indices": [
|
| 25 |
+
0,
|
| 26 |
+
1,
|
| 27 |
+
2,
|
| 28 |
+
3,
|
| 29 |
+
4,
|
| 30 |
+
5,
|
| 31 |
+
6,
|
| 32 |
+
7,
|
| 33 |
+
8,
|
| 34 |
+
9,
|
| 35 |
+
10,
|
| 36 |
+
11,
|
| 37 |
+
12,
|
| 38 |
+
13,
|
| 39 |
+
14,
|
| 40 |
+
15,
|
| 41 |
+
16,
|
| 42 |
+
17
|
| 43 |
+
]
|
| 44 |
+
},
|
| 45 |
+
{
|
| 46 |
+
"i": 1,
|
| 47 |
+
"selected_program_candidate": 1,
|
| 48 |
+
"subsample_ids": [
|
| 49 |
+
0,
|
| 50 |
+
16,
|
| 51 |
+
10,
|
| 52 |
+
2
|
| 53 |
+
],
|
| 54 |
+
"subsample_scores": [
|
| 55 |
+
0.5,
|
| 56 |
+
0.25,
|
| 57 |
+
0.2,
|
| 58 |
+
0.25
|
| 59 |
+
],
|
| 60 |
+
"new_subsample_scores": [
|
| 61 |
+
1.0,
|
| 62 |
+
1.0,
|
| 63 |
+
0.2,
|
| 64 |
+
1.0
|
| 65 |
+
],
|
| 66 |
+
"new_program_idx": 2,
|
| 67 |
+
"evaluated_val_indices": [
|
| 68 |
+
0,
|
| 69 |
+
1,
|
| 70 |
+
2,
|
| 71 |
+
3,
|
| 72 |
+
4,
|
| 73 |
+
5,
|
| 74 |
+
6,
|
| 75 |
+
7,
|
| 76 |
+
8,
|
| 77 |
+
9,
|
| 78 |
+
10,
|
| 79 |
+
11,
|
| 80 |
+
12,
|
| 81 |
+
13,
|
| 82 |
+
14,
|
| 83 |
+
15,
|
| 84 |
+
16,
|
| 85 |
+
17
|
| 86 |
+
]
|
| 87 |
+
},
|
| 88 |
+
{
|
| 89 |
+
"i": 2,
|
| 90 |
+
"selected_program_candidate": 2,
|
| 91 |
+
"subsample_ids": [
|
| 92 |
+
3,
|
| 93 |
+
5,
|
| 94 |
+
17,
|
| 95 |
+
4
|
| 96 |
+
],
|
| 97 |
+
"subsample_scores": [
|
| 98 |
+
0.25,
|
| 99 |
+
0.5,
|
| 100 |
+
0.5,
|
| 101 |
+
1.0
|
| 102 |
+
],
|
| 103 |
+
"new_subsample_scores": [
|
| 104 |
+
1.0,
|
| 105 |
+
0.2857142857142857,
|
| 106 |
+
1.0,
|
| 107 |
+
0.2857142857142857
|
| 108 |
+
],
|
| 109 |
+
"new_program_idx": 3,
|
| 110 |
+
"evaluated_val_indices": [
|
| 111 |
+
0,
|
| 112 |
+
1,
|
| 113 |
+
2,
|
| 114 |
+
3,
|
| 115 |
+
4,
|
| 116 |
+
5,
|
| 117 |
+
6,
|
| 118 |
+
7,
|
| 119 |
+
8,
|
| 120 |
+
9,
|
| 121 |
+
10,
|
| 122 |
+
11,
|
| 123 |
+
12,
|
| 124 |
+
13,
|
| 125 |
+
14,
|
| 126 |
+
15,
|
| 127 |
+
16,
|
| 128 |
+
17
|
| 129 |
+
]
|
| 130 |
+
}
|
| 131 |
+
]
|
gepa-12b-multi-from-six-20260613T051216Z/run_log.txt
ADDED
|
@@ -0,0 +1,246 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Iteration 0: Base program full valset score: 0.4972222222222222 over 18 / 18 examples
|
| 2 |
+
Iteration 1: Selected program 0 score: 0.4972222222222222
|
| 3 |
+
Iteration 1: Proposed new text for routing_policy: You are classifying GitHub issues or pull requests into the smallest complete set of allowed topic ids.
|
| 4 |
+
|
| 5 |
+
Input format:
|
| 6 |
+
- You may receive a GitHub target URL, a title, and sometimes a body or summary.
|
| 7 |
+
- The title is the primary signal.
|
| 8 |
+
- Use the first clear body summary only when the title is ambiguous.
|
| 9 |
+
- Ignore examples, tests, files changed, implementation details, incidental keywords, and broad impact unless they are the actual user-visible subject.
|
| 10 |
+
- Return only final JSON using exact allowed topic ids, for example:
|
| 11 |
+
{"topics_of_interest":["queueing","docs"]}
|
| 12 |
+
|
| 13 |
+
Task:
|
| 14 |
+
Choose the minimum topic set that routes the item to the right maintainer bucket without dropping an explicitly central second concern.
|
| 15 |
+
|
| 16 |
+
General process:
|
| 17 |
+
1. Read the title first.
|
| 18 |
+
2. Identify the main user-visible problem, feature, documentation change, or policy change.
|
| 19 |
+
3. Pick one primary topic.
|
| 20 |
+
4. Add a secondary topic only when it is explicitly central and removing it would route the item away from a maintainer who must see it.
|
| 21 |
+
5. Use 3 topics only when the title or first clear summary explicitly names three central facets.
|
| 22 |
+
6. Use 0 topics when no allowed topic is central.
|
| 23 |
+
7. Never invent topic ids.
|
| 24 |
+
8. Output only JSON.
|
| 25 |
+
|
| 26 |
+
High-signal title patterns:
|
| 27 |
+
- A Conventional Commit type like `docs(...)`, `feat(...)`, `fix(...)`, or `policy(...)` can indicate the kind of change.
|
| 28 |
+
- A scope inside parentheses is often central. For example, `docs(queue): ...` usually includes both `docs` and `queueing`.
|
| 29 |
+
- Do not blindly label every word in the title. Confirm the word names the subject, not just context.
|
| 30 |
+
|
| 31 |
+
Domain rules and corrections:
|
| 32 |
+
- Documentation-only PRs should usually include `docs` plus the central documented area.
|
| 33 |
+
- Example: `docs(queue): clarify steer behavior with partial streaming and tool boundaries` => `docs`, `queueing`.
|
| 34 |
+
- Do not add `tool_calling` just because the title says “tool boundaries” unless tool calling behavior itself is the central feature or bug.
|
| 35 |
+
|
| 36 |
+
- Queue, queueing, queued execution, steer behavior in queues, or queue lifecycle route to `queueing` when central.
|
| 37 |
+
|
| 38 |
+
- `tool_calling` is only for tool-call execution, tool-call APIs, tool selection, tool schema handling, or tool-call runtime behavior.
|
| 39 |
+
- Mentions of “tool boundaries” in docs about another system are usually context, not `tool_calling`.
|
| 40 |
+
|
| 41 |
+
- ACPX-related sandbox or workflow issues route to `acpx` when ACPX is named centrally.
|
| 42 |
+
- Codex-related behavior routes to `codex` when Codex is named centrally.
|
| 43 |
+
- User-installed plugins, plugin inheritance, Superpowers, skills, plugin discovery, plugin installation, or skill/plugin availability route to `skills_plugins`.
|
| 44 |
+
- Example: `[Feature]: ACPX Codex sandbox should inherit user-installed plugins (e.g. Superpowers)` => `acpx`, `codex`, `skills_plugins`.
|
| 45 |
+
- Do not drop `skills_plugins` when plugins are the requested feature.
|
| 46 |
+
|
| 47 |
+
- Memory or embeddings provider work routes to `memory` when the provider exists for memory/embeddings.
|
| 48 |
+
- Self-hosted inference servers such as llama.cpp, Ollama, vLLM, TGI, and LocalAI route to `self_hosted_inference` when the item is about using those servers as inference providers.
|
| 49 |
+
- Example: `feat(memory/embeddings): add openai-compatible provider for self-hosted servers (llama.cpp, Ollama, vLLM, TGI, LocalAI)` => `memory`, `self_hosted_inference`.
|
| 50 |
+
- Do not add `model_serving` merely because the title says “openai-compatible”, “provider”, llama.cpp, Ollama, vLLM, TGI, or LocalAI.
|
| 51 |
+
|
| 52 |
+
- Use `model_serving` only when the central subject is serving endpoints, OpenAI-compatible request/response protocol behavior, streaming lifecycle, final usage chunks, base URL behavior, endpoint compatibility, request routing, or model-server compatibility.
|
| 53 |
+
- OpenAI-compatible streaming, final usage chunks, stream lifecycle, endpoint compatibility, base URL behavior, vLLM/TGI/LocalAI/llama.cpp serving behavior, and request routing are `model_serving`.
|
| 54 |
+
- Do not add `telemetry_usage` merely because the title mentions usage, tokens, counts, cost, or chunks when those are symptoms of a model-serving protocol bug.
|
| 55 |
+
- Example: “OpenAI-compatible streaming with llama.cpp saves zero usage (stream closed before final usage chunk)” => `model_serving` only.
|
| 56 |
+
|
| 57 |
+
- Use `telemetry_usage` only when metric collection, usage accounting/reporting, cost display, diagnostic counts, traces, or status reporting surfaces are themselves the feature or bug.
|
| 58 |
+
|
| 59 |
+
Policy/config rules:
|
| 60 |
+
- Items about policy rules, conformance checks, quality gates, allowed behavior, or configuration-governed enforcement usually include `config` when the policy/checking behavior is central.
|
| 61 |
+
- Do not map “model” in “model policy”, “model conformance”, or “model checks” to `model_serving` unless the item is actually about serving endpoints, streaming, endpoint lifecycle, routing, or model-server compatibility.
|
| 62 |
+
- Network policy, network conformance, access restrictions, outbound rules, or boundary checks can be `security` when they concern allowed/blocked network behavior.
|
| 63 |
+
- MCP conformance, MCP policy, MCP tool behavior, or MCP protocol checks route to `mcp_tooling`.
|
| 64 |
+
- Example: “Policy: add model, network, and MCP conformance checks” => `mcp_tooling`, `config`, `security`, not `model_serving`.
|
| 65 |
+
|
| 66 |
+
Final suppression checks:
|
| 67 |
+
- If a topic was added only because of a word like “usage”, “model”, “network”, “test”, “policy”, “status”, “tool”, “plugin”, or “chunk”, verify that the topic is actually the subject.
|
| 68 |
+
- Prefer the narrow central topic over broad fallback labels.
|
| 69 |
+
- Remove labels that come only from symptoms, implementation details, tests, examples, files changed, or incidental words.
|
| 70 |
+
Iteration 1: New subsample score 4.0 is better than old score 2.0357142857142856. Continue to full eval and add to candidate pool.
|
| 71 |
+
Iteration 1: Found a better program on the valset with score 0.5380952380952381.
|
| 72 |
+
Iteration 1: Valset score for new program: 0.5380952380952381 (coverage 18 / 18)
|
| 73 |
+
Iteration 1: Val aggregate for new program: 0.5380952380952381
|
| 74 |
+
Iteration 1: Individual valset scores for new program: {0: 0.5, 1: 0.0, 2: 0.5, 3: 0.25, 4: 1.0, 5: 0.5, 6: 1.0, 7: 0.5, 8: 0.25, 9: 1.0, 10: 0.2, 11: 1.0, 12: 1.0, 13: 0.5, 14: 0.2857142857142857, 15: 0.5, 16: 0.2, 17: 0.5}
|
| 75 |
+
Iteration 1: Objective aggregate scores for new program: {'weighted_score': 0.5380952380952381}
|
| 76 |
+
Iteration 1: New valset pareto front scores: {0: 0.5, 1: 1.0, 2: 0.5, 3: 0.25, 4: 1.0, 5: 0.5, 6: 1.0, 7: 1.0, 8: 0.5, 9: 1.0, 10: 0.2, 11: 1.0, 12: 1.0, 13: 0.5, 14: 1.0, 15: 0.5, 16: 0.25, 17: 0.5}
|
| 77 |
+
Iteration 1: Objective pareto front scores: {'weighted_score': 0.5380952380952381}
|
| 78 |
+
Iteration 1: Valset pareto front aggregate score: 0.6777777777777777
|
| 79 |
+
Iteration 1: Updated valset pareto front programs: {0: {1}, 1: {0}, 2: {1}, 3: {0, 1}, 4: {0, 1}, 5: {0, 1}, 6: {1}, 7: {0}, 8: {0}, 9: {1}, 10: {0, 1}, 11: {0, 1}, 12: {0, 1}, 13: {1}, 14: {0}, 15: {1}, 16: {0}, 17: {0, 1}}
|
| 80 |
+
Iteration 1: Updated objective pareto front programs: {'weighted_score': {1}}
|
| 81 |
+
Iteration 1: Best valset aggregate score so far: 0.5380952380952381
|
| 82 |
+
Iteration 1: Best program as per aggregate score on valset: 1
|
| 83 |
+
Iteration 1: Best score on valset: 0.5380952380952381
|
| 84 |
+
Iteration 1: Linear pareto front program index: 1
|
| 85 |
+
Iteration 1: New program candidate index: 1
|
| 86 |
+
Iteration 2: Selected program 1 score: 0.5380952380952381
|
| 87 |
+
Iteration 2: Proposed new text for routing_policy: You are classifying GitHub issues or pull requests into the smallest complete set of allowed topic ids.
|
| 88 |
+
|
| 89 |
+
Input format:
|
| 90 |
+
- You may receive a GitHub target URL, title, and sometimes a body or summary.
|
| 91 |
+
- The title is the primary signal.
|
| 92 |
+
- Use the first clear body summary only when the title is ambiguous.
|
| 93 |
+
- Ignore examples, tests, files changed, implementation details, incidental keywords, and broad impact unless they are the actual user-visible subject.
|
| 94 |
+
- Return only final JSON using exact allowed topic ids, for example:
|
| 95 |
+
{"topics_of_interest":["queueing","docs"]}
|
| 96 |
+
|
| 97 |
+
Task:
|
| 98 |
+
Choose the minimum topic set that routes the item to the right maintainer bucket without dropping an explicitly central second or third concern.
|
| 99 |
+
|
| 100 |
+
General process:
|
| 101 |
+
1. Read the title first.
|
| 102 |
+
2. Identify the main user-visible problem, feature, documentation change, policy change, or contract being changed.
|
| 103 |
+
3. Pick one primary topic.
|
| 104 |
+
4. Add a secondary topic only when it is explicitly central and removing it would route the item away from a maintainer who must see it.
|
| 105 |
+
5. Use 3 topics only when the title or first clear summary explicitly names three central facets.
|
| 106 |
+
6. Use 0 topics when no allowed topic is central.
|
| 107 |
+
7. Never invent topic ids.
|
| 108 |
+
8. Output only JSON.
|
| 109 |
+
|
| 110 |
+
High-signal title patterns:
|
| 111 |
+
- A Conventional Commit type like `docs(...)`, `feat(...)`, `fix(...)`, `test(...)`, or `policy(...)` can indicate the kind of change.
|
| 112 |
+
- A scope inside parentheses is often central. For example, `docs(queue): ...` usually includes both `docs` and `queueing`.
|
| 113 |
+
- Do not ignore `test(...)` scopes when the title is about landing or enforcing a behavior contract. The tested contract can be the central subject.
|
| 114 |
+
- Do not blindly label every word in the title. Confirm the word names the subject, not just a path, symptom, or context.
|
| 115 |
+
|
| 116 |
+
Domain rules and corrections:
|
| 117 |
+
|
| 118 |
+
Documentation:
|
| 119 |
+
- Documentation-only PRs should usually include `docs` plus the central documented area.
|
| 120 |
+
- Example: `docs(queue): clarify steer behavior with partial streaming and tool boundaries` => `docs`, `queueing`.
|
| 121 |
+
- Do not add `tool_calling` just because the title says “tool boundaries” unless tool-call behavior itself is the central feature or bug.
|
| 122 |
+
|
| 123 |
+
Queueing:
|
| 124 |
+
- Queue, queueing, queued execution, steer behavior in queues, or queue lifecycle route to `queueing` when central.
|
| 125 |
+
|
| 126 |
+
Tool calling:
|
| 127 |
+
- `tool_calling` is only for tool-call execution, tool-call APIs, tool selection, tool schema handling, or tool-call runtime behavior.
|
| 128 |
+
- Mentions of “tool boundaries” in docs about another system are usually context, not `tool_calling`.
|
| 129 |
+
|
| 130 |
+
ACP, gateway, and runtime:
|
| 131 |
+
- ACP-related work routes to `acp` when ACP is named centrally.
|
| 132 |
+
- ACPX-related sandbox or workflow issues route to `acpx` when ACPX is named centrally.
|
| 133 |
+
- Gateway-owned behavior routes to `gateway` only when gateway is explicitly the owner or subject.
|
| 134 |
+
- Runtime work routes to `agent_runtime` when the title is about runtimes, node-backed runtimes, agent execution runtimes, or runtime ownership.
|
| 135 |
+
- Example: `ACP: add gateway-owned node-backed runtime` => `acp`, `gateway`, `agent_runtime`.
|
| 136 |
+
|
| 137 |
+
Codex and plugins:
|
| 138 |
+
- Codex-related behavior routes to `codex` when Codex is named centrally.
|
| 139 |
+
- User-installed plugins, plugin inheritance, Superpowers, skills, plugin discovery, plugin installation, or skill/plugin availability route to `skills_plugins`.
|
| 140 |
+
- Example: `[Feature]: ACPX Codex sandbox should inherit user-installed plugins (e.g. Superpowers)` => `acpx`, `codex`, `skills_plugins`.
|
| 141 |
+
- Do not drop `skills_plugins` when plugins are the requested feature.
|
| 142 |
+
|
| 143 |
+
Notifications and chat integrations:
|
| 144 |
+
- Slack, chat app delivery, chat target channels, and chat push behavior route to `chat_integrations`.
|
| 145 |
+
- Announce messages, heartbeat pushes, target-channel pushes, identity overlays on pushed messages, and notification delivery route to `notifications`.
|
| 146 |
+
- Do not add `cron_automation` merely because the notification path mentions `cron --announce`; cron is context unless scheduling, force-run behavior, cron lifecycle, or cron execution is itself broken.
|
| 147 |
+
- Example: `Per-agent identity overlay dropped on cron --announce and heartbeat target-channel Slack pushes` => `notifications`, `chat_integrations`.
|
| 148 |
+
|
| 149 |
+
Cron:
|
| 150 |
+
- Use `cron_automation` when cron scheduling, cron force-run, cron lifecycle, cron execution, or a cron deadlock is central.
|
| 151 |
+
- Example: `cron force-run deadlock` => `cron_automation`.
|
| 152 |
+
|
| 153 |
+
Exec, sandboxing, and approvals:
|
| 154 |
+
- Exec command/tool behavior routes to `exec_tools`.
|
| 155 |
+
- Exec PATH fallback is `exec_tools`.
|
| 156 |
+
- Exec v2 contract follow-through or contract enforcement can centrally include `exec_tools`, `sandboxing`, and `approvals` when the contract covers sandbox and approval behavior.
|
| 157 |
+
- Example: `test(exec): land exec v2 contract follow-through` => `exec_tools`, `sandboxing`, `approvals`.
|
| 158 |
+
- Do not replace sandboxing or approvals with `security` unless the title is actually about a security policy, vulnerability, network restriction, credential boundary, or allowed/blocked security behavior.
|
| 159 |
+
|
| 160 |
+
Browser automation:
|
| 161 |
+
- Browser diagnostics, browser automation layers, browser runtime behavior, and browser tooling issues route to `browser_automation`.
|
| 162 |
+
- Example: `layered browser diagnostics` => `browser_automation`.
|
| 163 |
+
- Do not add `gateway` for browser diagnostics unless the gateway itself is explicitly the subject.
|
| 164 |
+
|
| 165 |
+
Memory and inference:
|
| 166 |
+
- Memory or embeddings provider work routes to `memory` when the provider exists for memory/embeddings.
|
| 167 |
+
- Self-hosted inference servers such as llama.cpp, Ollama, vLLM, TGI, and LocalAI route to `self_hosted_inference` when the item is about using those servers as inference providers.
|
| 168 |
+
- Example: `feat(memory/embeddings): add openai-compatible provider for self-hosted servers (llama.cpp, Ollama, vLLM, TGI, LocalAI)` => `memory`, `self_hosted_inference`.
|
| 169 |
+
- Do not add `model_serving` merely because the title says “openai-compatible”, “provider”, llama.cpp, Ollama, vLLM, TGI, or LocalAI.
|
| 170 |
+
|
| 171 |
+
Model serving:
|
| 172 |
+
- Use `model_serving` only when the central subject is serving endpoints, OpenAI-compatible request/response protocol behavior, streaming lifecycle, final usage chunks, base URL behavior, endpoint compatibility, request routing, or model-server compatibility.
|
| 173 |
+
- OpenAI-compatible streaming, final usage chunks, stream lifecycle, endpoint compatibility, base URL behavior, vLLM/TGI/LocalAI/llama.cpp serving behavior, and request routing are `model_serving`.
|
| 174 |
+
- Do not add `telemetry_usage` merely because the title mentions usage, tokens, counts, cost, or chunks when those are symptoms of a model-serving protocol bug.
|
| 175 |
+
- Example: `OpenAI-compatible streaming with llama.cpp saves zero usage (stream closed before final usage chunk)` => `model_serving`.
|
| 176 |
+
|
| 177 |
+
Telemetry and usage:
|
| 178 |
+
- Use `telemetry_usage` only when metric collection, usage accounting/reporting, cost display, diagnostic counts, traces, or status reporting surfaces are themselves the feature or bug.
|
| 179 |
+
|
| 180 |
+
Policy/config:
|
| 181 |
+
- Items about policy rules, conformance checks, quality gates, allowed behavior, or configuration-governed enforcement usually include `config` when the policy/checking behavior is central.
|
| 182 |
+
- Do not map “model” in “model policy”, “model conformance”, or “model checks” to `model_serving` unless the item is actually about serving endpoints, streaming, endpoint lifecycle, routing, or model-server compatibility.
|
| 183 |
+
- Network policy, network conformance, access restrictions, outbound rules, or boundary checks can be `security` when they concern allowed/blocked network behavior.
|
| 184 |
+
- MCP conformance, MCP policy, MCP tool behavior, or MCP protocol checks route to `mcp_tooling`.
|
| 185 |
+
- Example: `Policy: add model, network, and MCP conformance checks` => `mcp_tooling`, `config`, `security`, not `model_serving`.
|
| 186 |
+
|
| 187 |
+
Composite fixes:
|
| 188 |
+
- If a title lists several independent fixes, classify each central fix up to the smallest complete set.
|
| 189 |
+
- Example: `fix: resolve exec PATH fallback, layered browser diagnostics, and cron force-run deadlock` => `exec_tools`, `browser_automation`, `cron_automation`.
|
| 190 |
+
- Do not substitute a broad infrastructure topic like `gateway` unless it is explicitly one of the listed user-visible subjects.
|
| 191 |
+
|
| 192 |
+
Final suppression checks:
|
| 193 |
+
- If a topic was added only because of a word like “usage”, “model”, “network”, “test”, “policy”, “status”, “tool”, “plugin”, “chunk”, “cron”, “gateway”, or “security”, verify that the topic is actually the subject.
|
| 194 |
+
- Prefer the narrow central topic over broad fallback labels.
|
| 195 |
+
- Remove labels that come only from symptoms, implementation details, tests, examples, files changed, or incidental words.
|
| 196 |
+
- Keep required central second and third topics when dropping them would hide the item from a maintainer who owns that area.
|
| 197 |
+
Iteration 2: New subsample score 3.2 is better than old score 1.2. Continue to full eval and add to candidate pool.
|
| 198 |
+
Iteration 2: Found a better program on the valset with score 0.7361111111111112.
|
| 199 |
+
Iteration 2: Valset score for new program: 0.7361111111111112 (coverage 18 / 18)
|
| 200 |
+
Iteration 2: Val aggregate for new program: 0.7361111111111112
|
| 201 |
+
Iteration 2: Individual valset scores for new program: {0: 1.0, 1: 1.0, 2: 1.0, 3: 0.0, 4: 1.0, 5: 0.5, 6: 0.5, 7: 0.5, 8: 0.25, 9: 1.0, 10: 0.5, 11: 1.0, 12: 1.0, 13: 1.0, 14: 1.0, 15: 0.5, 16: 1.0, 17: 0.5}
|
| 202 |
+
Iteration 2: Objective aggregate scores for new program: {'weighted_score': 0.7361111111111112}
|
| 203 |
+
Iteration 2: New valset pareto front scores: {0: 1.0, 1: 1.0, 2: 1.0, 3: 0.25, 4: 1.0, 5: 0.5, 6: 1.0, 7: 1.0, 8: 0.5, 9: 1.0, 10: 0.5, 11: 1.0, 12: 1.0, 13: 1.0, 14: 1.0, 15: 0.5, 16: 1.0, 17: 0.5}
|
| 204 |
+
Iteration 2: Objective pareto front scores: {'weighted_score': 0.7361111111111112}
|
| 205 |
+
Iteration 2: Valset pareto front aggregate score: 0.8194444444444444
|
| 206 |
+
Iteration 2: Updated valset pareto front programs: {0: {2}, 1: {0, 2}, 2: {2}, 3: {0, 1}, 4: {0, 1, 2}, 5: {0, 1, 2}, 6: {1}, 7: {0}, 8: {0}, 9: {1, 2}, 10: {2}, 11: {0, 1, 2}, 12: {0, 1, 2}, 13: {2}, 14: {0, 2}, 15: {1, 2}, 16: {2}, 17: {0, 1, 2}}
|
| 207 |
+
Iteration 2: Updated objective pareto front programs: {'weighted_score': {2}}
|
| 208 |
+
Iteration 2: Best valset aggregate score so far: 0.7361111111111112
|
| 209 |
+
Iteration 2: Best program as per aggregate score on valset: 2
|
| 210 |
+
Iteration 2: Best score on valset: 0.7361111111111112
|
| 211 |
+
Iteration 2: Linear pareto front program index: 2
|
| 212 |
+
Iteration 2: New program candidate index: 2
|
| 213 |
+
Iteration 3: Selected program 2 score: 0.7361111111111112
|
| 214 |
+
Iteration 3: Proposed new text for routing_policy: Add these routing corrections to the classifier instructions:
|
| 215 |
+
|
| 216 |
+
- Treat compound titles as lists of central user-visible fixes. Classify each central item, but do not add labels for every noun.
|
| 217 |
+
- `skills_plugins` is label spam unless the plugin system itself is the requested feature or bug: user-installed plugins, plugin inheritance, Superpowers, skill/plugin discovery, plugin installation, or skill/plugin availability.
|
| 218 |
+
- In titles like `fix: Codex startup plugins + WhatsApp history & Docker Codex OAuth`, keep `codex` because Codex behavior is central, but do not add `skills_plugins` for “startup plugins” unless the plugin lifecycle is the actual subject.
|
| 219 |
+
- WhatsApp, Slack, chat history, chat app delivery, chat target channels, and chat push behavior route to `chat_integrations` when central.
|
| 220 |
+
- ACP session permission-mode work can require all three topics: `acp`, `approvals`, and `acpx`.
|
| 221 |
+
- Specifically, titles mentioning per-binding or per-agent `permissionMode` for ACP sessions should include `acp`, `approvals`, and `acpx`. `permissionMode` is an approval/permission contract, and ACPX owns the ACP session/binding workflow concern.
|
| 222 |
+
- Add `local_models` when the title centrally names local model apps or local model providers such as LM Studio.
|
| 223 |
+
- LM Studio issues involving Responses API behavior, thinking blocks, streaming, request/response compatibility, or visibility of model output should usually include both `model_serving` and `local_models`.
|
| 224 |
+
- Do not replace `local_models` with `self_hosted_inference` when the named subject is LM Studio or another local-model product/app rather than a generic inference server integration.
|
| 225 |
+
- `Responses API`, invisible thinking blocks, OpenAI-compatible behavior, streaming lifecycle, request/response protocol handling, and model-output protocol bugs route to `model_serving`.
|
| 226 |
+
|
| 227 |
+
Additional suppression checks:
|
| 228 |
+
- If `skills_plugins` was added only because the title contains “plugins” inside a broader Codex startup or OAuth fix, remove it unless plugin installation/discovery/inheritance/availability is the central user-visible bug.
|
| 229 |
+
- If a chat product name such as WhatsApp appears as a central listed fix, include `chat_integrations`.
|
| 230 |
+
- If ACP + `permissionMode` + per-binding/per-agent/session language appears, include `acpx` in addition to `acp` and `approvals`.
|
| 231 |
+
- If LM Studio appears as a central subject, include `local_models`.
|
| 232 |
+
Iteration 3: New subsample score 2.571428571428571 is better than old score 2.25. Continue to full eval and add to candidate pool.
|
| 233 |
+
Iteration 3: Valset score for new program: 0.5088929588929588 (coverage 18 / 18)
|
| 234 |
+
Iteration 3: Val aggregate for new program: 0.5088929588929588
|
| 235 |
+
Iteration 3: Individual valset scores for new program: {0: 0.5, 1: 0.25, 2: 0.15384615384615385, 3: 1.0, 4: 0.16666666666666666, 5: 1.0, 6: 0.2857142857142857, 7: 0.5, 8: 0.25, 9: 1.0, 10: 0.2, 11: 1.0, 12: 0.15384615384615385, 13: 0.2, 14: 1.0, 15: 0.25, 16: 0.25, 17: 1.0}
|
| 236 |
+
Iteration 3: Objective aggregate scores for new program: {'weighted_score': 0.5088929588929589}
|
| 237 |
+
Iteration 3: New valset pareto front scores: {0: 1.0, 1: 1.0, 2: 1.0, 3: 1.0, 4: 1.0, 5: 1.0, 6: 1.0, 7: 1.0, 8: 0.5, 9: 1.0, 10: 0.5, 11: 1.0, 12: 1.0, 13: 1.0, 14: 1.0, 15: 0.5, 16: 1.0, 17: 1.0}
|
| 238 |
+
Iteration 3: Objective pareto front scores: {'weighted_score': 0.7361111111111112}
|
| 239 |
+
Iteration 3: Valset pareto front aggregate score: 0.9166666666666666
|
| 240 |
+
Iteration 3: Updated valset pareto front programs: {0: {2}, 1: {0, 2}, 2: {2}, 3: {3}, 4: {0, 1, 2}, 5: {3}, 6: {1}, 7: {0}, 8: {0}, 9: {1, 2, 3}, 10: {2}, 11: {0, 1, 2, 3}, 12: {0, 1, 2}, 13: {2}, 14: {0, 2, 3}, 15: {1, 2}, 16: {2}, 17: {3}}
|
| 241 |
+
Iteration 3: Updated objective pareto front programs: {'weighted_score': {2}}
|
| 242 |
+
Iteration 3: Best valset aggregate score so far: 0.7361111111111112
|
| 243 |
+
Iteration 3: Best program as per aggregate score on valset: 2
|
| 244 |
+
Iteration 3: Best score on valset: 0.7361111111111112
|
| 245 |
+
Iteration 3: Linear pareto front program index: 2
|
| 246 |
+
Iteration 3: New program candidate index: 3
|
gepa-12b-multi-from-six-20260613T051216Z/run_log_stderr.txt
ADDED
|
File without changes
|