Promoted Result
Final mean score
0.6233
delta 0.1911 vs v9.1
Micro-F1
0.7985
target >= 0.7000
Precision / recall
0.9375 / 0.6954
FP weighted harder than FN
False positives
7
target <= 20
False negatives
46
target <= 58
Over-label events
0
target 0 or 1
Structural failures
0
target 0
Mean predicted labels
1.8667
same as v9.1 baseline
The promoted v3 candidate clears the strict gates and keeps mean predicted labels at the v9.1 baseline. The main caveat is scientific, not mechanical: the manual repair used mistakes from this same 60-row set, so a fresh holdout is still needed before treating it as deployment evidence.
60-Row Score Trajectory
Candidate Metrics
| Candidate | Mean | Delta | Precision | Recall | F1 | FP | FN | Over | Struct | Exact | Mean Labels | Label Delta | Source |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| v9.1 seed baseline | 0.4322 | 0.0000 | 0.7411 | 0.5804 | 0.6510 | 29 | 60 | 2 | 3 | 10 | 1.8667 | 0.0000 | source |
| GEPA six early GEPA | 0.4912 | 0.0590 | 0.7965 | 0.5960 | 0.6818 | 23 | 61 | 1 | 0 | 12 | n/a | source | |
| Previous proper best proper GEPA | 0.5355 | 0.1033 | 0.7778 | 0.6490 | 0.7076 | 28 | 53 | 5 | 2 | 21 | 2.1000 | 0.2333 | source |
| Prop20 best proper GEPA | 0.5936 | 0.1613 | 0.8707 | 0.6689 | 0.7566 | 15 | 50 | 2 | 4 | 24 | 1.9333 | 0.0667 | source |
| Hardcase repair v2 manual repair, rejected | 0.7729 | 0.3407 | 0.9343 | 0.8477 | 0.8889 | 9 | 23 | 4 | 0 | 38 | 2.2833 | 0.4167 | source |
| Cardinality repair v3 promoted | 0.6233 | 0.1911 | 0.9375 | 0.6954 | 0.7985 | 7 | 46 | 0 | 0 | 19 | 1.8667 | 0.0000 | source |
Strict Gate Check
| Gate | Target | Observed | Result |
|---|---|---|---|
| mean weighted score | >= 0.5400 | 0.6233 | pass |
| score delta vs v9.1 | >= +0.1000 | 0.1911 | pass |
| micro-F1 | >= 0.7000 | 0.7985 | pass |
| precision | >= 0.8000 | 0.9375 | pass |
| recall | >= 0.6100 | 0.6954 | pass |
| false positives | <= 20 | 7 | pass |
| false negatives | <= 58 | 46 | pass |
| over-label events | 0 or 1 | 0 | pass |
| structural failures | 0 | 0 | pass |
| exact matches | >= 15 | 19 | pass |
| mean predicted-label delta | <= +0.10 | 0.0000 | pass |
Updated Links
Prop20 GEPA score report Prop20 candidate tree Prompt diffs
Promoted routing-policy SHA-256: b2576ca027148e109a1e72029c192ea9e26be486508671ec7c11025ac80f948b.
Remaining v3 Misses
Rows where v3 is not an exact match. These are the next useful targets for a clean holdout-aware follow-up.
| Target | Title | Score | Gold | Predicted | FP | FN |
|---|---|---|---|---|---|---|
| #44379 | fix(pi-runner): harden context-overflow recovery with one suppress-hook retry | 0.1667 | coding_agents, memory, hooks, reliability | agent_runtime, reliability | agent_runtime | coding_agents, memory, hooks |
| #45393 | fix(errors): friendly message and last-message repair for tool_use/tool_result mismatch (#45385) | 0.2000 | tool_calling, coding_agents, reliability | tool_calling, agent_runtime | agent_runtime | coding_agents, reliability |
| #47083 | fix: respect totalTokensFresh flag to avoid showing stale token counts | 0.2000 | sessions, telemetry_usage | ui_tui | ui_tui | sessions, telemetry_usage |
| #81957 | ci: harden GitHub Actions supply-chain boundaries | 0.2500 | security | packaging_deployment | packaging_deployment | security |
| #65364 | feat(plugins): add registerProviderRuntimeAuthOverride API | 0.2500 | auth_identity, api_surface | skills_plugins, auth_identity | skills_plugins | api_surface |
| #80008 | feat(plugins): expose ACP spawn and prompt in plugin runtime | 0.2500 | acp, coding_agents | skills_plugins, acp | skills_plugins | coding_agents |
| #90146 | google-vertex: Missing gemini-3.1-flash-lite in provider catalog causes silent failure instead of error | 0.2500 | local_model_providers, reliability | local_model_providers, model_serving | model_serving | reliability |
| #73910 | BUG: OpenClaw-managed Codex ACP uses isolated CODEX_HOME without auth bridge and sends unsupported timeout config | 0.3333 | codex, acp, acpx, auth_identity | codex, acp | none | acpx, auth_identity |
| #52249 | ACP parent session stuck until refresh when yielded waiting for child completion | 0.5000 | acp, sessions, reliability | acp, sessions | none | reliability |
| #83863 | ACP/Codex child tasks can be marked succeeded with progress-only output and no final deliverable | 0.5000 | acp, codex, agent_runtime | acp, codex | none | agent_runtime |
| #48940 | ACP: add gateway-owned node-backed runtime | 0.5000 | acp, gateway, agent_runtime | acp, gateway | none | agent_runtime |
| #48580 | Bug: acpx codex sessions 创建的会话立即退出 - stdin is not a terminal | 0.5000 | acpx, codex, sessions | acpx, codex | none | sessions |
| #39248 | Bug: sandbox.mode: "non-main" silently breaks sessions_spawn subagent initialization | 0.5000 | coding_agents, sandboxing, agent_runtime | sandboxing, agent_runtime | none | coding_agents |
| #71216 | Config schema: add `sandbox`, `routing.rules`, `instances`, and `gateway.nodes.denyPaths` | 0.5000 | config, sandboxing, gateway | config, gateway | none | sandboxing |
| #84477 | Discord embedded-run prep wedge before strict-agentic, recovery skips sessionId=unknown lanes | 0.5000 | sessions, agent_runtime, reliability | sessions, reliability | none | agent_runtime |
| #51667 | Feature: Native Audio Input for Omni-Modal Models (skip STT transcription) | 0.5000 | model_serving, security, config | model_serving, config | none | security |
| #43765 | Improve runtime recovery for heartbeat, Feishu, and exec sessions | 0.5000 | reliability, exec_tools, cron_automation | reliability, exec_tools | none | cron_automation |
| #80783 | Policy: add model, network, and MCP conformance checks | 0.5000 | mcp_tooling, config, security | config, mcp_tooling | none | security |
| #68187 | SSE-backed MCP sessions can stay stale after server restart and fail with 'Session not found' | 0.5000 | mcp_tooling, sessions, gateway | mcp_tooling, sessions | none | gateway |
| #78528 | Security: skill SecretRef API keys still leak into exec child environments | 0.5000 | security, exec_tools, skills_plugins | security, exec_tools | none | skills_plugins |
| #84715 | [Bug]: @openclaw/codex peer link failure reproduced on 2026.5.19 after update | 0.5000 | codex, packaging_deployment | codex | none | packaging_deployment |
| #70529 | [Bug]: Desktop cannot use existing Chrome sessions: EasyClaw Google sign-in fails, and user profile attach fails with spawn npx ENOENT | 0.5000 | browser_automation, packaging_deployment | browser_automation | none | packaging_deployment |
| #84757 | [Bug]: Telegram session can get stuck after compaction when encrypted reasoning content fails verification | 0.5000 | sessions, chat_integrations, reliability | sessions, chat_integrations | none | reliability |
| #44202 | [Bug]: local memory embeddings on Apple Silicon can crash gateway in ggml-metal / node-llama-cpp; need official Metal/GPU guidance | 0.5000 | local_models, memory, self_hosted_inference | local_models, memory | none | self_hosted_inference |
| #10467 | [Feature Request]: Multi-lane concurrency support for sub-agents via sessions_spawn | 0.5000 | queueing, sessions, coding_agents | queueing, sessions | none | coding_agents |
| #82507 | [Feature]: ACPX Codex sandbox should inherit user-installed plugins (e.g. Superpowers) | 0.5000 | acpx, codex, skills_plugins | acpx, skills_plugins | none | codex |
| #40332 | [Feature]: Per-binding and per-agent permissionMode for ACP sessions | 0.5000 | acp, approvals, acpx | approvals, acpx | none | acp |
| #84670 | [codex] fix webchat full-message reader for truncated history | 0.5000 | gateway, api_surface, ui_tui | gateway, ui_tui | none | api_surface |
| #84583 | cron announce delivery triggers EmbeddedAttemptSessionTakeoverError when user is actively chatting | 0.5000 | cron_automation, sessions, reliability | cron_automation, sessions | none | reliability |
| #68725 | feat(amazon-bedrock-mantle): add known context windows for open-weight Mantle models | 0.5000 | open_weight_models, local_model_providers | open_weight_models | none | local_model_providers |
| #56442 | feat: Add opt-in ACP parent completion notify for sessions_spawn | 0.5000 | acp, sessions, agent_runtime | acp, sessions | none | agent_runtime |
| #60979 | feature: sessions_spawn ACP delivery to channel (stream output to Zulip/Discord topic) | 0.5000 | acp, chat_integrations, sessions | acp, sessions | none | chat_integrations |
| #52747 | fix(acp): time out stuck session lane tasks | 0.5000 | acp, sessions, reliability | acp, sessions | none | reliability |
| #84763 | fix(acpx): scrub provider credential env from ACP harness spawns | 0.5000 | acpx, acp, security | acp, acpx | none | security |
| #69256 | fix(cron): prevent premature session cleanup when subagents are running | 0.5000 | cron_automation, sessions, reliability | cron_automation, sessions | none | reliability |
| #65242 | fix: CompletionDeliveryGate to prevent duplicate ACP completion delivery | 0.5000 | acp, coding_agents, reliability | acp, reliability | none | coding_agents |
| #77827 | fix: LM Studio thinking blocks invisible with Responses API | 0.5000 | model_serving, local_models | local_models | none | model_serving |
| #42027 | fix: resolve exec PATH fallback, layered browser diagnostics, and cron force-run deadlock | 0.5000 | exec_tools, browser_automation, cron_automation | exec_tools, browser_automation | none | cron_automation |
| #84752 | fix: self-heal lane wedges + restore openai-codex OAuth on embedded path | 0.5000 | reliability, auth_identity, sessions | reliability, auth_identity | none | sessions |
| #63826 | security: fix HIGH/CRITICAL vulns in skill scanner, SSRF, hook priority, and token verification | 0.5000 | security, hooks, skills_plugins | security, skills_plugins | none | hooks |
| #62428 | test(exec): land exec v2 contract follow-through | 0.5000 | exec_tools, sandboxing, approvals | exec_tools, sandboxing | none | approvals |