Localpager GEPA Final Report

Updated 2026-06-14T13:35:45Z. Dataset: Shaun ordered 60-row set. Gold labels: canonical ds4.jsonl. Task model: local 12B. Concurrency: 2.

Promoted Result

Final mean score
0.6233
delta 0.1911 vs v9.1
Micro-F1
0.7985
target >= 0.7000
Precision / recall
0.9375 / 0.6954
FP weighted harder than FN
False positives
7
target <= 20
False negatives
46
target <= 58
Over-label events
0
target 0 or 1
Structural failures
0
target 0
Mean predicted labels
1.8667
same as v9.1 baseline

The promoted v3 candidate clears the strict gates and keeps mean predicted labels at the v9.1 baseline. The main caveat is scientific, not mechanical: the manual repair used mistakes from this same 60-row set, so a fresh holdout is still needed before treating it as deployment evidence.

60-Row Score Trajectory

0.38 0.48 0.59 0.70 0.80 v9.1 seed 0.4322 0.4322 v9.1 seed GEPA six 0.4912 0.4912 GEPA six Previous proper best 0.5355 0.5355 Previous proper best Prop20 best 0.5936 0.5936 Prop20 best Hardcase repair v2 0.7729 0.7729 Hardcase repair v2 Cardinality repair v3 0.6233 0.6233 Cardinality repair v3 candidate timeline

Candidate Metrics

CandidateMeanDeltaPrecisionRecallF1FPFNOverStructExactMean LabelsLabel DeltaSource
v9.1 seed
baseline
0.43220.00000.74110.58040.6510296023101.86670.0000source
GEPA six
early GEPA
0.49120.05900.79650.59600.681823611012n/asource
Previous proper best
proper GEPA
0.53550.10330.77780.64900.7076285352212.10000.2333source
Prop20 best
proper GEPA
0.59360.16130.87070.66890.7566155024241.93330.0667source
Hardcase repair v2
manual repair, rejected
0.77290.34070.93430.84770.888992340382.28330.4167source
Cardinality repair v3
promoted
0.62330.19110.93750.69540.798574600191.86670.0000source

Strict Gate Check

GateTargetObservedResult
mean weighted score>= 0.54000.6233pass
score delta vs v9.1>= +0.10000.1911pass
micro-F1>= 0.70000.7985pass
precision>= 0.80000.9375pass
recall>= 0.61000.6954pass
false positives<= 207pass
false negatives<= 5846pass
over-label events0 or 10pass
structural failures00pass
exact matches>= 1519pass
mean predicted-label delta<= +0.100.0000pass

Remaining v3 Misses

Rows where v3 is not an exact match. These are the next useful targets for a clean holdout-aware follow-up.

TargetTitleScoreGoldPredictedFPFN
#44379fix(pi-runner): harden context-overflow recovery with one suppress-hook retry0.1667coding_agents, memory, hooks, reliabilityagent_runtime, reliabilityagent_runtimecoding_agents, memory, hooks
#45393fix(errors): friendly message and last-message repair for tool_use/tool_result mismatch (#45385)0.2000tool_calling, coding_agents, reliabilitytool_calling, agent_runtimeagent_runtimecoding_agents, reliability
#47083fix: respect totalTokensFresh flag to avoid showing stale token counts0.2000sessions, telemetry_usageui_tuiui_tuisessions, telemetry_usage
#81957ci: harden GitHub Actions supply-chain boundaries0.2500securitypackaging_deploymentpackaging_deploymentsecurity
#65364feat(plugins): add registerProviderRuntimeAuthOverride API0.2500auth_identity, api_surfaceskills_plugins, auth_identityskills_pluginsapi_surface
#80008feat(plugins): expose ACP spawn and prompt in plugin runtime0.2500acp, coding_agentsskills_plugins, acpskills_pluginscoding_agents
#90146google-vertex: Missing gemini-3.1-flash-lite in provider catalog causes silent failure instead of error0.2500local_model_providers, reliabilitylocal_model_providers, model_servingmodel_servingreliability
#73910BUG: OpenClaw-managed Codex ACP uses isolated CODEX_HOME without auth bridge and sends unsupported timeout config0.3333codex, acp, acpx, auth_identitycodex, acpnoneacpx, auth_identity
#52249ACP parent session stuck until refresh when yielded waiting for child completion0.5000acp, sessions, reliabilityacp, sessionsnonereliability
#83863ACP/Codex child tasks can be marked succeeded with progress-only output and no final deliverable0.5000acp, codex, agent_runtimeacp, codexnoneagent_runtime
#48940ACP: add gateway-owned node-backed runtime0.5000acp, gateway, agent_runtimeacp, gatewaynoneagent_runtime
#48580Bug: acpx codex sessions 创建的会话立即退出 - stdin is not a terminal0.5000acpx, codex, sessionsacpx, codexnonesessions
#39248Bug: sandbox.mode: "non-main" silently breaks sessions_spawn subagent initialization0.5000coding_agents, sandboxing, agent_runtimesandboxing, agent_runtimenonecoding_agents
#71216Config schema: add `sandbox`, `routing.rules`, `instances`, and `gateway.nodes.denyPaths`0.5000config, sandboxing, gatewayconfig, gatewaynonesandboxing
#84477Discord embedded-run prep wedge before strict-agentic, recovery skips sessionId=unknown lanes0.5000sessions, agent_runtime, reliabilitysessions, reliabilitynoneagent_runtime
#51667Feature: Native Audio Input for Omni-Modal Models (skip STT transcription)0.5000model_serving, security, configmodel_serving, confignonesecurity
#43765Improve runtime recovery for heartbeat, Feishu, and exec sessions0.5000reliability, exec_tools, cron_automationreliability, exec_toolsnonecron_automation
#80783Policy: add model, network, and MCP conformance checks0.5000mcp_tooling, config, securityconfig, mcp_toolingnonesecurity
#68187SSE-backed MCP sessions can stay stale after server restart and fail with 'Session not found'0.5000mcp_tooling, sessions, gatewaymcp_tooling, sessionsnonegateway
#78528Security: skill SecretRef API keys still leak into exec child environments0.5000security, exec_tools, skills_pluginssecurity, exec_toolsnoneskills_plugins
#84715[Bug]: @openclaw/codex peer link failure reproduced on 2026.5.19 after update0.5000codex, packaging_deploymentcodexnonepackaging_deployment
#70529[Bug]: Desktop cannot use existing Chrome sessions: EasyClaw Google sign-in fails, and user profile attach fails with spawn npx ENOENT0.5000browser_automation, packaging_deploymentbrowser_automationnonepackaging_deployment
#84757[Bug]: Telegram session can get stuck after compaction when encrypted reasoning content fails verification0.5000sessions, chat_integrations, reliabilitysessions, chat_integrationsnonereliability
#44202[Bug]: local memory embeddings on Apple Silicon can crash gateway in ggml-metal / node-llama-cpp; need official Metal/GPU guidance0.5000local_models, memory, self_hosted_inferencelocal_models, memorynoneself_hosted_inference
#10467[Feature Request]: Multi-lane concurrency support for sub-agents via sessions_spawn0.5000queueing, sessions, coding_agentsqueueing, sessionsnonecoding_agents
#82507[Feature]: ACPX Codex sandbox should inherit user-installed plugins (e.g. Superpowers)0.5000acpx, codex, skills_pluginsacpx, skills_pluginsnonecodex
#40332[Feature]: Per-binding and per-agent permissionMode for ACP sessions0.5000acp, approvals, acpxapprovals, acpxnoneacp
#84670[codex] fix webchat full-message reader for truncated history0.5000gateway, api_surface, ui_tuigateway, ui_tuinoneapi_surface
#84583cron announce delivery triggers EmbeddedAttemptSessionTakeoverError when user is actively chatting0.5000cron_automation, sessions, reliabilitycron_automation, sessionsnonereliability
#68725feat(amazon-bedrock-mantle): add known context windows for open-weight Mantle models0.5000open_weight_models, local_model_providersopen_weight_modelsnonelocal_model_providers
#56442feat: Add opt-in ACP parent completion notify for sessions_spawn0.5000acp, sessions, agent_runtimeacp, sessionsnoneagent_runtime
#60979feature: sessions_spawn ACP delivery to channel (stream output to Zulip/Discord topic)0.5000acp, chat_integrations, sessionsacp, sessionsnonechat_integrations
#52747fix(acp): time out stuck session lane tasks0.5000acp, sessions, reliabilityacp, sessionsnonereliability
#84763fix(acpx): scrub provider credential env from ACP harness spawns0.5000acpx, acp, securityacp, acpxnonesecurity
#69256fix(cron): prevent premature session cleanup when subagents are running0.5000cron_automation, sessions, reliabilitycron_automation, sessionsnonereliability
#65242fix: CompletionDeliveryGate to prevent duplicate ACP completion delivery0.5000acp, coding_agents, reliabilityacp, reliabilitynonecoding_agents
#77827fix: LM Studio thinking blocks invisible with Responses API0.5000model_serving, local_modelslocal_modelsnonemodel_serving
#42027fix: resolve exec PATH fallback, layered browser diagnostics, and cron force-run deadlock0.5000exec_tools, browser_automation, cron_automationexec_tools, browser_automationnonecron_automation
#84752fix: self-heal lane wedges + restore openai-codex OAuth on embedded path0.5000reliability, auth_identity, sessionsreliability, auth_identitynonesessions
#63826security: fix HIGH/CRITICAL vulns in skill scanner, SSRF, hook priority, and token verification0.5000security, hooks, skills_pluginssecurity, skills_pluginsnonehooks
#62428test(exec): land exec v2 contract follow-through0.5000exec_tools, sandboxing, approvalsexec_tools, sandboxingnoneapprovals