Spaces:
Running
chore(eval): pin gpt-4o-mini snapshot + wire fastapi golden_dataset + pre-commit tolerances
Browse filesPrep commit before the counterfactual-query prompt regression. Three
sub-changes bundled because each is "what must be true before the
measurement is valid" β drift control, evaluation path, decision
criteria β and none are the experimental variable itself.
1. Pin provider.py:208 to gpt-4o-mini-2024-07-18 (dated snapshot).
The unpinned alias is a known drift vector; the Mar 25 β Apr 12
P@5 slide is an open parallel item traceable to silent alias
migration. Pre/post regression runs against a floating alias
conflate prompt effect with model drift. Pricing dict gets a
matching dated entry so provider.py:209 cost-lookup resolves.
2. Wire corpora.fastapi.golden_dataset. Production /ask/stream
already routes through format_system_prompt() via
routes.py:_resolve_system_prompt β the app.state.system_prompt
legacy fallback is effectively dead code given the shipped
multi-corpus config. Adding the missing golden_dataset field
makes --corpus fastapi work so the regression gate can measure
the actual production prompt path, not the legacy eval
scaffolding at make evaluate-fast. Purely additive; zero blast
radius on serving.
3. Pre-commit four-metric tolerances (P@5 -0.02, R@5 -0.02, citation
hard, mean tool_calls +0.30, individual cap-hit gate) and strict
pilot_005 flip criteria. Written down before the post-edit runs
so the pass/fail call is not a judgment under confirmation-bias
pressure.
Full test suite passes (438/438). No code touches the prompt
template itself; the prompt edit is the next commit and is the sole
experimental variable.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- DECISIONS.md +64 -0
- agent_bench/core/provider.py +2 -2
- configs/default.yaml +4 -0
|
@@ -718,3 +718,67 @@ the threshold calibration should stand on its own empirical merit.
|
|
| 718 |
Feedback memory `feedback_fix_before_sweep.md` applies recursively:
|
| 719 |
fix measurement-affecting bugs at every layer before combining
|
| 720 |
fixes into single experiments.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 718 |
Feedback memory `feedback_fix_before_sweep.md` applies recursively:
|
| 719 |
fix measurement-affecting bugs at every layer before combining
|
| 720 |
fixes into single experiments.
|
| 721 |
+
|
| 722 |
+
## Prep for counterfactual-query prompt regression β pin, wire, tolerances
|
| 723 |
+
|
| 724 |
+
**Three sub-changes bundled as one prep commit, each small and in
|
| 725 |
+
service of making the downstream regression measurement valid.**
|
| 726 |
+
|
| 727 |
+
**1. OpenAI model pin.** `agent_bench/core/provider.py:208` changes
|
| 728 |
+
`self.model = "gpt-4o-mini"` β `self.model = "gpt-4o-mini-2024-07-18"`.
|
| 729 |
+
The unpinned alias is a known drift vector β the Mar 25 β Apr 12 P@5
|
| 730 |
+
slide bisection is an already-open parallel track item traceable to
|
| 731 |
+
silent alias migration. A regression run that uses the alias across
|
| 732 |
+
pre-edit and post-edit phases conflates prompt-clause effect with
|
| 733 |
+
model drift, even within a single session if the alias happens to
|
| 734 |
+
roll between runs. Pinning the dated snapshot removes the variable.
|
| 735 |
+
Pricing dict in `configs/default.yaml` gets a matching
|
| 736 |
+
`gpt-4o-mini-2024-07-18` entry so the cost-lookup at
|
| 737 |
+
`provider.py:209` still resolves. Tests that pin the model string
|
| 738 |
+
live in mock response payloads (not outgoing assertions) and the
|
| 739 |
+
langchain baseline (separate code path) β neither affected.
|
| 740 |
+
|
| 741 |
+
**2. FastAPI multi-corpus eval wiring.** `configs/default.yaml`
|
| 742 |
+
adds `corpora.fastapi.golden_dataset: agent_bench/evaluation/datasets/tech_docs_golden.json`.
|
| 743 |
+
The production serving path at `routes.py:105-120 _resolve_system_prompt`
|
| 744 |
+
already routes `/ask` and `/ask/stream` through `format_system_prompt(label)`
|
| 745 |
+
from `core/prompts.py` β the `app.state.system_prompt` legacy fallback
|
| 746 |
+
(serving/app.py:276) is effectively dead code given the shipped multi-corpus
|
| 747 |
+
config. The **only** remaining caller of `task.system_prompt` is the
|
| 748 |
+
`scripts/evaluate.py` legacy branch used by `make evaluate-fast`. Adding
|
| 749 |
+
the missing `golden_dataset` field makes `--corpus fastapi` work so the
|
| 750 |
+
regression gate can measure the actual production prompt path, not the
|
| 751 |
+
legacy eval-scaffolding prompt. Purely additive; zero blast radius on
|
| 752 |
+
serving (serving doesn't read `golden_dataset`).
|
| 753 |
+
|
| 754 |
+
**3. Pre-committed four-metric tolerances.** Written down now, before
|
| 755 |
+
the post-edit runs, so the pass/fail call on the counterfactual-query
|
| 756 |
+
prompt clause is not a judgment under confirmation-bias pressure.
|
| 757 |
+
Applied identically to FastAPI and K8s:
|
| 758 |
+
|
| 759 |
+
| Metric | Pass criterion |
|
| 760 |
+
|---|---|
|
| 761 |
+
| P@5 | post-edit β₯ pre-edit β 0.02 |
|
| 762 |
+
| R@5 | post-edit β₯ pre-edit β 0.02 |
|
| 763 |
+
| Citation accuracy | post-edit β₯ pre-edit (**hard gate** β any drop blocks commit) |
|
| 764 |
+
| Mean `tool_calls_made` | post-edit β€ pre-edit + 0.30 |
|
| 765 |
+
| Individual question cap | no question that used fewer than `max_iterations=3` iterations pre-edit may hit the cap post-edit |
|
| 766 |
+
|
| 767 |
+
**pilot_005 strict flip criterion (K8s-only):**
|
| 768 |
+
- `keyword_hit_rate β₯ 0.60` against golden keywords `["not", "does not", "NetworkPolicy", "service mesh", "TLS", "ingress controller"]`
|
| 769 |
+
- Answer cites `k8s_network_policies.md`
|
| 770 |
+
- Answer contains "service mesh" OR "ingress controller" (the concrete documented-negative evidence the pre-edit refusal lacked)
|
| 771 |
+
- Answer does NOT begin with refusal phrasing ("The ... documentation does not provide", "I cannot answer")
|
| 772 |
+
|
| 773 |
+
**Baseline reference:** K8s pre-edit numbers from `results/k8s_preedit.json`
|
| 774 |
+
at commit `b97f00f` β P@5 0.80, R@5 1.00, citation 1.00 (all 6),
|
| 775 |
+
mean tool_calls 1.167. FastAPI pre-edit reference established by
|
| 776 |
+
`results/fastapi_preedit.json` in the next step of this session,
|
| 777 |
+
same pinned ID, same refusal threshold (0.02).
|
| 778 |
+
|
| 779 |
+
**Rationale for bundling.** All three sub-changes answer "what must
|
| 780 |
+
be true before the regression measurement is valid" β drift control,
|
| 781 |
+
evaluation path, decision criteria. Splitting into three commits
|
| 782 |
+
would add noise without adding signal. None of them change the
|
| 783 |
+
prompt template itself; the prompt edit is the NEXT commit and is
|
| 784 |
+
the sole experimental variable the regression measures.
|
|
@@ -192,7 +192,7 @@ class MockProvider(LLMProvider):
|
|
| 192 |
|
| 193 |
|
| 194 |
class OpenAIProvider(LLMProvider):
|
| 195 |
-
"""OpenAI API provider
|
| 196 |
|
| 197 |
def __init__(self, config: AppConfig | None = None) -> None:
|
| 198 |
try:
|
|
@@ -205,7 +205,7 @@ class OpenAIProvider(LLMProvider):
|
|
| 205 |
self.config = config or load_config()
|
| 206 |
api_key = os.environ.get("OPENAI_API_KEY", "")
|
| 207 |
self.client = AsyncOpenAI(api_key=api_key)
|
| 208 |
-
self.model = "gpt-4o-mini"
|
| 209 |
model_pricing = self.config.provider.models.get(self.model)
|
| 210 |
self._input_cost = model_pricing.input_cost_per_mtok if model_pricing else 0.15
|
| 211 |
self._output_cost = model_pricing.output_cost_per_mtok if model_pricing else 0.60
|
|
|
|
| 192 |
|
| 193 |
|
| 194 |
class OpenAIProvider(LLMProvider):
|
| 195 |
+
"""OpenAI API provider pinned to a dated gpt-4o-mini snapshot."""
|
| 196 |
|
| 197 |
def __init__(self, config: AppConfig | None = None) -> None:
|
| 198 |
try:
|
|
|
|
| 205 |
self.config = config or load_config()
|
| 206 |
api_key = os.environ.get("OPENAI_API_KEY", "")
|
| 207 |
self.client = AsyncOpenAI(api_key=api_key)
|
| 208 |
+
self.model = "gpt-4o-mini-2024-07-18"
|
| 209 |
model_pricing = self.config.provider.models.get(self.model)
|
| 210 |
self._input_cost = model_pricing.input_cost_per_mtok if model_pricing else 0.15
|
| 211 |
self._output_cost = model_pricing.output_cost_per_mtok if model_pricing else 0.60
|
|
@@ -8,6 +8,9 @@ provider:
|
|
| 8 |
gpt-4o-mini:
|
| 9 |
input_cost_per_mtok: 0.15
|
| 10 |
output_cost_per_mtok: 0.60
|
|
|
|
|
|
|
|
|
|
| 11 |
claude-sonnet-4-20250514:
|
| 12 |
input_cost_per_mtok: 3.0
|
| 13 |
output_cost_per_mtok: 15.0
|
|
@@ -99,6 +102,7 @@ corpora:
|
|
| 99 |
refusal_threshold: 0.02 # matches legacy rag.refusal_threshold
|
| 100 |
top_k: 5
|
| 101 |
max_iterations: 3
|
|
|
|
| 102 |
k8s:
|
| 103 |
label: "Kubernetes"
|
| 104 |
store_path: .cache/store_k8s
|
|
|
|
| 8 |
gpt-4o-mini:
|
| 9 |
input_cost_per_mtok: 0.15
|
| 10 |
output_cost_per_mtok: 0.60
|
| 11 |
+
gpt-4o-mini-2024-07-18: # dated pin used by OpenAIProvider.model at runtime
|
| 12 |
+
input_cost_per_mtok: 0.15
|
| 13 |
+
output_cost_per_mtok: 0.60
|
| 14 |
claude-sonnet-4-20250514:
|
| 15 |
input_cost_per_mtok: 3.0
|
| 16 |
output_cost_per_mtok: 15.0
|
|
|
|
| 102 |
refusal_threshold: 0.02 # matches legacy rag.refusal_threshold
|
| 103 |
top_k: 5
|
| 104 |
max_iterations: 3
|
| 105 |
+
golden_dataset: agent_bench/evaluation/datasets/tech_docs_golden.json
|
| 106 |
k8s:
|
| 107 |
label: "Kubernetes"
|
| 108 |
store_path: .cache/store_k8s
|