Nomearod Claude Opus 4.6 (1M context) commited on
Commit
5c1f49f
Β·
1 Parent(s): 125dac0

chore(eval): pin gpt-4o-mini snapshot + wire fastapi golden_dataset + pre-commit tolerances

Browse files

Prep commit before the counterfactual-query prompt regression. Three
sub-changes bundled because each is "what must be true before the
measurement is valid" β€” drift control, evaluation path, decision
criteria β€” and none are the experimental variable itself.

1. Pin provider.py:208 to gpt-4o-mini-2024-07-18 (dated snapshot).
The unpinned alias is a known drift vector; the Mar 25 β†’ Apr 12
P@5 slide is an open parallel item traceable to silent alias
migration. Pre/post regression runs against a floating alias
conflate prompt effect with model drift. Pricing dict gets a
matching dated entry so provider.py:209 cost-lookup resolves.

2. Wire corpora.fastapi.golden_dataset. Production /ask/stream
already routes through format_system_prompt() via
routes.py:_resolve_system_prompt β€” the app.state.system_prompt
legacy fallback is effectively dead code given the shipped
multi-corpus config. Adding the missing golden_dataset field
makes --corpus fastapi work so the regression gate can measure
the actual production prompt path, not the legacy eval
scaffolding at make evaluate-fast. Purely additive; zero blast
radius on serving.

3. Pre-commit four-metric tolerances (P@5 -0.02, R@5 -0.02, citation
hard, mean tool_calls +0.30, individual cap-hit gate) and strict
pilot_005 flip criteria. Written down before the post-edit runs
so the pass/fail call is not a judgment under confirmation-bias
pressure.

Full test suite passes (438/438). No code touches the prompt
template itself; the prompt edit is the next commit and is the sole
experimental variable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

DECISIONS.md CHANGED
@@ -718,3 +718,67 @@ the threshold calibration should stand on its own empirical merit.
718
  Feedback memory `feedback_fix_before_sweep.md` applies recursively:
719
  fix measurement-affecting bugs at every layer before combining
720
  fixes into single experiments.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
718
  Feedback memory `feedback_fix_before_sweep.md` applies recursively:
719
  fix measurement-affecting bugs at every layer before combining
720
  fixes into single experiments.
721
+
722
+ ## Prep for counterfactual-query prompt regression β€” pin, wire, tolerances
723
+
724
+ **Three sub-changes bundled as one prep commit, each small and in
725
+ service of making the downstream regression measurement valid.**
726
+
727
+ **1. OpenAI model pin.** `agent_bench/core/provider.py:208` changes
728
+ `self.model = "gpt-4o-mini"` β†’ `self.model = "gpt-4o-mini-2024-07-18"`.
729
+ The unpinned alias is a known drift vector β€” the Mar 25 β†’ Apr 12 P@5
730
+ slide bisection is an already-open parallel track item traceable to
731
+ silent alias migration. A regression run that uses the alias across
732
+ pre-edit and post-edit phases conflates prompt-clause effect with
733
+ model drift, even within a single session if the alias happens to
734
+ roll between runs. Pinning the dated snapshot removes the variable.
735
+ Pricing dict in `configs/default.yaml` gets a matching
736
+ `gpt-4o-mini-2024-07-18` entry so the cost-lookup at
737
+ `provider.py:209` still resolves. Tests that pin the model string
738
+ live in mock response payloads (not outgoing assertions) and the
739
+ langchain baseline (separate code path) β€” neither affected.
740
+
741
+ **2. FastAPI multi-corpus eval wiring.** `configs/default.yaml`
742
+ adds `corpora.fastapi.golden_dataset: agent_bench/evaluation/datasets/tech_docs_golden.json`.
743
+ The production serving path at `routes.py:105-120 _resolve_system_prompt`
744
+ already routes `/ask` and `/ask/stream` through `format_system_prompt(label)`
745
+ from `core/prompts.py` β€” the `app.state.system_prompt` legacy fallback
746
+ (serving/app.py:276) is effectively dead code given the shipped multi-corpus
747
+ config. The **only** remaining caller of `task.system_prompt` is the
748
+ `scripts/evaluate.py` legacy branch used by `make evaluate-fast`. Adding
749
+ the missing `golden_dataset` field makes `--corpus fastapi` work so the
750
+ regression gate can measure the actual production prompt path, not the
751
+ legacy eval-scaffolding prompt. Purely additive; zero blast radius on
752
+ serving (serving doesn't read `golden_dataset`).
753
+
754
+ **3. Pre-committed four-metric tolerances.** Written down now, before
755
+ the post-edit runs, so the pass/fail call on the counterfactual-query
756
+ prompt clause is not a judgment under confirmation-bias pressure.
757
+ Applied identically to FastAPI and K8s:
758
+
759
+ | Metric | Pass criterion |
760
+ |---|---|
761
+ | P@5 | post-edit β‰₯ pre-edit βˆ’ 0.02 |
762
+ | R@5 | post-edit β‰₯ pre-edit βˆ’ 0.02 |
763
+ | Citation accuracy | post-edit β‰₯ pre-edit (**hard gate** β€” any drop blocks commit) |
764
+ | Mean `tool_calls_made` | post-edit ≀ pre-edit + 0.30 |
765
+ | Individual question cap | no question that used fewer than `max_iterations=3` iterations pre-edit may hit the cap post-edit |
766
+
767
+ **pilot_005 strict flip criterion (K8s-only):**
768
+ - `keyword_hit_rate β‰₯ 0.60` against golden keywords `["not", "does not", "NetworkPolicy", "service mesh", "TLS", "ingress controller"]`
769
+ - Answer cites `k8s_network_policies.md`
770
+ - Answer contains "service mesh" OR "ingress controller" (the concrete documented-negative evidence the pre-edit refusal lacked)
771
+ - Answer does NOT begin with refusal phrasing ("The ... documentation does not provide", "I cannot answer")
772
+
773
+ **Baseline reference:** K8s pre-edit numbers from `results/k8s_preedit.json`
774
+ at commit `b97f00f` β€” P@5 0.80, R@5 1.00, citation 1.00 (all 6),
775
+ mean tool_calls 1.167. FastAPI pre-edit reference established by
776
+ `results/fastapi_preedit.json` in the next step of this session,
777
+ same pinned ID, same refusal threshold (0.02).
778
+
779
+ **Rationale for bundling.** All three sub-changes answer "what must
780
+ be true before the regression measurement is valid" β€” drift control,
781
+ evaluation path, decision criteria. Splitting into three commits
782
+ would add noise without adding signal. None of them change the
783
+ prompt template itself; the prompt edit is the NEXT commit and is
784
+ the sole experimental variable the regression measures.
agent_bench/core/provider.py CHANGED
@@ -192,7 +192,7 @@ class MockProvider(LLMProvider):
192
 
193
 
194
  class OpenAIProvider(LLMProvider):
195
- """OpenAI API provider using gpt-4o-mini."""
196
 
197
  def __init__(self, config: AppConfig | None = None) -> None:
198
  try:
@@ -205,7 +205,7 @@ class OpenAIProvider(LLMProvider):
205
  self.config = config or load_config()
206
  api_key = os.environ.get("OPENAI_API_KEY", "")
207
  self.client = AsyncOpenAI(api_key=api_key)
208
- self.model = "gpt-4o-mini"
209
  model_pricing = self.config.provider.models.get(self.model)
210
  self._input_cost = model_pricing.input_cost_per_mtok if model_pricing else 0.15
211
  self._output_cost = model_pricing.output_cost_per_mtok if model_pricing else 0.60
 
192
 
193
 
194
  class OpenAIProvider(LLMProvider):
195
+ """OpenAI API provider pinned to a dated gpt-4o-mini snapshot."""
196
 
197
  def __init__(self, config: AppConfig | None = None) -> None:
198
  try:
 
205
  self.config = config or load_config()
206
  api_key = os.environ.get("OPENAI_API_KEY", "")
207
  self.client = AsyncOpenAI(api_key=api_key)
208
+ self.model = "gpt-4o-mini-2024-07-18"
209
  model_pricing = self.config.provider.models.get(self.model)
210
  self._input_cost = model_pricing.input_cost_per_mtok if model_pricing else 0.15
211
  self._output_cost = model_pricing.output_cost_per_mtok if model_pricing else 0.60
configs/default.yaml CHANGED
@@ -8,6 +8,9 @@ provider:
8
  gpt-4o-mini:
9
  input_cost_per_mtok: 0.15
10
  output_cost_per_mtok: 0.60
 
 
 
11
  claude-sonnet-4-20250514:
12
  input_cost_per_mtok: 3.0
13
  output_cost_per_mtok: 15.0
@@ -99,6 +102,7 @@ corpora:
99
  refusal_threshold: 0.02 # matches legacy rag.refusal_threshold
100
  top_k: 5
101
  max_iterations: 3
 
102
  k8s:
103
  label: "Kubernetes"
104
  store_path: .cache/store_k8s
 
8
  gpt-4o-mini:
9
  input_cost_per_mtok: 0.15
10
  output_cost_per_mtok: 0.60
11
+ gpt-4o-mini-2024-07-18: # dated pin used by OpenAIProvider.model at runtime
12
+ input_cost_per_mtok: 0.15
13
+ output_cost_per_mtok: 0.60
14
  claude-sonnet-4-20250514:
15
  input_cost_per_mtok: 3.0
16
  output_cost_per_mtok: 15.0
 
102
  refusal_threshold: 0.02 # matches legacy rag.refusal_threshold
103
  top_k: 5
104
  max_iterations: 3
105
+ golden_dataset: agent_bench/evaluation/datasets/tech_docs_golden.json
106
  k8s:
107
  label: "Kubernetes"
108
  store_path: .cache/store_k8s