Nomearod Claude Opus 4.7 (1M context) commited on
Commit
c39d5c7
Β·
1 Parent(s): 226b6f4

docs(harness,readme): two re-review must-fix items

Browse files

harness.py:84 β€” EvalResult.judge_scores docstring was stale: said
'Empty when item.category == "out_of_scope"', no longer true after
the per-dimension OOS gate fix. Updated to describe the actual
behavior: OOS items get relevance only, reference-based dims
(groundedness, completeness) skipped on OOS, completeness skipped
when reference_answer is empty regardless of category.

README.md β€” test count 509 β†’ 523 (caught up with the +9 review
regression tests + the +5 Codex regression tests), and the
evaluate-full cost note widened from $0.01–0.05 to $0.01–0.10 with
an explicit per-item breakdown so the OOS-gets-relevance change
doesn't quietly outgrow the upper bound. The change is small for
the current dataset (~10% OOS Γ— 1 extra judge call) but the cost
note should reflect the cost model, not just one dataset's
empirical floor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (2) hide show
  1. README.md +2 -2
  2. agent_bench/evaluation/harness.py +5 -1
README.md CHANGED
@@ -302,7 +302,7 @@ The golden dataset contains 27 hand-crafted FastAPI questions (19 retrieval Β· 3
302
  ## Testing
303
 
304
  ```bash
305
- make test # 509 deterministic tests, no API keys needed
306
  make lint # ruff + mypy
307
  ```
308
 
@@ -314,7 +314,7 @@ These Make targets call paid LLM APIs. Run locally; they are excluded from CI.
314
 
315
  | Target | Requires API key | Approximate cost | What it produces |
316
  |---|---|---|---|
317
- | `make evaluate-full` | OpenAI or Anthropic | $0.01–0.05 per run | Full-corpus harness run with L1 + L2 judges; results in `results/{run_label}.json` |
318
  | `make calibrate` | Anthropic + OpenAI | ~$2 per full run | Generates frozen system outputs, scores all 6 ablation rows, builds `docs/_generated/kappa_table.md` |
319
  | `make evaluate-judges` | Anthropic + OpenAI | ~$1 per run | Re-runs the 6 rows against existing system outputs (no regeneration) |
320
  | `make evaluate-langchain` | OpenAI or Anthropic | $0.01–0.05 per run | LangChain baseline harness for the comparison report |
 
302
  ## Testing
303
 
304
  ```bash
305
+ make test # 523 deterministic tests, no API keys needed
306
  make lint # ruff + mypy
307
  ```
308
 
 
314
 
315
  | Target | Requires API key | Approximate cost | What it produces |
316
  |---|---|---|---|
317
+ | `make evaluate-full` | OpenAI or Anthropic | $0.01–0.10 per run | Full-corpus harness run with L1 + L2 judges; results in `results/{run_label}.json`. Cost scales with item count Γ— judge dimensions: in-scope items get all 3 (groundedness + relevance + completeness), out-of-scope items get relevance only (~$0.0001/item). |
318
  | `make calibrate` | Anthropic + OpenAI | ~$2 per full run | Generates frozen system outputs, scores all 6 ablation rows, builds `docs/_generated/kappa_table.md` |
319
  | `make evaluate-judges` | Anthropic + OpenAI | ~$1 per run | Re-runs the 6 rows against existing system outputs (no regeneration) |
320
  | `make evaluate-langchain` | OpenAI or Anthropic | $0.01–0.05 per run | LangChain baseline harness for the comparison report |
agent_bench/evaluation/harness.py CHANGED
@@ -82,7 +82,11 @@ class EvalResult(BaseModel):
82
  answer: str = ""
83
  retrieved_sources: list[str] = []
84
  # New in judge-layer v1: per-dimension judge scores. Empty when no
85
- # judge_provider configured or item.category == "out_of_scope".
 
 
 
 
86
  judge_scores: dict[str, ScoreResult] = Field(default_factory=dict)
87
 
88
 
 
82
  answer: str = ""
83
  retrieved_sources: list[str] = []
84
  # New in judge-layer v1: per-dimension judge scores. Empty when no
85
+ # judge_provider is configured. With a provider, OOS items receive
86
+ # relevance only (refusal-vs-engagement is the L2 signal worth
87
+ # measuring); reference-based dimensions (groundedness, completeness)
88
+ # are skipped on OOS. Completeness is also skipped when
89
+ # reference_answer is empty regardless of category.
90
  judge_scores: dict[str, ScoreResult] = Field(default_factory=dict)
91
 
92