Spaces:

Nomearod
/

agentbench

Running

Nomearod Claude Opus 4.7 (1M context) commited on 26 days ago

Commit

c39d5c7

1 Parent(s): 226b6f4

docs(harness,readme): two re-review must-fix items

harness.py:84 — EvalResult.judge_scores docstring was stale: said
'Empty when item.category == "out_of_scope"', no longer true after
the per-dimension OOS gate fix. Updated to describe the actual
behavior: OOS items get relevance only, reference-based dims
(groundedness, completeness) skipped on OOS, completeness skipped
when reference_answer is empty regardless of category.

README.md — test count 509 → 523 (caught up with the +9 review
regression tests + the +5 Codex regression tests), and the
evaluate-full cost note widened from $0.01–0.05 to $0.01–0.10 with
an explicit per-item breakdown so the OOS-gets-relevance change
doesn't quietly outgrow the upper bound. The change is small for
the current dataset (~10% OOS × 1 extra judge call) but the cost
note should reflect the cost model, not just one dataset's
empirical floor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (2) hide show

README.md +2 -2
agent_bench/evaluation/harness.py +5 -1

README.md CHANGED Viewed

@@ -302,7 +302,7 @@ The golden dataset contains 27 hand-crafted FastAPI questions (19 retrieval · 3
 ## Testing
 ```bash
-make test    # 509 deterministic tests, no API keys needed
 make lint    # ruff + mypy
 ```
@@ -314,7 +314,7 @@ These Make targets call paid LLM APIs. Run locally; they are excluded from CI.
 | Target | Requires API key | Approximate cost | What it produces |
 |---|---|---|---|
-| `make evaluate-full` | OpenAI or Anthropic | $0.01–0.05 per run | Full-corpus harness run with L1 + L2 judges; results in `results/{run_label}.json` |
 | `make calibrate` | Anthropic + OpenAI | ~$2 per full run | Generates frozen system outputs, scores all 6 ablation rows, builds `docs/_generated/kappa_table.md` |
 | `make evaluate-judges` | Anthropic + OpenAI | ~$1 per run | Re-runs the 6 rows against existing system outputs (no regeneration) |
 | `make evaluate-langchain` | OpenAI or Anthropic | $0.01–0.05 per run | LangChain baseline harness for the comparison report |

 ## Testing
 ```bash
+make test    # 523 deterministic tests, no API keys needed
 make lint    # ruff + mypy
 ```
 | Target | Requires API key | Approximate cost | What it produces |
 |---|---|---|---|
+| `make evaluate-full` | OpenAI or Anthropic | $0.01–0.10 per run | Full-corpus harness run with L1 + L2 judges; results in `results/{run_label}.json`. Cost scales with item count × judge dimensions: in-scope items get all 3 (groundedness + relevance + completeness), out-of-scope items get relevance only (~$0.0001/item). |
 | `make calibrate` | Anthropic + OpenAI | ~$2 per full run | Generates frozen system outputs, scores all 6 ablation rows, builds `docs/_generated/kappa_table.md` |
 | `make evaluate-judges` | Anthropic + OpenAI | ~$1 per run | Re-runs the 6 rows against existing system outputs (no regeneration) |
 | `make evaluate-langchain` | OpenAI or Anthropic | $0.01–0.05 per run | LangChain baseline harness for the comparison report |

agent_bench/evaluation/harness.py CHANGED Viewed

@@ -82,7 +82,11 @@ class EvalResult(BaseModel):
     answer: str = ""
     retrieved_sources: list[str] = []
     # New in judge-layer v1: per-dimension judge scores. Empty when no
-    # judge_provider configured or item.category == "out_of_scope".
     judge_scores: dict[str, ScoreResult] = Field(default_factory=dict)

     answer: str = ""
     retrieved_sources: list[str] = []
     # New in judge-layer v1: per-dimension judge scores. Empty when no
+    # judge_provider is configured. With a provider, OOS items receive
+    # relevance only (refusal-vs-engagement is the L2 signal worth
+    # measuring); reference-based dimensions (groundedness, completeness)
+    # are skipped on OOS. Completeness is also skipped when
+    # reference_answer is empty regardless of category.
     judge_scores: dict[str, ScoreResult] = Field(default_factory=dict)