Spaces:
Running
docs(harness,readme): two re-review must-fix items
Browse filesharness.py:84 β EvalResult.judge_scores docstring was stale: said
'Empty when item.category == "out_of_scope"', no longer true after
the per-dimension OOS gate fix. Updated to describe the actual
behavior: OOS items get relevance only, reference-based dims
(groundedness, completeness) skipped on OOS, completeness skipped
when reference_answer is empty regardless of category.
README.md β test count 509 β 523 (caught up with the +9 review
regression tests + the +5 Codex regression tests), and the
evaluate-full cost note widened from $0.01β0.05 to $0.01β0.10 with
an explicit per-item breakdown so the OOS-gets-relevance change
doesn't quietly outgrow the upper bound. The change is small for
the current dataset (~10% OOS Γ 1 extra judge call) but the cost
note should reflect the cost model, not just one dataset's
empirical floor.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- README.md +2 -2
- agent_bench/evaluation/harness.py +5 -1
|
@@ -302,7 +302,7 @@ The golden dataset contains 27 hand-crafted FastAPI questions (19 retrieval Β· 3
|
|
| 302 |
## Testing
|
| 303 |
|
| 304 |
```bash
|
| 305 |
-
make test #
|
| 306 |
make lint # ruff + mypy
|
| 307 |
```
|
| 308 |
|
|
@@ -314,7 +314,7 @@ These Make targets call paid LLM APIs. Run locally; they are excluded from CI.
|
|
| 314 |
|
| 315 |
| Target | Requires API key | Approximate cost | What it produces |
|
| 316 |
|---|---|---|---|
|
| 317 |
-
| `make evaluate-full` | OpenAI or Anthropic | $0.01β0.
|
| 318 |
| `make calibrate` | Anthropic + OpenAI | ~$2 per full run | Generates frozen system outputs, scores all 6 ablation rows, builds `docs/_generated/kappa_table.md` |
|
| 319 |
| `make evaluate-judges` | Anthropic + OpenAI | ~$1 per run | Re-runs the 6 rows against existing system outputs (no regeneration) |
|
| 320 |
| `make evaluate-langchain` | OpenAI or Anthropic | $0.01β0.05 per run | LangChain baseline harness for the comparison report |
|
|
|
|
| 302 |
## Testing
|
| 303 |
|
| 304 |
```bash
|
| 305 |
+
make test # 523 deterministic tests, no API keys needed
|
| 306 |
make lint # ruff + mypy
|
| 307 |
```
|
| 308 |
|
|
|
|
| 314 |
|
| 315 |
| Target | Requires API key | Approximate cost | What it produces |
|
| 316 |
|---|---|---|---|
|
| 317 |
+
| `make evaluate-full` | OpenAI or Anthropic | $0.01β0.10 per run | Full-corpus harness run with L1 + L2 judges; results in `results/{run_label}.json`. Cost scales with item count Γ judge dimensions: in-scope items get all 3 (groundedness + relevance + completeness), out-of-scope items get relevance only (~$0.0001/item). |
|
| 318 |
| `make calibrate` | Anthropic + OpenAI | ~$2 per full run | Generates frozen system outputs, scores all 6 ablation rows, builds `docs/_generated/kappa_table.md` |
|
| 319 |
| `make evaluate-judges` | Anthropic + OpenAI | ~$1 per run | Re-runs the 6 rows against existing system outputs (no regeneration) |
|
| 320 |
| `make evaluate-langchain` | OpenAI or Anthropic | $0.01β0.05 per run | LangChain baseline harness for the comparison report |
|
|
@@ -82,7 +82,11 @@ class EvalResult(BaseModel):
|
|
| 82 |
answer: str = ""
|
| 83 |
retrieved_sources: list[str] = []
|
| 84 |
# New in judge-layer v1: per-dimension judge scores. Empty when no
|
| 85 |
-
# judge_provider configured
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
judge_scores: dict[str, ScoreResult] = Field(default_factory=dict)
|
| 87 |
|
| 88 |
|
|
|
|
| 82 |
answer: str = ""
|
| 83 |
retrieved_sources: list[str] = []
|
| 84 |
# New in judge-layer v1: per-dimension judge scores. Empty when no
|
| 85 |
+
# judge_provider is configured. With a provider, OOS items receive
|
| 86 |
+
# relevance only (refusal-vs-engagement is the L2 signal worth
|
| 87 |
+
# measuring); reference-based dimensions (groundedness, completeness)
|
| 88 |
+
# are skipped on OOS. Completeness is also skipped when
|
| 89 |
+
# reference_answer is empty regardless of category.
|
| 90 |
judge_scores: dict[str, ScoreResult] = Field(default_factory=dict)
|
| 91 |
|
| 92 |
|