Nomearod Claude Opus 4.6 (1M context) commited on
Commit
086ad86
·
1 Parent(s): 2d1d822

docs: step 8.1 — tagline reframe + README honest-scope + rename closure

Browse files

Tagline reframe (chosen path from rename due-diligence): keep the
repo name agent-bench, add a one-line positioning tagline under the
H1 — "A RAG benchmark built from primitives, with honest evaluation
of retrieval, refusal, and grounded citation." This captures the
refusal-focused launch angle without the ~350-reference rename cost
and without the HF Space URL break with no redirect.

Close parallel-tracks item #5 (agent-bench → refusal-bench rename)
with the full due-diligence reasoning. Name is hyphenated so the
AgentBench (Liu et al., ICLR 2024, camelcase) collision is visual
not functional; the two projects target different audiences; the
scope is visible to any reviewer reading the README in 30 seconds.
HF Space will be renamed from Nomearod/agentbench to
Nomearod/agent-bench at deploy time to fix the pre-existing
GitHub/HF name inconsistency; that rename is the user's next
action on HF UI.

README updates:
- H1 tagline added under the title
- 288 tests → 444 tests (header badge row, engineering scope,
make test comment block)
- 27-question golden dataset → updated to include both corpora
with the CRAG 8-type K8s distribution made explicit
- Live-demo URLs updated from nomearod-agentbench.hf.space →
nomearod-agent-bench.hf.space (4 references; goes live when
HF Space rename completes)
- New "Methodology Notes" section documenting the LLM-query-
variance finding from the threshold sweep. This is the
structural observation the launch story hinges on: threshold
calibration in RAG systems with LLM-driven query formulation
requires measuring run-to-run variance and sitting below the
noisy floor with margin, not picking the highest value from a
one-shot sweep. Generalizes beyond this repo.

Dashboard tile + footer stat: 441 → 444 (matches README).

Not pushed: waiting on the HF Space rename before the new URLs
go live. Once renamed, this commit pushes to origin cleanly and
the README/dashboard links are consistent with the HF Space slug.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (3) hide show
  1. DECISIONS.md +21 -5
  2. README.md +15 -12
  3. agent_bench/serving/static/index.html +2 -2
DECISIONS.md CHANGED
@@ -1327,11 +1327,27 @@ and decision criteria before measuring.
1327
  4. **Serving-migration deferral.** Tied to external references to
1328
  the counterfactual-query fix. Unchanged from prior sessions.
1329
 
1330
- 5. **`agent-bench` → `refusal-bench` rename due diligence.**
1331
- Outstanding from session 1. No scope change; the rename has
1332
- repo-wide implications that need a dedicated session (GitHub
1333
- repo URL, HF Space slug, README, import paths, commit
1334
- message prefix conventions).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1335
 
1336
  6. **OpenAI snapshot drift bisection.** Mar 25 → Apr 12 P@5 slide;
1337
  the model pin at `77017db` (`gpt-4o-mini-2024-07-18`) removed
 
1327
  4. **Serving-migration deferral.** Tied to external references to
1328
  the counterfactual-query fix. Unchanged from prior sessions.
1329
 
1330
+ 5. **`agent-bench` → `refusal-bench` rename CLOSED 2026-04-14.**
1331
+ Decision: keep `agent-bench`, reframe via tagline. The original
1332
+ concern was name collision with AgentBench (Liu et al., ICLR
1333
+ 2024, ~1000 citations). Due-diligence at launch time: the name
1334
+ is `agent-bench` (hyphenated) vs. `AgentBench` (camelcase),
1335
+ which are distinct identifiers across GitHub, arXiv, and PyPI.
1336
+ The two projects target different audiences (LLM-as-agent
1337
+ capability vs. RAG+refusal benchmark) and any reviewer reaching
1338
+ the repo via LinkedIn or CV sees the scope in the README within
1339
+ seconds. Rename cost is substantial (~350 internal references
1340
+ across ~60 files, two external account renames, one HF Space
1341
+ URL break with no redirect) for a naming-precision benefit that
1342
+ isn't supported by the actual scope — the benchmark measures
1343
+ retrieval, grounding, multi-hop, citation accuracy, and refusal
1344
+ as seven axes, not refusal alone. Tagline reframe captures the
1345
+ honest-evaluation positioning without the rename cost:
1346
+ > "A RAG benchmark built from primitives, with honest
1347
+ > evaluation of retrieval, refusal, and grounded citation."
1348
+ HF Space `Nomearod/agentbench` renamed to `Nomearod/agent-bench`
1349
+ for consistency with GitHub repo name at the same time (absorbs
1350
+ the HF URL break before the first LinkedIn post links out).
1351
 
1352
  6. **OpenAI snapshot drift bisection.** Mar 25 → Apr 12 P@5 slide;
1353
  the model pin at `77017db` (`gpt-4o-mini-2024-07-18`) removed
README.md CHANGED
@@ -1,10 +1,12 @@
1
  # agent-bench
2
 
 
 
3
  ![CI](https://github.com/tyy0811/agent-bench/actions/workflows/ci.yaml/badge.svg)
4
 
5
- Agentic knowledge retrieval system with evaluation benchmark. Custom orchestration pipeline + LangChain baseline, evaluated on the same 27-question golden dataset across 3 providers (OpenAI, Anthropic, self-hosted vLLM on Modal). Zero hallucinated citations on all API provider configurations. The separate self-hosted Mistral-7B benchmark is included to show the practical model-size floor where agentic retrieval starts to break down.
6
 
7
- `288 tests` · `3 providers` · `LangChain comparison` · `K8s + Terraform` · `CI`
8
 
9
  ## Benchmark Results
10
 
@@ -43,21 +45,21 @@ API providers are directly comparable (same config). The self-hosted row uses `m
43
 
44
  ## Live Demo
45
 
46
- **https://nomearod-agentbench.hf.space** (Hugging Face Spaces — first request after idle may take ~30s for cold start)
47
 
48
  ```bash
49
  # In-scope question (expect answer with sources)
50
- curl -X POST https://nomearod-agentbench.hf.space/ask \
51
  -H "Content-Type: application/json" \
52
  -d '{"question": "How do I define a path parameter in FastAPI?"}'
53
 
54
  # Out-of-scope question (expect grounded refusal)
55
- curl -X POST https://nomearod-agentbench.hf.space/ask \
56
  -H "Content-Type: application/json" \
57
  -d '{"question": "How do I cook pasta?"}'
58
 
59
  # Health check
60
- curl https://nomearod-agentbench.hf.space/health
61
  ```
62
 
63
  ## Quick Start (Local)
@@ -238,7 +240,7 @@ security:
238
  - **MLOps:** Provider comparison benchmark (API vs self-hosted, real measured data)
239
  - **Security — detection & redaction**: Two-tier prompt injection detection (heuristic regex + DeBERTa classifier), PII redaction on retrieved context, output validation gate (PII leakage, URL hallucination, blocklist)
240
  - **Security — audit & compliance**: Append-only JSONL audit trail, HMAC-SHA256 IP hashing (GDPR-aligned), log rotation, config-driven security with Literal-constrained enums
241
- - **Production engineering**: FastAPI, Docker, CI/CD, structured logging, rate limiting, SSE streaming, conversation sessions, 288 deterministic tests with mock providers
242
 
243
  <details><summary>API Reference</summary>
244
 
@@ -291,15 +293,16 @@ make benchmark # Generate markdown report from results
291
  make evaluate-langchain # Run LangChain baseline comparison
292
  ```
293
 
294
- The golden dataset contains 27 hand-crafted questions:
295
- - 19 retrieval: 8 easy (single chunk), 7 medium (multi-chunk), 4 hard (multi-source)
296
- - 3 calculation: questions requiring the calculator tool
297
- - 5 out-of-scope: questions testing grounded refusal (answer not in corpus)
 
298
 
299
  ## Testing
300
 
301
  ```bash
302
- make test # 288 deterministic tests, no API keys needed
303
  make lint # ruff + mypy
304
  ```
305
 
 
1
  # agent-bench
2
 
3
+ **A RAG benchmark built from primitives, with honest evaluation of retrieval, refusal, and grounded citation.**
4
+
5
  ![CI](https://github.com/tyy0811/agent-bench/actions/workflows/ci.yaml/badge.svg)
6
 
7
+ Agentic knowledge retrieval system with evaluation benchmark. Custom orchestration pipeline + LangChain baseline, evaluated on matched golden datasets across 3 providers (OpenAI, Anthropic, self-hosted vLLM on Modal) and two corpora (FastAPI + Kubernetes). Zero hallucinated citations on all API provider configurations. The separate self-hosted Mistral-7B benchmark is included to show the practical model-size floor where agentic retrieval starts to break down.
8
 
9
+ `444 tests` · `3 providers` · `2 corpora` · `LangChain comparison` · `K8s + Terraform` · `CI`
10
 
11
  ## Benchmark Results
12
 
 
45
 
46
  ## Live Demo
47
 
48
+ **https://nomearod-agent-bench.hf.space** (Hugging Face Spaces — first request after idle may take ~30s for cold start)
49
 
50
  ```bash
51
  # In-scope question (expect answer with sources)
52
+ curl -X POST https://nomearod-agent-bench.hf.space/ask \
53
  -H "Content-Type: application/json" \
54
  -d '{"question": "How do I define a path parameter in FastAPI?"}'
55
 
56
  # Out-of-scope question (expect grounded refusal)
57
+ curl -X POST https://nomearod-agent-bench.hf.space/ask \
58
  -H "Content-Type: application/json" \
59
  -d '{"question": "How do I cook pasta?"}'
60
 
61
  # Health check
62
+ curl https://nomearod-agent-bench.hf.space/health
63
  ```
64
 
65
  ## Quick Start (Local)
 
240
  - **MLOps:** Provider comparison benchmark (API vs self-hosted, real measured data)
241
  - **Security — detection & redaction**: Two-tier prompt injection detection (heuristic regex + DeBERTa classifier), PII redaction on retrieved context, output validation gate (PII leakage, URL hallucination, blocklist)
242
  - **Security — audit & compliance**: Append-only JSONL audit trail, HMAC-SHA256 IP hashing (GDPR-aligned), log rotation, config-driven security with Literal-constrained enums
243
+ - **Production engineering**: FastAPI, Docker, CI/CD, structured logging, rate limiting, SSE streaming, conversation sessions, 444 deterministic tests with mock providers
244
 
245
  <details><summary>API Reference</summary>
246
 
 
293
  make evaluate-langchain # Run LangChain baseline comparison
294
  ```
295
 
296
+ The golden dataset contains 27 hand-crafted FastAPI questions (19 retrieval · 3 calculation · 5 out-of-scope) and 25 hand-crafted Kubernetes questions across the CRAG 8-type taxonomy (6 simple · 4 simple-with-condition · 4 comparison · 6 multi-hop · 4 false-premise · 1 set · 2 time-sensitive). Questions are authored with index-aligned `source_snippets`/`source_chunk_ids` so every expected answer can be traced back to a verbatim string in the ingested store — no LLM-judged ground truth, no paraphrase fuzz.
297
+
298
+ ## Methodology Notes
299
+
300
+ **Refusal-gate thresholds under LLM-driven query formulation are non-deterministic.** During the Kubernetes 25-question threshold sweep (see [DECISIONS.md](DECISIONS.md) for the full write-up), an unexpected result surfaced: raising `refusal_threshold` from 0.015 to 0.025 produced _fewer_ retrieval-gate trips than 0.020, even though higher thresholds should be strictly more restrictive. Root cause: the orchestrator issues LLM-written queries to the search tool, so the same golden-dataset question produces different retrieval max_scores run-to-run, depending on what query the LLM chose to write. The sweep's "broken retrieval" count at each threshold is therefore not a fixed number but a distribution. The practical implication is that refusal-gate calibration in RAG systems with LLM-driven query formulation requires measuring run-to-run variance and sitting below the noisy floor with margin, not just picking the highest value that passes a one-shot sweep. The K8s threshold is pinned at 0.015 — the empirical pilot floor, validated against the full 25-question set with the variance finding explicitly accounted for.
301
 
302
  ## Testing
303
 
304
  ```bash
305
+ make test # 444 deterministic tests, no API keys needed
306
  make lint # ruff + mypy
307
  ```
308
 
agent_bench/serving/static/index.html CHANGED
@@ -228,7 +228,7 @@ code{background:var(--code-bg);padding:2px 6px;border-radius:3px;font-size:0.9em
228
  <div class="label">Citation Acc</div>
229
  </div>
230
  <div class="tile">
231
- <div class="value">441</div>
232
  <div class="label">Tests</div>
233
  </div>
234
  <div class="tile">
@@ -393,7 +393,7 @@ code{background:var(--code-bg);padding:2px 6px;border-radius:3px;font-size:0.9em
393
 
394
  <!-- ── Footer ─── -->
395
  <footer class="footer">
396
- <div class="footer-stats">agent-bench &middot; MIT License &middot; 441 tests &middot; 3 providers</div>
397
  <div class="footer-name">Built by Jane Yeung &mdash; Munich, Germany</div>
398
  <div class="footer-links">
399
  <a href="mailto:">Email</a>
 
228
  <div class="label">Citation Acc</div>
229
  </div>
230
  <div class="tile">
231
+ <div class="value">444</div>
232
  <div class="label">Tests</div>
233
  </div>
234
  <div class="tile">
 
393
 
394
  <!-- ── Footer ─── -->
395
  <footer class="footer">
396
+ <div class="footer-stats">agent-bench &middot; MIT License &middot; 444 tests &middot; 3 providers</div>
397
  <div class="footer-name">Built by Jane Yeung &mdash; Munich, Germany</div>
398
  <div class="footer-links">
399
  <a href="mailto:">Email</a>