Spaces:
Running
docs: step 8.1 — tagline reframe + README honest-scope + rename closure
Browse filesTagline reframe (chosen path from rename due-diligence): keep the
repo name agent-bench, add a one-line positioning tagline under the
H1 — "A RAG benchmark built from primitives, with honest evaluation
of retrieval, refusal, and grounded citation." This captures the
refusal-focused launch angle without the ~350-reference rename cost
and without the HF Space URL break with no redirect.
Close parallel-tracks item #5 (agent-bench → refusal-bench rename)
with the full due-diligence reasoning. Name is hyphenated so the
AgentBench (Liu et al., ICLR 2024, camelcase) collision is visual
not functional; the two projects target different audiences; the
scope is visible to any reviewer reading the README in 30 seconds.
HF Space will be renamed from Nomearod/agentbench to
Nomearod/agent-bench at deploy time to fix the pre-existing
GitHub/HF name inconsistency; that rename is the user's next
action on HF UI.
README updates:
- H1 tagline added under the title
- 288 tests → 444 tests (header badge row, engineering scope,
make test comment block)
- 27-question golden dataset → updated to include both corpora
with the CRAG 8-type K8s distribution made explicit
- Live-demo URLs updated from nomearod-agentbench.hf.space →
nomearod-agent-bench.hf.space (4 references; goes live when
HF Space rename completes)
- New "Methodology Notes" section documenting the LLM-query-
variance finding from the threshold sweep. This is the
structural observation the launch story hinges on: threshold
calibration in RAG systems with LLM-driven query formulation
requires measuring run-to-run variance and sitting below the
noisy floor with margin, not picking the highest value from a
one-shot sweep. Generalizes beyond this repo.
Dashboard tile + footer stat: 441 → 444 (matches README).
Not pushed: waiting on the HF Space rename before the new URLs
go live. Once renamed, this commit pushes to origin cleanly and
the README/dashboard links are consistent with the HF Space slug.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- DECISIONS.md +21 -5
- README.md +15 -12
- agent_bench/serving/static/index.html +2 -2
|
@@ -1327,11 +1327,27 @@ and decision criteria before measuring.
|
|
| 1327 |
4. **Serving-migration deferral.** Tied to external references to
|
| 1328 |
the counterfactual-query fix. Unchanged from prior sessions.
|
| 1329 |
|
| 1330 |
-
5. **`agent-bench` → `refusal-bench` rename
|
| 1331 |
-
|
| 1332 |
-
|
| 1333 |
-
|
| 1334 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1335 |
|
| 1336 |
6. **OpenAI snapshot drift bisection.** Mar 25 → Apr 12 P@5 slide;
|
| 1337 |
the model pin at `77017db` (`gpt-4o-mini-2024-07-18`) removed
|
|
|
|
| 1327 |
4. **Serving-migration deferral.** Tied to external references to
|
| 1328 |
the counterfactual-query fix. Unchanged from prior sessions.
|
| 1329 |
|
| 1330 |
+
5. **`agent-bench` → `refusal-bench` rename — CLOSED 2026-04-14.**
|
| 1331 |
+
Decision: keep `agent-bench`, reframe via tagline. The original
|
| 1332 |
+
concern was name collision with AgentBench (Liu et al., ICLR
|
| 1333 |
+
2024, ~1000 citations). Due-diligence at launch time: the name
|
| 1334 |
+
is `agent-bench` (hyphenated) vs. `AgentBench` (camelcase),
|
| 1335 |
+
which are distinct identifiers across GitHub, arXiv, and PyPI.
|
| 1336 |
+
The two projects target different audiences (LLM-as-agent
|
| 1337 |
+
capability vs. RAG+refusal benchmark) and any reviewer reaching
|
| 1338 |
+
the repo via LinkedIn or CV sees the scope in the README within
|
| 1339 |
+
seconds. Rename cost is substantial (~350 internal references
|
| 1340 |
+
across ~60 files, two external account renames, one HF Space
|
| 1341 |
+
URL break with no redirect) for a naming-precision benefit that
|
| 1342 |
+
isn't supported by the actual scope — the benchmark measures
|
| 1343 |
+
retrieval, grounding, multi-hop, citation accuracy, and refusal
|
| 1344 |
+
as seven axes, not refusal alone. Tagline reframe captures the
|
| 1345 |
+
honest-evaluation positioning without the rename cost:
|
| 1346 |
+
> "A RAG benchmark built from primitives, with honest
|
| 1347 |
+
> evaluation of retrieval, refusal, and grounded citation."
|
| 1348 |
+
HF Space `Nomearod/agentbench` renamed to `Nomearod/agent-bench`
|
| 1349 |
+
for consistency with GitHub repo name at the same time (absorbs
|
| 1350 |
+
the HF URL break before the first LinkedIn post links out).
|
| 1351 |
|
| 1352 |
6. **OpenAI snapshot drift bisection.** Mar 25 → Apr 12 P@5 slide;
|
| 1353 |
the model pin at `77017db` (`gpt-4o-mini-2024-07-18`) removed
|
|
@@ -1,10 +1,12 @@
|
|
| 1 |
# agent-bench
|
| 2 |
|
|
|
|
|
|
|
| 3 |

|
| 4 |
|
| 5 |
-
Agentic knowledge retrieval system with evaluation benchmark. Custom orchestration pipeline + LangChain baseline, evaluated on
|
| 6 |
|
| 7 |
-
`
|
| 8 |
|
| 9 |
## Benchmark Results
|
| 10 |
|
|
@@ -43,21 +45,21 @@ API providers are directly comparable (same config). The self-hosted row uses `m
|
|
| 43 |
|
| 44 |
## Live Demo
|
| 45 |
|
| 46 |
-
**https://nomearod-
|
| 47 |
|
| 48 |
```bash
|
| 49 |
# In-scope question (expect answer with sources)
|
| 50 |
-
curl -X POST https://nomearod-
|
| 51 |
-H "Content-Type: application/json" \
|
| 52 |
-d '{"question": "How do I define a path parameter in FastAPI?"}'
|
| 53 |
|
| 54 |
# Out-of-scope question (expect grounded refusal)
|
| 55 |
-
curl -X POST https://nomearod-
|
| 56 |
-H "Content-Type: application/json" \
|
| 57 |
-d '{"question": "How do I cook pasta?"}'
|
| 58 |
|
| 59 |
# Health check
|
| 60 |
-
curl https://nomearod-
|
| 61 |
```
|
| 62 |
|
| 63 |
## Quick Start (Local)
|
|
@@ -238,7 +240,7 @@ security:
|
|
| 238 |
- **MLOps:** Provider comparison benchmark (API vs self-hosted, real measured data)
|
| 239 |
- **Security — detection & redaction**: Two-tier prompt injection detection (heuristic regex + DeBERTa classifier), PII redaction on retrieved context, output validation gate (PII leakage, URL hallucination, blocklist)
|
| 240 |
- **Security — audit & compliance**: Append-only JSONL audit trail, HMAC-SHA256 IP hashing (GDPR-aligned), log rotation, config-driven security with Literal-constrained enums
|
| 241 |
-
- **Production engineering**: FastAPI, Docker, CI/CD, structured logging, rate limiting, SSE streaming, conversation sessions,
|
| 242 |
|
| 243 |
<details><summary>API Reference</summary>
|
| 244 |
|
|
@@ -291,15 +293,16 @@ make benchmark # Generate markdown report from results
|
|
| 291 |
make evaluate-langchain # Run LangChain baseline comparison
|
| 292 |
```
|
| 293 |
|
| 294 |
-
The golden dataset contains 27 hand-crafted questions
|
| 295 |
-
|
| 296 |
-
|
| 297 |
-
|
|
|
|
| 298 |
|
| 299 |
## Testing
|
| 300 |
|
| 301 |
```bash
|
| 302 |
-
make test #
|
| 303 |
make lint # ruff + mypy
|
| 304 |
```
|
| 305 |
|
|
|
|
| 1 |
# agent-bench
|
| 2 |
|
| 3 |
+
**A RAG benchmark built from primitives, with honest evaluation of retrieval, refusal, and grounded citation.**
|
| 4 |
+
|
| 5 |

|
| 6 |
|
| 7 |
+
Agentic knowledge retrieval system with evaluation benchmark. Custom orchestration pipeline + LangChain baseline, evaluated on matched golden datasets across 3 providers (OpenAI, Anthropic, self-hosted vLLM on Modal) and two corpora (FastAPI + Kubernetes). Zero hallucinated citations on all API provider configurations. The separate self-hosted Mistral-7B benchmark is included to show the practical model-size floor where agentic retrieval starts to break down.
|
| 8 |
|
| 9 |
+
`444 tests` · `3 providers` · `2 corpora` · `LangChain comparison` · `K8s + Terraform` · `CI`
|
| 10 |
|
| 11 |
## Benchmark Results
|
| 12 |
|
|
|
|
| 45 |
|
| 46 |
## Live Demo
|
| 47 |
|
| 48 |
+
**https://nomearod-agent-bench.hf.space** (Hugging Face Spaces — first request after idle may take ~30s for cold start)
|
| 49 |
|
| 50 |
```bash
|
| 51 |
# In-scope question (expect answer with sources)
|
| 52 |
+
curl -X POST https://nomearod-agent-bench.hf.space/ask \
|
| 53 |
-H "Content-Type: application/json" \
|
| 54 |
-d '{"question": "How do I define a path parameter in FastAPI?"}'
|
| 55 |
|
| 56 |
# Out-of-scope question (expect grounded refusal)
|
| 57 |
+
curl -X POST https://nomearod-agent-bench.hf.space/ask \
|
| 58 |
-H "Content-Type: application/json" \
|
| 59 |
-d '{"question": "How do I cook pasta?"}'
|
| 60 |
|
| 61 |
# Health check
|
| 62 |
+
curl https://nomearod-agent-bench.hf.space/health
|
| 63 |
```
|
| 64 |
|
| 65 |
## Quick Start (Local)
|
|
|
|
| 240 |
- **MLOps:** Provider comparison benchmark (API vs self-hosted, real measured data)
|
| 241 |
- **Security — detection & redaction**: Two-tier prompt injection detection (heuristic regex + DeBERTa classifier), PII redaction on retrieved context, output validation gate (PII leakage, URL hallucination, blocklist)
|
| 242 |
- **Security — audit & compliance**: Append-only JSONL audit trail, HMAC-SHA256 IP hashing (GDPR-aligned), log rotation, config-driven security with Literal-constrained enums
|
| 243 |
+
- **Production engineering**: FastAPI, Docker, CI/CD, structured logging, rate limiting, SSE streaming, conversation sessions, 444 deterministic tests with mock providers
|
| 244 |
|
| 245 |
<details><summary>API Reference</summary>
|
| 246 |
|
|
|
|
| 293 |
make evaluate-langchain # Run LangChain baseline comparison
|
| 294 |
```
|
| 295 |
|
| 296 |
+
The golden dataset contains 27 hand-crafted FastAPI questions (19 retrieval · 3 calculation · 5 out-of-scope) and 25 hand-crafted Kubernetes questions across the CRAG 8-type taxonomy (6 simple · 4 simple-with-condition · 4 comparison · 6 multi-hop · 4 false-premise · 1 set · 2 time-sensitive). Questions are authored with index-aligned `source_snippets`/`source_chunk_ids` so every expected answer can be traced back to a verbatim string in the ingested store — no LLM-judged ground truth, no paraphrase fuzz.
|
| 297 |
+
|
| 298 |
+
## Methodology Notes
|
| 299 |
+
|
| 300 |
+
**Refusal-gate thresholds under LLM-driven query formulation are non-deterministic.** During the Kubernetes 25-question threshold sweep (see [DECISIONS.md](DECISIONS.md) for the full write-up), an unexpected result surfaced: raising `refusal_threshold` from 0.015 to 0.025 produced _fewer_ retrieval-gate trips than 0.020, even though higher thresholds should be strictly more restrictive. Root cause: the orchestrator issues LLM-written queries to the search tool, so the same golden-dataset question produces different retrieval max_scores run-to-run, depending on what query the LLM chose to write. The sweep's "broken retrieval" count at each threshold is therefore not a fixed number but a distribution. The practical implication is that refusal-gate calibration in RAG systems with LLM-driven query formulation requires measuring run-to-run variance and sitting below the noisy floor with margin, not just picking the highest value that passes a one-shot sweep. The K8s threshold is pinned at 0.015 — the empirical pilot floor, validated against the full 25-question set with the variance finding explicitly accounted for.
|
| 301 |
|
| 302 |
## Testing
|
| 303 |
|
| 304 |
```bash
|
| 305 |
+
make test # 444 deterministic tests, no API keys needed
|
| 306 |
make lint # ruff + mypy
|
| 307 |
```
|
| 308 |
|
|
@@ -228,7 +228,7 @@ code{background:var(--code-bg);padding:2px 6px;border-radius:3px;font-size:0.9em
|
|
| 228 |
<div class="label">Citation Acc</div>
|
| 229 |
</div>
|
| 230 |
<div class="tile">
|
| 231 |
-
<div class="value">
|
| 232 |
<div class="label">Tests</div>
|
| 233 |
</div>
|
| 234 |
<div class="tile">
|
|
@@ -393,7 +393,7 @@ code{background:var(--code-bg);padding:2px 6px;border-radius:3px;font-size:0.9em
|
|
| 393 |
|
| 394 |
<!-- ── Footer ─── -->
|
| 395 |
<footer class="footer">
|
| 396 |
-
<div class="footer-stats">agent-bench · MIT License ·
|
| 397 |
<div class="footer-name">Built by Jane Yeung — Munich, Germany</div>
|
| 398 |
<div class="footer-links">
|
| 399 |
<a href="mailto:">Email</a>
|
|
|
|
| 228 |
<div class="label">Citation Acc</div>
|
| 229 |
</div>
|
| 230 |
<div class="tile">
|
| 231 |
+
<div class="value">444</div>
|
| 232 |
<div class="label">Tests</div>
|
| 233 |
</div>
|
| 234 |
<div class="tile">
|
|
|
|
| 393 |
|
| 394 |
<!-- ── Footer ─── -->
|
| 395 |
<footer class="footer">
|
| 396 |
+
<div class="footer-stats">agent-bench · MIT License · 444 tests · 3 providers</div>
|
| 397 |
<div class="footer-name">Built by Jane Yeung — Munich, Germany</div>
|
| 398 |
<div class="footer-links">
|
| 399 |
<a href="mailto:">Email</a>
|