Spaces:

Nomearod
/

agentbench

Running

Nomearod Claude Opus 4.6 (1M context) commited on Apr 14

Commit

086ad86

1 Parent(s): 2d1d822

docs: step 8.1 — tagline reframe + README honest-scope + rename closure

Tagline reframe (chosen path from rename due-diligence): keep the
repo name agent-bench, add a one-line positioning tagline under the
H1 — "A RAG benchmark built from primitives, with honest evaluation
of retrieval, refusal, and grounded citation." This captures the
refusal-focused launch angle without the ~350-reference rename cost
and without the HF Space URL break with no redirect.

Close parallel-tracks item #5 (agent-bench → refusal-bench rename)
with the full due-diligence reasoning. Name is hyphenated so the
AgentBench (Liu et al., ICLR 2024, camelcase) collision is visual
not functional; the two projects target different audiences; the
scope is visible to any reviewer reading the README in 30 seconds.
HF Space will be renamed from Nomearod/agentbench to
Nomearod/agent-bench at deploy time to fix the pre-existing
GitHub/HF name inconsistency; that rename is the user's next
action on HF UI.

README updates:
- H1 tagline added under the title
- 288 tests → 444 tests (header badge row, engineering scope,
make test comment block)
- 27-question golden dataset → updated to include both corpora
with the CRAG 8-type K8s distribution made explicit
- Live-demo URLs updated from nomearod-agentbench.hf.space →
nomearod-agent-bench.hf.space (4 references; goes live when
HF Space rename completes)
- New "Methodology Notes" section documenting the LLM-query-
variance finding from the threshold sweep. This is the
structural observation the launch story hinges on: threshold
calibration in RAG systems with LLM-driven query formulation
requires measuring run-to-run variance and sitting below the
noisy floor with margin, not picking the highest value from a
one-shot sweep. Generalizes beyond this repo.

Dashboard tile + footer stat: 441 → 444 (matches README).

Not pushed: waiting on the HF Space rename before the new URLs
go live. Once renamed, this commit pushes to origin cleanly and
the README/dashboard links are consistent with the HF Space slug.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (3) hide show

DECISIONS.md +21 -5
README.md +15 -12
agent_bench/serving/static/index.html +2 -2

DECISIONS.md CHANGED Viewed

@@ -1327,11 +1327,27 @@ and decision criteria before measuring.
 4. **Serving-migration deferral.** Tied to external references to
    the counterfactual-query fix. Unchanged from prior sessions.
-5. **`agent-bench` → `refusal-bench` rename due diligence.**
-   Outstanding from session 1. No scope change; the rename has
-   repo-wide implications that need a dedicated session (GitHub
-   repo URL, HF Space slug, README, import paths, commit
-   message prefix conventions).
 6. **OpenAI snapshot drift bisection.** Mar 25 → Apr 12 P@5 slide;
    the model pin at `77017db` (`gpt-4o-mini-2024-07-18`) removed

 4. **Serving-migration deferral.** Tied to external references to
    the counterfactual-query fix. Unchanged from prior sessions.
+5. **`agent-bench` → `refusal-bench` rename — CLOSED 2026-04-14.**
+   Decision: keep `agent-bench`, reframe via tagline. The original
+   concern was name collision with AgentBench (Liu et al., ICLR
+   2024, ~1000 citations). Due-diligence at launch time: the name
+   is `agent-bench` (hyphenated) vs. `AgentBench` (camelcase),
+   which are distinct identifiers across GitHub, arXiv, and PyPI.
+   The two projects target different audiences (LLM-as-agent
+   capability vs. RAG+refusal benchmark) and any reviewer reaching
+   the repo via LinkedIn or CV sees the scope in the README within
+   seconds. Rename cost is substantial (~350 internal references
+   across ~60 files, two external account renames, one HF Space
+   URL break with no redirect) for a naming-precision benefit that
+   isn't supported by the actual scope — the benchmark measures
+   retrieval, grounding, multi-hop, citation accuracy, and refusal
+   as seven axes, not refusal alone. Tagline reframe captures the
+   honest-evaluation positioning without the rename cost:
+   > "A RAG benchmark built from primitives, with honest
+   >  evaluation of retrieval, refusal, and grounded citation."
+   HF Space `Nomearod/agentbench` renamed to `Nomearod/agent-bench`
+   for consistency with GitHub repo name at the same time (absorbs
+   the HF URL break before the first LinkedIn post links out).
 6. **OpenAI snapshot drift bisection.** Mar 25 → Apr 12 P@5 slide;
    the model pin at `77017db` (`gpt-4o-mini-2024-07-18`) removed

README.md CHANGED Viewed

@@ -1,10 +1,12 @@
 # agent-bench
 ![CI](https://github.com/tyy0811/agent-bench/actions/workflows/ci.yaml/badge.svg)
-Agentic knowledge retrieval system with evaluation benchmark. Custom orchestration pipeline + LangChain baseline, evaluated on the same 27-question golden dataset across 3 providers (OpenAI, Anthropic, self-hosted vLLM on Modal). Zero hallucinated citations on all API provider configurations. The separate self-hosted Mistral-7B benchmark is included to show the practical model-size floor where agentic retrieval starts to break down.
-`288 tests` · `3 providers` · `LangChain comparison` · `K8s + Terraform` · `CI`
 ## Benchmark Results
@@ -43,21 +45,21 @@ API providers are directly comparable (same config). The self-hosted row uses `m
 ## Live Demo
-**https://nomearod-agentbench.hf.space** (Hugging Face Spaces — first request after idle may take ~30s for cold start)
 ```bash
 # In-scope question (expect answer with sources)
-curl -X POST https://nomearod-agentbench.hf.space/ask \
   -H "Content-Type: application/json" \
   -d '{"question": "How do I define a path parameter in FastAPI?"}'
 # Out-of-scope question (expect grounded refusal)
-curl -X POST https://nomearod-agentbench.hf.space/ask \
   -H "Content-Type: application/json" \
   -d '{"question": "How do I cook pasta?"}'
 # Health check
-curl https://nomearod-agentbench.hf.space/health
 ```
 ## Quick Start (Local)
@@ -238,7 +240,7 @@ security:
 - **MLOps:** Provider comparison benchmark (API vs self-hosted, real measured data)
 - **Security — detection & redaction**: Two-tier prompt injection detection (heuristic regex + DeBERTa classifier), PII redaction on retrieved context, output validation gate (PII leakage, URL hallucination, blocklist)
 - **Security — audit & compliance**: Append-only JSONL audit trail, HMAC-SHA256 IP hashing (GDPR-aligned), log rotation, config-driven security with Literal-constrained enums
-- **Production engineering**: FastAPI, Docker, CI/CD, structured logging, rate limiting, SSE streaming, conversation sessions, 288 deterministic tests with mock providers
 <details><summary>API Reference</summary>
@@ -291,15 +293,16 @@ make benchmark            # Generate markdown report from results
 make evaluate-langchain   # Run LangChain baseline comparison
 ```
-The golden dataset contains 27 hand-crafted questions:
-- 19 retrieval: 8 easy (single chunk), 7 medium (multi-chunk), 4 hard (multi-source)
-- 3 calculation: questions requiring the calculator tool
-- 5 out-of-scope: questions testing grounded refusal (answer not in corpus)
 ## Testing
 ```bash
-make test    # 288 deterministic tests, no API keys needed
 make lint    # ruff + mypy
 ```

 # agent-bench
+**A RAG benchmark built from primitives, with honest evaluation of retrieval, refusal, and grounded citation.**
 ![CI](https://github.com/tyy0811/agent-bench/actions/workflows/ci.yaml/badge.svg)
+Agentic knowledge retrieval system with evaluation benchmark. Custom orchestration pipeline + LangChain baseline, evaluated on matched golden datasets across 3 providers (OpenAI, Anthropic, self-hosted vLLM on Modal) and two corpora (FastAPI + Kubernetes). Zero hallucinated citations on all API provider configurations. The separate self-hosted Mistral-7B benchmark is included to show the practical model-size floor where agentic retrieval starts to break down.
+`444 tests` · `3 providers` · `2 corpora` · `LangChain comparison` · `K8s + Terraform` · `CI`
 ## Benchmark Results
 ## Live Demo
+**https://nomearod-agent-bench.hf.space** (Hugging Face Spaces — first request after idle may take ~30s for cold start)
 ```bash
 # In-scope question (expect answer with sources)
+curl -X POST https://nomearod-agent-bench.hf.space/ask \
   -H "Content-Type: application/json" \
   -d '{"question": "How do I define a path parameter in FastAPI?"}'
 # Out-of-scope question (expect grounded refusal)
+curl -X POST https://nomearod-agent-bench.hf.space/ask \
   -H "Content-Type: application/json" \
   -d '{"question": "How do I cook pasta?"}'
 # Health check
+curl https://nomearod-agent-bench.hf.space/health
 ```
 ## Quick Start (Local)
 - **MLOps:** Provider comparison benchmark (API vs self-hosted, real measured data)
 - **Security — detection & redaction**: Two-tier prompt injection detection (heuristic regex + DeBERTa classifier), PII redaction on retrieved context, output validation gate (PII leakage, URL hallucination, blocklist)
 - **Security — audit & compliance**: Append-only JSONL audit trail, HMAC-SHA256 IP hashing (GDPR-aligned), log rotation, config-driven security with Literal-constrained enums
+- **Production engineering**: FastAPI, Docker, CI/CD, structured logging, rate limiting, SSE streaming, conversation sessions, 444 deterministic tests with mock providers
 <details><summary>API Reference</summary>
 make evaluate-langchain   # Run LangChain baseline comparison
 ```
+The golden dataset contains 27 hand-crafted FastAPI questions (19 retrieval · 3 calculation · 5 out-of-scope) and 25 hand-crafted Kubernetes questions across the CRAG 8-type taxonomy (6 simple · 4 simple-with-condition · 4 comparison · 6 multi-hop · 4 false-premise · 1 set · 2 time-sensitive). Questions are authored with index-aligned `source_snippets`/`source_chunk_ids` so every expected answer can be traced back to a verbatim string in the ingested store — no LLM-judged ground truth, no paraphrase fuzz.
+## Methodology Notes
+**Refusal-gate thresholds under LLM-driven query formulation are non-deterministic.** During the Kubernetes 25-question threshold sweep (see [DECISIONS.md](DECISIONS.md) for the full write-up), an unexpected result surfaced: raising `refusal_threshold` from 0.015 to 0.025 produced _fewer_ retrieval-gate trips than 0.020, even though higher thresholds should be strictly more restrictive. Root cause: the orchestrator issues LLM-written queries to the search tool, so the same golden-dataset question produces different retrieval max_scores run-to-run, depending on what query the LLM chose to write. The sweep's "broken retrieval" count at each threshold is therefore not a fixed number but a distribution. The practical implication is that refusal-gate calibration in RAG systems with LLM-driven query formulation requires measuring run-to-run variance and sitting below the noisy floor with margin, not just picking the highest value that passes a one-shot sweep. The K8s threshold is pinned at 0.015 — the empirical pilot floor, validated against the full 25-question set with the variance finding explicitly accounted for.
 ## Testing
 ```bash
+make test    # 444 deterministic tests, no API keys needed
 make lint    # ruff + mypy
 ```

agent_bench/serving/static/index.html CHANGED Viewed

@@ -228,7 +228,7 @@ code{background:var(--code-bg);padding:2px 6px;border-radius:3px;font-size:0.9em
       <div class="label">Citation Acc</div>
     </div>
     <div class="tile">
-      <div class="value">441</div>
       <div class="label">Tests</div>
     </div>
     <div class="tile">
@@ -393,7 +393,7 @@ code{background:var(--code-bg);padding:2px 6px;border-radius:3px;font-size:0.9em
 <!-- ── Footer ─── -->
 <footer class="footer">
-  <div class="footer-stats">agent-bench &middot; MIT License &middot; 441 tests &middot; 3 providers</div>
   <div class="footer-name">Built by Jane Yeung &mdash; Munich, Germany</div>
   <div class="footer-links">
     <a href="mailto:">Email</a>

       <div class="label">Citation Acc</div>
     </div>
     <div class="tile">
+      <div class="value">444</div>
       <div class="label">Tests</div>
     </div>
     <div class="tile">
 <!-- ── Footer ─── -->
 <footer class="footer">
+  <div class="footer-stats">agent-bench &middot; MIT License &middot; 444 tests &middot; 3 providers</div>
   <div class="footer-name">Built by Jane Yeung &mdash; Munich, Germany</div>
   <div class="footer-links">
     <a href="mailto:">Email</a>