agentbench / README.md
Nomearod's picture
Merge remote-tracking branch 'origin/main' into hf-deploy
4158bba
---
title: agent-bench
emoji: "🔍"
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
---
# agent-bench
**A RAG benchmark built from primitives, with honest evaluation of retrieval, refusal, and grounded citation.**
![CI](https://github.com/tyy0811/agent-bench/actions/workflows/ci.yaml/badge.svg)
Agentic knowledge retrieval system with evaluation benchmark. Custom orchestration pipeline + LangChain baseline, evaluated on matched golden datasets across 3 providers (OpenAI, Anthropic, self-hosted vLLM on Modal) and two corpora (FastAPI + Kubernetes). Zero hallucinated citations on all API provider configurations. The separate self-hosted Mistral-7B benchmark is included to show the practical model-size floor where agentic retrieval starts to break down.
`443 tests` · `3 providers` · `2 corpora` · `LangChain comparison` · `K8s + Terraform` · `CI`
## Benchmark Results
Evaluated on 27 hand-crafted questions over 16 FastAPI documentation files. Both pipelines use identical retrieval (FAISS + BM25 + RRF + cross-encoder reranker).
### Framework Comparison: Custom vs. LangChain
| Metric | Custom OpenAI | Custom Anthropic | LC OpenAI | LC Anthropic |
|--------|--------------|-----------------|-----------|-------------|
| P@5 | 0.70 | 0.74 | 0.64 | **0.75** |
| R@5 | 0.83 | **0.84** | **0.86** | **0.84** |
| KHR | 0.89 | **0.92** | 0.85 | 0.91 |
| Citation Acc | 1.00 | 1.00 | 1.00 | 1.00 |
| Cost/query | **$0.0004** | $0.0007 | $0.0003 | $0.0046 |
> **Key insight:** Retrieval quality is dominated by the shared retrieval stack (FAISS + BM25 + RRF + cross-encoder), not the orchestration layer. P@5 and R@5 vary by less than 0.12 across all four configurations. The main cost of framework abstraction is visible in LangChain's Anthropic path: 6.6x higher per-query cost with no retrieval improvement.
Citation accuracy is 1.00 everywhere, confirming the retrieval-grounded approach prevents hallucination regardless of framework or provider choice.
Full analysis: [comparison report](results/comparison_custom_vs_langchain.md)
### Provider Comparison (Custom Pipeline)
| Metric | OpenAI gpt-4o-mini | Anthropic claude-haiku | Self-hosted Mistral-7B |
|--------|-------------------|----------------------|----------------------|
| Retrieval P@5 | 0.70 | **0.74** | 0.05 |
| Retrieval R@5 | 0.83 | **0.84** | 0.05 |
| Keyword Hit Rate | 0.89 | **0.92** | 0.61 |
| Citation Acc | **1.00** | **1.00** | 0.14 |
| Latency p50 | 4,690 ms | 5,120 ms | 6,709 ms |
| Cost per query | **$0.0004** | $0.0007 | $0.0031 |
API providers are directly comparable (same config). The self-hosted row uses `max_iterations=1` and `top_k=3` (vs 3/5 for API) to fit Mistral-7B's 8K context window. Mistral-7B's context constraint forces single-iteration retrieval with fewer chunks, demonstrating that agentic tool-calling workflows have a practical model-size floor — a genuine architectural finding, not a system failure. See [provider comparison](docs/provider_comparison.md) for full analysis.
[Full benchmark report](docs/benchmark_report.md) | [Provider comparison](docs/provider_comparison.md) | [Design decisions](DECISIONS.md)
## Live Demo
**https://nomearod-agentbench.hf.space** (Hugging Face Spaces — cold wake on idle takes ~2 minutes, warm queries respond in ~5s; see [DECISIONS.md](DECISIONS.md) for the bounded measurement and v1.1 contingency)
```bash
# In-scope question (expect answer with sources)
curl -X POST https://nomearod-agentbench.hf.space/ask \
-H "Content-Type: application/json" \
-d '{"question": "How do I define a path parameter in FastAPI?"}'
# Out-of-scope question (expect grounded refusal)
curl -X POST https://nomearod-agentbench.hf.space/ask \
-H "Content-Type: application/json" \
-d '{"question": "How do I cook pasta?"}'
# Health check
curl https://nomearod-agentbench.hf.space/health
```
## Quick Start (Local)
```bash
make install # Install dependencies
make ingest # Chunk + embed 16 FastAPI docs into FAISS + BM25
make serve # Start FastAPI server on :8000
```
```bash
curl -X POST http://localhost:8000/ask \
-H "Content-Type: application/json" \
-d '{"question": "How do I define a path parameter in FastAPI?"}'
```
### With Docker
```bash
OPENAI_API_KEY=sk-... docker-compose -f docker/docker-compose.yaml up --build
```
### Self-Hosted LLM via Modal (no local GPU needed)
```bash
pip install -e ".[modal]" # Install Modal SDK
modal setup # Authenticate with Modal
modal secret create huggingface-secret HF_TOKEN=hf_... # HF token for model download
make modal-deploy # Deploy vLLM on Modal A10G
export MODAL_VLLM_URL=https://your--agent-bench-vllm-serve.modal.run/v1
AGENT_BENCH_ENV=selfhosted_modal make serve # Serve with self-hosted provider
# Run provider comparison (requires all provider API keys)
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
make benchmark-all
# Or run only the self-hosted provider
python modal/run_benchmark.py --base-url $MODAL_VLLM_URL --only selfhosted_modal
```
### Self-Hosted LLM via Docker Compose (requires local NVIDIA GPU)
```bash
docker compose -f docker/docker-compose.vllm.yml up --build
```
### Kubernetes (Helm)
```bash
make k8s-dev # Dev: 1 replica, no HPA
make k8s-prod # Prod: 3 replicas, HPA 2-8 pods
```
See [docs/k8s-local-setup.md](docs/k8s-local-setup.md) for minikube walkthrough.
## Architecture
```mermaid
flowchart LR
Client -->|POST /ask| MW[Middleware<br/>request_id, timing, errors]
MW --> Orch[Orchestrator<br/>max 3 iterations]
Orch --> LLM[LLM Provider<br/>OpenAI / Anthropic]
LLM -->|tool_calls| Reg[Tool Registry]
Reg --> Search[search_documents]
Reg --> Calc[calculator]
Search --> Store[Hybrid Store<br/>FAISS + BM25 + RRF]
LLM -->|no tool_calls| Resp[AskResponse<br/>answer + sources + metadata]
subgraph Providers
LLM --- OpenAI[OpenAI<br/>gpt-4o-mini]
LLM --- Anthropic[Anthropic<br/>claude-haiku]
LLM --- SelfHosted[SelfHosted<br/>vLLM / TGI / Ollama]
end
```
## Security Architecture
Injection detection → PII redaction → output validation → audit logging. Four guardrails, each independently configurable, each degrades gracefully.
```
User Input
┌──────────────────────┐
│ Injection Detection │ Tier 1: heuristic regex (local, <1ms)
│ (pre-retrieval) │ Tier 2: DeBERTa classifier (Modal GPU)
└──────────┬───────────┘
│ safe
┌──────────────────────┐
│ Retrieval │ FAISS + BM25 + RRF + cross-encoder
│ (existing pipeline) │
└──────────┬───────────┘
┌──────────────────────┐
│ PII Redaction │ regex (always) + spaCy NER (optional)
│ (post-retrieval) │
└──────────┬───────────┘
┌──────────────────────┐
│ LLM Generation │ OpenAI / Anthropic / vLLM (Modal)
│ (existing pipeline) │
└──────────┬───────────┘
┌──────────────────────┐
│ Output Validation │ PII leakage + URL check + blocklist
│ (post-generation) │
└──────────┬───────────┘
┌──────────────────────┐
│ Audit Log │ JSONL, HMAC-hashed IPs, rotated
│ (every request) │
└──────────┬───────────┘
Response
```
**Injection detection** uses a two-tier architecture: heuristic regex rules catch common patterns (<1ms), and an optional DeBERTa classifier on Modal GPU provides high-confidence classification. Without GPU, the system runs heuristic-only — honest degradation, not silent failure.
**PII redaction** runs regex patterns for high-risk types (SSN, credit card, email, phone, IP address) on every retrieved chunk before it enters the LLM context window. Optional spaCy NER adds PERSON/ORG detection for deployments that need it.
**Output validation** catches PII leakage (LLM reconstructing redacted data), URL hallucination (URLs not in retrieved chunks), and blocklisted patterns (system prompt fragments, API keys).
**Audit logging** writes one structured JSON record per request to an append-only JSONL file. Client IPs are HMAC-SHA256 hashed with a server secret (`AUDIT_HMAC_KEY` env var) so they are irreversible even against offline enumeration of the IPv4 address space. Logs include injection verdicts, output validation results, and response metadata.
```bash
# Query the audit log with jq
jq 'select(.injection_verdict.safe == false)' logs/audit.jsonl
jq 'select(.session_id == "abc123")' logs/audit.jsonl
```
This is an application-layer security pipeline — it does not replace network-level security, authentication, or infrastructure hardening.
See [SECURITY.md](SECURITY.md) for the OWASP LLM Top 10 (2025) mapping. See [DECISIONS.md](DECISIONS.md) for why we chose two-tier detection over three, regex-only PII by default, JSONL over SQLite for audit, and HMAC over plain SHA-256 for IP hashing.
<details><summary>Security configuration</summary>
All security settings live in `configs/default.yaml` under the `security` key and map to Pydantic models with Literal-constrained enums:
```yaml
security:
injection:
enabled: true
action: block # block | warn | flag
tiers: [heuristic, classifier]
classifier_url: "" # Modal endpoint URL when using Tier 2
pii:
enabled: true
mode: redact # redact | detect_only | passthrough
redact_patterns: [EMAIL, PHONE, SSN, CREDIT_CARD, IP_ADDRESS]
use_ner: false # requires: pip install -e ".[ner]"
ner_entities: [PERSON]
output:
enabled: true
pii_check: true
url_check: true
blocklist: [] # regex patterns to block in output
audit:
enabled: true
path: logs/audit.jsonl
max_size_mb: 100
rotate: true
```
</details>
## Engineering Scope
- **Agent design & evaluation**: Built two independent orchestration approaches (custom tool-calling loop + LangChain AgentExecutor) and evaluated both on identical metrics to quantify framework tradeoffs
- **Retrieval engineering**: Hybrid FAISS + BM25 with Reciprocal Rank Fusion, cross-encoder reranking, evaluated across 27 questions with P@5, R@5, citation accuracy
- **Infrastructure:** Kubernetes (Helm), Terraform (GCP/GKE), self-hosted LLM serving (vLLM on Modal + Docker Compose)
- **MLOps:** Provider comparison benchmark (API vs self-hosted, real measured data)
- **Security — detection & redaction**: Two-tier prompt injection detection (heuristic regex + DeBERTa classifier), PII redaction on retrieved context, output validation gate (PII leakage, URL hallucination, blocklist)
- **Security — audit & compliance**: Append-only JSONL audit trail, HMAC-SHA256 IP hashing (GDPR-aligned), log rotation, config-driven security with Literal-constrained enums
- **Production engineering**: FastAPI, Docker, CI/CD, structured logging, rate limiting, SSE streaming, conversation sessions, 443 deterministic tests with mock providers
<details><summary>API Reference</summary>
## API Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/ask` | POST | Ask a question, get answer with sources |
| `/ask/stream` | POST | SSE streaming (sources → chunks → done) |
| `/health` | GET | Store stats, provider status, uptime |
| `/metrics` | GET | Request count, latency p50/p95, cost (JSON) |
| `/metrics/prometheus` | GET | Prometheus text exposition format |
### POST /ask
```json
{
"question": "How do I define a path parameter in FastAPI?",
"top_k": 5,
"retrieval_strategy": "hybrid"
}
```
Response:
```json
{
"answer": "Path parameters in FastAPI are defined using curly braces...",
"sources": [{"source": "fastapi_path_params.md"}],
"metadata": {
"provider": "openai",
"model": "gpt-4o-mini",
"iterations": 2,
"tools_used": ["search_documents"],
"latency_ms": 1234.5,
"token_usage": {"input_tokens": 500, "output_tokens": 150, "estimated_cost_usd": 0.0002},
"request_id": "abc-123"
}
}
```
</details>
## Evaluation
```bash
make evaluate-fast # Deterministic metrics only (needs API key)
make evaluate-full # + LLM-judge metrics (costs more)
make benchmark # Generate markdown report from results
make evaluate-langchain # Run LangChain baseline comparison
```
The golden dataset contains 27 hand-crafted FastAPI questions (19 retrieval · 3 calculation · 5 out-of-scope) and 25 hand-crafted Kubernetes questions across the CRAG 8-type taxonomy (6 simple · 4 simple-with-condition · 4 comparison · 6 multi-hop · 4 false-premise · 1 set · 2 time-sensitive). Questions are authored with index-aligned `source_snippets`/`source_chunk_ids` so every expected answer can be traced back to a verbatim string in the ingested store — no LLM-judged ground truth, no paraphrase fuzz.
## Methodology Notes
**Refusal-gate thresholds under LLM-driven query formulation are non-deterministic.** During the Kubernetes 25-question threshold sweep (see [DECISIONS.md](DECISIONS.md) for the full write-up), an unexpected result surfaced: raising `refusal_threshold` from 0.015 to 0.025 produced _fewer_ retrieval-gate trips than 0.020, even though higher thresholds should be strictly more restrictive. Root cause: the orchestrator issues LLM-written queries to the search tool, so the same golden-dataset question produces different retrieval max_scores run-to-run, depending on what query the LLM chose to write. The sweep's "broken retrieval" count at each threshold is therefore not a fixed number but a distribution. The practical implication is that refusal-gate calibration in RAG systems with LLM-driven query formulation requires measuring run-to-run variance and sitting below the noisy floor with margin, not just picking the highest value that passes a one-shot sweep. The K8s threshold is pinned at 0.015 — the empirical pilot floor, validated against the full 25-question set with the variance finding explicitly accounted for.
## Testing
```bash
make test # 523 deterministic tests, no API keys needed
make lint # ruff + mypy
```
All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads. CI-safe.
### Targets that cost money
These Make targets call paid LLM APIs. Run locally; they are excluded from CI.
| Target | Requires API key | Approximate cost | What it produces |
|---|---|---|---|
| `make evaluate-full` | OpenAI or Anthropic | $0.01–0.10 per run | Full-corpus harness run with L1 + L2 judges; results in `results/{run_label}.json`. Cost scales with item count × judge dimensions: in-scope items get all 3 (groundedness + relevance + completeness), out-of-scope items get relevance only (~$0.0001/item). |
| `make calibrate` | Anthropic + OpenAI | ~$2 per full run | Generates frozen system outputs, scores all 6 ablation rows, builds `docs/_generated/kappa_table.md` |
| `make evaluate-judges` | Anthropic + OpenAI | ~$1 per run | Re-runs the 6 rows against existing system outputs (no regeneration) |
| `make evaluate-langchain` | OpenAI or Anthropic | $0.01–0.05 per run | LangChain baseline harness for the comparison report |
Set keys via `OPENAI_API_KEY` and `ANTHROPIC_API_KEY` environment variables. CI does not have these (test job uses `MockProvider`).
## Design Decisions
See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, security architecture tradeoffs, and more.
### V1 → V2 → V3 Evolution
| Feature | V1 | V2 | V3 |
|---------|----|----|-----|
| Grounded refusal | 0/5 | Threshold gate | Threshold gate |
| Retrieval P@5 | 0.70 | 0.74 (cross-encoder) | 0.74 |
| Provider support | OpenAI only | OpenAI + Anthropic + vLLM | Same |
| Streaming | None | SSE (`/ask/stream`) | SSE |
| Infrastructure | Local only | Docker, K8s, Terraform, Modal | Same |
| **Injection detection** | None | None | Two-tier (heuristic + DeBERTa) |
| **PII redaction** | None | None | Regex + optional NER |
| **Output validation** | None | None | PII leakage + URL + blocklist |
| **Audit logging** | None | None | JSONL, HMAC-hashed IPs |
| Tests | 97 | 205 | 443 |