--- title: agent-bench emoji: "🔍" colorFrom: blue colorTo: purple sdk: docker app_port: 7860 --- # agent-bench **A RAG benchmark built from primitives, with honest evaluation of retrieval, refusal, and grounded citation.** ![CI](https://github.com/tyy0811/agent-bench/actions/workflows/ci.yaml/badge.svg) Agentic knowledge retrieval system with evaluation benchmark. Custom orchestration pipeline + LangChain baseline, evaluated on matched golden datasets across 3 providers (OpenAI, Anthropic, self-hosted vLLM on Modal) and two corpora (FastAPI + Kubernetes). Zero hallucinated citations on all API provider configurations. The separate self-hosted Mistral-7B benchmark is included to show the practical model-size floor where agentic retrieval starts to break down. `443 tests` · `3 providers` · `2 corpora` · `LangChain comparison` · `K8s + Terraform` · `CI` ## Benchmark Results Evaluated on 27 hand-crafted questions over 16 FastAPI documentation files. Both pipelines use identical retrieval (FAISS + BM25 + RRF + cross-encoder reranker). ### Framework Comparison: Custom vs. LangChain | Metric | Custom OpenAI | Custom Anthropic | LC OpenAI | LC Anthropic | |--------|--------------|-----------------|-----------|-------------| | P@5 | 0.70 | 0.74 | 0.64 | **0.75** | | R@5 | 0.83 | **0.84** | **0.86** | **0.84** | | KHR | 0.89 | **0.92** | 0.85 | 0.91 | | Citation Acc | 1.00 | 1.00 | 1.00 | 1.00 | | Cost/query | **$0.0004** | $0.0007 | $0.0003 | $0.0046 | > **Key insight:** Retrieval quality is dominated by the shared retrieval stack (FAISS + BM25 + RRF + cross-encoder), not the orchestration layer. P@5 and R@5 vary by less than 0.12 across all four configurations. The main cost of framework abstraction is visible in LangChain's Anthropic path: 6.6x higher per-query cost with no retrieval improvement. Citation accuracy is 1.00 everywhere, confirming the retrieval-grounded approach prevents hallucination regardless of framework or provider choice. Full analysis: [comparison report](results/comparison_custom_vs_langchain.md) ### Provider Comparison (Custom Pipeline) | Metric | OpenAI gpt-4o-mini | Anthropic claude-haiku | Self-hosted Mistral-7B | |--------|-------------------|----------------------|----------------------| | Retrieval P@5 | 0.70 | **0.74** | 0.05 | | Retrieval R@5 | 0.83 | **0.84** | 0.05 | | Keyword Hit Rate | 0.89 | **0.92** | 0.61 | | Citation Acc | **1.00** | **1.00** | 0.14 | | Latency p50 | 4,690 ms | 5,120 ms | 6,709 ms | | Cost per query | **$0.0004** | $0.0007 | $0.0031 | API providers are directly comparable (same config). The self-hosted row uses `max_iterations=1` and `top_k=3` (vs 3/5 for API) to fit Mistral-7B's 8K context window. Mistral-7B's context constraint forces single-iteration retrieval with fewer chunks, demonstrating that agentic tool-calling workflows have a practical model-size floor — a genuine architectural finding, not a system failure. See [provider comparison](docs/provider_comparison.md) for full analysis. [Full benchmark report](docs/benchmark_report.md) | [Provider comparison](docs/provider_comparison.md) | [Design decisions](DECISIONS.md) ## Live Demo **https://nomearod-agentbench.hf.space** (Hugging Face Spaces — cold wake on idle takes ~2 minutes, warm queries respond in ~5s; see [DECISIONS.md](DECISIONS.md) for the bounded measurement and v1.1 contingency) ```bash # In-scope question (expect answer with sources) curl -X POST https://nomearod-agentbench.hf.space/ask \ -H "Content-Type: application/json" \ -d '{"question": "How do I define a path parameter in FastAPI?"}' # Out-of-scope question (expect grounded refusal) curl -X POST https://nomearod-agentbench.hf.space/ask \ -H "Content-Type: application/json" \ -d '{"question": "How do I cook pasta?"}' # Health check curl https://nomearod-agentbench.hf.space/health ``` ## Quick Start (Local) ```bash make install # Install dependencies make ingest # Chunk + embed 16 FastAPI docs into FAISS + BM25 make serve # Start FastAPI server on :8000 ``` ```bash curl -X POST http://localhost:8000/ask \ -H "Content-Type: application/json" \ -d '{"question": "How do I define a path parameter in FastAPI?"}' ``` ### With Docker ```bash OPENAI_API_KEY=sk-... docker-compose -f docker/docker-compose.yaml up --build ``` ### Self-Hosted LLM via Modal (no local GPU needed) ```bash pip install -e ".[modal]" # Install Modal SDK modal setup # Authenticate with Modal modal secret create huggingface-secret HF_TOKEN=hf_... # HF token for model download make modal-deploy # Deploy vLLM on Modal A10G export MODAL_VLLM_URL=https://your--agent-bench-vllm-serve.modal.run/v1 AGENT_BENCH_ENV=selfhosted_modal make serve # Serve with self-hosted provider # Run provider comparison (requires all provider API keys) export OPENAI_API_KEY=sk-... export ANTHROPIC_API_KEY=sk-ant-... make benchmark-all # Or run only the self-hosted provider python modal/run_benchmark.py --base-url $MODAL_VLLM_URL --only selfhosted_modal ``` ### Self-Hosted LLM via Docker Compose (requires local NVIDIA GPU) ```bash docker compose -f docker/docker-compose.vllm.yml up --build ``` ### Kubernetes (Helm) ```bash make k8s-dev # Dev: 1 replica, no HPA make k8s-prod # Prod: 3 replicas, HPA 2-8 pods ``` See [docs/k8s-local-setup.md](docs/k8s-local-setup.md) for minikube walkthrough. ## Architecture ```mermaid flowchart LR Client -->|POST /ask| MW[Middleware
request_id, timing, errors] MW --> Orch[Orchestrator
max 3 iterations] Orch --> LLM[LLM Provider
OpenAI / Anthropic] LLM -->|tool_calls| Reg[Tool Registry] Reg --> Search[search_documents] Reg --> Calc[calculator] Search --> Store[Hybrid Store
FAISS + BM25 + RRF] LLM -->|no tool_calls| Resp[AskResponse
answer + sources + metadata] subgraph Providers LLM --- OpenAI[OpenAI
gpt-4o-mini] LLM --- Anthropic[Anthropic
claude-haiku] LLM --- SelfHosted[SelfHosted
vLLM / TGI / Ollama] end ``` ## Security Architecture Injection detection → PII redaction → output validation → audit logging. Four guardrails, each independently configurable, each degrades gracefully. ``` User Input │ ▼ ┌──────────────────────┐ │ Injection Detection │ Tier 1: heuristic regex (local, <1ms) │ (pre-retrieval) │ Tier 2: DeBERTa classifier (Modal GPU) └──────────┬───────────┘ │ safe ▼ ┌──────────────────────┐ │ Retrieval │ FAISS + BM25 + RRF + cross-encoder │ (existing pipeline) │ └──────────┬───────────┘ │ ▼ ┌──────────────────────┐ │ PII Redaction │ regex (always) + spaCy NER (optional) │ (post-retrieval) │ └──────────┬───────────┘ │ ▼ ┌──────────────────────┐ │ LLM Generation │ OpenAI / Anthropic / vLLM (Modal) │ (existing pipeline) │ └──────────┬───────────┘ │ ▼ ┌──────────────────────┐ │ Output Validation │ PII leakage + URL check + blocklist │ (post-generation) │ └──────────┬───────────┘ │ ▼ ┌──────────────────────┐ │ Audit Log │ JSONL, HMAC-hashed IPs, rotated │ (every request) │ └──────────┬───────────┘ │ ▼ Response ``` **Injection detection** uses a two-tier architecture: heuristic regex rules catch common patterns (<1ms), and an optional DeBERTa classifier on Modal GPU provides high-confidence classification. Without GPU, the system runs heuristic-only — honest degradation, not silent failure. **PII redaction** runs regex patterns for high-risk types (SSN, credit card, email, phone, IP address) on every retrieved chunk before it enters the LLM context window. Optional spaCy NER adds PERSON/ORG detection for deployments that need it. **Output validation** catches PII leakage (LLM reconstructing redacted data), URL hallucination (URLs not in retrieved chunks), and blocklisted patterns (system prompt fragments, API keys). **Audit logging** writes one structured JSON record per request to an append-only JSONL file. Client IPs are HMAC-SHA256 hashed with a server secret (`AUDIT_HMAC_KEY` env var) so they are irreversible even against offline enumeration of the IPv4 address space. Logs include injection verdicts, output validation results, and response metadata. ```bash # Query the audit log with jq jq 'select(.injection_verdict.safe == false)' logs/audit.jsonl jq 'select(.session_id == "abc123")' logs/audit.jsonl ``` This is an application-layer security pipeline — it does not replace network-level security, authentication, or infrastructure hardening. See [SECURITY.md](SECURITY.md) for the OWASP LLM Top 10 (2025) mapping. See [DECISIONS.md](DECISIONS.md) for why we chose two-tier detection over three, regex-only PII by default, JSONL over SQLite for audit, and HMAC over plain SHA-256 for IP hashing.
Security configuration All security settings live in `configs/default.yaml` under the `security` key and map to Pydantic models with Literal-constrained enums: ```yaml security: injection: enabled: true action: block # block | warn | flag tiers: [heuristic, classifier] classifier_url: "" # Modal endpoint URL when using Tier 2 pii: enabled: true mode: redact # redact | detect_only | passthrough redact_patterns: [EMAIL, PHONE, SSN, CREDIT_CARD, IP_ADDRESS] use_ner: false # requires: pip install -e ".[ner]" ner_entities: [PERSON] output: enabled: true pii_check: true url_check: true blocklist: [] # regex patterns to block in output audit: enabled: true path: logs/audit.jsonl max_size_mb: 100 rotate: true ```
## Engineering Scope - **Agent design & evaluation**: Built two independent orchestration approaches (custom tool-calling loop + LangChain AgentExecutor) and evaluated both on identical metrics to quantify framework tradeoffs - **Retrieval engineering**: Hybrid FAISS + BM25 with Reciprocal Rank Fusion, cross-encoder reranking, evaluated across 27 questions with P@5, R@5, citation accuracy - **Infrastructure:** Kubernetes (Helm), Terraform (GCP/GKE), self-hosted LLM serving (vLLM on Modal + Docker Compose) - **MLOps:** Provider comparison benchmark (API vs self-hosted, real measured data) - **Security — detection & redaction**: Two-tier prompt injection detection (heuristic regex + DeBERTa classifier), PII redaction on retrieved context, output validation gate (PII leakage, URL hallucination, blocklist) - **Security — audit & compliance**: Append-only JSONL audit trail, HMAC-SHA256 IP hashing (GDPR-aligned), log rotation, config-driven security with Literal-constrained enums - **Production engineering**: FastAPI, Docker, CI/CD, structured logging, rate limiting, SSE streaming, conversation sessions, 443 deterministic tests with mock providers
API Reference ## API Endpoints | Endpoint | Method | Description | |----------|--------|-------------| | `/ask` | POST | Ask a question, get answer with sources | | `/ask/stream` | POST | SSE streaming (sources → chunks → done) | | `/health` | GET | Store stats, provider status, uptime | | `/metrics` | GET | Request count, latency p50/p95, cost (JSON) | | `/metrics/prometheus` | GET | Prometheus text exposition format | ### POST /ask ```json { "question": "How do I define a path parameter in FastAPI?", "top_k": 5, "retrieval_strategy": "hybrid" } ``` Response: ```json { "answer": "Path parameters in FastAPI are defined using curly braces...", "sources": [{"source": "fastapi_path_params.md"}], "metadata": { "provider": "openai", "model": "gpt-4o-mini", "iterations": 2, "tools_used": ["search_documents"], "latency_ms": 1234.5, "token_usage": {"input_tokens": 500, "output_tokens": 150, "estimated_cost_usd": 0.0002}, "request_id": "abc-123" } } ```
## Evaluation ```bash make evaluate-fast # Deterministic metrics only (needs API key) make evaluate-full # + LLM-judge metrics (costs more) make benchmark # Generate markdown report from results make evaluate-langchain # Run LangChain baseline comparison ``` The golden dataset contains 27 hand-crafted FastAPI questions (19 retrieval · 3 calculation · 5 out-of-scope) and 25 hand-crafted Kubernetes questions across the CRAG 8-type taxonomy (6 simple · 4 simple-with-condition · 4 comparison · 6 multi-hop · 4 false-premise · 1 set · 2 time-sensitive). Questions are authored with index-aligned `source_snippets`/`source_chunk_ids` so every expected answer can be traced back to a verbatim string in the ingested store — no LLM-judged ground truth, no paraphrase fuzz. ## Methodology Notes **Refusal-gate thresholds under LLM-driven query formulation are non-deterministic.** During the Kubernetes 25-question threshold sweep (see [DECISIONS.md](DECISIONS.md) for the full write-up), an unexpected result surfaced: raising `refusal_threshold` from 0.015 to 0.025 produced _fewer_ retrieval-gate trips than 0.020, even though higher thresholds should be strictly more restrictive. Root cause: the orchestrator issues LLM-written queries to the search tool, so the same golden-dataset question produces different retrieval max_scores run-to-run, depending on what query the LLM chose to write. The sweep's "broken retrieval" count at each threshold is therefore not a fixed number but a distribution. The practical implication is that refusal-gate calibration in RAG systems with LLM-driven query formulation requires measuring run-to-run variance and sitting below the noisy floor with margin, not just picking the highest value that passes a one-shot sweep. The K8s threshold is pinned at 0.015 — the empirical pilot floor, validated against the full 25-question set with the variance finding explicitly accounted for. ## Testing ```bash make test # 523 deterministic tests, no API keys needed make lint # ruff + mypy ``` All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads. CI-safe. ### Targets that cost money These Make targets call paid LLM APIs. Run locally; they are excluded from CI. | Target | Requires API key | Approximate cost | What it produces | |---|---|---|---| | `make evaluate-full` | OpenAI or Anthropic | $0.01–0.10 per run | Full-corpus harness run with L1 + L2 judges; results in `results/{run_label}.json`. Cost scales with item count × judge dimensions: in-scope items get all 3 (groundedness + relevance + completeness), out-of-scope items get relevance only (~$0.0001/item). | | `make calibrate` | Anthropic + OpenAI | ~$2 per full run | Generates frozen system outputs, scores all 6 ablation rows, builds `docs/_generated/kappa_table.md` | | `make evaluate-judges` | Anthropic + OpenAI | ~$1 per run | Re-runs the 6 rows against existing system outputs (no regeneration) | | `make evaluate-langchain` | OpenAI or Anthropic | $0.01–0.05 per run | LangChain baseline harness for the comparison report | Set keys via `OPENAI_API_KEY` and `ANTHROPIC_API_KEY` environment variables. CI does not have these (test job uses `MockProvider`). ## Design Decisions See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, security architecture tradeoffs, and more. ### V1 → V2 → V3 Evolution | Feature | V1 | V2 | V3 | |---------|----|----|-----| | Grounded refusal | 0/5 | Threshold gate | Threshold gate | | Retrieval P@5 | 0.70 | 0.74 (cross-encoder) | 0.74 | | Provider support | OpenAI only | OpenAI + Anthropic + vLLM | Same | | Streaming | None | SSE (`/ask/stream`) | SSE | | Infrastructure | Local only | Docker, K8s, Terraform, Modal | Same | | **Injection detection** | None | None | Two-tier (heuristic + DeBERTa) | | **PII redaction** | None | None | Regex + optional NER | | **Output validation** | None | None | PII leakage + URL + blocklist | | **Audit logging** | None | None | JSONL, HMAC-hashed IPs | | Tests | 97 | 205 | 443 |