Spaces:
Running
Running
| title: agent-bench | |
| emoji: "🔍" | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| app_port: 7860 | |
| # agent-bench | |
| **A RAG benchmark built from primitives, with honest evaluation of retrieval, refusal, and grounded citation.** | |
|  | |
| Agentic knowledge retrieval system with evaluation benchmark. Custom orchestration pipeline + LangChain baseline, evaluated on matched golden datasets across 3 providers (OpenAI, Anthropic, self-hosted vLLM on Modal) and two corpora (FastAPI + Kubernetes). Zero hallucinated citations on all API provider configurations. The separate self-hosted Mistral-7B benchmark is included to show the practical model-size floor where agentic retrieval starts to break down. | |
| `443 tests` · `3 providers` · `2 corpora` · `LangChain comparison` · `K8s + Terraform` · `CI` | |
| ## Benchmark Results | |
| Evaluated on 27 hand-crafted questions over 16 FastAPI documentation files. Both pipelines use identical retrieval (FAISS + BM25 + RRF + cross-encoder reranker). | |
| ### Framework Comparison: Custom vs. LangChain | |
| | Metric | Custom OpenAI | Custom Anthropic | LC OpenAI | LC Anthropic | | |
| |--------|--------------|-----------------|-----------|-------------| | |
| | P@5 | 0.70 | 0.74 | 0.64 | **0.75** | | |
| | R@5 | 0.83 | **0.84** | **0.86** | **0.84** | | |
| | KHR | 0.89 | **0.92** | 0.85 | 0.91 | | |
| | Citation Acc | 1.00 | 1.00 | 1.00 | 1.00 | | |
| | Cost/query | **$0.0004** | $0.0007 | $0.0003 | $0.0046 | | |
| > **Key insight:** Retrieval quality is dominated by the shared retrieval stack (FAISS + BM25 + RRF + cross-encoder), not the orchestration layer. P@5 and R@5 vary by less than 0.12 across all four configurations. The main cost of framework abstraction is visible in LangChain's Anthropic path: 6.6x higher per-query cost with no retrieval improvement. | |
| Citation accuracy is 1.00 everywhere, confirming the retrieval-grounded approach prevents hallucination regardless of framework or provider choice. | |
| Full analysis: [comparison report](results/comparison_custom_vs_langchain.md) | |
| ### Provider Comparison (Custom Pipeline) | |
| | Metric | OpenAI gpt-4o-mini | Anthropic claude-haiku | Self-hosted Mistral-7B | | |
| |--------|-------------------|----------------------|----------------------| | |
| | Retrieval P@5 | 0.70 | **0.74** | 0.05 | | |
| | Retrieval R@5 | 0.83 | **0.84** | 0.05 | | |
| | Keyword Hit Rate | 0.89 | **0.92** | 0.61 | | |
| | Citation Acc | **1.00** | **1.00** | 0.14 | | |
| | Latency p50 | 4,690 ms | 5,120 ms | 6,709 ms | | |
| | Cost per query | **$0.0004** | $0.0007 | $0.0031 | | |
| API providers are directly comparable (same config). The self-hosted row uses `max_iterations=1` and `top_k=3` (vs 3/5 for API) to fit Mistral-7B's 8K context window. Mistral-7B's context constraint forces single-iteration retrieval with fewer chunks, demonstrating that agentic tool-calling workflows have a practical model-size floor — a genuine architectural finding, not a system failure. See [provider comparison](docs/provider_comparison.md) for full analysis. | |
| [Full benchmark report](docs/benchmark_report.md) | [Provider comparison](docs/provider_comparison.md) | [Design decisions](DECISIONS.md) | |
| ## Live Demo | |
| **https://nomearod-agentbench.hf.space** (Hugging Face Spaces — cold wake on idle takes ~2 minutes, warm queries respond in ~5s; see [DECISIONS.md](DECISIONS.md) for the bounded measurement and v1.1 contingency) | |
| ```bash | |
| # In-scope question (expect answer with sources) | |
| curl -X POST https://nomearod-agentbench.hf.space/ask \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"question": "How do I define a path parameter in FastAPI?"}' | |
| # Out-of-scope question (expect grounded refusal) | |
| curl -X POST https://nomearod-agentbench.hf.space/ask \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"question": "How do I cook pasta?"}' | |
| # Health check | |
| curl https://nomearod-agentbench.hf.space/health | |
| ``` | |
| ## Quick Start (Local) | |
| ```bash | |
| make install # Install dependencies | |
| make ingest # Chunk + embed 16 FastAPI docs into FAISS + BM25 | |
| make serve # Start FastAPI server on :8000 | |
| ``` | |
| ```bash | |
| curl -X POST http://localhost:8000/ask \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"question": "How do I define a path parameter in FastAPI?"}' | |
| ``` | |
| ### With Docker | |
| ```bash | |
| OPENAI_API_KEY=sk-... docker-compose -f docker/docker-compose.yaml up --build | |
| ``` | |
| ### Self-Hosted LLM via Modal (no local GPU needed) | |
| ```bash | |
| pip install -e ".[modal]" # Install Modal SDK | |
| modal setup # Authenticate with Modal | |
| modal secret create huggingface-secret HF_TOKEN=hf_... # HF token for model download | |
| make modal-deploy # Deploy vLLM on Modal A10G | |
| export MODAL_VLLM_URL=https://your--agent-bench-vllm-serve.modal.run/v1 | |
| AGENT_BENCH_ENV=selfhosted_modal make serve # Serve with self-hosted provider | |
| # Run provider comparison (requires all provider API keys) | |
| export OPENAI_API_KEY=sk-... | |
| export ANTHROPIC_API_KEY=sk-ant-... | |
| make benchmark-all | |
| # Or run only the self-hosted provider | |
| python modal/run_benchmark.py --base-url $MODAL_VLLM_URL --only selfhosted_modal | |
| ``` | |
| ### Self-Hosted LLM via Docker Compose (requires local NVIDIA GPU) | |
| ```bash | |
| docker compose -f docker/docker-compose.vllm.yml up --build | |
| ``` | |
| ### Kubernetes (Helm) | |
| ```bash | |
| make k8s-dev # Dev: 1 replica, no HPA | |
| make k8s-prod # Prod: 3 replicas, HPA 2-8 pods | |
| ``` | |
| See [docs/k8s-local-setup.md](docs/k8s-local-setup.md) for minikube walkthrough. | |
| ## Architecture | |
| ```mermaid | |
| flowchart LR | |
| Client -->|POST /ask| MW[Middleware<br/>request_id, timing, errors] | |
| MW --> Orch[Orchestrator<br/>max 3 iterations] | |
| Orch --> LLM[LLM Provider<br/>OpenAI / Anthropic] | |
| LLM -->|tool_calls| Reg[Tool Registry] | |
| Reg --> Search[search_documents] | |
| Reg --> Calc[calculator] | |
| Search --> Store[Hybrid Store<br/>FAISS + BM25 + RRF] | |
| LLM -->|no tool_calls| Resp[AskResponse<br/>answer + sources + metadata] | |
| subgraph Providers | |
| LLM --- OpenAI[OpenAI<br/>gpt-4o-mini] | |
| LLM --- Anthropic[Anthropic<br/>claude-haiku] | |
| LLM --- SelfHosted[SelfHosted<br/>vLLM / TGI / Ollama] | |
| end | |
| ``` | |
| ## Security Architecture | |
| Injection detection → PII redaction → output validation → audit logging. Four guardrails, each independently configurable, each degrades gracefully. | |
| ``` | |
| User Input | |
| │ | |
| ▼ | |
| ┌──────────────────────┐ | |
| │ Injection Detection │ Tier 1: heuristic regex (local, <1ms) | |
| │ (pre-retrieval) │ Tier 2: DeBERTa classifier (Modal GPU) | |
| └──────────┬───────────┘ | |
| │ safe | |
| ▼ | |
| ┌──────────────────────┐ | |
| │ Retrieval │ FAISS + BM25 + RRF + cross-encoder | |
| │ (existing pipeline) │ | |
| └──────────┬───────────┘ | |
| │ | |
| ▼ | |
| ┌──────────────────────┐ | |
| │ PII Redaction │ regex (always) + spaCy NER (optional) | |
| │ (post-retrieval) │ | |
| └──────────┬───────────┘ | |
| │ | |
| ▼ | |
| ┌──────────────────────┐ | |
| │ LLM Generation │ OpenAI / Anthropic / vLLM (Modal) | |
| │ (existing pipeline) │ | |
| └──────────┬───────────┘ | |
| │ | |
| ▼ | |
| ┌──────────────────────┐ | |
| │ Output Validation │ PII leakage + URL check + blocklist | |
| │ (post-generation) │ | |
| └──────────┬───────────┘ | |
| │ | |
| ▼ | |
| ┌──────────────────────┐ | |
| │ Audit Log │ JSONL, HMAC-hashed IPs, rotated | |
| │ (every request) │ | |
| └──────────┬───────────┘ | |
| │ | |
| ▼ | |
| Response | |
| ``` | |
| **Injection detection** uses a two-tier architecture: heuristic regex rules catch common patterns (<1ms), and an optional DeBERTa classifier on Modal GPU provides high-confidence classification. Without GPU, the system runs heuristic-only — honest degradation, not silent failure. | |
| **PII redaction** runs regex patterns for high-risk types (SSN, credit card, email, phone, IP address) on every retrieved chunk before it enters the LLM context window. Optional spaCy NER adds PERSON/ORG detection for deployments that need it. | |
| **Output validation** catches PII leakage (LLM reconstructing redacted data), URL hallucination (URLs not in retrieved chunks), and blocklisted patterns (system prompt fragments, API keys). | |
| **Audit logging** writes one structured JSON record per request to an append-only JSONL file. Client IPs are HMAC-SHA256 hashed with a server secret (`AUDIT_HMAC_KEY` env var) so they are irreversible even against offline enumeration of the IPv4 address space. Logs include injection verdicts, output validation results, and response metadata. | |
| ```bash | |
| # Query the audit log with jq | |
| jq 'select(.injection_verdict.safe == false)' logs/audit.jsonl | |
| jq 'select(.session_id == "abc123")' logs/audit.jsonl | |
| ``` | |
| This is an application-layer security pipeline — it does not replace network-level security, authentication, or infrastructure hardening. | |
| See [SECURITY.md](SECURITY.md) for the OWASP LLM Top 10 (2025) mapping. See [DECISIONS.md](DECISIONS.md) for why we chose two-tier detection over three, regex-only PII by default, JSONL over SQLite for audit, and HMAC over plain SHA-256 for IP hashing. | |
| <details><summary>Security configuration</summary> | |
| All security settings live in `configs/default.yaml` under the `security` key and map to Pydantic models with Literal-constrained enums: | |
| ```yaml | |
| security: | |
| injection: | |
| enabled: true | |
| action: block # block | warn | flag | |
| tiers: [heuristic, classifier] | |
| classifier_url: "" # Modal endpoint URL when using Tier 2 | |
| pii: | |
| enabled: true | |
| mode: redact # redact | detect_only | passthrough | |
| redact_patterns: [EMAIL, PHONE, SSN, CREDIT_CARD, IP_ADDRESS] | |
| use_ner: false # requires: pip install -e ".[ner]" | |
| ner_entities: [PERSON] | |
| output: | |
| enabled: true | |
| pii_check: true | |
| url_check: true | |
| blocklist: [] # regex patterns to block in output | |
| audit: | |
| enabled: true | |
| path: logs/audit.jsonl | |
| max_size_mb: 100 | |
| rotate: true | |
| ``` | |
| </details> | |
| ## Engineering Scope | |
| - **Agent design & evaluation**: Built two independent orchestration approaches (custom tool-calling loop + LangChain AgentExecutor) and evaluated both on identical metrics to quantify framework tradeoffs | |
| - **Retrieval engineering**: Hybrid FAISS + BM25 with Reciprocal Rank Fusion, cross-encoder reranking, evaluated across 27 questions with P@5, R@5, citation accuracy | |
| - **Infrastructure:** Kubernetes (Helm), Terraform (GCP/GKE), self-hosted LLM serving (vLLM on Modal + Docker Compose) | |
| - **MLOps:** Provider comparison benchmark (API vs self-hosted, real measured data) | |
| - **Security — detection & redaction**: Two-tier prompt injection detection (heuristic regex + DeBERTa classifier), PII redaction on retrieved context, output validation gate (PII leakage, URL hallucination, blocklist) | |
| - **Security — audit & compliance**: Append-only JSONL audit trail, HMAC-SHA256 IP hashing (GDPR-aligned), log rotation, config-driven security with Literal-constrained enums | |
| - **Production engineering**: FastAPI, Docker, CI/CD, structured logging, rate limiting, SSE streaming, conversation sessions, 443 deterministic tests with mock providers | |
| <details><summary>API Reference</summary> | |
| ## API Endpoints | |
| | Endpoint | Method | Description | | |
| |----------|--------|-------------| | |
| | `/ask` | POST | Ask a question, get answer with sources | | |
| | `/ask/stream` | POST | SSE streaming (sources → chunks → done) | | |
| | `/health` | GET | Store stats, provider status, uptime | | |
| | `/metrics` | GET | Request count, latency p50/p95, cost (JSON) | | |
| | `/metrics/prometheus` | GET | Prometheus text exposition format | | |
| ### POST /ask | |
| ```json | |
| { | |
| "question": "How do I define a path parameter in FastAPI?", | |
| "top_k": 5, | |
| "retrieval_strategy": "hybrid" | |
| } | |
| ``` | |
| Response: | |
| ```json | |
| { | |
| "answer": "Path parameters in FastAPI are defined using curly braces...", | |
| "sources": [{"source": "fastapi_path_params.md"}], | |
| "metadata": { | |
| "provider": "openai", | |
| "model": "gpt-4o-mini", | |
| "iterations": 2, | |
| "tools_used": ["search_documents"], | |
| "latency_ms": 1234.5, | |
| "token_usage": {"input_tokens": 500, "output_tokens": 150, "estimated_cost_usd": 0.0002}, | |
| "request_id": "abc-123" | |
| } | |
| } | |
| ``` | |
| </details> | |
| ## Evaluation | |
| ```bash | |
| make evaluate-fast # Deterministic metrics only (needs API key) | |
| make evaluate-full # + LLM-judge metrics (costs more) | |
| make benchmark # Generate markdown report from results | |
| make evaluate-langchain # Run LangChain baseline comparison | |
| ``` | |
| The golden dataset contains 27 hand-crafted FastAPI questions (19 retrieval · 3 calculation · 5 out-of-scope) and 25 hand-crafted Kubernetes questions across the CRAG 8-type taxonomy (6 simple · 4 simple-with-condition · 4 comparison · 6 multi-hop · 4 false-premise · 1 set · 2 time-sensitive). Questions are authored with index-aligned `source_snippets`/`source_chunk_ids` so every expected answer can be traced back to a verbatim string in the ingested store — no LLM-judged ground truth, no paraphrase fuzz. | |
| ## Methodology Notes | |
| **Refusal-gate thresholds under LLM-driven query formulation are non-deterministic.** During the Kubernetes 25-question threshold sweep (see [DECISIONS.md](DECISIONS.md) for the full write-up), an unexpected result surfaced: raising `refusal_threshold` from 0.015 to 0.025 produced _fewer_ retrieval-gate trips than 0.020, even though higher thresholds should be strictly more restrictive. Root cause: the orchestrator issues LLM-written queries to the search tool, so the same golden-dataset question produces different retrieval max_scores run-to-run, depending on what query the LLM chose to write. The sweep's "broken retrieval" count at each threshold is therefore not a fixed number but a distribution. The practical implication is that refusal-gate calibration in RAG systems with LLM-driven query formulation requires measuring run-to-run variance and sitting below the noisy floor with margin, not just picking the highest value that passes a one-shot sweep. The K8s threshold is pinned at 0.015 — the empirical pilot floor, validated against the full 25-question set with the variance finding explicitly accounted for. | |
| ## Testing | |
| ```bash | |
| make test # 523 deterministic tests, no API keys needed | |
| make lint # ruff + mypy | |
| ``` | |
| All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads. CI-safe. | |
| ### Targets that cost money | |
| These Make targets call paid LLM APIs. Run locally; they are excluded from CI. | |
| | Target | Requires API key | Approximate cost | What it produces | | |
| |---|---|---|---| | |
| | `make evaluate-full` | OpenAI or Anthropic | $0.01–0.10 per run | Full-corpus harness run with L1 + L2 judges; results in `results/{run_label}.json`. Cost scales with item count × judge dimensions: in-scope items get all 3 (groundedness + relevance + completeness), out-of-scope items get relevance only (~$0.0001/item). | | |
| | `make calibrate` | Anthropic + OpenAI | ~$2 per full run | Generates frozen system outputs, scores all 6 ablation rows, builds `docs/_generated/kappa_table.md` | | |
| | `make evaluate-judges` | Anthropic + OpenAI | ~$1 per run | Re-runs the 6 rows against existing system outputs (no regeneration) | | |
| | `make evaluate-langchain` | OpenAI or Anthropic | $0.01–0.05 per run | LangChain baseline harness for the comparison report | | |
| Set keys via `OPENAI_API_KEY` and `ANTHROPIC_API_KEY` environment variables. CI does not have these (test job uses `MockProvider`). | |
| ## Design Decisions | |
| See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, security architecture tradeoffs, and more. | |
| ### V1 → V2 → V3 Evolution | |
| | Feature | V1 | V2 | V3 | | |
| |---------|----|----|-----| | |
| | Grounded refusal | 0/5 | Threshold gate | Threshold gate | | |
| | Retrieval P@5 | 0.70 | 0.74 (cross-encoder) | 0.74 | | |
| | Provider support | OpenAI only | OpenAI + Anthropic + vLLM | Same | | |
| | Streaming | None | SSE (`/ask/stream`) | SSE | | |
| | Infrastructure | Local only | Docker, K8s, Terraform, Modal | Same | | |
| | **Injection detection** | None | None | Two-tier (heuristic + DeBERTa) | | |
| | **PII redaction** | None | None | Regex + optional NER | | |
| | **Output validation** | None | None | PII leakage + URL + blocklist | | |
| | **Audit logging** | None | None | JSONL, HMAC-hashed IPs | | |
| | Tests | 97 | 205 | 443 | | |