diff --git "a/agent_bench/serving/static/index.html" "b/agent_bench/serving/static/index.html" --- "a/agent_bench/serving/static/index.html" +++ "b/agent_bench/serving/static/index.html" @@ -4,1091 +4,1680 @@ agent-bench + + + + + + + + + - + + - - - - + +
+
agent-bench
+ +
+ +
-

agent-bench

-

Production RAG with honest evaluation. Custom orchestration benchmarked against LangChain across 3 LLM providers — including the model-size floor where agentic retrieval breaks down.

-

Built by Jane Yeung · Munich · Open to AI/ML roles in Germany

- -
-
-
0.84
-
R@5 (best)
-
-
-
1.00API / 0.14 self-hosted
-
Citation Acc
-
-
-
444
-
Tests
+
LIVE · FASTAPI + K8S CORPORA · 3 PROVIDERS
+

Production RAG, benchmarked honestly — including the model-size floor where agentic retrieval breaks down.

+

A custom tool-calling orchestrator and a LangChain baseline, evaluated on the same 27-question FastAPI golden set (plus a 6-question Kubernetes set) across OpenAI, Anthropic, and a self-hosted Mistral-7B. Every stage is instrumented. The interesting finding isn't which pipeline wins — it's where both fail.

+ + + +
+
+
API models
+
1.00
+
OpenAI gpt-4o-mini and Anthropic claude-haiku-4-5, 27/27 correct citations.
-
-
3
-
Providers
+ +
+
Self-hosted · 7B
+
0.14
+
Mistral-7B on 8K context — agentic retrieval can't recover from a weak first pass.
+
+ R@5 0.83–0.86 across 4 configs + 27 FastAPI + 6 K8s questions + 2 corpora · FastAPI · Kubernetes + 6.6× cost delta · custom vs LangChain (Anthropic) +
+
- -
-
- - -
-
-
5 of 10 OWASP demoable · 3 infrastructure-layer · 2 out of scope · SECURITY.md has the full mapping
-
-
Pick a corpus and ask a question to see the RAG pipeline in action.
-
-
- - -
+ +
+
+
+

Live pipeline

+

Ask a question. Watch every stage — injection check, hybrid retrieval, rerank, iterative tool-calls, LLM synthesis, output validation — with real latencies and token counts.

+
- -
-
+
+
+ Provider +
- Mistral-7B +
- -
- +
+
+ Corpus +
+
+
+
running on OpenAI · FastAPI corpus
+
-
+
+ +
+ +
+ + session · local-dev + demo data + + + open live demo ↗ + idle + +
+
+
+
Pick an example chip above — or type a question. Press Enter to send.
+
+
+ + +
+
-
-
Pipeline
-
-
-
-
Injection Check
+ +
+
+
+ Pipeline + idle · schematic +
+
+
+
+
+
injection_check
+
regex + classifier, tiered
+
+
~3ms
+
+
+
+
+
retrieval
+
FAISS + BM25 + RRF, top-20
+
+
~40ms
-
-
-
LLM Synthesis
+
+
+
+
reranking
+
cross-encoder, top-5
+
+
~60ms
-
-
-
Output Validation
+
+
+
+
llm_synthesis
+
tool-calling loop · max 3 iter
+
+
~800ms
+
+
+
+
output_validation
+
post-stream · monitored, not gated ?
+
+
~12ms
+
+
+
+ latency + tokens + cost +
+
+ +
+
+ Retrieval + waiting
- -
-
-

Retrieval Results

- +
+
+ Security + 3 layers
-
-
Waiting for query...
+ Mapped against the OWASP LLM Top 10 (2025) — named residual risks for LLM01, scope limits for LLM02 → SECURITY.md ↗ +
+
+
Injection
+
+
regex + classifier
+
+
+
PII redact
+
+
context only
+
+
+
Output
+
+
monitored
+
+
Try a guardrail
+
+
5 of 10 OWASP demoable · 3 infrastructure-layer · 2 out of scope · SECURITY.md has the full mapping
+
+
+
+
+ + +
+
+

Three findings

+ 27 FastAPI + 6 K8s · custom + langchain · 3 providers +
+
+ +
+
01 / orchestration
+

Retrieval dominates orchestration.

+
+
custom · oai0.83
+
langchain · oai0.86
+
custom · anth0.84
+
langchain · anth0.84
+
max spread0.03
+
+

R@5 spans only 0.03 across all four Custom × LangChain × OpenAI × Anthropic configs with identical retrieval stacks. The orchestration layer is interchangeable; FAISS + BM25 + RRF + cross-encoder is what matters.

+ comparison_custom_vs_langchain.md ↗ +
+ +
+
02 / cost
+

LangChain's Anthropic adapter carries a 6.6× cost tax.

+
+
custom$0.0007
+
langchain$0.0046
+

Same model (claude-haiku-4-5), same retrieval, same 27-question FastAPI set. The multiplier comes from LangChain's prompt construction in the Anthropic tool-calling adapter — extra system prompt and tool schema re-sends on every iteration.

+ docs/provider_comparison.md ↗ +
-
-

Security

- Mapped against the OWASP LLM Top 10 (2025) — named residual risks for LLM01, scope limits for LLM02 → SECURITY.md -
-
- Injection - - +
+
03 / model-size floor
+

There's a model-size floor for agentic retrieval — and a 7B model falls off it.

+
+
+
+
+
1.00
+
gpt-4o-mini
+
+
+
+
1.00
+
haiku-4-5
-
- PII Redacted - - context +
+
+
0.14
+
mistral-7B · citation
-
- Output - - monitored +
+
+
0.05
+
mistral-7B · R@5
+
Three of the four bars are citation accuracy. The rightmost shows Mistral-7B's R@5 (0.05) on the same axis — both retrieval and citation collapse together.
+

Not because the model is bad — because 8K context forces top_k=3, single-iteration retrieval that can't recover from a weak first pass. This is a context-window + iteration-budget effect, not a claim about Mistral-7B's general capability. The chart above isolates the failure: both layers (retrieval R@5 and citation accuracy) collapse together.

+ docs/provider_comparison.md ↗
+
- -
-

Request Log

-

Every query is instrumented. Metrics accumulate as you interact.

-
- + +
+
+

Request log

+ cached — previous session · 6 queries +
+
+
- - - - - - - - - - - - + + + - - +
#QuestionProviderInjectionChunksRerankedPIIOutputItersTokensLatencyCost#QuestionProviderInjectionChunksRerankedPIIOutputIterTokensLatencyCost
-
No queries yet. Try an example above.
-
- -
- - -
-

Key Findings

-

From the 27-question benchmark across Custom and LangChain pipelines, 3 providers.

-
-
-

Retrieval dominates orchestration

-

R@5 varies by less than 0.03 across Custom and LangChain with identical retrieval stacks. The orchestration layer is interchangeable; the retrieval stack (FAISS + BM25 + RRF + cross-encoder) is what matters.

- View benchmark comparison → -
-
-

LangChain abstraction has a real cost

-

$0.0046/query vs $0.0007/query (custom Anthropic). Same model, same retrieval, 6.6x cost multiplier from LangChain's prompt construction in the Anthropic adapter.

- View cost analysis → -
-
-

There's a model-size floor for agentic retrieval

-

Mistral-7B citation accuracy: 0.14. R@5: 0.05. Not because the model is bad — because 8K context forces top_k=3 single-iteration retrieval that can't recover from a weak first pass. This is a context-window + iteration-budget effect, not a claim about Mistral-7B's general capability.

- View provider comparison → +
+ queries 6 + avg latency 984ms + total tokens 14,220 + total cost $0.0081 + blocked 1
- + - -
- Email - LinkedIn - GitHub + + + + +