LIVE · FASTAPI + K8S CORPORA · 3 PROVIDERS

Production RAG, benchmarked honestly — including the model-size floor where agentic retrieval breaks down.

A custom tool-calling orchestrator and a LangChain baseline, evaluated on the same 27-question FastAPI golden set (plus a 6-question Kubernetes set) across OpenAI, Anthropic, and a self-hosted Mistral-7B. Every stage is instrumented. The interesting finding isn't which pipeline wins — it's where both fail.

Built by Jane Yeung · Munich · Open to AI/ML roles in Germany

API models

1.00

OpenAI gpt-4o-mini and Anthropic claude-haiku-4-5, 27/27 correct citations.

Self-hosted · 7B

0.14

Mistral-7B on 8K context — agentic retrieval can't recover from a weak first pass.

R@5 0.83–0.86 across 4 configs 27 FastAPI + 6 K8s questions 2 corpora · FastAPI · Kubernetes 6.6× cost delta · custom vs LangChain (Anthropic)

Try the demo ↓ Source on GitHub ↗

Live pipeline

Ask a question. Watch every stage — injection check, hybrid retrieval, rerank, iterative tool-calls, LLM synthesis, output validation — with real latencies and token counts.

session · local-dev demo data open live demo ↗ idle

Pick an example chip above — or type a question. Press Enter to send.

Pipeline idle · schematic

injection_check

regex + classifier, tiered

~3ms

retrieval

FAISS + BM25 + RRF, top-20

~40ms

reranking

cross-encoder, top-5

~60ms

llm_synthesis

tool-calling loop · max 3 iter

~800ms

output_validation

post-stream · monitored, not gated ?

~12ms

latency — tokens — cost —

Retrieval waiting

The top-5 reranked chunks land here, with RRF-normalized scores.

Security 3 layers

Mapped against the OWASP LLM Top 10 (2025) — named residual risks for LLM01, scope limits for LLM02 → SECURITY.md ↗

Injection

—

regex + classifier

PII redact

—

context only

Output

—

monitored

Try a guardrail

5 of 10 OWASP demoable · 3 infrastructure-layer · 2 out of scope · SECURITY.md has the full mapping

How we grade it

4 anchored rubrics · LLM-as-judge · κ-calibrated against human labels

Benchmark numbers are only as good as the grader. Each answer is scored by an LLM judge against an anchored markdown rubric — strict scope, fixed scale, abstain-allowed — and the judges themselves are calibrated against human labels on a held-out set before they're trusted on the main run.

30 calibration items · human-labeled v1.1 rubric · sha-pinned per result headline metric: Cohen's κ · Gwet's AC1 on prevalence-skewed dims

Groundedness

01abstain

Every claim must be entailed by gold snippets. A claim that's correct in the world but not in the snippets scores 0 — strict-snippet measures retrieval-grounded behavior, not LLM general knowledge passing through.

ANCHOR · q006
Answer adds "particularly useful for expensive operations like database connections" — not in snippet → 0.

Relevance

012abstain

Reference-free. Does the answer address the user's question? Score the topic-match, not the truth-value. A refusal that doesn't engage with the premise scores 0.

ANCHOR
Q: "How do I deploy to Kubernetes?"
A: "Python virtual environments isolate dependencies." → 0.

Completeness

012abstain

Reference-based against gold answer. Score coverage of the reference's key points only — extra correct detail isn't penalized here.

ANCHOR
Reference covers ordinal, hostname, storage. Answer covers ordinal, hostname only → 1.

Citation faithfulness

01abstain

For every [source: X.md] in the answer, does the cited chunk actually support the claim next to it? All-or-nothing per item — one bad citation fails the whole answer.

ANCHOR
Claim: "default port is 8080." Cited chunk: about OAuth and SAML auth → 0 (citation drift).

Inter-rater agreement vs. human labels (calibration v1, baseline)

groundednessAC1 = 1.000

relevanceAC1 = 0.964

completenessκ = 0.416

Full table + variance hardening ↓

Three findings

27 FastAPI + 6 K8s · custom + langchain · 3 providers

01 / orchestration

Retrieval dominates orchestration.

custom · oai0.83

langchain · oai0.86

custom · anth0.84

langchain · anth0.84

max spread0.03

R@5 spans only 0.03 across all four Custom × LangChain × OpenAI × Anthropic configs with identical retrieval stacks. The orchestration layer is interchangeable; FAISS + BM25 + RRF + cross-encoder is what matters.

comparison_custom_vs_langchain.md ↗

02 / cost

LangChain's Anthropic adapter carries a 6.6× cost tax.

custom$0.0007

langchain$0.0046

Same model (claude-haiku-4-5), same retrieval, same 27-question FastAPI set. The multiplier comes from LangChain's prompt construction in the Anthropic tool-calling adapter — extra system prompt and tool schema re-sends on every iteration.

docs/provider_comparison.md ↗

03 / model-size floor

There's a model-size floor for agentic retrieval — and a 7B model falls off it.

1.00

gpt-4o-mini

1.00

haiku-4-5

0.14

mistral-7B · citation

0.05

mistral-7B · R@5

Three of the four bars are citation accuracy. The rightmost shows Mistral-7B's R@5 (0.05) on the same axis — both retrieval and citation collapse together.

Not because the model is bad — because 8K context forces top_k=3, single-iteration retrieval that can't recover from a weak first pass. This is a context-window + iteration-budget effect, not a claim about Mistral-7B's general capability. The chart above isolates the failure: both layers (retrieval R@5 and citation accuracy) collapse together.

docs/provider_comparison.md ↗

Request log

cached — previous session · 6 queries

#	Question	Provider	Injection	Chunks	Reranked	PII	Output	Iter	Tokens	Latency	Cost

queries 6 avg latency 984ms total tokens 14,220 total cost $0.0081 blocked 1

Methodology appendix

κ ablations · variance hardening · abstain semantics

κ ablation table · calibration v1

Configuration	Groundedness AC1	Relevance AC1	Completeness κ
baseline (v1.1, anchors, CoT)	1.000	0.964	0.416
baseline · no anchors	0.953	0.964	0.623
baseline · no CoT	0.897	0.963	1.000
permute (n=2 seeds)	1.000	0.966	0.506
jury · κ-weighted (haiku + gpt-4o-mini)	1.000	1.000	0.416

Reading this: groundedness and relevance gold are prevalence-skewed (29×0 / 1×1 and 29×2 / 1×1 respectively), which makes Cohen's κ degenerate to ≈0 even at 95%+ raw agreement. AC1 is the right metric there. Completeness gold is balanced enough (23×2 / 5×1) for κ to behave normally. The no-CoT κ=1.000 looks like a win but comes with an 11.5% abstain rate — the headline is the baseline row.

Variance hardening

PermutedJudge · level-order permutation

Wrap a judge with n=2 prompt-seed permutations of the rubric's level order; aggregate by mean. Catches judges whose verdict flips when "Score 0" anchor moves above "Score 2" — a presentation-order artifact, not a content disagreement.

Jury · κ-weighted multi-judge aggregation

Run the same item through claude-haiku-4-5 and gpt-4o-mini, weight each judge's vote by its calibration κ, abstain if any member abstains. Surfaces single-model bias without flattening to majority-rule, and keeps abstain as a first-class outcome.

Abstain semantics · "Unknown" sentinel

Schema-parse failures retry once, then abstain with a typed prefix; rubric-allowed model abstains use the empty-string sentinel. The metric drops the item, doesn't pretend it scored 0 — visible in the abstain rate column above.