A custom tool-calling orchestrator and a LangChain baseline, evaluated on the same 27-question FastAPI golden set (plus a 6-question Kubernetes set) across OpenAI, Anthropic, and a self-hosted Mistral-7B. Every stage is instrumented. The interesting finding isn't which pipeline wins — it's where both fail.
Ask a question. Watch every stage — injection check, hybrid retrieval, rerank, iterative tool-calls, LLM synthesis, output validation — with real latencies and token counts.
Benchmark numbers are only as good as the grader. Each answer is scored by an LLM judge against an anchored markdown rubric — strict scope, fixed scale, abstain-allowed — and the judges themselves are calibrated against human labels on a held-out set before they're trusted on the main run.
[source: X.md] in the answer, does the cited chunk actually support the claim next to it? All-or-nothing per item — one bad citation fails the whole answer.R@5 spans only 0.03 across all four Custom × LangChain × OpenAI × Anthropic configs with identical retrieval stacks. The orchestration layer is interchangeable; FAISS + BM25 + RRF + cross-encoder is what matters.
comparison_custom_vs_langchain.md ↗Same model (claude-haiku-4-5), same retrieval, same 27-question FastAPI set. The multiplier comes from LangChain's prompt construction in the Anthropic tool-calling adapter — extra system prompt and tool schema re-sends on every iteration.
docs/provider_comparison.md ↗Not because the model is bad — because 8K context forces top_k=3, single-iteration retrieval that can't recover from a weak first pass. This is a context-window + iteration-budget effect, not a claim about Mistral-7B's general capability. The chart above isolates the failure: both layers (retrieval R@5 and citation accuracy) collapse together.
docs/provider_comparison.md ↗| # | Question | Provider | Injection | Chunks | Reranked | PII | Output | Iter | Tokens | Latency | Cost |
|---|
| Configuration | Groundedness AC1 | Relevance AC1 | Completeness κ |
|---|---|---|---|
| baseline (v1.1, anchors, CoT) | 1.000 | 0.964 | 0.416 |
| baseline · no anchors | 0.953 | 0.964 | 0.623 |
| baseline · no CoT | 0.897 | 0.963 | 1.000 |
| permute (n=2 seeds) | 1.000 | 0.966 | 0.506 |
| jury · κ-weighted (haiku + gpt-4o-mini) | 1.000 | 1.000 | 0.416 |
Reading this: groundedness and relevance gold are prevalence-skewed (29×0 / 1×1 and 29×2 / 1×1 respectively), which makes Cohen's κ degenerate to ≈0 even at 95%+ raw agreement. AC1 is the right metric there. Completeness gold is balanced enough (23×2 / 5×1) for κ to behave normally. The no-CoT κ=1.000 looks like a win but comes with an 11.5% abstain rate — the headline is the baseline row.
PermutedJudge · level-order permutationJury · κ-weighted multi-judge aggregation"Unknown" sentinel