diff --git "a/agent_bench/serving/static/index.html" "b/agent_bench/serving/static/index.html" --- "a/agent_bench/serving/static/index.html" +++ "b/agent_bench/serving/static/index.html" @@ -4,1091 +4,1680 @@
Production RAG with honest evaluation. Custom orchestration benchmarked against LangChain across 3 LLM providers — including the model-size floor where agentic retrieval breaks down.
- - -A custom tool-calling orchestrator and a LangChain baseline, evaluated on the same 27-question FastAPI golden set (plus a 6-question Kubernetes set) across OpenAI, Anthropic, and a self-hosted Mistral-7B. Every stage is instrumented. The interesting finding isn't which pipeline wins — it's where both fail.
+ + + +Ask a question. Watch every stage — injection check, hybrid retrieval, rerank, iterative tool-calls, LLM synthesis, output validation — with real latencies and token counts.