Spaces:

Nomearod
/

agentbench

Running

Nomearod Claude Opus 4.6 (1M context) commited on Apr 7

Commit

2293da9

1 Parent(s): 503f5c4

docs: sharpen zero-hallucination claim, explain Mistral-7B row

Reframe the headline metric to "on all API provider configurations"
and add a sentence explaining the self-hosted Mistral-7B benchmark
as a deliberate model-size floor for agentic retrieval.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -2,7 +2,7 @@
 ![CI](https://github.com/tyy0811/agent-bench/actions/workflows/ci.yaml/badge.svg)
-Agentic knowledge retrieval system with evaluation benchmark. Custom orchestration pipeline + LangChain baseline, evaluated on the same 27-question golden dataset across 3 providers (OpenAI, Anthropic, self-hosted vLLM on Modal). Zero hallucinated citations in all API configurations.
 `288 tests` · `3 providers` · `LangChain comparison` · `K8s + Terraform` · `CI`

 ![CI](https://github.com/tyy0811/agent-bench/actions/workflows/ci.yaml/badge.svg)
+Agentic knowledge retrieval system with evaluation benchmark. Custom orchestration pipeline + LangChain baseline, evaluated on the same 27-question golden dataset across 3 providers (OpenAI, Anthropic, self-hosted vLLM on Modal). Zero hallucinated citations on all API provider configurations. The separate self-hosted Mistral-7B benchmark is included to show the practical model-size floor where agentic retrieval starts to break down.
 `288 tests` · `3 providers` · `LangChain comparison` · `K8s + Terraform` · `CI`