Running on CPU Upgrade 187 LLM Hallucination Leaderboard π 187 View and filter LLM hallucination leaderboard
Running on CPU Upgrade 1 Clinical Trials Assistant π¨ 1 Clinical Trial assistant using vectara-agentic
vectara/hallucination_evaluation_model Text Classification β’ 0.1B β’ Updated Oct 20, 2025 β’ 153k β’ 339
BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent Paper β’ 2508.06600 β’ Published Aug 8, 2025 β’ 41
FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents Paper β’ 2504.13128 β’ Published Apr 17, 2025 β’ 7
Chatbot Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses Paper β’ 2504.20006 β’ Published Apr 28, 2025
Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval Paper β’ 2505.16967 β’ Published May 22, 2025 β’ 24