PeterKruger commited on
Commit
687a524
·
verified ·
1 Parent(s): 0475477
Files changed (1) hide show
  1. README.md +54 -52
README.md CHANGED
@@ -10,75 +10,77 @@ license: apache-2.0
10
 
11
  # AutoBench
12
 
13
- ## Organization Details
14
-
15
- * **Organization:** AutoBench
16
- * **Point of Contact:** [Peter Kruger](https://huggingface.co/PeterKruger), CEO eZecute
17
- * **Website:** [www.autobench.org]
18
- * **Funding:** Self-funded, with the support of inference API providers providing free inference compute support.
19
-
20
  ## Organization Description
21
 
22
- AutoBench is an organization dedicated to advancing the evaluation of Large Language Models (LLMs) through innovative, automated benchmarking systems. Our flagship project is the **AutoBench 1.0** benchmark, a novel system that utilizes a "Collective-LLM-as-a-Judge" approach. This approach leverages LLMs themselves to assess the quality of both questions and answers generated by other LLMs. AutoBench aims to address the limitations of traditional, static benchmarks by providing a dynamic, scalable, cost-effective, and less human-biased evaluation framework.
23
-
24
- ## Benchmarking System: AutoBench 1.0
25
 
26
- ### Overview
27
 
28
- AutoBench 1.0 is a fully automated, iterative benchmark system for evaluating LLMs. It dynamically generates questions, assesses their quality, and ranks LLM-generated answers using a collective of LLMs as judges. This system is designed to be:
29
 
30
- * **Dynamic:** Questions are generated on-the-fly for each iteration, reducing the risk of benchmark gaming.
31
- * **Scalable:** The system is designed to handle a large number of models and can be easily scaled up.
32
- * **Cost-Effective:** AutoBench 1.0 achieves high correlation with established benchmarks at a significantly lower cost than human-based evaluations (under $100 for a full run with 20 models).
33
- * **Less Human-Biased:** While model bias exists, the "Collective-LLM-as-a-Judge" approach reduces reliance on subjective human judgment.
34
- * **Granular:** Provides topic-specific performance insights, not just an aggregate score.
35
- * **Adaptive:** Model weights are adjusted with use, meaning the benchmark improves in quality the more it is used.
36
 
37
- ### Key Features
 
38
 
39
- * **Collective-LLM-as-a-Judge:** Employs a group of LLMs to evaluate both the quality of generated questions and the answers provided by other LLMs.
40
- * **Dynamic Question Generation:** Generates new questions in each iteration, covering a range of topics and difficulty levels.
41
- * **Iterative Evaluation:** Runs for a predefined number of iterations to provide robust and statistically meaningful results.
42
- * **Model Weighting and Adaptation:** Dynamically adjusts the influence of individual judging models based on their performance.
43
- * **Comprehensive Metrics:** Provides overall average rank, topic-specific ranks, and correlations with established benchmarks (Chatbot Arena, MMLU, AAQI).
44
- * **Error Handling and Robustness:** Includes mechanisms for handling API errors, unresponsive models, and invalid responses.
45
 
46
- ### Intended Use
 
47
 
48
- The AutoBench 1.0 benchmark is intended for:
 
49
 
50
- * Researchers and developers working on LLMs.
51
- * Organizations evaluating LLMs for deployment.
52
- * Anyone interested in tracking the progress of LLM capabilities.
53
 
54
- The benchmark provides a standardized, automated, and cost-effective way to assess the performance of LLMs across a variety of tasks and topics.
 
 
 
 
 
 
55
 
56
- ## Ethical Considerations
57
 
58
- AutoBench is committed to the responsible development and use of LLMs. We encourage users of the benchmark to consider the potential ethical implications of their work and to use the benchmark results responsibly. The limitations and biases of AutoBench 1.0 should be carefully considered when interpreting the results.
 
 
 
 
59
 
60
- ## Inference cost Support
 
61
 
62
- Running a compute intensive benchmark like AutoBench can be expensive. We welcome all inference API providers to support us with free inference credits.
 
 
 
 
 
 
 
 
 
63
 
64
- ## Citation
65
 
66
- If you use AutoBench 1.0 in your research, please cite:
67
 
68
- @misc{autobench2024,
69
- title={AutoBench 1.0: A Collective-LLM-as-a-Judge Benchmark System},
70
- author={AutoBench},
71
- year={2024},
72
- publisher = {Hugging Face},
73
- howpublished = {\url{https://huggingface.co/AutoBench}},
74
- note = {Accessed: [Date Accessed]}
75
- }
76
 
77
- ## Learn more and contribute
78
 
79
- * **Start from our blog post on Hugging Face**: [Escape the Benchmark Trap: AutoBench – the Collective-LLM-as-a-Judge System for Evaluating AI models (ASI-Ready!)](https://huggingface.co/blog/PeterKruger/autobench)
80
- * **Explore the code and data:** [Hugging Face AutoBench 1.0 Repository](https://huggingface.co/PeterKruger/AutoBench) <!-- Replace with actual link -->
81
- * **Try our Demo on Spaces:** [AutoBench 1.0 Demo](https://huggingface.co/spaces/PeterKruger/AutoBench) <!-- Replace with actual link -->
82
- * **Read the detailed methodology:** [Detailed Methodology Document](https://huggingface.co/PeterKruger/AutoBench/blob/main/AutoBench_1_0_Detailed_Methodology_Document.pdf) <!-- Replace with link -->
83
- * **Join the discussion:** [Hugging Face AutoBench Community Discussion](https://huggingface.co/PeterKruger/AutoBench/discussions) <!-- Replace with link -->
84
- * **Contribute:** Help us by suggesting new topics, refining prompts, or enhancing the weighting algorithm—submit pull requests or issues via the Hugging Face Repo.
 
10
 
11
  # AutoBench
12
 
 
 
 
 
 
 
 
13
  ## Organization Description
14
 
15
+ **AutoBench** is the premier LLM evaluation and routing infrastructure for the Agentic Era. We are dedicated to solving the LLM evaluation crisis by moving the industry beyond static, domian-rigid, easily gameable text prompts and build the first open LLM-based API Router for the agentic era.
 
 
16
 
17
+ Pioneering the **"Collective-LLM-as-a-Judge"** methodology, AutoBench uses massive pools of LLMs to dynamically generate tasks, execute multi-turn workflows, and granularly evaluate performance across the AI ecosystem. Today, AutoBench provides fully automated, highly correlated, and strictly un-gameable benchmarking. Furthermore, we leverage the massive synthetic execution datasets generated by our benchmarks to train next-generation **Agentic LLM Routers**, helping agent developers and enterprises optimize for both absolute quality and unit economics.
18
 
19
+ ## The AutoBench Ecosystem
20
 
21
+ ### 1. AutoBench Agentic (Latest Evolution)
22
+ Current agentic benchmarks (like Gval-AA or Terminal-bench) are static, allowing models to simply "train to the test." AutoBench Agentic fundamentally changes this by dropping LLMs into dynamically generated, multi-turn **Agentic Virtual Environments**.
23
+ * **Technical Complexity:** Our infrastructure deterministically injects business-flavored personas into a native Universal Intermediate Representation (UIR). We inject stateful "memory lines" of previous workflow failures, and force models to navigate complex native JSON `tools[]` arrays filled with randomly injected "distractor" tools.
24
+ * **10 Granular Task Types:** We evaluate true orchestration under pressure, measuring specific capabilities like *Adaptive Replanning*, *Parameter Complexity*, *Single Tool Call*, and *Failure Recovery*.
25
+ * **Cost vs. Performance Tracking:** Tracks exact P99 Latency and strict USD/run costs to help developers define their efficiency frontier.
 
26
 
27
+ ### 2. Agentic LLM Routing
28
+ Benchmarking is just the first step. AutoBench uses the millions of execution traces and granular performance data generated by our runs to train dynamic **Agentic LLM Routers**. Instead of passive gateways or superficial semantic heuristic routers, AutoBench empowers active pipeline optimization—routing complex edge cases to frontier models and standard tasks to open-weight models, saving agent applications and enterprises up to 90% in API costs.
29
 
30
+ ### 3. AutoBench 2.0 & Domain Benchmarks
31
+ The core engine powering our latest generalist and domain-specific runs (such as our Agronomy vertical). AutoBench 2.0 introduces three major technical breakthroughs to the Collective-LLM-as-a-Judge framework:
32
+ * **Random Score Pooling:** Instead of prefixed judging models, we pool random models for every scoring session, expanding exploration of the "LLM performance space" while reducing required compute.
33
+ * **Nonlinear Weighting:** Replaces simple linear averaging with advanced weighting functions (exponential, power-law, Boltzmann) to compensate for variance and improve convergence among highly capable frontier models.
34
+ * **Parallel Iteration:** Reduces evaluation cycles from days to mere hours.
 
35
 
36
+ ### 4. Bot Scanner (Consumer/Dev Platform)
37
+ Powered by AutoBench's evaluation methodology, [**Bot Scanner**](https://botscanner.ai) is the "skyscanner for LLM responses." It is a live platform that allows users to route a single prompt to multiple "responder" LLMs simultaneously, and then uses AutoBench's "judge" LLMs to evaluate, rank, and deliver the absolute best answer instantly, ending LLM guesswork.
38
 
39
+ ### 5. AutoBench 1.0 (Open Source)
40
+ The foundational open-source framework that proved the Collective-LLM-as-a-Judge concept. It remains free and available for researchers and developers to run local evaluations and explore the core architecture.
41
 
42
+ ## Key Differentiators & Industry Correlations
 
 
43
 
44
+ AutoBench solves the traditional tradeoff between scalability, cost, and accuracy:
45
+ * **Strictly Un-gameable:** Because tasks and environments are dynamically generated at runtime, test-set contamination is impossible. Models cannot "memorize" the benchmark.
46
+ * **Highly Correlated (Scientific Validation):** Despite its dynamic nature, AutoBench achieves massive correlation with rigid, human-verified industry standards:
47
+ * **Agentic Correlations:** 85.15% with the Artificial Analysis Intelligence Index, 84.56% with GDPval-AA, and 83.00% with Terminal-Bench Hard.
48
+ * **Generalist Correlations:** 89.38% with the Artificial Analysis Index, 82.21% with MMLU-Pro, and 71.84% with LMSYS Chatbot Arena (Human Preference).
49
+ * **High Granularity & Adaptability:** Unlike one-size-fits-all tests, AutoBench's architecture easily adapts to highly specialized, domain-specific verticals. We provide granular, topic-specific performance insights—such as our recent **[Agronomic Benchmark](https://huggingface.co/blog/PeterKruger/autobench-run-agronomy-1)**, allowing enterprises to test models on their exact proprietary schemas and niche industry knowledge.
50
+ * **Highly Scalable & Cost-Effective:** A comprehensive benchmark evaluating 30+ models costs a fraction of human-annotated alternatives (often under $100 in raw compute).
51
 
52
+ ## Scientific Validation & Acknowledgements
53
 
54
+ Our methodology is scientifically validated and continuously peer-reviewed. We extend our immense gratitude to our partners and supporters:
55
+ * **Translated:** Global leader in Professioal AI-enabled translations and high-quality training human data generation for their continued support in compute resources and strategic insight.
56
+ * **DIAG, Sapienza Università di Roma:** The team led by **Prof. Fabrizio Silvestri** for providing the rigorous scientific validation that underpins our methodology.
57
+ * **eZecute:** The venture builder for enabling the industrialization and scaling of this platform.
58
+ * **AWS Startups:** For compute credits.
59
 
60
+ ### Citation
61
+ If you use AutoBench in your research, please cite our validation paper:
62
 
63
+ ```bibtex
64
+ @misc{autobench2025,
65
+ title={AutoBench: Automating LLM Evaluation through Reciprocal Peer Assessment},
66
+ author={AutoBench},
67
+ year={2025},
68
+ eprint={2510.22593},
69
+ archivePrefix={arXiv},
70
+ primaryClass={cs.CL},
71
+ url={[https://arxiv.org/abs/2510.22593](https://arxiv.org/abs/2510.22593)},
72
+ }
73
 
74
+ ## Explore, Connect, and Contribute
75
 
76
+ Whether you are an AI researcher, a prompt engineer, or an enterprise IT architect deploying autonomous agents, AutoBench has the data you need to stop flying blind.
77
 
78
+ * 🌐 **Official Website & Data Archive:** [autobench.org](https://autobench.org/)
79
+ * 🏆 **Interactive Agentic Leaderboard:** [AutoBench-Leaderboard](https://huggingface.co/spaces/AutoBench/AutoBench-Leaderboard)
80
+ * 🤖 **Test Bot Scanner:** [botscanner.ai](https://botscanner.ai/)
81
+ * 📖 **Read Our Official Blog:** [AutoBench Blog](https://autobench.org/blog)
82
+ * 💻 **Explore the OS Code:** [AutoBench 1.0 Repository](https://huggingface.co/AutoBench/AutoBench_1.0)
83
+ * 📝 **Read the Scientific Paper:** [arXiv:2510.22593](https://arxiv.org/abs/2510.22593)
 
 
84
 
85
+ *Inference Support: Running a compute-intensive benchmark like AutoBench can be expensive. We welcome all inference API providers to support us with free inference credits to expand the scope of our evaluations.*
86