Updated
Browse files
README.md
CHANGED
|
@@ -10,75 +10,77 @@ license: apache-2.0
|
|
| 10 |
|
| 11 |
# AutoBench
|
| 12 |
|
| 13 |
-
## Organization Details
|
| 14 |
-
|
| 15 |
-
* **Organization:** AutoBench
|
| 16 |
-
* **Point of Contact:** [Peter Kruger](https://huggingface.co/PeterKruger), CEO eZecute
|
| 17 |
-
* **Website:** [www.autobench.org]
|
| 18 |
-
* **Funding:** Self-funded, with the support of inference API providers providing free inference compute support.
|
| 19 |
-
|
| 20 |
## Organization Description
|
| 21 |
|
| 22 |
-
AutoBench is
|
| 23 |
-
|
| 24 |
-
## Benchmarking System: AutoBench 1.0
|
| 25 |
|
| 26 |
-
|
| 27 |
|
| 28 |
-
|
| 29 |
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
*
|
| 33 |
-
*
|
| 34 |
-
*
|
| 35 |
-
* **Adaptive:** Model weights are adjusted with use, meaning the benchmark improves in quality the more it is used.
|
| 36 |
|
| 37 |
-
###
|
|
|
|
| 38 |
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
*
|
| 42 |
-
*
|
| 43 |
-
*
|
| 44 |
-
* **Error Handling and Robustness:** Includes mechanisms for handling API errors, unresponsive models, and invalid responses.
|
| 45 |
|
| 46 |
-
###
|
|
|
|
| 47 |
|
| 48 |
-
|
|
|
|
| 49 |
|
| 50 |
-
|
| 51 |
-
* Organizations evaluating LLMs for deployment.
|
| 52 |
-
* Anyone interested in tracking the progress of LLM capabilities.
|
| 53 |
|
| 54 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
-
##
|
| 57 |
|
| 58 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
-
##
|
|
|
|
| 61 |
|
| 62 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
-
##
|
| 65 |
|
| 66 |
-
|
| 67 |
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
note = {Accessed: [Date Accessed]}
|
| 75 |
-
}
|
| 76 |
|
| 77 |
-
|
| 78 |
|
| 79 |
-
* **Start from our blog post on Hugging Face**: [Escape the Benchmark Trap: AutoBench – the Collective-LLM-as-a-Judge System for Evaluating AI models (ASI-Ready!)](https://huggingface.co/blog/PeterKruger/autobench)
|
| 80 |
-
* **Explore the code and data:** [Hugging Face AutoBench 1.0 Repository](https://huggingface.co/PeterKruger/AutoBench) <!-- Replace with actual link -->
|
| 81 |
-
* **Try our Demo on Spaces:** [AutoBench 1.0 Demo](https://huggingface.co/spaces/PeterKruger/AutoBench) <!-- Replace with actual link -->
|
| 82 |
-
* **Read the detailed methodology:** [Detailed Methodology Document](https://huggingface.co/PeterKruger/AutoBench/blob/main/AutoBench_1_0_Detailed_Methodology_Document.pdf) <!-- Replace with link -->
|
| 83 |
-
* **Join the discussion:** [Hugging Face AutoBench Community Discussion](https://huggingface.co/PeterKruger/AutoBench/discussions) <!-- Replace with link -->
|
| 84 |
-
* **Contribute:** Help us by suggesting new topics, refining prompts, or enhancing the weighting algorithm—submit pull requests or issues via the Hugging Face Repo.
|
|
|
|
| 10 |
|
| 11 |
# AutoBench
|
| 12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
## Organization Description
|
| 14 |
|
| 15 |
+
**AutoBench** is the premier LLM evaluation and routing infrastructure for the Agentic Era. We are dedicated to solving the LLM evaluation crisis by moving the industry beyond static, domian-rigid, easily gameable text prompts and build the first open LLM-based API Router for the agentic era.
|
|
|
|
|
|
|
| 16 |
|
| 17 |
+
Pioneering the **"Collective-LLM-as-a-Judge"** methodology, AutoBench uses massive pools of LLMs to dynamically generate tasks, execute multi-turn workflows, and granularly evaluate performance across the AI ecosystem. Today, AutoBench provides fully automated, highly correlated, and strictly un-gameable benchmarking. Furthermore, we leverage the massive synthetic execution datasets generated by our benchmarks to train next-generation **Agentic LLM Routers**, helping agent developers and enterprises optimize for both absolute quality and unit economics.
|
| 18 |
|
| 19 |
+
## The AutoBench Ecosystem
|
| 20 |
|
| 21 |
+
### 1. AutoBench Agentic (Latest Evolution)
|
| 22 |
+
Current agentic benchmarks (like Gval-AA or Terminal-bench) are static, allowing models to simply "train to the test." AutoBench Agentic fundamentally changes this by dropping LLMs into dynamically generated, multi-turn **Agentic Virtual Environments**.
|
| 23 |
+
* **Technical Complexity:** Our infrastructure deterministically injects business-flavored personas into a native Universal Intermediate Representation (UIR). We inject stateful "memory lines" of previous workflow failures, and force models to navigate complex native JSON `tools[]` arrays filled with randomly injected "distractor" tools.
|
| 24 |
+
* **10 Granular Task Types:** We evaluate true orchestration under pressure, measuring specific capabilities like *Adaptive Replanning*, *Parameter Complexity*, *Single Tool Call*, and *Failure Recovery*.
|
| 25 |
+
* **Cost vs. Performance Tracking:** Tracks exact P99 Latency and strict USD/run costs to help developers define their efficiency frontier.
|
|
|
|
| 26 |
|
| 27 |
+
### 2. Agentic LLM Routing
|
| 28 |
+
Benchmarking is just the first step. AutoBench uses the millions of execution traces and granular performance data generated by our runs to train dynamic **Agentic LLM Routers**. Instead of passive gateways or superficial semantic heuristic routers, AutoBench empowers active pipeline optimization—routing complex edge cases to frontier models and standard tasks to open-weight models, saving agent applications and enterprises up to 90% in API costs.
|
| 29 |
|
| 30 |
+
### 3. AutoBench 2.0 & Domain Benchmarks
|
| 31 |
+
The core engine powering our latest generalist and domain-specific runs (such as our Agronomy vertical). AutoBench 2.0 introduces three major technical breakthroughs to the Collective-LLM-as-a-Judge framework:
|
| 32 |
+
* **Random Score Pooling:** Instead of prefixed judging models, we pool random models for every scoring session, expanding exploration of the "LLM performance space" while reducing required compute.
|
| 33 |
+
* **Nonlinear Weighting:** Replaces simple linear averaging with advanced weighting functions (exponential, power-law, Boltzmann) to compensate for variance and improve convergence among highly capable frontier models.
|
| 34 |
+
* **Parallel Iteration:** Reduces evaluation cycles from days to mere hours.
|
|
|
|
| 35 |
|
| 36 |
+
### 4. Bot Scanner (Consumer/Dev Platform)
|
| 37 |
+
Powered by AutoBench's evaluation methodology, [**Bot Scanner**](https://botscanner.ai) is the "skyscanner for LLM responses." It is a live platform that allows users to route a single prompt to multiple "responder" LLMs simultaneously, and then uses AutoBench's "judge" LLMs to evaluate, rank, and deliver the absolute best answer instantly, ending LLM guesswork.
|
| 38 |
|
| 39 |
+
### 5. AutoBench 1.0 (Open Source)
|
| 40 |
+
The foundational open-source framework that proved the Collective-LLM-as-a-Judge concept. It remains free and available for researchers and developers to run local evaluations and explore the core architecture.
|
| 41 |
|
| 42 |
+
## Key Differentiators & Industry Correlations
|
|
|
|
|
|
|
| 43 |
|
| 44 |
+
AutoBench solves the traditional tradeoff between scalability, cost, and accuracy:
|
| 45 |
+
* **Strictly Un-gameable:** Because tasks and environments are dynamically generated at runtime, test-set contamination is impossible. Models cannot "memorize" the benchmark.
|
| 46 |
+
* **Highly Correlated (Scientific Validation):** Despite its dynamic nature, AutoBench achieves massive correlation with rigid, human-verified industry standards:
|
| 47 |
+
* **Agentic Correlations:** 85.15% with the Artificial Analysis Intelligence Index, 84.56% with GDPval-AA, and 83.00% with Terminal-Bench Hard.
|
| 48 |
+
* **Generalist Correlations:** 89.38% with the Artificial Analysis Index, 82.21% with MMLU-Pro, and 71.84% with LMSYS Chatbot Arena (Human Preference).
|
| 49 |
+
* **High Granularity & Adaptability:** Unlike one-size-fits-all tests, AutoBench's architecture easily adapts to highly specialized, domain-specific verticals. We provide granular, topic-specific performance insights—such as our recent **[Agronomic Benchmark](https://huggingface.co/blog/PeterKruger/autobench-run-agronomy-1)**, allowing enterprises to test models on their exact proprietary schemas and niche industry knowledge.
|
| 50 |
+
* **Highly Scalable & Cost-Effective:** A comprehensive benchmark evaluating 30+ models costs a fraction of human-annotated alternatives (often under $100 in raw compute).
|
| 51 |
|
| 52 |
+
## Scientific Validation & Acknowledgements
|
| 53 |
|
| 54 |
+
Our methodology is scientifically validated and continuously peer-reviewed. We extend our immense gratitude to our partners and supporters:
|
| 55 |
+
* **Translated:** Global leader in Professioal AI-enabled translations and high-quality training human data generation for their continued support in compute resources and strategic insight.
|
| 56 |
+
* **DIAG, Sapienza Università di Roma:** The team led by **Prof. Fabrizio Silvestri** for providing the rigorous scientific validation that underpins our methodology.
|
| 57 |
+
* **eZecute:** The venture builder for enabling the industrialization and scaling of this platform.
|
| 58 |
+
* **AWS Startups:** For compute credits.
|
| 59 |
|
| 60 |
+
### Citation
|
| 61 |
+
If you use AutoBench in your research, please cite our validation paper:
|
| 62 |
|
| 63 |
+
```bibtex
|
| 64 |
+
@misc{autobench2025,
|
| 65 |
+
title={AutoBench: Automating LLM Evaluation through Reciprocal Peer Assessment},
|
| 66 |
+
author={AutoBench},
|
| 67 |
+
year={2025},
|
| 68 |
+
eprint={2510.22593},
|
| 69 |
+
archivePrefix={arXiv},
|
| 70 |
+
primaryClass={cs.CL},
|
| 71 |
+
url={[https://arxiv.org/abs/2510.22593](https://arxiv.org/abs/2510.22593)},
|
| 72 |
+
}
|
| 73 |
|
| 74 |
+
## Explore, Connect, and Contribute
|
| 75 |
|
| 76 |
+
Whether you are an AI researcher, a prompt engineer, or an enterprise IT architect deploying autonomous agents, AutoBench has the data you need to stop flying blind.
|
| 77 |
|
| 78 |
+
* 🌐 **Official Website & Data Archive:** [autobench.org](https://autobench.org/)
|
| 79 |
+
* 🏆 **Interactive Agentic Leaderboard:** [AutoBench-Leaderboard](https://huggingface.co/spaces/AutoBench/AutoBench-Leaderboard)
|
| 80 |
+
* 🤖 **Test Bot Scanner:** [botscanner.ai](https://botscanner.ai/)
|
| 81 |
+
* 📖 **Read Our Official Blog:** [AutoBench Blog](https://autobench.org/blog)
|
| 82 |
+
* 💻 **Explore the OS Code:** [AutoBench 1.0 Repository](https://huggingface.co/AutoBench/AutoBench_1.0)
|
| 83 |
+
* 📝 **Read the Scientific Paper:** [arXiv:2510.22593](https://arxiv.org/abs/2510.22593)
|
|
|
|
|
|
|
| 84 |
|
| 85 |
+
*Inference Support: Running a compute-intensive benchmark like AutoBench can be expensive. We welcome all inference API providers to support us with free inference credits to expand the scope of our evaluations.*
|
| 86 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|