Update README.md
Browse files
README.md
CHANGED
|
@@ -12,19 +12,23 @@ license: apache-2.0
|
|
| 12 |
|
| 13 |
## Organization Description
|
| 14 |
|
| 15 |
-
**[AutoBench](https://autobench.org/)** is the premier LLM evaluation and routing infra for the Agentic Era.
|
| 16 |
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
## The AutoBench Ecosystem
|
| 20 |
|
| 21 |
### 1. AutoBench Agentic (Latest Evolution)
|
| 22 |
Current agentic benchmarks (like Gval-AA or Terminal-bench) are static, allowing models to simply "train to the test." AutoBench Agentic fundamentally changes this by dropping LLMs into dynamically generated, multi-turn **Agentic Virtual Environments**.
|
| 23 |
-
* **Technical Complexity:** Our infrastructure
|
| 24 |
* **10 Granular Task Types:** We evaluate true orchestration under pressure, measuring specific capabilities like *Adaptive Replanning*, *Parameter Complexity*, *Single Tool Call*, and *Failure Recovery*.
|
| 25 |
* **Cost vs. Performance Tracking:** Tracks exact P99 Latency and strict USD/run costs to help developers define their efficiency frontier.
|
| 26 |
|
| 27 |
-
### 2. Agentic LLM Routing
|
| 28 |
Benchmarking is just the first step. AutoBench uses the millions of execution traces and granular performance data generated by our runs to train dynamic **Agentic LLM Routers**. Instead of passive gateways or superficial semantic heuristic routers, AutoBench empowers active pipeline optimization—routing complex edge cases to frontier models and standard tasks to open-weight models, saving agent applications and enterprises up to 90% in API costs.
|
| 29 |
|
| 30 |
### 3. AutoBench 2.0 & Domain Benchmarks
|
|
|
|
| 12 |
|
| 13 |
## Organization Description
|
| 14 |
|
| 15 |
+
**[AutoBench](https://autobench.org/)** is the premier LLM evaluation and routing infra for the Agentic Era. This is not just about LLM benchmarking, but real-time, AI-trained LLM routing for agents (delivering up to 90% inference cost savings).
|
| 16 |
|
| 17 |
+
We are solving the LLM evaluation crisis by moving the industry beyond static, domain-rigid, and easily gameable benchmarks. AutoBench uses massive pools of LLMs to dynamically generate tasks, execute multi-turn workflows, and granularly evaluate LLM performance. Our benchmarks correlate 80-90% with industry standards, but they remain strictly un-gameable, unbiased, granular, flexible. At a fraction of the cost.
|
| 18 |
+
|
| 19 |
+
And that is just the beginning. We leverage the massive synthetic datasets generated by our benchmarks to train **next-gen Agentic LLM Routers**, helping agentic frameworks optimize for both quality and economics.
|
| 20 |
+
|
| 21 |
+
Our vision sees AutoBench as the essential, universal layer that will soon intermediate between all AI agents and underlying LLMs.
|
| 22 |
|
| 23 |
## The AutoBench Ecosystem
|
| 24 |
|
| 25 |
### 1. AutoBench Agentic (Latest Evolution)
|
| 26 |
Current agentic benchmarks (like Gval-AA or Terminal-bench) are static, allowing models to simply "train to the test." AutoBench Agentic fundamentally changes this by dropping LLMs into dynamically generated, multi-turn **Agentic Virtual Environments**.
|
| 27 |
+
* **Technical Complexity:** Our infrastructure combines deterministic procedures and LLM generation to build complex business-flavored agentic task payloads via a native Universal Intermediate Representation (UIR). We inject stateful "memory lines" of previous workflow failures, and force models to navigate complex native JSON `tools[]` arrays filled with randomly injected "distractor" tools.
|
| 28 |
* **10 Granular Task Types:** We evaluate true orchestration under pressure, measuring specific capabilities like *Adaptive Replanning*, *Parameter Complexity*, *Single Tool Call*, and *Failure Recovery*.
|
| 29 |
* **Cost vs. Performance Tracking:** Tracks exact P99 Latency and strict USD/run costs to help developers define their efficiency frontier.
|
| 30 |
|
| 31 |
+
### 2. Agentic LLM Routing (alpha)
|
| 32 |
Benchmarking is just the first step. AutoBench uses the millions of execution traces and granular performance data generated by our runs to train dynamic **Agentic LLM Routers**. Instead of passive gateways or superficial semantic heuristic routers, AutoBench empowers active pipeline optimization—routing complex edge cases to frontier models and standard tasks to open-weight models, saving agent applications and enterprises up to 90% in API costs.
|
| 33 |
|
| 34 |
### 3. AutoBench 2.0 & Domain Benchmarks
|