PeterKruger commited on
Commit
508b3df
·
verified ·
1 Parent(s): 5e33e6a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -4
README.md CHANGED
@@ -12,19 +12,23 @@ license: apache-2.0
12
 
13
  ## Organization Description
14
 
15
- **[AutoBench](https://autobench.org/)** is the premier LLM evaluation and routing infra for the Agentic Era. We are dedicated to solving the LLM evaluation crisis by moving the industry beyond static, domain-rigid, easily gameable benchmarks and building the next generation LLM-based API routers for the agentic era.
16
 
17
- Pioneering the **"Collective-LLM-as-a-Judge"** methodology, AutoBench uses massive pools of LLMs to dynamically generate tasks, execute multi-turn workflows, and granularly evaluate performance across the AI ecosystem. Today, AutoBench provides fully automated, highly correlated, and strictly un-gameable benchmarking. Furthermore, we leverage the massive synthetic execution datasets generated by our benchmarks to train next-generation **Agentic LLM Routers**, helping agent developers and enterprises optimize for both absolute quality and unit economics.
 
 
 
 
18
 
19
  ## The AutoBench Ecosystem
20
 
21
  ### 1. AutoBench Agentic (Latest Evolution)
22
  Current agentic benchmarks (like Gval-AA or Terminal-bench) are static, allowing models to simply "train to the test." AutoBench Agentic fundamentally changes this by dropping LLMs into dynamically generated, multi-turn **Agentic Virtual Environments**.
23
- * **Technical Complexity:** Our infrastructure deterministically injects business-flavored personas into a native Universal Intermediate Representation (UIR). We inject stateful "memory lines" of previous workflow failures, and force models to navigate complex native JSON `tools[]` arrays filled with randomly injected "distractor" tools.
24
  * **10 Granular Task Types:** We evaluate true orchestration under pressure, measuring specific capabilities like *Adaptive Replanning*, *Parameter Complexity*, *Single Tool Call*, and *Failure Recovery*.
25
  * **Cost vs. Performance Tracking:** Tracks exact P99 Latency and strict USD/run costs to help developers define their efficiency frontier.
26
 
27
- ### 2. Agentic LLM Routing
28
  Benchmarking is just the first step. AutoBench uses the millions of execution traces and granular performance data generated by our runs to train dynamic **Agentic LLM Routers**. Instead of passive gateways or superficial semantic heuristic routers, AutoBench empowers active pipeline optimization—routing complex edge cases to frontier models and standard tasks to open-weight models, saving agent applications and enterprises up to 90% in API costs.
29
 
30
  ### 3. AutoBench 2.0 & Domain Benchmarks
 
12
 
13
  ## Organization Description
14
 
15
+ **[AutoBench](https://autobench.org/)** is the premier LLM evaluation and routing infra for the Agentic Era. This is not just about LLM benchmarking, but real-time, AI-trained LLM routing for agents (delivering up to 90% inference cost savings).
16
 
17
+ We are solving the LLM evaluation crisis by moving the industry beyond static, domain-rigid, and easily gameable benchmarks. AutoBench uses massive pools of LLMs to dynamically generate tasks, execute multi-turn workflows, and granularly evaluate LLM performance. Our benchmarks correlate 80-90% with industry standards, but they remain strictly un-gameable, unbiased, granular, flexible. At a fraction of the cost.
18
+
19
+ And that is just the beginning. We leverage the massive synthetic datasets generated by our benchmarks to train **next-gen Agentic LLM Routers**, helping agentic frameworks optimize for both quality and economics.
20
+
21
+ Our vision sees AutoBench as the essential, universal layer that will soon intermediate between all AI agents and underlying LLMs.
22
 
23
  ## The AutoBench Ecosystem
24
 
25
  ### 1. AutoBench Agentic (Latest Evolution)
26
  Current agentic benchmarks (like Gval-AA or Terminal-bench) are static, allowing models to simply "train to the test." AutoBench Agentic fundamentally changes this by dropping LLMs into dynamically generated, multi-turn **Agentic Virtual Environments**.
27
+ * **Technical Complexity:** Our infrastructure combines deterministic procedures and LLM generation to build complex business-flavored agentic task payloads via a native Universal Intermediate Representation (UIR). We inject stateful "memory lines" of previous workflow failures, and force models to navigate complex native JSON `tools[]` arrays filled with randomly injected "distractor" tools.
28
  * **10 Granular Task Types:** We evaluate true orchestration under pressure, measuring specific capabilities like *Adaptive Replanning*, *Parameter Complexity*, *Single Tool Call*, and *Failure Recovery*.
29
  * **Cost vs. Performance Tracking:** Tracks exact P99 Latency and strict USD/run costs to help developers define their efficiency frontier.
30
 
31
+ ### 2. Agentic LLM Routing (alpha)
32
  Benchmarking is just the first step. AutoBench uses the millions of execution traces and granular performance data generated by our runs to train dynamic **Agentic LLM Routers**. Instead of passive gateways or superficial semantic heuristic routers, AutoBench empowers active pipeline optimization—routing complex edge cases to frontier models and standard tasks to open-weight models, saving agent applications and enterprises up to 90% in API costs.
33
 
34
  ### 3. AutoBench 2.0 & Domain Benchmarks