| --- |
| title: AutoBench Leaderboard |
| emoji: π |
| colorFrom: green |
| colorTo: pink |
| sdk: gradio |
| sdk_version: 5.27.0 |
| app_file: app.py |
| pinned: false |
| license: mit |
| short_description: Multi-run AutoBench leaderboard with historical navigation |
| --- |
| |
| # AutoBench LLM Leaderboard |
|
|
| Interactive leaderboard for AutoBench, where Large Language Models (LLMs) evaluate and rank responses from other LLMs. This application supports multiple benchmark runs with seamless navigation between different time periods. |
|
|
| ## π Features |
|
|
| ### Multi-Run Navigation |
| - **π Run Selector**: Switch between different AutoBench runs using the dropdown menu |
| - **π Historical Data**: View and compare results across different time periods |
| - **π Reactive Interface**: All tabs and visualizations update automatically when switching runs |
| - **π Enhanced Metrics**: Support for evaluation iterations and fail rates in newer runs |
|
|
| ### Comprehensive Analysis |
| - **Overall Ranking**: Model performance with AutoBench scores, costs, latency, and reliability metrics |
| - **Benchmark Comparison**: Correlations with Chatbot Arena, AAI Index, and MMLU benchmarks |
| - **Performance Plots**: Interactive scatter plots showing cost vs. performance trade-offs |
| - **Cost & Latency Analysis**: Detailed breakdown by domain and response time percentiles |
| - **Domain Performance**: Model rankings across specific knowledge areas |
|
|
| ### Dynamic Features |
| - **π Benchmark Correlations**: Displays correlation percentages with other popular benchmarks |
| - **π° Cost Conversion**: Automatic conversion to cents for better readability |
| - **β‘ Performance Metrics**: Average and P99 latency measurements |
| - **π― Fail Rate Tracking**: Model reliability metrics (for supported runs) |
| - **π’ Iteration Counts**: Number of evaluations per model (for supported runs) |
|
|
| ## π How to Use |
|
|
| ### Navigation |
| 1. **Select a Run**: Use the dropdown menu at the top to choose between available benchmark runs |
| 2. **Explore Tabs**: Navigate through different analysis views using the tab interface |
| 3. **Interactive Tables**: Sort and filter data by clicking on column headers |
| 4. **Hover for Details**: Get additional information by hovering over chart elements |
|
|
| ### Understanding the Data |
| - **AutoBench Score**: Higher scores indicate better performance |
| - **Cost**: Lower values are better (displayed in cents per response) |
| - **Latency**: Lower response times are better (average and P99 percentiles) |
| - **Fail Rate**: Lower percentages indicate more reliable models |
| - **Iterations**: Number of evaluation attempts per model |
|
|
| ## π§ Adding New Runs |
|
|
| ### Directory Structure |
| ``` |
| runs/ |
| βββ run_YYYY-MM-DD/ |
| β βββ metadata.json # Run information and metadata |
| β βββ correlations.json # Benchmark correlation data |
| β βββ summary_data.csv # Main leaderboard data |
| β βββ domain_ranks.csv # Domain-specific rankings |
| β βββ cost_data.csv # Cost breakdown by domain |
| β βββ avg_latency.csv # Average latency by domain |
| β βββ p99_latency.csv # P99 latency by domain |
| ``` |
|
|
| ### Required Files |
|
|
| #### 1. metadata.json |
| ```json |
| { |
| "run_id": "run_2025-08-14", |
| "title": "AutoBench Run 3 - August 2025", |
| "date": "2025-08-14", |
| "description": "Latest AutoBench run with enhanced metrics", |
| "blog_url": "https://huggingface.co/blog/PeterKruger/autobench-3rd-run", |
| "model_count": 34, |
| "is_latest": true |
| } |
| ``` |
|
|
| #### 2. correlations.json |
| ```json |
| { |
| "correlations": { |
| "Chatbot Arena": 82.51, |
| "Artificial Analysis Intelligence Index": 83.74, |
| "MMLU": 71.51 |
| }, |
| "description": "Correlation percentages between AutoBench scores and other benchmark scores" |
| } |
| ``` |
|
|
| #### 3. summary_data.csv |
| Required columns: |
| - `Model`: Model name |
| - `AutoBench`: AutoBench score |
| - `Costs (USD)`: Cost per response in USD |
| - `Avg Answer Duration (sec)`: Average response time |
| - `P99 Answer Duration (sec)`: 99th percentile response time |
| |
| Optional columns (for enhanced metrics): |
| - `Iterations`: Number of evaluation iterations |
| - `Fail Rate %`: Percentage of failed responses |
| - `LMArena` or `Chatbot Ar.`: Chatbot Arena scores |
| - `MMLU-Pro` or `MMLU Index`: MMLU benchmark scores |
| - `AAI Index`: Artificial Analysis Intelligence Index scores |
| |
| ### Adding a New Run |
| |
| 1. **Create Directory**: `mkdir runs/run_YYYY-MM-DD` |
| 2. **Add Data Files**: Copy your CSV files to the new directory |
| 3. **Create Metadata**: Add `metadata.json` with run information |
| 4. **Add Correlations**: Create `correlations.json` with benchmark correlations |
| 5. **Update Previous Run**: Set `"is_latest": false` in the previous latest run's metadata |
| 6. **Restart App**: The new run will be automatically discovered |
|
|
| ### Column Compatibility |
|
|
| The application automatically adapts to different column structures: |
| - **Legacy Runs**: Support basic columns (Model, AutoBench, Cost, Latency) |
| - **Enhanced Runs**: Include additional metrics (Iterations, Fail Rate %) |
| - **Flexible Naming**: Handles variations in benchmark column names |
|
|
| ## π οΈ Development |
|
|
| ### Requirements |
| - Python 3.8+ |
| - Gradio 5.27.0+ |
| - Pandas |
| - Plotly |
|
|
| ### Installation |
| ```bash |
| pip install -r requirements.txt |
| ``` |
|
|
| ### Running Locally |
| ```bash |
| python app.py |
| ``` |
|
|
| ### killing all python processes |
| ```bash |
| taskkill /F /IM python.exe 2>/dev/null || echo "No Python processes to kill" |
| ``` |
|
|
| The app will automatically discover available runs and launch on a local port. |
|
|
| ## π Data Sources |
|
|
| AutoBench evaluations are conducted using LLM-generated questions across diverse domains, with responses ranked by evaluation LLMs. For more information about the methodology, visit the [AutoBench blog posts](https://huggingface.co/blog/PeterKruger/autobench). |
|
|
| ## π License |
|
|
| MIT License - see LICENSE file for details. |
|
|
| --- |
|
|
| Check out the [Hugging Face Spaces configuration reference](https://huggingface.co/docs/hub/spaces-config-reference) for deployment options. |
|
|