Spaces:
Sleeping
Sleeping
| # Your leaderboard name | |
| TITLE = """<h1 align="center" id="space-title">Evaluation Leaderboard</h1>""" | |
| # SINGLE_TURN_TASK_NAMES = ", ".join([f"[{task.value.col_name}]({task.value.source})" for task in Tasks if task.value.type == "base"]) | |
| # AGENTIC_TASK_NAMES = ", ".join([f"[{task.value.col_name}]({task.value.source})" for task in Tasks if task.value.type == "agentic"]) | |
| # What does your leaderboard evaluate? | |
| INTRODUCTION_TEXT = f""" | |
| Powered by **Inspect** and **Inspect Evals**, the **Vector Evaluation Leaderboard** presents an evaluation of leading frontier models across a comprehensive suite of benchmarks. Go beyond the summary metrics: click through to interactive reporting for each model and benchmark to explore sample-level performance and detailed traces.""" | |
| # Which evaluations are you running? how can people reproduce what you have? | |
| ABOUT_TEXT = f""" | |
| ## Vector Institute | |
| The **Vector Institute** is dedicated to advancing the field of artificial intelligence through cutting-edge research and application. Our mission is to drive excellence and innovation in AI, fostering a community of researchers, developers, and industry partners. | |
| ## 🎯 Benchmarks | |
| This leaderboard showcases performance across a comprehensive suite of benchmarks, designed to rigorously evaluate different aspects of AI model capabilities. Let's explore the benchmarks we use: | |
| ### Inspect Evals | |
| This leaderboard leverages [Inspect Evals](https://ukgovernmentbeis.github.io/inspect_evals/) to power evaluation. Inspect Evals is an open-source repository built upon the Inspect AI framework. Developed in collaboration between the Vector Institute, Arcadia Impact and the UK AI Security Institute, Inspect Evals provides a comprehensive suite of high-quality benchmarks spanning diverse domains like coding, mathematics, cybersecurity, reasoning, and general knowledge. | |
| #### Transparent and Detailed Insights | |
| All evaluations presented on this leaderboard are run using Inspect Evals. To facilitate in-depth analysis and promote transparency, we provide [Inspect Logs](https://inspect.ai-safety-institute.org.uk/log-viewer.html) for every benchmark run. These logs offer sample and trace level reporting, allowing the community to explore the granular details of model performance. | |
| ### ⚙️ Base Benchmarks | |
| These benchmarks assess fundamental reasoning and knowledge capabilities of models. | |
| <div class="benchmark-table-container"> | |
| | Benchmark | Description | | |
| |--------------------|----------------------------------------------------------------------------------| | |
| | **ARC-Easy** / **ARC-Challenge** | Multiple-choice science questions. | | |
| | **DROP** | Comprehension benchmark evaluating advanced reasoning capability. | | |
| | **WinoGrande** | Commonsense reasoning challenge. | | |
| | **GSM8K** | Grade-school math word problems testing math capability & multi-step reasoning. | | |
| | **HellaSwag** | Commonsense reasoning task. | | |
| | **HumanEval** | Evaluates code generation and reasoning in a programming context. | | |
| | **IFEval** | Specialized benchmark for instruction following. | | |
| | **MATH** | Challenging questions sourced from math competitions. | | |
| | **MMLU** / **MMLU-Pro**| Multi-subject multiple-choice tests of advanced knowledge. | | |
| | **GPQA-Diamond** | Question-answering benchmark assessing deeper reasoning. | | |
| | **MMMU** (Multi-Choice / Open-Ended) | Multi-modal tasks testing structured & open responses. | | |
| </div> | |
| ### 🚀 Agentic Benchmarks | |
| These benchmarks go beyond basic reasoning and evaluate more advanced, autonomous, or "agentic" capabilities of models, such as planning and interaction. | |
| <div class="benchmark-table-container"> | |
| | Benchmark | Description | | |
| |-----------------------|----------------------------------------------------------------------------| | |
| | **GAIA** | Evaluates autonomous reasoning, planning, problem-solving for question answering. | | |
| | **InterCode-CTF** | Capture-the-flag challenge testing cyber-security skills. | | |
| | **In-House-CTF** | Capture-the-flag challenge testing cyber-security skills. | | |
| | **AgentHarm** / **AgentHarm-Benign** | Measures harmfulness of LLM agents (and benign behavior baseline). | | |
| | **SWE-Bench-Verified** | Tests AI agent ability to solve software engineering tasks. | | |
| </div> | |
| """ | |
| REPRODUCIBILITY_TEXT = """ | |
| ## 🛠️ Reproducibility | |
| The [Vector State of Evaluation Leaderboard Repository](https://github.com/VectorInstitute/evaluation) repository contains the evaluation script to reproduce results presented on the leaderboard. | |
| ### Install dependencies | |
| 1. Create a python virtual env. with ```python>=3.10``` and activate it | |
| ```bash | |
| python -m venv env | |
| source env/bin/activate | |
| ``` | |
| 2. Install ```inspect_ai```, ```inspect_evals``` and other dependencies based on ```requirements.txt``` | |
| ```bash | |
| python -m pip install -r requirements.txt | |
| ``` | |
| 3. Install any packages required for models you'd like to evaluate and use as grader models | |
| ```bash | |
| python -m pip install <model_package> | |
| ``` | |
| Note: ```openai``` package is already included in ```requirements.txt``` | |
| ### Run Inspect evaluation | |
| 1. Update the ```src/evals_cfg/run_cfg.yaml``` file to select the evals (base/agentic) and include all models to be evaluated | |
| 2. Now run evaluation as follows: | |
| ```bash | |
| python src/run_evals.py | |
| ``` | |
| """ | |