eval-leaderboard

Sleeping

App Files Files Community

eval-leaderboard / src /about.py

jwilles

copy

ec7dfaf 8 months ago

raw

history blame contribute delete

5.68 kB

	# Your leaderboard name
	TITLE = """<h1 align="center" id="space-title">Evaluation Leaderboard</h1>"""

	# SINGLE_TURN_TASK_NAMES = ", ".join([f"[{task.value.col_name}]({task.value.source})" for task in Tasks if task.value.type == "base"])
	# AGENTIC_TASK_NAMES = ", ".join([f"[{task.value.col_name}]({task.value.source})" for task in Tasks if task.value.type == "agentic"])

	# What does your leaderboard evaluate?
	INTRODUCTION_TEXT = f"""
	Powered by Inspect and Inspect Evals, the Vector Evaluation Leaderboard presents an evaluation of leading frontier models across a comprehensive suite of benchmarks. Go beyond the summary metrics: click through to interactive reporting for each model and benchmark to explore sample-level performance and detailed traces."""

	# Which evaluations are you running? how can people reproduce what you have?
	ABOUT_TEXT = f"""

	## Vector Institute
	The Vector Institute is dedicated to advancing the field of artificial intelligence through cutting-edge research and application. Our mission is to drive excellence and innovation in AI, fostering a community of researchers, developers, and industry partners.

	## 🎯 Benchmarks

	This leaderboard showcases performance across a comprehensive suite of benchmarks, designed to rigorously evaluate different aspects of AI model capabilities. Let's explore the benchmarks we use:

	### Inspect Evals

	This leaderboard leverages [Inspect Evals](https://ukgovernmentbeis.github.io/inspect_evals/) to power evaluation. Inspect Evals is an open-source repository built upon the Inspect AI framework. Developed in collaboration between the Vector Institute, Arcadia Impact and the UK AI Security Institute, Inspect Evals provides a comprehensive suite of high-quality benchmarks spanning diverse domains like coding, mathematics, cybersecurity, reasoning, and general knowledge.

	#### Transparent and Detailed Insights

	All evaluations presented on this leaderboard are run using Inspect Evals. To facilitate in-depth analysis and promote transparency, we provide [Inspect Logs](https://inspect.ai-safety-institute.org.uk/log-viewer.html) for every benchmark run. These logs offer sample and trace level reporting, allowing the community to explore the granular details of model performance.

	### ⚙️ Base Benchmarks

	These benchmarks assess fundamental reasoning and knowledge capabilities of models.

	<div class="benchmark-table-container">

	\| Benchmark \| Description \|
	\|--------------------\|----------------------------------------------------------------------------------\|
	\| ARC-Easy / ARC-Challenge \| Multiple-choice science questions. \|
	\| DROP \| Comprehension benchmark evaluating advanced reasoning capability. \|
	\| WinoGrande \| Commonsense reasoning challenge. \|
	\| GSM8K \| Grade-school math word problems testing math capability & multi-step reasoning. \|
	\| HellaSwag \| Commonsense reasoning task. \|
	\| HumanEval \| Evaluates code generation and reasoning in a programming context. \|
	\| IFEval \| Specialized benchmark for instruction following. \|
	\| MATH \| Challenging questions sourced from math competitions. \|
	\| MMLU / MMLU-Pro\| Multi-subject multiple-choice tests of advanced knowledge. \|
	\| GPQA-Diamond \| Question-answering benchmark assessing deeper reasoning. \|
	\| MMMU (Multi-Choice / Open-Ended) \| Multi-modal tasks testing structured & open responses. \|
	</div>

	### 🚀 Agentic Benchmarks

	These benchmarks go beyond basic reasoning and evaluate more advanced, autonomous, or "agentic" capabilities of models, such as planning and interaction.

	<div class="benchmark-table-container">

	\| Benchmark \| Description \|
	\|-----------------------\|----------------------------------------------------------------------------\|
	\| GAIA \| Evaluates autonomous reasoning, planning, problem-solving for question answering. \|
	\| InterCode-CTF \| Capture-the-flag challenge testing cyber-security skills. \|
	\| In-House-CTF \| Capture-the-flag challenge testing cyber-security skills. \|
	\| AgentHarm / AgentHarm-Benign \| Measures harmfulness of LLM agents (and benign behavior baseline). \|
	\| SWE-Bench-Verified \| Tests AI agent ability to solve software engineering tasks. \|
	</div>
	"""

	REPRODUCIBILITY_TEXT = """
	## 🛠️ Reproducibility
	The [Vector State of Evaluation Leaderboard Repository](https://github.com/VectorInstitute/evaluation) repository contains the evaluation script to reproduce results presented on the leaderboard.

	### Install dependencies

	1. Create a python virtual env. with ```python>=3.10``` and activate it
	```bash
	python -m venv env
	source env/bin/activate
	```

	2. Install ```inspect_ai```, ```inspect_evals``` and other dependencies based on ```requirements.txt```
	```bash
	python -m pip install -r requirements.txt
	```

	3. Install any packages required for models you'd like to evaluate and use as grader models
	```bash
	python -m pip install <model_package>
	```
	Note: ```openai``` package is already included in ```requirements.txt```

	### Run Inspect evaluation
	1. Update the ```src/evals_cfg/run_cfg.yaml``` file to select the evals (base/agentic) and include all models to be evaluated
	2. Now run evaluation as follows:
	```bash
	python src/run_evals.py
	```
	"""