eval-leaderboard

Sleeping

xeon27 commited on Jan 21

Commit

ba14348

1 Parent(s): 6410971

Add task link in description

Files changed (1) hide show

src/about.py CHANGED Viewed

@@ -42,20 +42,21 @@ NUM_FEWSHOT = 0 # Change with your few shot
 # Your leaderboard name
 TITLE = """<h1 align="center" id="space-title">LLM Evaluation Leaderboard</h1>"""
 # What does your leaderboard evaluate?
-INTRODUCTION_TEXT = """
-This leaderboard presents the performance of selected LLM models on a set of tasks. The tasks are divided into two categories: base and agentic. The base tasks are:
-""" + ", ".join([f"[{task.value.col_name}]({task.value.source})" for task in Tasks if task.value.type == "base"]) + """. The agentic tasks are:
-""" + ", ".join([f"[{task.value.col_name}]({task.value.source})" for task in Tasks if task.value.type == "agentic"]) + """."""
 # Which evaluations are you running? how can people reproduce what you have?
 LLM_BENCHMARKS_TEXT = f"""
 ## How it works
 The following benchmarks are included:
-Base: [ARC-Easy](https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/arc), ARC-Challenge, DROP, WinoGrande, GSM8K, HellaSwag, HumanEval, IFEval, MATH, MMLU, MMLU-Pro, GPQA-Diamond
-Agentic: GAIA, GDM-InterCode-CTF
 ## Reproducibility
 To reproduce our results, here is the commands you can run:

 # Your leaderboard name
 TITLE = """<h1 align="center" id="space-title">LLM Evaluation Leaderboard</h1>"""
+BASE_TASK_NAMES = ", ".join([f"[{task.value.col_name}]({task.value.source})" for task in Tasks if task.value.type == "base"])
+AGENTIC_TASK_NAMES = ", ".join([f"[{task.value.col_name}]({task.value.source})" for task in Tasks if task.value.type == "agentic"])
 # What does your leaderboard evaluate?
+INTRODUCTION_TEXT = f"""
+This leaderboard presents the performance of selected LLM models on a set of tasks. The tasks are divided into two categories: base and agentic. The base tasks are: {BASE_TASK_NAMES}. The agentic tasks are: {AGENTIC_TASK_NAMES}."""
 # Which evaluations are you running? how can people reproduce what you have?
 LLM_BENCHMARKS_TEXT = f"""
 ## How it works
 The following benchmarks are included:
+Base: {BASE_TASK_NAMES}
+Agentic: {AGENTIC_TASK_NAMES}
 ## Reproducibility
 To reproduce our results, here is the commands you can run: