Leaderboards and benchmarks ✨

clefourrier 's Collections

LLM evaluation datasets

updated Mar 2

Cool leaderboard spaces collection for models across modalities! Text, vision, audio, ...

Upvote

121

Running on CPU Upgrade

14k

Open LLM Leaderboard

🏆

14k

Track, rank and evaluate open LLMs and chatbots

Note The reference leaderboard for Open LLMs! Find the best LLM for your size and precision needs, compare your models to the others! (Evaluates on ARC, HellaSwag, TruthfulQA, and MMLU)
Running

Agents

1.51k

Big Code Models Leaderboard

📈

1.51k

Explore code model leaderboard and submit evaluations

Note Specialized leaderboard for models with coding capabilities 🖥️ (Evaluates on HumanEval and MultiPL-E)
Running

4.94k

Arena Leaderboard

🏆

4.94k

View the LMArena leaderboard in full‑screen

Note Pitches chatbots against one another to compare their output quality (Evaluates on MTBench, an Elo score, and MMLU)
Running

Agents

Featured

589

LLM-Perf Leaderboard

🏆

589

Compare LLM hardware performance and find the best model

Note Do you want to know which model to use for which hardware? This leaderboard is for you! (Looks at the throughput of many LLMs in different hardware settings)
EleutherAI: Going Beyond "Open Science" to "Science in the Open"

Paper • 2210.06413 • Published Oct 12, 2022

Note This paper introduces (among other things) the Eleuther AI Harness, a reference evaluation suite which is simple to use and quite complete!
Holistic Evaluation of Language Models

Paper • 2211.09110 • Published Nov 16, 2022 • 1

Note The HELM paper! A super cool reference paper on the many axes to look at when creating an LLM benchmark or evaluation suite. Super exhaustive and interesting to read.
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Paper • 2206.04615 • Published Jun 9, 2022 • 6

Note The BigBench paper! A bunch of tasks to evaluate edge cases and random unusual LLM capabilities. The associated benchmark has since been completed with a lot of fun crowdsourced tasks.
Running on CPU Upgrade

7.56k

MTEB Leaderboard

📊

7.56k

Embedding Leaderboard

Note Text Embeddings benchmark across 58 tasks and 112 languages!
Running on CPU Upgrade

Agents

613

GAIA Leaderboard

🦾

613

Submit and view GAIA model evaluation leaderboard

Note A leaderboard for tool augmented LLMs!
Sleeping

Agents

95

OpenCompass LLM Leaderboard

🚀

95

Serve a web page from a Flask server

Note An LLM leaderboard for Chinese models on many metric axes - super complete
Runtime error

Agents

Featured

570

Open Ko-LLM Leaderboard

📉

570

Explore and filter language model benchmark results

Note An Open LLM Leaderboard specially for Korean models by our friends at Upstage!
Running

67

Hallucination Evaluation Leaderboard

⚡

67

View the Vectara leaderboard online

Note A leaderboard to evaluate the propensy of LLMs to hallucinate
Runtime error

Agents

145

Hallucinations Leaderboard

🔥

145

View and submit LLM evaluations

Note A lot of metrics if you are interested in the propensity of LLMs to hallucinate!
Running

Agents

95

Nexus Function Calling Leaderboard

🐠

95

Display benchmark results for models on various tasks

Note Tests LLM API usage and calls (few models atm)
Running

72

CyberSecEvalTest

📈

72

Evaluate LLMs' cybersecurity risks and capabilities

Note How likely is your LLM to help cyber attacks?
Running

192

Yet Another LLM Leaderboard

🌖

192

Launch a Streamlit web app interface

Note An aggregation of benchmarks well correlated with human preferences
Running on CPU Upgrade

Agents

92

LLM Safety Leaderboard

🥇

92

Search, filter and submit LLM benchmark evaluations

Note Bias, safety, toxicity, all those things that are important to test when your chatbot actually interacts with users
Build error

Agents

33

EvalCrafter

⚡

33

Display and filter video generation model leaderboard

Note Text to video generation leaderboard
Running

Agents

223

Ocrbench Leaderboard

🏆

223

View OCR model leaderboard rankings

Note An OCR benchmark
Running

Agents

54

NPHardEval Leaderboard

🥇

54

Explore and filter LLM benchmark results

Note Dynamic leaderboard using complexity classes to create reasoning problems for LLMs - quite a cool one
Runtime error

Agents

20

Subquadratic LLM Leaderboard

🏆

20

Submit models for evaluation on a leaderboard

Note The Open LLM Leaderboard, but for structured state models!
Running

Featured

561

Vision Arena (Testing VLMs side-by-side)

🖼

561

Explore Vision Arena visual AI demo online

Note A multimodal arena!
Running

Agents

358

VBench Leaderboard

📊

358

Submit video model evaluation results to a public benchmark
Runtime error

Agents

244

Open Portuguese LLM Leaderboard

🏆

244

Track, rank and evaluate open LLMs in Portuguese

Note An LLM leaderboard for Portuguese
Running on CPU Upgrade

Agents

85

Open Ita Llm Leaderboard

🏆

85

Track, rank and evaluate open LLMs in the italian language!

Note An LLM leaderboard for Italian
Build error

Agents

296

GenAI Arena

📈

296

Realtime Image/Video Gen AI Arena

Note An arena for image generation!
Running

Agents

12

Q-Bench+ Leaderboard

📊

12

View leaderboard results for Q-Bench
Paused

Agents

34

Parti Prompts Leaderboard

📊

34

Display leaderboard comparing text-to-image models based on human preferences
Running on CPU Upgrade

200

LLM Hallucination Leaderboard

🚀

200

View and filter LLM hallucination leaderboard

Note An hallucination leaderboard, focused on a different set of tasks
Running on CPU Upgrade

Agents

79

Open PL LLM Leaderboard

🏆

79

Explore LLM benchmark leaderboard with searchable filters
Running on CPU Upgrade

Agents

94

OpenLLM Turkish leaderboard

🥇

94

Explore and submit LLM benchmarks
Running

Agents

232

AI2 WildBench Leaderboard (V2)

🦁

232

Display LLM performance leaderboards with customizable views
Running on CPU Upgrade

Agents

Featured

1.4k

Open ASR Leaderboard

🏆

1.4k

Explore and compare speech recognition model benchmarks
Running on CPU Upgrade

Agents

1.02k

Open VLM Leaderboard

🌎

1.02k

VLMEvalKit Evaluation Results Collection
Running

Agents

432

Reward Bench Leaderboard

📐

432

Explore and compare model scores on RewardBench benchmarks
Running on CPU Upgrade

Featured

967

TTS Arena V2

🗣

967

Compare and rank TTS voices by listening and voting
Running

Agents

42

Long Code Arena

🏟

42

View model performance leaderboards for various tasks
Running

1.91k

UGI Leaderboard

📢

1.91k

Uncensored General Intelligence Leaderboard
Running

125

Berkeley Function Calling Leaderboard

🏃

125

View the Berkeley Function-Calling Leaderboard
Running on CPU Upgrade

Agents

61

Open CoT Leaderboard

🥇

61

Track, rank and evaluate open LLMs' CoT quality
Running

Agents

22

URIAL Bench (Eval Base LLMs on MT-Bench)

🐑

22

Show a static leaderboard of LLM benchmark scores
Running

26

Indic Llm Leaderboard

🔥

26

Explore and compare Indic LLMs on a leaderboard
Sleeping

8

Meta Open LLM Leaderboard

🏆

8
Runtime error

Agents

13

Science Leaderboard

👁

13

Leaderboard for LLM for Science Reasoning
Runtime error

Agents

Featured

436

Open Medical-LLM Leaderboard

🥇

436

Explore and submit models for benchmarking
Runtime error

Agents

28

Open RL Leaderboard

🥇

28
Running

Agents

22

LLM Leaderboard for SEA

🥇

22

Display and filter leaderboard data for language models
Running on CPU Upgrade

Agents

46

Hebrew LLM Leaderboard

🥇

46

Explore LLM benchmark leaderboard with searchable filters
Runtime error

Agents

Featured

151

Open LLM Progress Tracker

🔬

151

Visualize Open vs. Proprietary LLM Progress
Running

Agents

Featured

220

Low-bit LLM Leaderboard

🏆

220

Track, rank and evaluate open LLMs and chatbots
Running on CPU Upgrade

Agents

77

AIR-Bench Leaderboard

🥇

77

Explore and compare QA and long doc benchmarks
Running on CPU Upgrade

Agents

183

Open Arabic LLM Leaderboard

🏆

183

Track, rank and evaluate open Arabic LLMs and chatbots
Running on CPU Upgrade

Agents

127

Open Chinese LLM Leaderboard

🏆

127

Explore LLM benchmark scores and submit your model for evaluation
Running

Featured

400

3D Arena

🏢

400

Vote for 3D creations and view the leaderboard
Running

Agents

232

BigCodeBench Leaderboard

🥇

232

Explore code-generation model leaderboards and task details
Runtime error

Agents

22

Open Tw Llm Leaderboard

🥇

22

Explore and submit LLM benchmarks
Running

Agents

95

Zebra Logic Bench

🦓

95

Display model leaderboard and explore sample puzzles
Running

Agents

104

Internal European Leaderboard

🌍

104

Explore and compare multilingual LLM benchmarks
Running

Agents

25

🇨🇿 BenCzechMark

📊

25

Submit and explore model leaderboard rankings
Build error

Agents

54

Leaderboard

🥇

54

Browse and submit evaluation results for AI benchmarks
Running

62

Stick To Your Role! Leaderboard

🎭

62

Benchmarking LLMs on the stability of simulated populations
Paused

Agents

353

GPU Poor LLM Arena

🏆

353

Compact LLM Battle Arena: Frugal AI Face-Off!
Running on CPU Upgrade

Agents

76

La Leaderboard

🌸

76

Evaluate open LLMs in the languages of LATAM and Spain.
Build error

Agents

40

OpenLLM French leaderboard 🇫🇷

🥇

40

Explore and submit LLM benchmarks
Running

Agents

220

GIFT Eval

🥇

220

GIFT-Eval: A Benchmark for General Time Series Forecasting
Running

Agents

112

Judge Arena

💻

112

View and compare open‑source AI model rankings with ELO scores
Running

Agents

91

Open Persian LLM Leaderboard

🏅

91

Open Persian LLM Leaderboard
Running

36

Japanese Chatbot Arena Leaderboard

🌖

36

Compare two chatbots and vote on the better one
Runtime error

Agents

108

Open Japanese LLM Leaderboard

🌸

108

Explore and compare LLM models with interactive filters and visualizations
Running

Agents

8

Leaderboard2024

🏅

8

Submit models to MLSB 2024 leaderboard
Runtime error

Agents

11

Toxicity Benchmarking

🥇

11

Explore toxicity scores of models
Running

Agents

80

Background Removal Arena

⚡

80

Vote on the best background‑removal results
Running

58

fev-bench

🏆

58

Forecast evaluation benchmark
Running

113

AI Phone Leaderboard

📱

113

AI Phone Leaderboard
Running

Agents

14

Polish EQ-Bench Leaderboard

🏆

14

View model benchmark leaderboard with scores and plots
Build error

Agents

9

Polish Medical Leaderboard

🇵

9

Display and filter LLM benchmark results
Running

11

CPTU-Bench

🧠

11

Explore and compare Polish language model benchmark results
Paused

Agents

29

MT Bench PL

📊

29

Przeglądaj i porównuj odpowiedzi modeli językowych w języku polskim
Running on CPU Upgrade

Agents

102

DABstep Leaderboard

🕺

102

DABstep Reasoning Benchmark Leaderboard

Upvote

121

Leaderboards and benchmarks ✨

Open LLM Leaderboard

Big Code Models Leaderboard

Arena Leaderboard

LLM-Perf Leaderboard

MTEB Leaderboard

GAIA Leaderboard

OpenCompass LLM Leaderboard

Open Ko-LLM Leaderboard

Hallucination Evaluation Leaderboard

Hallucinations Leaderboard

Nexus Function Calling Leaderboard

CyberSecEvalTest

Yet Another LLM Leaderboard

LLM Safety Leaderboard

EvalCrafter

Ocrbench Leaderboard

NPHardEval Leaderboard

Subquadratic LLM Leaderboard

Vision Arena (Testing VLMs side-by-side)

VBench Leaderboard

Open Portuguese LLM Leaderboard

Open Ita Llm Leaderboard

GenAI Arena

Q-Bench+ Leaderboard

Parti Prompts Leaderboard

LLM Hallucination Leaderboard

Open PL LLM Leaderboard

OpenLLM Turkish leaderboard

AI2 WildBench Leaderboard (V2)

Open ASR Leaderboard

Open VLM Leaderboard

Reward Bench Leaderboard

TTS Arena V2

Long Code Arena

UGI Leaderboard

Berkeley Function Calling Leaderboard

Open CoT Leaderboard

URIAL Bench (Eval Base LLMs on MT-Bench)

Indic Llm Leaderboard

Meta Open LLM Leaderboard

Science Leaderboard

Open Medical-LLM Leaderboard

Open RL Leaderboard

LLM Leaderboard for SEA

Hebrew LLM Leaderboard

Open LLM Progress Tracker

Low-bit LLM Leaderboard

AIR-Bench Leaderboard

Open Arabic LLM Leaderboard

Open Chinese LLM Leaderboard

3D Arena

BigCodeBench Leaderboard

Open Tw Llm Leaderboard

Zebra Logic Bench

Internal European Leaderboard

🇨🇿 BenCzechMark

Leaderboard

Stick To Your Role! Leaderboard

GPU Poor LLM Arena

La Leaderboard

OpenLLM French leaderboard 🇫🇷

GIFT Eval

Judge Arena

Open Persian LLM Leaderboard

Japanese Chatbot Arena Leaderboard

Open Japanese LLM Leaderboard

Leaderboard2024

Toxicity Benchmarking

Background Removal Arena

fev-bench

AI Phone Leaderboard

Polish EQ-Bench Leaderboard

Polish Medical Leaderboard

CPTU-Bench

MT Bench PL

DABstep Leaderboard