Open LLM Leaderboard
Track, rank and evaluate open LLMs and chatbots
Cool leaderboard spaces collection for models across modalities! Text, vision, audio, ...
Track, rank and evaluate open LLMs and chatbots
Note The reference leaderboard for Open LLMs! Find the best LLM for your size and precision needs, compare your models to the others! (Evaluates on ARC, HellaSwag, TruthfulQA, and MMLU)
Explore and submit code model evaluations on a leaderboard
Note Specialized leaderboard for models with coding capabilities ๐ฅ๏ธ (Evaluates on HumanEval and MultiPL-E)
View LMArena model leaderboard
Note Pitches chatbots against one another to compare their output quality (Evaluates on MTBench, an Elo score, and MMLU)
Explore LLM performance across hardware configurations
Note Do you want to know which model to use for which hardware? This leaderboard is for you! (Looks at the throughput of many LLMs in different hardware settings)
Note This paper introduces (among other things) the Eleuther AI Harness, a reference evaluation suite which is simple to use and quite complete!
Note The HELM paper! A super cool reference paper on the many axes to look at when creating an LLM benchmark or evaluation suite. Super exhaustive and interesting to read.
Note The BigBench paper! A bunch of tasks to evaluate edge cases and random unusual LLM capabilities. The associated benchmark has since been completed with a lot of fun crowdsourced tasks.
Embedding Leaderboard
Note Text Embeddings benchmark across 58 tasks and 112 languages!
Submit your model answers to GAIA benchmark and view leaderboard
Note A leaderboard for tool augmented LLMs!
Display a web page
Note An LLM leaderboard for Chinese models on many metric axes - super complete
Explore and filter language model benchmark results
Note An Open LLM Leaderboard specially for Korean models by our friends at Upstage!
Redirect to leaderboard page
Note A leaderboard to evaluate the propensy of LLMs to hallucinate
View and submit LLM evaluations
Note A lot of metrics if you are interested in the propensity of LLMs to hallucinate!
Display benchmark results for models on various tasks
Note Tests LLM API usage and calls (few models atm)
Evaluate LLMs' cybersecurity risks and capabilities
Note How likely is your LLM to help cyber attacks?
Launch a Streamlit web app interface
Note An aggregation of benchmarks well correlated with human preferences
Explore and submit LLM benchmarks
Note Bias, safety, toxicity, all those things that are important to test when your chatbot actually interacts with users
Display and filter video generation model leaderboard
Note Text to video generation leaderboard
Can AI Code? An LLM leaderboard inclquantized models.
Note Coding benchmark
Show OCRBench leaderboard rankings for OCR models
Note An OCR benchmark
Explore and filter LLM benchmark results
Note Dynamic leaderboard using complexity classes to create reasoning problems for LLMs - quite a cool one
Submit models for evaluation on a leaderboard
Note The Open LLM Leaderboard, but for structured state models!
Analyze images with multiple vision models for labels and boxes
Note A multimodal arena!
Upload video model evaluation data to update the VBench leaderboard
Track, rank and evaluate open LLMs in Portuguese
Note An LLM leaderboard for Portuguese
Track, rank and evaluate open LLMs in the italian language!
Note An LLM leaderboard for Italian
Realtime Image/Video Gen AI Arena
Note An arena for image generation!
View leaderboard results for Q-Bench
Display leaderboard comparing text-to-image models based on human preferences
View and filter LLM hallucination leaderboard
Note An hallucination leaderboard, focused on a different set of tasks
Compare and rank large language models side-by-side
Explore and submit LLM benchmarks
Display and explore a leaderboard of language models
Explore and compare speechโrecognition model benchmarks
VLMEvalKit Evaluation Results Collection
Explore RewardBench model rankings and scores
Vote on the latest TTS models!
View and compare leaderboard results for coding tasks
Uncensored General Intelligence Leaderboard
View the Berkeley Function-Calling Leaderboard
Track, rank and evaluate open LLMs' CoT quality
Display a static leaderboard for language models
Explore and compare Indic LLMs on a leaderboard
Leaderboard for LLM for Science Reasoning
Explore and submit models for benchmarking
Display and filter leaderboard data for language models
Explore and compare large language model benchmarks and submit your own models for evaluation
Visualize Open vs. Proprietary LLM Progress
Track, rank and evaluate open LLMs and chatbots
Explore and compare QA and long doc benchmarks
Track, rank and evaluate open Arabic LLMs and chatbots
Explore LLM benchmark leaderboard and submit models
Vote for 3D creations and see the leaderboard
Explore code-generation model leaderboards and task details
Explore and submit LLM benchmarks
Show leaderboard and explore model puzzle results
Explore and compare multilingual LLM benchmarks
Submit and track model performance on a leaderboard
Browse and submit evaluation results for AI benchmarks
Benchmarking LLMs on the stability of simulated populations
Compact LLM Battle Arena: Frugal AI Face-Off!
Evaluate open LLMs in the languages of LATAM and Spain.
Explore and submit LLM benchmarks
GIFT-Eval: A Benchmark for General Time Series Forecasting
View and compare openโsource AI model rankings with ELO scores
Open Persian LLM Leaderboard
Compare two chatbots and vote on the better one
Explore and compare LLM models with interactive filters and visualizations
Submit models to MLSB 2024 leaderboard
Explore toxicity scores of models
Vote for the best background removal model
Forecast evaluation benchmark
AI Phone Leaderboard
Display model leaderboard and performance plot
Display and filter LLM benchmark results
Analyze complex Polish text understanding in models
Przeglฤ daj i porรณwnuj odpowiedzi modeli jฤzykowych w jฤzyku polskim
DABstep Reasoning Benchmark Leaderboard