MMLU-Pro Leaderboard
🥇
246
More advanced and challenging multi-task evaluation
More advanced and challenging multi-task evaluation
Visualize Open vs. Proprietary LLM Progress
View how beam search decoding works, in detail!
Submit and score your model on the GAIA benchmark