On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards Paper • 2407.04065 • Published Jul 4, 2024 • 10
AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security Paper • 2605.29801 • Published 29 days ago • 144
Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild Paper • 2605.24213 • Published May 22 • 14
view article Article Open LLM Leaderboard: DROP deep dive +3 clefourrier, cabreraalex, stellaathena, SaylorTwift, thomwolf • Dec 1, 2023 • 11
view article Article What's going on with the Open LLM Leaderboard? +2 clefourrier, SaylorTwift, slippylolo, thomwolf • Jun 23, 2023 • 51
ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations Paper • 2504.00824 • Published Apr 1, 2025 • 43
Leaderboards and benchmarks ✨ Collection Cool leaderboard spaces collection for models across modalities! Text, vision, audio, ... • 88 items • Updated Mar 2 • 120
Open LLM Leaderboard best models ❤️🔥 Collection A daily uploaded list of models with best evaluations on the LLM leaderboard: • 50 items • Updated Mar 13 • 694
The Big Benchmarks Collection Collection Gathering benchmark spaces on the hub (beyond the Open LLM Leaderboard) • 13 items • Updated Nov 18, 2024 • 267