VHELM: A Holistic Evaluation of Vision Language Models Paper • 2410.07112 • Published Oct 9, 2024 • 3
AIR-Bench 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies Paper • 2407.17436 • Published Jul 11, 2024
Image2Struct: Benchmarking Structure Extraction for Vision-Language Models Paper • 2410.22456 • Published Oct 29, 2024
SEA-HELM: Southeast Asian Holistic Evaluation of Language Models Paper • 2502.14301 • Published Feb 20, 2025 • 3
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons Paper • 2503.05731 • Published Feb 19, 2025 • 3
AHELM: A Holistic Evaluation of Audio-Language Models Paper • 2508.21376 • Published Aug 29, 2025 • 9
Structured Prompting Enables More Robust Evaluation of Language Models Paper • 2511.20836 • Published Nov 25, 2025
Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation Paper • 2510.11977 • Published Oct 13, 2025
Introducing v0.5 of the AI Safety Benchmark from MLCommons Paper • 2404.12241 • Published Apr 18, 2024 • 13