Spaces:

evaleval
/

general-eval-card

Running

File size: 53,500 Bytes
title,subtitle,authors,link,code_link,date,purpose,principles_tested,functional_props,input_modality,output_modality,input_source,output_source,size,splits,design,judge,protocol,model_access,has_heldout,heldout_details,alignment_validation,is_valid,baseline_models,robustness_measures,known_limitations,benchmarks_list
DigiData: Training and Evaluating General-Purpose Mobile Control Agents,"We present DigiData-Bench, a benchmark for evaluating mobile control agents on real-world complex tasks. We demonstrate that the commonly used step-accuracy metric falls short in reliably assessing mobile control agents and, to address this, we propose dynamic evaluation protocols and AI-powered evaluations as rigorous alternatives for agent assessment.",Meta FAIR,https://arxiv.org/abs/2511.07413,https://github.com/facebookresearch/DigiData,2025-11-08,Development; Research,mobile control agents,Core Performance,Text + Vision,Actions,New dataset (released with eval),Human annotations,Small (< 1K samples),,Dynamic data-driven (adaptive/interactive),Model-based: In the wild,"1. Given a goal and a trajectory produced by an agent, an LLM judge classifies whether it successfully achieved 2. the goal. We use LLM judges relying on both the screenshot and UI tree.",Outputs,False,,human operator is asked to judge whether the trajectory successfully achieves the goal or not. Then we assess the alignment between human judge and LLM judge to ensure a high alignment,unknown,,Prompt variations tested; Multiple runs per sample; Temperature sensitivity tested; Repeated evaluations; Ablation studies; Inter-rater reliability; Significance testing,"-live and dynamic environment introduces uncontrollable factors (feature deprecation, version changes etc) that makes certain tasks unavailable after a period -sizable efforts required to manually set up prework in the environment ","AndroidControl, Android in the wild"
IntPhys 2,"IntPhys 2 offers a comprehensive suite of tests, based on the violation of expectation framework, that challenge models to differentiate between possible and impossible events within controlled and diverse virtual environments.",Meta FAIR,https://arxiv.org/abs/2506.09849,https://github.com/facebookresearch/IntPhys2,2025-05-31,Research,Intuitive Physics,Core Performance,Video,Scores/Embeddings,New dataset (released with eval),Simulation-based,Medium (1K - 100K samples),"Debug Set: 60 videos for Model calibration Main Set: 1,012 videos as Main evaluation set Held-Out Set: 344 videos as Test set",Fixed data-driven (static test set),Automatic (Reference-based),Feed a video to the model Ask the model using specific prompts wether the video is physically plausible or not Check if the model's answer match the ground truth label ,Outputs,True,Held-Out Set: 344 videos as Test set,We ran human baselines to ensure that this task is easy for humans. ,unknown,- Human baseline: 96% - Best model baseline: V-JEPA2 56%,Prompt variations tested; Multiple runs per sample; Temperature sensitivity tested,"Model can be very sensitive to the way they are prompted. Also the compression artefact of the video can also impact the results. Those are mostly limitations on the model sides, since human are not sensitive to those. ","IntPhys: A Framework and Benchmark for Visual Intuitive Physics Reasoning, Riochet et al. 2020"
ImageNet,"A large-scale hierarchical image database designed for visual object recognition research, containing over 14 million images across 20,000+ categories.",Princeton University,https://ieeexplore.ieee.org/document/5206848,https://image-net.org/,2009-01-01,Research; Development,"Object recognition, Visual categorization, Multi-class classification",Core Performance,Vision (Image),Structured Data,New dataset (released with eval),Human annotations,Huge (> 10M samples),"Train, Validation, Test",Fixed data-driven (static test set),Automatic (Reference-based),1. Model receives an image as input 2. Model predicts class label from 1000 classes 3. Top-1 and Top-5 accuracy computed against ground truth,Outputs,True,"Private test set maintained by organizers, used for annual ILSVRC competitions","Expert curation of hierarchical categories, manual verification of labels, consistency checks across similar categories",unknown,AlexNet: 63.3% top-1 VGG: 71.5% top-1 ResNet-50: 76.1% top-1 ResNet-152: 78.3% top-1 EfficientNet-B7: 84.3% top-1 Human performance: ~95% top-5,Multiple runs per sample; Significance testing; Inter-rater reliability,Class imbalance in some categories; Label noise in training set; Some ambiguous images with multiple valid labels; Bias towards certain object viewpoints and contexts,"ImageNet-V2, ImageNet-C, ImageNet-R, ImageNet-A, ImageNet-Sketch"
COCO (Common Objects in Context),"A large-scale object detection, segmentation, and captioning dataset containing 330K images with 80 object categories, designed to advance scene understanding.",Microsoft Research,https://arxiv.org/abs/1405.0312,https://github.com/cocodataset/cocoapi,2014-01-01,Research; Development; Selection,"Object detection, Instance segmentation, Keypoint detection, Panoptic segmentation, Image captioning",Core Performance,Vision (Image),Text; Structured Data,New dataset (released with eval),Human annotations,Large (100K - 1M samples),"Train, Validation, Test",Fixed data-driven (static test set),Automatic (Reference-based),"1. Model receives image as input 2. Model outputs bounding boxes, segmentation masks, or captions 3. Metrics computed: mAP for detection, IoU for segmentation, BLEU/CIDEr for captioning",Outputs,True,"Test-dev and test-challenge splits maintained privately, submissions via evaluation server","Multi-annotator consensus for instance annotations, quality control through redundant labeling",unknown,Faster R-CNN: 42.0 mAP Mask R-CNN: 37.1 mask mAP YOLOv8: 53.9 mAP Human performance (detection): ~70 mAP,Inter-rater reliability; Multiple runs per sample; Confidence intervals,Small object detection remains challenging; Occlusion handling difficulties; Dataset bias toward certain object contexts; Annotation inconsistencies in crowded scenes,"LVIS, Objects365, Open Images, Visual Genome"
CIFAR-10 and CIFAR-100,"Small-scale image classification benchmarks with 60K 32x32 color images in 10 (CIFAR-10) or 100 (CIFAR-100) classes, widely used for algorithm development and testing.",University of Toronto,https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf,https://github.com/pytorch/vision,2009-04-08,Research; Development; Selection,"Image classification, Transfer learning, Generalization",Core Performance; Robustness,Vision (Image),Structured Data,New dataset (released with eval),Human annotations,Medium (1K - 100K samples),"Train (50K), Test (10K)",Fixed data-driven (static test set),Automatic (Reference-based),1. Model receives 32x32 RGB image 2. Model predicts class label (10 or 100 classes) 3. Classification accuracy computed,Outputs,False,,"Systematic sampling from larger dataset (80 million tiny images), human verification of labels",unknown,CIFAR-10: ResNet-56: 93.03% Wide ResNet-28-10: 96.11% PyramidNet: 96.54%  CIFAR-100: ResNet-56: 71.35% Wide ResNet-28-10: 81.15% PyramidNet: 83.78%,Multiple runs per sample; Seed variation tested; Ablation studies,Low resolution (32x32) limits fine-grained recognition; Some label noise in CIFAR-100; Dataset size enables memorization in large models; Limited diversity in poses and contexts,"CIFAR-10-C, CIFAR-10.1, CIFAR-100-C, STL-10, Tiny ImageNet"
Pascal VOC,"A pioneering object detection and segmentation benchmark with 20 object classes, providing standardized evaluation for visual recognition tasks.",University of Oxford,https://link.springer.com/article/10.1007/s11263-009-0275-4,https://github.com/pytorch/vision,2007-01-01,Research; Development; Selection,"Object detection, Semantic segmentation, Instance segmentation, Action classification",Core Performance,Vision (Image),Structured Data,New dataset (released with eval),Human annotations,Medium (1K - 100K samples),"Train, Validation, Test",Fixed data-driven (static test set),Automatic (Reference-based),1. Model receives image as input 2. Model predicts bounding boxes and class labels 3. mAP computed at IoU threshold 0.5,Outputs,True,"Test set labels held privately, evaluation via submission to organizers (historically)","Careful manual annotation with quality control, multiple annotators for difficult cases",unknown,R-CNN: 58.5 mAP (VOC 2007) Fast R-CNN: 70.0 mAP Faster R-CNN: 75.9 mAP YOLOv3: 78.6 mAP,Inter-rater reliability; Significance testing,Limited to 20 classes; Small dataset size by modern standards; Some annotation inconsistencies; Relatively simple backgrounds,"COCO, LVIS, Open Images, Cityscapes"
Cityscapes,"A large-scale dataset for urban scene understanding with pixel-level annotations for semantic segmentation, instance segmentation, and depth estimation in autonomous driving contexts.","Daimler AG, MPI for Informatics, TU Darmstadt",https://arxiv.org/abs/1604.01685,https://github.com/mcordts/cityscapesScripts,2016-04-01,Research; Development; Deployment,"Semantic segmentation, Instance segmentation, Scene understanding, Autonomous driving perception",Core Performance; Robustness,Vision (Image); Video,Structured Data,New dataset (released with eval),Expert annotations,Medium (1K - 100K samples),"Train (2975), Val (500), Test (1525)",Fixed data-driven (static test set),Automatic (Reference-based),1. Model receives street scene image (2048×1024) 2. Model predicts per-pixel semantic class (19 or 30 classes) 3. IoU and accuracy metrics computed,Outputs,True,"Test set labels private, evaluation via online server with leaderboard","Expert annotators with domain knowledge, multi-pass quality control, consistency verification across video sequences",unknown,FCN-8s: 65.3 mIoU DeepLab v3+: 82.1 mIoU HRNetV2: 83.0 mIoU SegFormer: 84.0 mIoU,Multiple runs per sample; Ablation studies; Significance testing,Limited to European cities; Weather bias (mostly good conditions); Class imbalance for rare objects; Fine annotation boundaries challenging,"ADE20K, KITTI, Mapillary Vistas, BDD100K, nuScenes"
ADE20K,"A comprehensive scene parsing benchmark with 150 semantic categories, designed for understanding diverse indoor and outdoor scenes with detailed object and part annotations.",MIT CSAIL,https://arxiv.org/abs/1608.05442,https://github.com/CSAILVision/ADE20K,2017-06-01,Research; Development,"Scene parsing, Semantic segmentation, Multi-scale recognition, Part segmentation",Core Performance,Vision (Image),Structured Data,New dataset (released with eval),Human annotations,Medium (1K - 100K samples),"Train (20K), Val (2K), Test (3K)",Fixed data-driven (static test set),Automatic (Reference-based),1. Model receives diverse scene image 2. Model predicts per-pixel semantic labels (150 classes) 3. Mean IoU and pixel accuracy computed,Outputs,False,,"Crowdsourced annotations with expert review, hierarchical consistency checks, multi-round verification",unknown,PSPNet: 43.29 mIoU DeepLab v3: 45.65 mIoU UPerNet: 44.85 mIoU SegFormer-B5: 51.8 mIoU,Inter-rater reliability; Multiple runs per sample,Long-tail distribution of object classes; Annotation granularity varies; Some scenes have ambiguous boundaries; Challenging for rare categories,"Cityscapes, Pascal Context, COCO-Stuff, Mapillary Vistas"
Kinetics,"A large-scale video action recognition dataset with 400/600/700 human action classes, designed to advance video understanding and temporal reasoning.",DeepMind,https://arxiv.org/abs/1705.06950,https://github.com/cvdfoundation/kinetics-dataset,2017-05-01,Research; Development,"Action recognition, Video understanding, Temporal reasoning, Human activity recognition",Core Performance,Video,Structured Data,New dataset (released with eval),Human annotations,Large (100K - 1M samples),"Train, Val, Test",Fixed data-driven (static test set),Automatic (Reference-based),1. Model receives 10-second video clip 2. Model predicts action class (400/600/700 classes) 3. Top-1 and Top-5 accuracy computed,Outputs,False,,"Human verification of video labels, removal of ambiguous clips, consistency checks for similar actions",unknown,I3D: 71.1% top-1 (K400) SlowFast: 79.8% top-1 (K400) X3D: 80.4% top-1 (K400) VideoMAE: 81.5% top-1 (K400),Multiple runs per sample; Temporal ordering tested,YouTube videos may become unavailable over time; Some action classes overlap or are ambiguous; Camera viewpoint bias; Dataset drift as internet content changes,"UCF-101, HMDB-51, ActivityNet, Something-Something, Moments in Time"
KITTI,"An autonomous driving benchmark suite providing stereo vision, optical flow, 3D object detection, and tracking datasets collected from real-world driving scenarios.","Karlsruhe Institute of Technology, Toyota Technological Institute",https://www.cvlibs.net/publications/Geiger2012CVPR.pdf,https://github.com/bostondiditeam/kitti,2012-01-01,Research; Development; Deployment,"3D object detection, Stereo vision, Optical flow, Visual odometry, Tracking",Core Performance; Robustness,Vision (Image); Video,Structured Data,New dataset (released with eval),Human annotations; Programmatically generated,Medium (1K - 100K samples),"Train, Val, Test",Fixed data-driven (static test set),Automatic (Reference-based),"1. Model receives stereo image pair or point cloud 2. Model predicts 3D bounding boxes, orientation, class 3. 3D AP computed at different difficulty levels (easy/moderate/hard)",Outputs,True,"Test set labels held privately, evaluation via online server with public leaderboard","LiDAR ground truth for 3D positions, manual verification of annotations, multi-sensor fusion for accuracy",unknown,"PointPillars: 79.05 AP (Car, Moderate) PV-RCNN: 83.90 AP (Car, Moderate) CenterPoint: 85.15 AP (Car, Moderate)",Multiple difficulty levels; Ablation studies; Significance testing,Limited to specific geographic region; Weather bias (mostly clear); Limited nighttime data; Class imbalance (cars dominate),"nuScenes, Waymo Open Dataset, Argoverse, A2D2, Lyft Level 5"
Places365,"A scene recognition benchmark with 365 scene categories and over 10 million images, designed to understand high-level visual concepts and environmental context.",MIT CSAIL,https://arxiv.org/abs/1610.02055,https://github.com/CSAILVision/places365,2017-07-01,Research; Development,"Scene recognition, Scene classification, Environmental understanding, Context recognition",Core Performance,Vision (Image),Structured Data,New dataset (released with eval),Human annotations,Very Huge (> 100M samples),"Train, Val, Test",Fixed data-driven (static test set),Automatic (Reference-based),1. Model receives scene image 2. Model predicts scene category (365 classes) 3. Top-1 and Top-5 accuracy computed,Outputs,False,,"Human verification, consistency checks for scene categories, hierarchical taxonomy validation",unknown,ResNet-152: 55.24% top-1 DenseNet-161: 56.12% top-1 ResNeXt-101: 56.05% top-1,Inter-rater reliability; Multiple runs per sample,Some scene categories overlap semantically; Cultural bias in scene definitions; Indoor scenes better represented than outdoor; Ambiguous boundary cases,"SUN397, MIT Indoor 67, Scene-15, ADE20K"
UCF-101,"An action recognition benchmark with 101 action categories and 13,320 videos collected from YouTube, widely used for video understanding research.",University of Central Florida,https://arxiv.org/abs/1212.0402,https://github.com/pytorch/vision,2012-01-01,Research; Development,"Action recognition, Video classification, Temporal understanding",Core Performance,Video,Structured Data,New dataset (released with eval),Human annotations,Medium (1K - 100K samples),Train/Test (3 splits provided),Fixed data-driven (static test set),Automatic (Reference-based),1. Model receives video clip 2. Model predicts action class (101 classes) 3. Average accuracy across 3 splits reported,Outputs,False,,"Manual verification of action labels, removal of ambiguous videos",unknown,Two-Stream CNN: 88.0% I3D: 95.6% SlowFast: 96.8% VideoMAE: 97.2%,Multiple evaluation splits; Seed variation tested,YouTube videos may become unavailable; Camera motion and quality vary; Some action classes are very similar; Dataset saturation with modern methods,"HMDB-51, Kinetics, ActivityNet, Something-Something-V2"
NYU Depth V2,"An RGB-D dataset for indoor scene understanding with 1449 densely labeled pairs of aligned RGB and depth images, designed for depth estimation and semantic segmentation.",New York University,https://cs.nyu.edu/~fergus/datasets/indoor_seg_support.pdf,https://github.com/ankurhanda/nyuv2-meta-data,2012-06-01,Research; Development,"Depth estimation, RGB-D understanding, Indoor scene parsing, 3D reconstruction",Core Performance,Vision (Image); Structured Data,Structured Data,New dataset (released with eval),Expert annotations,Small (< 1K samples),"Train (795), Test (654)",Fixed data-driven (static test set),Automatic (Reference-based),"1. Model receives RGB image 2. Model predicts depth map or semantic segmentation 3. RMSE, absolute relative error, and accuracy metrics computed",Outputs,False,,"Kinect sensor ground truth with manual alignment corrections, multi-view consistency",unknown,Depth Estimation: Eigen et al.: 0.641 RMSE AdaBins: 0.364 RMSE BTS: 0.392 RMSE,Multiple runs per sample; Ablation studies,"Small dataset size; Limited to indoor scenes; Kinect depth sensor limitations (range, IR interference); Mostly residential environments","ScanNet, Matterport3D, SUNRGB-D, KITTI Depth"
CelebA,"A large-scale face attributes dataset with 202,599 face images annotated with 40 binary attributes, 5 landmark locations, and identity information for face recognition and attribute prediction.",The Chinese University of Hong Kong,https://arxiv.org/abs/1411.7766,https://github.com/tkarras/progressive_growing_of_gans,2015-09-01,Research; Development,"Face attribute recognition, Face detection, Facial landmark detection, Identity recognition",Core Performance; Fairness,Vision (Image),Structured Data,New dataset (released with eval),Human annotations,Large (100K - 1M samples),"Train (162,770), Val (19,867), Test (19,962)",Fixed data-driven (static test set),Automatic (Reference-based),1. Model receives face image 2. Model predicts binary attributes (40 dimensions) 3. Per-attribute accuracy and mean accuracy computed,Outputs,False,,"Manual annotation with quality control, consistency verification across multiple attributes",unknown,LNets+ANet: 87.30% mean accuracy Walk and Learn: 88.06% FaceNet: 89.35%,Inter-rater reliability; Multiple runs per sample,Demographic bias (celebrity images); Some attributes are subjective; Label noise in some attributes; Privacy concerns with celebrity images; Lighting and pose variation,"LFW, VGGFace2, MS-Celeb-1M, FFHQ"
Visual Genome,"A comprehensive visual understanding dataset with dense annotations of objects, attributes, relationships, and scene graphs across 108K images for structured scene understanding.",Stanford University,https://arxiv.org/abs/1602.07332,https://github.com/ranjaykrishna/visual_genome_python_driver,2016-05-01,Research; Development,"Scene graph generation, Visual relationship detection, Visual question answering, Dense captioning",Core Performance,Vision (Image),Structured Data; Text,MS COCO,Crowdsourced annotations,Large (100K - 1M samples),"Train, Val, Test",Fixed data-driven (static test set),Automatic (Reference-based),"1. Model receives image 2. Model predicts objects, attributes, and relationships 3. Recall@K metrics for scene graph components",Outputs,False,,"Multi-round crowdsourced annotations with validation, consistency checks for relationships",unknown,IMP: 14.6 R@50 (scene graph detection) Motifs: 21.4 R@50 VCTree: 22.0 R@50 GPS-Net: 24.0 R@50,Inter-rater reliability; Multiple runs per sample,Long-tail distribution of relationships; Annotation inconsistencies; Subjective relationship definitions; Incomplete annotations (not all relationships captured),"GQA, Scene Graph Benchmark, VRD, OpenImages V6"
LVIS (Large Vocabulary Instance Segmentation),A large-scale instance segmentation dataset with 1203 object categories designed to address the long-tail distribution challenge in object recognition.,Facebook AI Research,https://arxiv.org/abs/1908.03195,https://github.com/lvis-dataset/lvis-api,2019-08-01,Research; Development,"Instance segmentation, Long-tail recognition, Object detection, Fine-grained categorization",Core Performance; Robustness,Vision (Image),Structured Data,MS COCO,Expert annotations,Large (100K - 1M samples),"Train, Val, Test",Fixed data-driven (static test set),Automatic (Reference-based),"1. Model receives image 2. Model predicts instance masks and categories (1203 classes) 3. AP computed separately for rare, common, and frequent categories",Outputs,True,"Test set labels private, evaluation via online server","Expert annotators with WordNet taxonomy, quality control for long-tail categories",unknown,Mask R-CNN: 21.2 AP (v1.0) Cascade R-CNN: 26.2 AP Swin Transformer: 50.9 AP,Multiple runs per sample; Ablation studies; Category frequency stratification,Rare categories have very few examples; Annotation cost for 1203 categories; Some category definitions overlap; Challenging for zero-shot generalization,"COCO, Objects365, OpenImages, iNaturalist"
Mapillary Vistas,"A diverse street-level imagery dataset with pixel-level annotations for 66 object categories, designed for robust semantic segmentation across varied geographic locations and conditions.",Mapillary AB,https://openaccess.thecvf.com/content_ICCV_2017/papers/Neuhold_The_Mapillary_Vistas_ICCV_2017_paper.pdf,https://github.com/mapillary/mapillary_vistas,2017-08-01,Research; Development; Deployment,"Semantic segmentation, Panoptic segmentation, Scene understanding, Robust perception",Core Performance; Robustness,Vision (Image),Structured Data,User-generated content,Expert annotations,Medium (1K - 100K samples),"Train (18K), Val (2K), Test (5K)",Fixed data-driven (static test set),Automatic (Reference-based),1. Model receives street-level image 2. Model predicts per-pixel semantic labels (66 classes) 3. mIoU computed across all classes,Outputs,True,"Test set labels private, evaluation via online platform","Expert annotators, multi-stage quality control, geographic diversity validation",unknown,PSPNet: 42.7 mIoU DeepLab v3+: 45.8 mIoU HRNetV2: 50.3 mIoU,Geographic diversity tested; Weather variations included; Multiple runs per sample,Varying image quality from crowdsourced data; Camera parameter diversity; Some regions overrepresented; Annotation inconsistencies across diverse scenes,"Cityscapes, BDD100K, IDD, WildDash"
MPII Human Pose,"A benchmark for human pose estimation with 25K images containing over 40K annotated people with 16 body joints, covering diverse activities and viewpoints.",Max Planck Institute for Informatics,https://openaccess.thecvf.com/content_cvpr_2014/papers/Andriluka_2D_Human_Pose_2014_CVPR_paper.pdf,https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/software-and-datasets/mpii-human-pose-dataset,2014-06-01,Research; Development,"Human pose estimation, Keypoint detection, Articulated pose estimation, Activity recognition",Core Performance,Vision (Image),Structured Data,New dataset (released with eval),Human annotations,Medium (1K - 100K samples),"Train (~29K people), Test (~12K people)",Fixed data-driven (static test set),Automatic (Reference-based),1. Model receives image with person 2. Model predicts 2D joint locations (16 joints) 3. PCKh (Percentage of Correct Keypoints) at various thresholds,Outputs,False,,"Manual annotation with consistency checks, multi-annotator agreement for difficult poses",unknown,Hourglass Network: 90.9 PCKh@0.5 HRNet: 92.3 PCKh@0.5 SimpleBaseline: 91.5 PCKh@0.5,Multiple difficulty levels; Inter-rater reliability; Occlusion analysis,2D annotations only (no 3D); Occlusion and truncation challenges; Some joint definitions ambiguous; Dataset bias toward certain activities,"COCO Keypoints, Human3.6M, PoseTrack, CrowdPose"
Open Images,"A large-scale multi-task dataset with ~9M images annotated for image classification (20K classes), object detection (600 classes), visual relationships, and segmentation masks.",Google Research,https://arxiv.org/abs/1811.00982,https://github.com/openimages/dataset,2018-01-01,Research; Development; Selection,"Object detection, Image classification, Visual relationship detection, Instance segmentation",Core Performance; Robustness,Vision (Image),Structured Data,New dataset (released with eval),Human annotations; Crowdsourced annotations,Very Huge (> 100M samples),"Train, Val, Test",Fixed data-driven (static test set),Automatic (Reference-based),"1. Model receives image 2. Model performs classification, detection, or segmentation 3. mAP computed for detection, top-k accuracy for classification",Outputs,True,"Test set labels private for challenges, public validation set available","Multi-stage crowdsourced annotation with verification, automated quality filters",unknown,Faster R-CNN: 54.3 mAP (detection) YOLOv4: 55.8 mAP EfficientDet: 56.1 mAP,Multiple runs per sample; Confidence intervals; Large-scale diversity,Long-tail class distribution; Annotation inconsistencies at scale; Some classes poorly defined; Label noise in crowdsourced annotations,"COCO, LVIS, Objects365, ImageNet"
ScanNet,"A richly-annotated 3D dataset of indoor scenes with RGB-D scans, semantic segmentation, instance segmentation, and 3D object bounding boxes for 1513 scenes.","Stanford University, Princeton University, Technical University of Munich",https://arxiv.org/abs/1702.04405,https://github.com/ScanNet/ScanNet,2017-04-01,Research; Development,"3D scene understanding, Semantic segmentation, Instance segmentation, 3D reconstruction",Core Performance,Vision (Image); Video; Structured Data,Structured Data,New dataset (released with eval),Human annotations; Programmatically generated,Medium (1K - 100K samples),"Train (1201), Val (312), Test (100)",Fixed data-driven (static test set),Automatic (Reference-based),"1. Model receives 3D scene (RGB-D scans) 2. Model predicts semantic labels or instance masks 3. mIoU for semantic segmentation, mAP for instance segmentation",Outputs,True,"Test set labels private, evaluation via online benchmark server","Manual verification of 3D reconstructions, multi-annotator consistency for semantic labels",unknown,PointNet++: 53.5 mIoU SparseConvNet: 72.5 mIoU MinkowskiNet: 73.6 mIoU,Multiple runs per sample; Ablation studies,Limited to indoor scenes; Reconstruction artifacts; Scanning noise and occlusions; Limited scene diversity (mostly offices and apartments),"Matterport3D, S3DIS, 2D-3D-S, ARKitScenes"
nuScenes,"A large-scale autonomous driving dataset with 1000 scenes (40K frames) featuring 3D object annotations, tracking IDs, and multimodal sensor data including cameras, LiDAR, and radar.",Motional (formerly nuTonomy),https://arxiv.org/abs/1903.11027,https://github.com/nutonomy/nuscenes-devkit,2019-03-26,Research; Development; Deployment,"3D object detection, Multi-object tracking, Prediction, Sensor fusion, Scene understanding",Core Performance; Robustness,Vision (Image); Video; Structured Data,Structured Data,New dataset (released with eval),Human annotations,Medium (1K - 100K samples),"Train (700), Val (150), Test (150)",Fixed data-driven (static test set),Automatic (Reference-based),"1. Model receives multimodal sensor data (cameras, LiDAR, radar) 2. Model predicts 3D bounding boxes and tracking IDs 3. NDS (nuScenes Detection Score) and mAP computed",Outputs,True,"Test set labels private, evaluation via online leaderboard","Expert annotators, multi-sensor consistency checks, temporal coherence validation",unknown,PointPillars: 45.3 NDS CenterPoint: 65.5 NDS BEVFusion: 72.9 NDS,Geographic diversity; Weather conditions; Day/night variations; Multiple runs per sample,"Limited geographic coverage (Boston, Singapore); Sensor calibration challenges; Annotation latency for fast-moving objects; Class imbalance","Waymo Open Dataset, KITTI, Argoverse, Lyft Level 5, A2D2"
ActivityNet,"A large-scale video dataset for human activity understanding with 200 activity classes and temporal annotations, designed for action recognition and temporal action localization.","Universidad del Norte, KAUST",https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Heilbron_ActivityNet_A_Large-Scale_2015_CVPR_paper.pdf,https://github.com/activitynet/ActivityNet,2015-06-01,Research; Development,"Temporal action detection, Action recognition, Dense video captioning, Video understanding",Core Performance,Video,Structured Data; Text,New dataset (released with eval),Human annotations,Medium (1K - 100K samples),"Train (50%), Val (25%), Test (25%)",Fixed data-driven (static test set),Automatic (Reference-based),"1. Model receives untrimmed video 2. Model predicts action segments with class labels 3. mAP at different IoU thresholds (0.5, 0.75, 0.95)",Outputs,True,"Test set used for annual challenges, labels withheld","Multi-annotator temporal boundary agreement, consistency checks for activity definitions",unknown,SSN: 41.3 mAP@0.5 BMN: 50.1 mAP@0.5 TALLFormer: 59.8 mAP@0.5,Inter-rater reliability; Multiple IoU thresholds; Temporal boundary sensitivity,YouTube video availability issues; Temporal boundary ambiguity; Action class overlap; Video quality variance,"THUMOS14, Kinetics, Charades, MultiTHUMOS, AVA"
DAVIS (Densely Annotated VIdeo Segmentation),A video object segmentation benchmark with high-quality pixel-level annotations for densely segmenting objects in video sequences.,"ETH Zurich",https://arxiv.org/abs/1704.00675,https://github.com/davisvideochallenge/davis,2016-10-01,Research; Development,"Video object segmentation, Temporal consistency, Object tracking, Dense prediction",Core Performance; Robustness,Video,Structured Data,New dataset (released with eval),Human annotations,Small (< 1K samples),"Train/Val (60 sequences), Test-dev (30 sequences)",Fixed data-driven (static test set),Automatic (Reference-based),1. Model receives video sequence with first frame annotation 2. Model propagates segmentation to subsequent frames 3. J&F metric (region similarity and contour accuracy),Outputs,True,"Test-challenge set for competitions, labels withheld",High-quality manual annotations with temporal consistency verification,unknown,OSVOS: 79.8 J&F STM: 84.3 J&F XMem: 86.2 J&F,Temporal consistency tested; Occlusion robustness; Multiple runs per sample,Small dataset size; Limited object categories; Short video sequences; Primarily objects with clear boundaries,"YouTube-VOS, FBMS, SegTrack, OVIS"
VQA (Visual Question Answering),"A dataset with open-ended questions about images requiring visual understanding and reasoning, containing 265K images and over 1M questions.","Virginia Tech, Facebook AI Research",https://arxiv.org/abs/1505.00468,https://github.com/GT-Vision-Lab/VQA,2015-10-01,Research; Development,"Visual question answering, Visual reasoning, Multimodal understanding, Common sense reasoning",Core Performance; Robustness,Text + Vision,Text,MS COCO,Crowdsourced annotations,Huge (> 10M samples),"Train, Val, Test",Fixed data-driven (static test set),Automatic (Reference-based); Human: Representative sample,1. Model receives image and question 2. Model generates answer 3. Accuracy computed with consensus matching (multiple human answers),Outputs,True,"Test-dev and test-std splits, evaluation via server","Multiple human answers per question (10 answers), consensus-based evaluation",unknown,Bottom-Up Top-Down: 70.3% VILBERT: 72.4% OSCAR: 73.8% BLIP: 78.3%,Multiple human references; Prompt variations tested; Inter-rater reliability,Language bias (can answer many questions without image); Dataset bias toward common objects; Answer distribution imbalance; Ambiguous questions,"GQA, VQA v2, OK-VQA, TextVQA, VizWiz"
CLEVR,"A diagnostic dataset for compositional visual reasoning with 100K synthetic images and 1M questions testing spatial relations, counting, and logic.","Stanford University, Facebook AI Research",https://arxiv.org/abs/1612.06890,https://github.com/facebookresearch/clevr-dataset-gen,2017-04-01,Research; Development,"Compositional reasoning, Spatial reasoning, Counting, Logical reasoning, Visual reasoning",Core Performance; Robustness,Text + Vision,Text,Synthetic/Generated,Programmatically generated,Huge (> 10M samples),"Train (70K), Val (15K), Test (15K)",Fixed data-driven (static test set),Automatic (Reference-based),"1. Model receives synthetic image and question 2. Model predicts answer from predefined set 3. Accuracy computed, can analyze by question type",Outputs,False,,"Programmatically generated with known ground truth, exhaustive question type coverage",unknown,CNN+LSTM: 52.3% Film: 97.7% NS-VQA: 99.8% MAC: 98.9%,Ablation studies; Question type analysis; Compositional generalization tested,Synthetic domain (limited real-world applicability); Simple shapes and colors; No natural language variation; Programmatic biases,"CLEVR-CoGenT, GQA, NLVR2, CLOSURE"
Waymo Open Dataset,"A large-scale autonomous driving dataset with 1000 diverse driving scenes, 12M LiDAR points per frame, high-resolution camera images, and rich 3D annotations.",Waymo LLC,https://arxiv.org/abs/1912.04838,https://github.com/waymo-research/waymo-open-dataset,2019-08-21,Research; Development; Deployment,"3D object detection, 2D object detection, Tracking, Domain adaptation, Sensor fusion",Core Performance; Robustness,Vision (Image); Video; Structured Data,Structured Data,New dataset (released with eval),Human annotations; Programmatically generated,Huge (> 10M samples),"Train (798), Val (202), Test (150)",Fixed data-driven (static test set),Automatic (Reference-based),1. Model receives multimodal sensor data 2. Model predicts 3D bounding boxes with tracking IDs 3. AP/APH metrics at different IoU thresholds and difficulty levels,Outputs,True,"Test set labels private, evaluation via online leaderboard","Multi-stage annotation pipeline with quality checks, temporal consistency validation",unknown,PointPillars: 63.8 L2 APH (Vehicle) CenterPoint: 73.9 L2 APH PV-RCNN++: 77.8 L2 APH,Geographic diversity; Time of day variations; Weather conditions; Multiple difficulty levels,Geographic concentration in specific cities; Sensor-specific challenges; Annotation latency for distant objects; Class imbalance,"nuScenes, KITTI, Argoverse 2, Once, Lyft Level 5"
Fashion-MNIST,"A drop-in replacement for MNIST with 70K grayscale images of fashion products across 10 categories, designed to be a more challenging benchmark for image classification.",Zalando Research,https://arxiv.org/abs/1708.07747,https://github.com/zalandoresearch/fashion-mnist,2017-08-25,Research; Development; Selection,"Image classification, Transfer learning, Benchmark comparison",Core Performance,Vision (Image),Structured Data,New dataset (released with eval),Human annotations,Medium (1K - 100K samples),"Train (60K), Test (10K)",Fixed data-driven (static test set),Automatic (Reference-based),1. Model receives 28x28 grayscale image 2. Model predicts fashion category (10 classes) 3. Classification accuracy computed,Outputs,False,,Product catalog labels verified by domain experts,unknown,Linear Classifier: 83.7% CNN: 93.5% ResNet: 94.9% Vision Transformer: 95.1%,Multiple runs per sample; Seed variation tested,Low resolution (28x28); Grayscale only; Limited intra-class variation; Some categories overlap visually,"MNIST, EMNIST, Kuzushiji-MNIST, DeepFashion"
MMLU (Massive Multitask Language Understanding),"A comprehensive benchmark with 57 tasks spanning STEM, humanities, social sciences, and more, designed to measure multitask accuracy and knowledge breadth in language models.","UC Berkeley, Columbia University",https://arxiv.org/abs/2009.03300,https://github.com/hendrycks/test,2020-09-07,Research; Development; Selection,"World knowledge, Reasoning, Domain expertise across 57 subjects",Core Performance,Text,Text,New dataset (released with eval),Expert annotations,Medium (1K - 100K samples),"Dev (5 shot examples per task), Test (285 questions per task average)",Fixed data-driven (static test set),Automatic (Reference-based),1. Model receives multiple choice question with 4 options 2. Model predicts answer (A/B/C/D) 3. Accuracy computed per subject and overall,Outputs,False,,"Questions sourced from practice exams and tests, expert review for correctness",unknown,GPT-3 (175B): 43.9% GPT-3.5: 70.0% GPT-4: 86.4% Claude 3.5 Sonnet: 88.7% Random baseline: 25%,Multiple evaluation runs; Few-shot prompting variations; Subject-wise analysis,Multiple choice format may not reflect real-world usage; Some questions have ambiguous answers; Dataset contamination concerns with web-trained models; Cultural and knowledge cutoff biases,"MMLU-Pro, AGIEval, C-Eval, CMMLU"
HumanEval,A code generation benchmark with 164 hand-written programming problems to evaluate functional correctness of synthesized Python code.,OpenAI,https://arxiv.org/abs/2107.03374,https://github.com/openai/human-eval,2021-07-07,Research; Development; Selection,"Code generation, Programming ability, Functional correctness, Code understanding",Core Performance,Text; Code,Code,New dataset (released with eval),Author-provided,Small (< 1K samples),Test (164 problems),Fixed data-driven (static test set),Automatic (Execution-based),1. Model receives function signature and docstring 2. Model generates function implementation 3. Generated code executed against unit tests 4. pass@k metric computed (% passing all tests in k samples),Outputs,False,,"Hand-written problems with comprehensive unit tests, manual verification of test correctness",unknown,Codex (12B): 28.8% pass@1 GPT-3.5-turbo: 48.1% pass@1 GPT-4: 67.0% pass@1 Claude 3.5 Sonnet: 92.0% pass@1,Multiple samples per problem (pass@k); Temperature sensitivity tested; Execution-based verification,Limited to Python; Small dataset size (164 problems); Relatively simple problems; May be contaminated in training data; No testing of code efficiency or style,"MBPP, APPS, CodeContests, HumanEval+, MultiPL-E"
HellaSwag,"A benchmark for commonsense natural language inference about physical situations, requiring models to complete scenarios with the most plausible continuation.","University of Washington, Allen Institute for AI",https://arxiv.org/abs/1905.07830,https://github.com/rowanz/hellaswag,2019-05-19,Research; Development,"Commonsense reasoning, Physical understanding, Situation modeling, Plausibility judgment",Core Performance; Robustness,Text,Text,New dataset (released with eval),Crowdsourced annotations,Medium (1K - 100K samples),"Train (39,905), Val (10,042), Test (10,003)",Fixed data-driven (static test set),Automatic (Reference-based),1. Model receives scenario context 2. Model selects most plausible continuation from 4 options 3. Accuracy computed,Outputs,False,,"Adversarial filtering using BERT to ensure difficulty, human validation of plausibility",unknown,BERT-Large: 47.3% GPT-2: 50.9% GPT-3: 78.9% GPT-4: 95.3% Human performance: 95.6%,Adversarial filtering; Multiple runs per sample; Human baseline comparison,Dataset may be easier than originally intended for modern LLMs; Multiple choice format; Adversarial examples may have artifacts; Potential data contamination,"PIQA, WinoGrande, CommonsenseQA, ARC"
GSM8K (Grade School Math 8K),"A dataset of 8,500 grade school math word problems requiring multi-step arithmetic reasoning to solve.",OpenAI,https://arxiv.org/abs/2110.14168,https://github.com/openai/grade-school-math,2021-10-27,Research; Development,"Mathematical reasoning, Multi-step problem solving, Arithmetic reasoning, Chain-of-thought reasoning",Core Performance,Text,Text,New dataset (released with eval),Author-provided,Medium (1K - 100K samples),"Train (7,473), Test (1,319)",Fixed data-driven (static test set),Automatic (Reference-based),1. Model receives math word problem 2. Model generates solution with reasoning steps 3. Final numerical answer extracted and compared to ground truth,Outputs,False,,"Human-written problems with verified solutions, consistency checks for answer correctness",unknown,GPT-3 (175B): 34.2% GPT-3.5: 57.1% GPT-4: 92.0% GPT-4o: 96.1% Claude 3.5 Sonnet: 96.4%,Multiple runs per sample; Chain-of-thought prompting tested; Answer extraction robustness,Limited to grade-school level math; Answer extraction can be brittle; Some problems have ambiguous wording; Potential contamination in training data,"MATH, GSM-Hard, SVAMP, ASDiv, MathQA"
TruthfulQA,A benchmark measuring whether language models generate truthful answers to questions that humans might answer falsely due to misconceptions or false beliefs.,"Oxford University, OpenAI",https://arxiv.org/abs/2109.07958,https://github.com/sylinrl/TruthfulQA,2021-09-16,Research; Development; Safety,"Truthfulness, Factual accuracy, Resistance to misconceptions, Calibration",Safety; Core Performance; Calibration,Text,Text,New dataset (released with eval),Expert annotations,Small (< 1K samples),Test (817 questions across 38 categories),Fixed data-driven (static test set),Model-based: Expert; Human: Experts,1. Model receives question designed to elicit false beliefs 2. Model generates answer 3. Answers judged for truthfulness and informativeness using GPT-judge or human evaluation,Outputs,False,,"Expert-curated questions targeting known misconceptions, multi-rater validation of truth labels",unknown,GPT-3 (175B): 58.0% GPT-3.5: 47.0% GPT-4: 59.0% Claude 2: 62.0% Human baseline: 94.0%,"Multiple evaluation methods (MC1, MC2, generative); Human validation; Model-based evaluation correlation",Subjective truthfulness judgments in some cases; Cultural bias in what constitutes 'truth'; Model-based evaluation may not align with human judgment; Limited coverage of misconceptions,"FactScore, HaluEval, SelfCheckGPT, FELM"
BIG-bench (Beyond the Imitation Game),"A collaborative benchmark with 204 tasks spanning linguistics, child development, math, commonsense reasoning, biology, physics, social bias, and software development.","Google Research, 450+ authors",https://arxiv.org/abs/2206.04615,https://github.com/google/BIG-bench,2022-06-09,Research; Development,"Diverse capabilities across 204 tasks including reasoning, knowledge, language understanding, bias detection",Core Performance; Fairness; Robustness,Text,Text,New dataset (released with eval),Multiple/Mixed sources,Large (100K - 1M samples),Varies by task,Composite,Automatic (Reference-based); Model-based: In the wild,1. Model evaluated on 204 diverse tasks 2. Each task has its own evaluation protocol 3. Performance aggregated across tasks,Outputs,False,,"Crowdsourced task creation with quality review, diverse authorship for broad coverage",unknown,Average human rater: 89.0% Few-shot PaLM (540B): 65.7% GPT-4: ~83% (estimated on BIG-Bench Hard),Multiple tasks provide robustness; Human baseline comparison; Cross-task analysis,Task quality varies; Some tasks too easy or too hard; Computational cost of running all 204 tasks; Aggregation methodology debatable,"BIG-Bench Hard, MMLU, HELM, SuperGLUE"
MATH,"A dataset of 12,500 challenging competition mathematics problems from high school math competitions, requiring advanced mathematical reasoning and problem-solving.","UC Berkeley, OpenAI",https://arxiv.org/abs/2103.03874,https://github.com/hendrycks/math,2021-03-05,Research; Development,"Advanced mathematical reasoning, Problem solving, Multi-step reasoning, Mathematical knowledge",Core Performance,Text,Text,New dataset (released with eval),Published references,Medium (1K - 100K samples),"Train (7,500), Test (5,000)",Fixed data-driven (static test set),Automatic (Reference-based),1. Model receives competition math problem 2. Model generates solution with steps 3. Final answer extracted and checked against ground truth 4. Problems span 7 subjects with 5 difficulty levels,Outputs,False,,"Problems from real math competitions with verified solutions, difficulty levels validated",unknown,GPT-3: 6.9% GPT-4: 42.5% Minerva (540B): 33.6% GPT-4 Turbo: 52.9% Claude 3.5 Sonnet: 71.1%,Multiple difficulty levels; Subject-wise analysis; Chain-of-thought evaluation,Answer extraction challenges; LaTeX formatting issues; Symbolic vs numeric answers; High difficulty may not reflect practical math usage,"GSM8K, MathQA, SVAMP, ASDiv, Hendrycks MATH"
ARC (AI2 Reasoning Challenge),"A dataset of 7,787 science exam questions from grade 3-9, designed to require reasoning beyond simple retrieval or pattern matching.",Allen Institute for AI,https://arxiv.org/abs/1803.05457,https://github.com/allenai/arc,2018-03-14,Research; Development,"Scientific reasoning, Commonsense reasoning, Knowledge retrieval, Multi-hop reasoning",Core Performance; Robustness,Text,Text,New dataset (released with eval),Existing dataset labels,Medium (1K - 100K samples),"Easy (2,376 train, 570 test), Challenge (1,119 train, 1,172 test)",Fixed data-driven (static test set),Automatic (Reference-based),1. Model receives science exam question with multiple choice options 2. Model selects answer 3. Accuracy computed separately for Easy and Challenge sets,Outputs,False,,"Real exam questions filtered to require reasoning, partitioned by difficulty using retrieval-based baseline",unknown,ARC-Challenge: BERT: 59.1% GPT-3: 51.4% GPT-3.5: 85.2% GPT-4: 96.3%,Two difficulty levels; Multiple choice format variations; Retrieval baseline comparison,Multiple choice format; Some questions solvable without reasoning; Challenge set becoming saturated; Grade-school level may not test advanced reasoning,"OpenBookQA, CommonsenseQA, QASC, SciQ"
DROP (Discrete Reasoning Over Paragraphs),"A reading comprehension benchmark requiring discrete reasoning operations over paragraph content, including sorting, counting, and arithmetic.","Allen Institute for AI, University of Washington",https://arxiv.org/abs/1903.00161,https://github.com/allenai/drop,2019-03-01,Research; Development,"Reading comprehension, Numerical reasoning, Discrete operations, Multi-hop reasoning",Core Performance,Text,Text,New dataset (released with eval),Crowdsourced annotations,Large (100K - 1M samples),"Train (77,409), Dev (9,536), Test (9,622)",Fixed data-driven (static test set),Automatic (Reference-based),"1. Model receives paragraph and question 2. Model generates answer (number, span, or date) 3. F1 and Exact Match metrics computed",Outputs,False,,"Crowdsourced questions with verification, requires discrete reasoning operations confirmed through analysis",unknown,BERT: 47.0 F1 RoBERTa: 80.9 F1 GPT-3: 29.0 F1 GPT-4: 80.9 F1 Human performance: 96.4 F1,Multiple answer types; Question type analysis; Human baseline comparison,Requires careful answer extraction; Some questions ambiguous; Numerical reasoning can be brittle; Limited diversity in reasoning types,"SQuAD, NewsQA, NaturalQuestions, NumGLUE, TAT-QA"
WinoGrande,"A large-scale commonsense reasoning benchmark with 44,000 problems requiring resolving pronoun ambiguity through world knowledge and commonsense.","Allen Institute for AI, University of Washington",https://arxiv.org/abs/1907.10641,https://github.com/allenai/winogrande,2019-07-24,Research; Development,"Commonsense reasoning, Coreference resolution, World knowledge, Causal reasoning",Core Performance; Robustness,Text,Text,New dataset (released with eval),Crowdsourced annotations,Medium (1K - 100K samples),"Train (40,398), Dev (1,267), Test (1,767)",Fixed data-driven (static test set),Automatic (Reference-based),1. Model receives sentence with pronoun ambiguity 2. Model selects correct referent from 2 options 3. Accuracy computed,Outputs,False,,"Adversarial filtering using language models, crowdsourced generation with validation",unknown,BERT-Large: 59.4% RoBERTa-Large: 79.1% GPT-3: 70.2% GPT-4: 87.5% Human performance: 94.0%,Adversarial filtering; Large-scale dataset; Multiple difficulty levels,Binary choice may be limiting; Adversarial filtering may introduce artifacts; Saturation with modern models; Limited reasoning depth,"Winograd Schema Challenge, COPA, CommonsenseQA, PIQA"
BBH (BIG-Bench Hard),"A curated subset of 23 challenging tasks from BIG-Bench where language models perform below human raters, focusing on tasks requiring multi-step reasoning.",Google Research,https://arxiv.org/abs/2210.09261,https://github.com/suzgunmirac/BIG-Bench-Hard,2022-10-17,Research; Development,"Complex reasoning, Multi-step thinking, Challenging cognitive tasks, Chain-of-thought reasoning",Core Performance,Text,Text,One or Multiple existing datasets,Multiple/Mixed sources,Medium (1K - 100K samples),"Test (6,511 examples across 23 tasks)",Fixed data-driven (static test set),Automatic (Reference-based),1. Model receives task from one of 23 challenging categories 2. Model generates answer 3. Performance aggregated across all tasks 4. Chain-of-thought prompting typically used,Outputs,False,,"Tasks selected where models underperform humans, difficulty validated through empirical testing",unknown,PaLM (540B): 56.5% average PaLM (540B) + CoT: 78.1% GPT-4: ~91% Human raters: ~92% average,Multiple tasks; Chain-of-thought evaluation; Human baseline comparison; Cross-model validation,Only 23 tasks (limited coverage); Chain-of-thought prompting required for good performance; Task aggregation methodology; Rapidly saturating with newer models,"BIG-Bench, MMLU, AGIEval, HELM"
MT-Bench,"A multi-turn conversational benchmark with 80 high-quality multi-turn questions spanning 8 categories, evaluated using GPT-4 as a judge.","UC Berkeley, UCSD, CMU, MBZUAI",https://arxiv.org/abs/2306.05685,https://github.com/lm-sys/FastChat,2023-06-09,Development; Selection,"Multi-turn conversation, Instruction following, Reasoning, Writing, Role-playing, Knowledge",Core Performance; Core Quality Dimensions,Text,Text,New dataset (released with eval),Author-provided,Small (< 1K samples),Test (80 questions with 2 turns each),Fixed data-driven (static test set),Model-based: In the wild,1. Model engages in 2-turn conversation 2. GPT-4 evaluates responses on 10-point scale 3. Scores averaged across turns and categories,Outputs,False,,"Strong correlation with human preferences validated on Chatbot Arena data, GPT-4 judge agreement measured",unknown,Vicuna-13B: 6.39 GPT-3.5-turbo: 7.94 Claude 2: 8.06 GPT-4: 8.99 GPT-4-turbo: 9.32,Multiple categories; Position bias mitigation; Agreement with human ratings validated; Pairwise comparison,Small dataset (80 questions); GPT-4 judge may have biases; Evaluation cost; Model-based judge limitations; Prompt sensitivity,"Chatbot Arena, AlpacaEval, Arena-Hard, LiveBench"
GPQA (Google-Proof Q&A),"A challenging multiple-choice benchmark of 448 expert-level questions in biology, physics, and chemistry designed to be difficult even for skilled non-experts with internet access.",New York University,https://arxiv.org/abs/2311.12022,https://github.com/idavidrein/gpqa,2023-11-20,Research; Development,"Expert-level knowledge, Scientific reasoning, Domain expertise, Graduate-level understanding",Core Performance; Robustness,Text,Text,New dataset (released with eval),Expert annotations,Small (< 1K samples),"Main (198), Extended (246), Diamond (198 highest quality)",Fixed data-driven (static test set),Automatic (Reference-based),1. Model receives graduate-level science question 2. Model selects from 4 multiple choice options 3. Accuracy computed 4. Questions validated to be difficult for non-experts with Google access,Outputs,False,,"PhD-level experts write and validate questions, non-expert validators with Google access achieve <35% accuracy",unknown,Random: 25% Non-expert w/ Google: 34% Expert: 81% GPT-4: 39% Claude 3 Opus: 59.4% GPT-4o: 53.6%,Expert validation; Non-expert baseline; Multiple subject areas; Quality tiers (Diamond subset),Small dataset size; Limited to 3 scientific domains; Multiple choice format; High cost of expert question creation; Cultural bias toward Western scientific education,"MMLU-Pro, JEE-Advanced, SciBench, LiveBench"
IFEval (Instruction-Following Eval),"A benchmark of 541 prompts with verifiable instructions testing models' ability to follow precise formatting, length, and structural constraints.",Google DeepMind,https://arxiv.org/abs/2311.07911,https://github.com/google-research/google-research/tree/master/instruction_following_eval,2023-11-13,Research; Development; Selection,"Instruction following, Constraint satisfaction, Format compliance, Precise control",Core Performance; Robustness,Text,Text,New dataset (released with eval),Programmatically generated,Small (< 1K samples),Test (541 prompts with ~25 verifiable instructions),Fixed data-driven (static test set),Automatic (Reference-free),"1. Model receives prompt with verifiable instructions (e.g., 'respond in exactly 3 paragraphs', 'include word X at least 5 times') 2. Model generates response 3. Programmatic checks verify instruction compliance 4. Strict and loose accuracy metrics computed",Outputs,False,,"Verifiable instructions with programmatic checking, no ambiguity in correctness",unknown,GPT-3.5: 57.4% strict GPT-4: 76.9% strict Claude 2: 66.7% strict Gemini Ultra: 79.4% strict,Strict and loose metrics; Multiple instruction types; Programmatic verification; No human evaluation needed,Limited to verifiable instructions only; May not reflect realistic usage; Some instructions may be conflicting; Excludes semantic quality assessment,"MT-Bench, AlpacaEval, InstructGPT evals, FollowBench"