| benchmark,category,year,size,languages,hf_dataset_id,data_access,visualization_complexity,influence,priority_score,batch,notes | |
| MBPP+,Code Generation,2023,399,Python,evalplus/mbppplus,easy,simple,high,9,1,Natural companion to HumanEval+; same EvalPlus ecosystem | |
| ClassEval,Code Generation,2023,100 classes (410 methods),Python,FudanSELab/ClassEval,easy,moderate,high,9,1,Class-level code generation with test classes | |
| LiveCodeBench,Code Generation,2024,1055+,Python,livecodebench/code_generation_lite,easy,moderate,high,9,1,Continuously updated; contamination-free; high community interest | |
| DebugBench,Code Editing/Debugging,2024,4253,"C++, Java, Python",Rtian/DebugBench,easy,moderate,high,8,1,Buggy code with implanted bugs; 4 categories; 18 minor types | |
| HumanEval-X,Code Translation,2022,820 (164x5),"Python, C++, Java, JS, Go",THUDM/humaneval-x,easy,moderate,high,8,1,Same 164 problems in 5 languages with test cases | |
| SWE-bench Lite,Code Editing,2024,300,Python,princeton-nlp/SWE-bench_Lite,easy,complex,very high,8,2,GitHub issue resolution; extremely high-profile | |
| CodeContests,Code Generation,2022,13328,"C++, Python, Java",deepmind/code_contests,easy,moderate,high,8,2,AlphaCode benchmark; competitive programming | |
| APPS,Code Generation,2021,10000,Python,codeparrot/apps,easy,moderate,high,7,2,Large-scale coding problems at 3 difficulty levels | |
| CanItEdit,Code Editing,2023,105,Python,nuprl/CanItEdit,easy,simple,medium,7,2,Before/after code editing with dual instruction types | |
| MBPP,Code Generation,2021,974,Python,google-research-datasets/mbpp,easy,simple,high,7,2,Original MBPP; foundational benchmark | |
| DS-1000,Code Generation,2023,1000,Python,xlangai/DS-1000,easy,moderate,high,7,3,Data science library-specific problems (NumPy/Pandas/etc.) | |
| CodeEditorBench,Code Editing,2024,7961,Multiple,m-a-p/CodeEditorBench,easy,moderate,medium,7,3,4 editing scenarios: debug/translate/polish/requirement switch | |
| SAFIM,Code Completion,2024,17720,"Python, Java, C++, C#",gonglinyuan/safim,easy,moderate,medium,7,3,Syntax-aware fill-in-the-middle; 3 subtasks | |
| BigVul,Vulnerability Detection,2020,190000,C/C++,bstee615/bigvul,easy,moderate,medium,6,3,CVE-linked vulnerability detection; 91 CWE types | |
| RepoBench,Code Completion,2023,10000+,"Python, Java",tianyang/repobench-c,easy,complex,medium,6,3,Repo-level code completion with 3 sub-tasks | |
| MultiPL-E,Code Generation/Translation,2023,HumanEval+MBPP in 22 langs,22 languages,nuprl/MultiPL-E,easy,moderate,medium,6,4,Translations of HumanEval/MBPP to 22 languages | |
| DiverseVul,Vulnerability Detection,2023,350000+,C/C++,claudios/DiverseVul,easy,simple,medium,6,4,Large-scale vulnerability detection; 150 CWEs | |
| PrimeVul,Vulnerability Detection,2024,236000+,C/C++,starsofchance/PrimeVul,easy,simple,medium,6,4,Highest quality labels for vuln detection | |
| McEval,Code Generation,2024,16000,40 languages,Multilingual-Multimodal-NLP/McEval,easy,complex,medium,6,4,Massive language coverage | |
| CodeSearchNet,Code Search/Summarization,2019,2000000,"Python, JS, Ruby, Go, Java, PHP",code-search-net/code_search_net,easy,moderate,medium,6,4,Foundational code search benchmark | |
| xCodeEval,Multi-task,2023,25000000,11-17 languages,NTU-NLP-sg/xCodeEval,easy,very complex,medium,5,5,7 tasks; very large; complex format | |
| Devign,Vulnerability Detection,2019,20756,C,google/code_x_glue_cc_defect_detection,easy,simple,medium,5,5,Function-level vulnerability identification | |
| CrossVul,Vulnerability Detection,2021,9313,40+ languages,hitoshura25/crossvul,easy,simple,medium,5,5,Cross-language vulnerability detection | |
| SWE-bench Verified,Code Editing,2024,500,Python,princeton-nlp/SWE-bench_Verified,easy,complex,high,5,5,Curated subset of SWE-bench | |
| CoderEval,Code Generation,2023,460,"Python, Java",N/A (GitHub only),medium,complex,medium,4,deferred,Requires project-level context | |
| NaturalCodeBench,Code Generation,2024,402,"Python, Java",N/A (GitHub only),medium,moderate,medium,4,deferred,Only dev set released (140 problems) | |
| DevEval,Code Generation,2024,1874,Python,N/A (GitHub only),medium,complex,medium,4,deferred,Repository-level; complex dependencies | |
| RunBugRun,Program Repair,2023,450000+,9 languages,N/A (GitHub/SQLite),hard,complex,medium,3,deferred,SQLite format; complex infrastructure | |
| Defects4J,Program Repair,2014,854,Java,N/A (GitHub only),hard,very complex,high,3,deferred,Requires Java tooling; full project repos | |
| ConDefects,Program Repair,2023,2879,"Java, Python",N/A (GitHub only),medium,moderate,medium,3,deferred,AtCoder buggy/fixed pairs | |
| FixEval,Program Repair,2023,varies,"Python, Java",N/A (GitHub only),medium,moderate,low,3,deferred,Competitive programming fixes | |
| TransCoder,Code Translation,2020,852,"Java, Python, C++",N/A (GitHub only),medium,moderate,medium,3,deferred,Facebook Research; unsupervised translation | |
| AVATAR,Code Translation,2021,9515,"Java, Python",N/A (GitHub only),medium,moderate,low,3,deferred,Parallel Java-Python corpus | |
| TypeEvalPy,Type Inference,2023,154,Python,N/A (GitHub only),medium,moderate,low,3,deferred,Niche; type inference evaluation | |
| VJBench,Vulnerability Repair,2023,42,Java,N/A (GitHub only),hard,complex,low,2,deferred,Very small; requires Java tooling | |
| SVEN,Vulnerability Detection,2023,1606,C/C++,N/A (GitHub only),medium,moderate,low,2,deferred,Small; security hardening focus | |
| PyTER,Type Error Repair,2022,93,Python,N/A (Figshare),hard,complex,low,2,deferred,Very small; niche | |