ml4se-evals-visualization / benchmarks_analysis.csv
egor-bogomolov's picture
Add 28 benchmark datasets with rich visualization views
9a8a9c5
benchmark,category,year,size,languages,hf_dataset_id,data_access,visualization_complexity,influence,priority_score,batch,notes
MBPP+,Code Generation,2023,399,Python,evalplus/mbppplus,easy,simple,high,9,1,Natural companion to HumanEval+; same EvalPlus ecosystem
ClassEval,Code Generation,2023,100 classes (410 methods),Python,FudanSELab/ClassEval,easy,moderate,high,9,1,Class-level code generation with test classes
LiveCodeBench,Code Generation,2024,1055+,Python,livecodebench/code_generation_lite,easy,moderate,high,9,1,Continuously updated; contamination-free; high community interest
DebugBench,Code Editing/Debugging,2024,4253,"C++, Java, Python",Rtian/DebugBench,easy,moderate,high,8,1,Buggy code with implanted bugs; 4 categories; 18 minor types
HumanEval-X,Code Translation,2022,820 (164x5),"Python, C++, Java, JS, Go",THUDM/humaneval-x,easy,moderate,high,8,1,Same 164 problems in 5 languages with test cases
SWE-bench Lite,Code Editing,2024,300,Python,princeton-nlp/SWE-bench_Lite,easy,complex,very high,8,2,GitHub issue resolution; extremely high-profile
CodeContests,Code Generation,2022,13328,"C++, Python, Java",deepmind/code_contests,easy,moderate,high,8,2,AlphaCode benchmark; competitive programming
APPS,Code Generation,2021,10000,Python,codeparrot/apps,easy,moderate,high,7,2,Large-scale coding problems at 3 difficulty levels
CanItEdit,Code Editing,2023,105,Python,nuprl/CanItEdit,easy,simple,medium,7,2,Before/after code editing with dual instruction types
MBPP,Code Generation,2021,974,Python,google-research-datasets/mbpp,easy,simple,high,7,2,Original MBPP; foundational benchmark
DS-1000,Code Generation,2023,1000,Python,xlangai/DS-1000,easy,moderate,high,7,3,Data science library-specific problems (NumPy/Pandas/etc.)
CodeEditorBench,Code Editing,2024,7961,Multiple,m-a-p/CodeEditorBench,easy,moderate,medium,7,3,4 editing scenarios: debug/translate/polish/requirement switch
SAFIM,Code Completion,2024,17720,"Python, Java, C++, C#",gonglinyuan/safim,easy,moderate,medium,7,3,Syntax-aware fill-in-the-middle; 3 subtasks
BigVul,Vulnerability Detection,2020,190000,C/C++,bstee615/bigvul,easy,moderate,medium,6,3,CVE-linked vulnerability detection; 91 CWE types
RepoBench,Code Completion,2023,10000+,"Python, Java",tianyang/repobench-c,easy,complex,medium,6,3,Repo-level code completion with 3 sub-tasks
MultiPL-E,Code Generation/Translation,2023,HumanEval+MBPP in 22 langs,22 languages,nuprl/MultiPL-E,easy,moderate,medium,6,4,Translations of HumanEval/MBPP to 22 languages
DiverseVul,Vulnerability Detection,2023,350000+,C/C++,claudios/DiverseVul,easy,simple,medium,6,4,Large-scale vulnerability detection; 150 CWEs
PrimeVul,Vulnerability Detection,2024,236000+,C/C++,starsofchance/PrimeVul,easy,simple,medium,6,4,Highest quality labels for vuln detection
McEval,Code Generation,2024,16000,40 languages,Multilingual-Multimodal-NLP/McEval,easy,complex,medium,6,4,Massive language coverage
CodeSearchNet,Code Search/Summarization,2019,2000000,"Python, JS, Ruby, Go, Java, PHP",code-search-net/code_search_net,easy,moderate,medium,6,4,Foundational code search benchmark
xCodeEval,Multi-task,2023,25000000,11-17 languages,NTU-NLP-sg/xCodeEval,easy,very complex,medium,5,5,7 tasks; very large; complex format
Devign,Vulnerability Detection,2019,20756,C,google/code_x_glue_cc_defect_detection,easy,simple,medium,5,5,Function-level vulnerability identification
CrossVul,Vulnerability Detection,2021,9313,40+ languages,hitoshura25/crossvul,easy,simple,medium,5,5,Cross-language vulnerability detection
SWE-bench Verified,Code Editing,2024,500,Python,princeton-nlp/SWE-bench_Verified,easy,complex,high,5,5,Curated subset of SWE-bench
CoderEval,Code Generation,2023,460,"Python, Java",N/A (GitHub only),medium,complex,medium,4,deferred,Requires project-level context
NaturalCodeBench,Code Generation,2024,402,"Python, Java",N/A (GitHub only),medium,moderate,medium,4,deferred,Only dev set released (140 problems)
DevEval,Code Generation,2024,1874,Python,N/A (GitHub only),medium,complex,medium,4,deferred,Repository-level; complex dependencies
RunBugRun,Program Repair,2023,450000+,9 languages,N/A (GitHub/SQLite),hard,complex,medium,3,deferred,SQLite format; complex infrastructure
Defects4J,Program Repair,2014,854,Java,N/A (GitHub only),hard,very complex,high,3,deferred,Requires Java tooling; full project repos
ConDefects,Program Repair,2023,2879,"Java, Python",N/A (GitHub only),medium,moderate,medium,3,deferred,AtCoder buggy/fixed pairs
FixEval,Program Repair,2023,varies,"Python, Java",N/A (GitHub only),medium,moderate,low,3,deferred,Competitive programming fixes
TransCoder,Code Translation,2020,852,"Java, Python, C++",N/A (GitHub only),medium,moderate,medium,3,deferred,Facebook Research; unsupervised translation
AVATAR,Code Translation,2021,9515,"Java, Python",N/A (GitHub only),medium,moderate,low,3,deferred,Parallel Java-Python corpus
TypeEvalPy,Type Inference,2023,154,Python,N/A (GitHub only),medium,moderate,low,3,deferred,Niche; type inference evaluation
VJBench,Vulnerability Repair,2023,42,Java,N/A (GitHub only),hard,complex,low,2,deferred,Very small; requires Java tooling
SVEN,Vulnerability Detection,2023,1606,C/C++,N/A (GitHub only),medium,moderate,low,2,deferred,Small; security hardening focus
PyTER,Type Error Repair,2022,93,Python,N/A (Figshare),hard,complex,low,2,deferred,Very small; niche