Spaces:

JetBrains-Research
/

ml4se-evals-visualization

Running

App Files Files Community

ml4se-evals-visualization / PROGRESS.md

egor-bogomolov

Add 13 new benchmark datasets (batches 6-8)

9f85fac 1 day ago

preview code

raw

history blame contribute delete

10.5 kB

Benchmark Integration Progress

Status: Batches 1-5 Complete

Batch Plan

Batch 1 (Highest Priority -- Easy HF, High Influence)

Benchmark	Slug	Status	HF Dataset	View Type
MBPP+	`mbppplus`	Done	`evalplus/mbppplus`	Simple
ClassEval	`classeval`	Done	`FudanSELab/ClassEval`	Simple
LiveCodeBench	`livecodebench`	Done	`livecodebench/code_generation_lite`	Simple
DebugBench	`debugbench`	Done	`Rtian/DebugBench`	Before/After
HumanEval-X	`humanevalx`	Done	`THUDM/humaneval-x`	Multi-language

Refactoring done: Multi-language syntax highlighting via get_lexer_by_name(). Before/after code diff view. Multi-language tab view.

Batch 2

Benchmark	Slug	Status	HF Dataset	View Type
SWE-bench Lite	`swebenchlite`	Done	`princeton-nlp/SWE-bench_Lite`	Diff
CodeContests	`codecontests`	Done	`deepmind/code_contests`	Multi-solution
APPS	`apps`	Done	`codeparrot/apps`	Multi-solution / Simple
CanItEdit	`canitedit`	Done	`nuprl/CanItEdit`	Before/After
MBPP	`mbpp`	Done	`google-research-datasets/mbpp`	Simple

New views: Unified diff view for SWE-bench patches. Multi-solution view extended to show language labels for CodeContests.

Batch 3

Benchmark	Slug	Status	HF Dataset	View Type
SAFIM	`safim`	Done	`gonglinyuan/safim`	Fill-in-the-Middle
BigVul	`bigvul`	Done	`bstee615/bigvul`	Vulnerability
DiverseVul	`diversevul`	Done	`claudios/DiverseVul`	Vulnerability
PrimeVul	`primevul`	Done	`starsofchance/PrimeVul`	Vulnerability
CodeEditorBench	`codeeditorbench`	Done	`m-a-p/CodeEditorBench`	Before/After

New views: Fill-in-the-Middle view showing code with [HOLE] marker and ground truth. Vulnerability view with CWE badges and vulnerable/patched code comparison.

Batch 4

Benchmark	Slug	Status	HF Dataset	View Type
SWE-bench Verified	`swebenchverified`	Done	`princeton-nlp/SWE-bench_Verified`	Diff
CodeSearchNet	`codesearchnet`	Done	`code-search-net/code_search_net`	Simple
Devign	`devign`	Done	`google/code_x_glue_cc_defect_detection`	Vulnerability

Dropped from original plan

Benchmark	Reason
DS-1000	Complex library-specific format, limited visualization value
RepoBench	Repo-level context too complex for per-problem viewing
MultiPL-E	22 languages but same problems as HumanEval/MBPP already covered
McEval	Very large (40 languages), complex format
xCodeEval	Very large (25M rows), 7 tasks, too complex
CrossVul	Similar to DiverseVul/BigVul, diminishing returns

Batch 5

Benchmark	Slug	Status	HF Dataset	View Type
BigCodeBench	`bigcodebench`	Done	`bigcode/bigcodebench`	Simple
HumanEvalPack	`humanevalpack`	Done	`bigcode/humanevalpack`	Multi-language + Before/After
CodeXGLUE Refinement	`codexgluerefinement`	Done	`google/code_x_glue_cc_code_refinement`	Before/After
SWE-bench	`swebenchfull`	Done	`princeton-nlp/SWE-bench`	Diff
CommitBench	`commitbench`	Done	`Maxscha/commitbench`	Diff
EffiBench	`effibench`	Done	`DONG19/EffiBench`	Simple

New views: Multi-language view with canonical/buggy code toggle (HumanEvalPack). CommitBench reuses diff view. CodeXGLUE Refinement uses before/after Java view.

Deferred (GitHub-only or complex infrastructure)

CoderEval, NaturalCodeBench, DevEval, RunBugRun, Defects4J, ConDefects, FixEval, TransCoder, AVATAR, TypeEvalPy, VJBench, SVEN, PyTER

Architecture Decisions

Multi-language Support

highlight_code() in app.py accepts language parameter (default: "python")
Uses get_lexer_by_name() from Pygments for automatic lexer selection
Adapters pass language when calling _highlight_code(code, language=...)

View Types Implemented

BigOBench view -- multiple solutions with complexity badges
Simple view -- code + inputs/outputs + test suite (HumanEval+, MBPP+, MBPP, ClassEval, LiveCodeBench, APPS, CodeSearchNet)
CRUXEval view -- given/predict task selector
DREval view -- full interactive view with coverage, arrows, ground truth
Before/After view -- side-by-side buggy/fixed code (DebugBench, CanItEdit, CodeEditorBench)
Multi-language view -- same problem in multiple languages (HumanEval-X, HumanEvalPack)
Diff view -- unified diff patch visualization (SWE-bench Lite, SWE-bench Verified, SWE-bench, CommitBench)
Fill-in-the-Middle view -- prefix + [HOLE] + suffix (SAFIM)
Vulnerability view -- vulnerable/patched code + CWE labels (BigVul, DiverseVul, PrimeVul, Devign)

Batch 6 — Long Code Arena (6 project-level tasks)

Benchmark	Slug	Status	HF Dataset	View Type
LCA Library-Based Code Gen	`lca-libcodegen`	Done	`JetBrains-Research/lca-library-based-code-generation`	Simple
LCA Project-Level Completion	`lca-codecompletion`	Done	`JetBrains-Research/lca-project-level-code-completion`	Simple
LCA Bug Localization	`lca-buglocalization`	Done	`JetBrains-Research/lca-bug-localization`	Diff
LCA Commit Message Gen	`lca-commitmsg`	Done	`JetBrains-Research/lca-commit-message-generation`	Diff
LCA CI Builds Repair	`lca-cirepair`	Done	`JetBrains-Research/lca-ci-builds-repair`	Diff
LCA Module Summarization	`lca-modulesumm`	Done	`JetBrains-Research/lca-module-summarization`	Simple

New adapter module: adapters/long_code_arena.py — all 6 Long Code Arena project-level tasks.

Batch 7 — dpaia & Additional Benchmarks (7 datasets)

Benchmark	Slug	Status	Source	View Type
DPAIA EE-Dataset	`dpaia-ee`	Done	`github.com/dpaia/ee-dataset` (JSON)	Diff (SWE-bench style)
Multi-SWE-bench	`multiswebench`	Done	`ByteDance-Seed/Multi-SWE-bench` (JSONL)	Diff
SWE-bench Multilingual	`swebenchmultilingual`	Done	`SWE-bench/SWE-bench_Multilingual`	Diff
CrossCodeEval	`crosscodeeval`	Done	`Vincentvmt/CrossCodeEval` (JSONL)	Fill-in-the-Middle
McEval	`mceval`	Done	`Multilingual-Multimodal-NLP/McEval`	Simple
MultiPL-E	`multiple`	Done	`nuprl/MultiPL-E`	Multi-language
Defects4J	`defects4j`	Done	`rufimelo/defects4j`	Before/After

Dropped from Batch 7

Benchmark	Reason
RepoBench	HF repo has only a deprecated loading script (`repobench-p.py`), no actual data files

New adapter module: adapters/additional.py — dpaia EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J.

Sources:

Long Code Arena: https://huggingface.co/collections/JetBrains-Research/long-code-arena (OpenReview: aQoUjxlgNE)
DPAIA EE-Dataset: https://github.com/dpaia/ee-dataset (Java/Spring SWE-bench-style)
Multi-SWE-bench: ByteDance multilingual SWE-bench (7 languages, 1632 problems across 40 repos)
SWE-bench Multilingual: Official SWE-bench multilingual extension (42 repos)
CrossCodeEval: Cross-file code completion (4 languages, Amazon, 9928 problems)
McEval: Massively multilingual code evaluation (40 languages)
MultiPL-E: Multi-language HumanEval/MBPP translation (9 languages loaded)
Defects4J: Classic Java bug-fix benchmark (467 bugs)
Arxiv survey reference: https://arxiv.org/abs/2505.08903

Total Datasets: 41

Base (4): REval, CRUXEval, HumanEval+, BigOBench Batch 1 (5): MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X Batch 2 (5): SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP Batch 3 (5): SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench Batch 4 (3): SWE-bench Verified, CodeSearchNet, Devign Batch 5 (6): BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench Batch 6 (6): LCA Library-Based Code Gen, LCA Project-Level Completion, LCA Bug Localization, LCA Commit Message Gen, LCA CI Builds Repair, LCA Module Summarization Batch 7 (7): DPAIA EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J

Changelog

2026-03-03: Initial benchmark analysis and prioritization complete
2026-03-03: Batch 1 complete (MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X)
2026-03-03: Batch 2 complete (SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP)
2026-03-03: Batch 3 complete (SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench)
2026-03-03: Batch 4 complete (SWE-bench Verified, CodeSearchNet, Devign)
2026-03-03: Fixed APPS loading (refs/convert/parquet), PrimeVul (direct JSONL), CodeEditorBench (per-task JSONL)
2026-03-03: All 22 datasets verified loading successfully
2026-03-04: Refactored adapters into submodules (adapters/code_generation.py, code_editing.py, code_reasoning.py, vulnerability.py)
2026-03-04: Extracted CSS and JS into static/ directory (static/problem.css, static/problem.js)
2026-03-04: Added sampling for large datasets (cap at 1000 with seed=42)
2026-03-04: Enhanced FIM view (merged code with ground truth highlighting)
2026-03-04: Enhanced Before/After view (diff highlighting)
2026-03-04: Enhanced SWE-bench diff view (full file with diff chunks)
2026-03-04: Batch 5 complete (BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench)
2026-03-04: All 28 datasets verified loading successfully
2026-03-04: Batch 6 complete (Long Code Arena — 6 project-level tasks)
2026-03-04: Batch 7 complete (dpaia EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J)
2026-03-04: Dropped RepoBench (HF repo has only deprecated loading script, no data files)
2026-03-04: Fixed Multi-SWE-bench (load per-repo JSONL files directly instead of load_dataset)
2026-03-04: Fixed CrossCodeEval (load per-language JSONL files directly, inconsistent columns across files)
2026-03-04: Fixed Defects4J (split="train" not "test", fields: bug_id/func_before/func_after)
2026-03-04: All 41 datasets verified loading successfully