Benchmark Integration Progress
Status: Batches 1-5 Complete
Batch Plan
Batch 1 (Highest Priority -- Easy HF, High Influence)
| Benchmark | Slug | Status | HF Dataset | View Type |
|---|---|---|---|---|
| MBPP+ | mbppplus |
Done | evalplus/mbppplus |
Simple |
| ClassEval | classeval |
Done | FudanSELab/ClassEval |
Simple |
| LiveCodeBench | livecodebench |
Done | livecodebench/code_generation_lite |
Simple |
| DebugBench | debugbench |
Done | Rtian/DebugBench |
Before/After |
| HumanEval-X | humanevalx |
Done | THUDM/humaneval-x |
Multi-language |
Refactoring done: Multi-language syntax highlighting via get_lexer_by_name(). Before/after code diff view. Multi-language tab view.
Batch 2
| Benchmark | Slug | Status | HF Dataset | View Type |
|---|---|---|---|---|
| SWE-bench Lite | swebenchlite |
Done | princeton-nlp/SWE-bench_Lite |
Diff |
| CodeContests | codecontests |
Done | deepmind/code_contests |
Multi-solution |
| APPS | apps |
Done | codeparrot/apps |
Multi-solution / Simple |
| CanItEdit | canitedit |
Done | nuprl/CanItEdit |
Before/After |
| MBPP | mbpp |
Done | google-research-datasets/mbpp |
Simple |
New views: Unified diff view for SWE-bench patches. Multi-solution view extended to show language labels for CodeContests.
Batch 3
| Benchmark | Slug | Status | HF Dataset | View Type |
|---|---|---|---|---|
| SAFIM | safim |
Done | gonglinyuan/safim |
Fill-in-the-Middle |
| BigVul | bigvul |
Done | bstee615/bigvul |
Vulnerability |
| DiverseVul | diversevul |
Done | claudios/DiverseVul |
Vulnerability |
| PrimeVul | primevul |
Done | starsofchance/PrimeVul |
Vulnerability |
| CodeEditorBench | codeeditorbench |
Done | m-a-p/CodeEditorBench |
Before/After |
New views: Fill-in-the-Middle view showing code with [HOLE] marker and ground truth. Vulnerability view with CWE badges and vulnerable/patched code comparison.
Batch 4
| Benchmark | Slug | Status | HF Dataset | View Type |
|---|---|---|---|---|
| SWE-bench Verified | swebenchverified |
Done | princeton-nlp/SWE-bench_Verified |
Diff |
| CodeSearchNet | codesearchnet |
Done | code-search-net/code_search_net |
Simple |
| Devign | devign |
Done | google/code_x_glue_cc_defect_detection |
Vulnerability |
Dropped from original plan
| Benchmark | Reason |
|---|---|
| DS-1000 | Complex library-specific format, limited visualization value |
| RepoBench | Repo-level context too complex for per-problem viewing |
| MultiPL-E | 22 languages but same problems as HumanEval/MBPP already covered |
| McEval | Very large (40 languages), complex format |
| xCodeEval | Very large (25M rows), 7 tasks, too complex |
| CrossVul | Similar to DiverseVul/BigVul, diminishing returns |
Batch 5
| Benchmark | Slug | Status | HF Dataset | View Type |
|---|---|---|---|---|
| BigCodeBench | bigcodebench |
Done | bigcode/bigcodebench |
Simple |
| HumanEvalPack | humanevalpack |
Done | bigcode/humanevalpack |
Multi-language + Before/After |
| CodeXGLUE Refinement | codexgluerefinement |
Done | google/code_x_glue_cc_code_refinement |
Before/After |
| SWE-bench | swebenchfull |
Done | princeton-nlp/SWE-bench |
Diff |
| CommitBench | commitbench |
Done | Maxscha/commitbench |
Diff |
| EffiBench | effibench |
Done | DONG19/EffiBench |
Simple |
New views: Multi-language view with canonical/buggy code toggle (HumanEvalPack). CommitBench reuses diff view. CodeXGLUE Refinement uses before/after Java view.
Deferred (GitHub-only or complex infrastructure)
CoderEval, NaturalCodeBench, DevEval, RunBugRun, Defects4J, ConDefects, FixEval, TransCoder, AVATAR, TypeEvalPy, VJBench, SVEN, PyTER
Architecture Decisions
Multi-language Support
highlight_code()inapp.pyacceptslanguageparameter (default:"python")- Uses
get_lexer_by_name()from Pygments for automatic lexer selection - Adapters pass language when calling
_highlight_code(code, language=...)
View Types Implemented
- BigOBench view -- multiple solutions with complexity badges
- Simple view -- code + inputs/outputs + test suite (HumanEval+, MBPP+, MBPP, ClassEval, LiveCodeBench, APPS, CodeSearchNet)
- CRUXEval view -- given/predict task selector
- DREval view -- full interactive view with coverage, arrows, ground truth
- Before/After view -- side-by-side buggy/fixed code (DebugBench, CanItEdit, CodeEditorBench)
- Multi-language view -- same problem in multiple languages (HumanEval-X, HumanEvalPack)
- Diff view -- unified diff patch visualization (SWE-bench Lite, SWE-bench Verified, SWE-bench, CommitBench)
- Fill-in-the-Middle view -- prefix + [HOLE] + suffix (SAFIM)
- Vulnerability view -- vulnerable/patched code + CWE labels (BigVul, DiverseVul, PrimeVul, Devign)
Batch 6 β Long Code Arena (6 project-level tasks)
| Benchmark | Slug | Status | HF Dataset | View Type |
|---|---|---|---|---|
| LCA Library-Based Code Gen | lca-libcodegen |
Done | JetBrains-Research/lca-library-based-code-generation |
Simple |
| LCA Project-Level Completion | lca-codecompletion |
Done | JetBrains-Research/lca-project-level-code-completion |
Simple |
| LCA Bug Localization | lca-buglocalization |
Done | JetBrains-Research/lca-bug-localization |
Diff |
| LCA Commit Message Gen | lca-commitmsg |
Done | JetBrains-Research/lca-commit-message-generation |
Diff |
| LCA CI Builds Repair | lca-cirepair |
Done | JetBrains-Research/lca-ci-builds-repair |
Diff |
| LCA Module Summarization | lca-modulesumm |
Done | JetBrains-Research/lca-module-summarization |
Simple |
New adapter module: adapters/long_code_arena.py β all 6 Long Code Arena project-level tasks.
Batch 7 β dpaia & Additional Benchmarks (7 datasets)
| Benchmark | Slug | Status | Source | View Type |
|---|---|---|---|---|
| DPAIA EE-Dataset | dpaia-ee |
Done | github.com/dpaia/ee-dataset (JSON) |
Diff (SWE-bench style) |
| Multi-SWE-bench | multiswebench |
Done | ByteDance-Seed/Multi-SWE-bench (JSONL) |
Diff |
| SWE-bench Multilingual | swebenchmultilingual |
Done | SWE-bench/SWE-bench_Multilingual |
Diff |
| CrossCodeEval | crosscodeeval |
Done | Vincentvmt/CrossCodeEval (JSONL) |
Fill-in-the-Middle |
| McEval | mceval |
Done | Multilingual-Multimodal-NLP/McEval |
Simple |
| MultiPL-E | multiple |
Done | nuprl/MultiPL-E |
Multi-language |
| Defects4J | defects4j |
Done | rufimelo/defects4j |
Before/After |
Dropped from Batch 7
| Benchmark | Reason |
|---|---|
| RepoBench | HF repo has only a deprecated loading script (repobench-p.py), no actual data files |
New adapter module: adapters/additional.py β dpaia EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J.
Sources:
- Long Code Arena: https://huggingface.co/collections/JetBrains-Research/long-code-arena (OpenReview: aQoUjxlgNE)
- DPAIA EE-Dataset: https://github.com/dpaia/ee-dataset (Java/Spring SWE-bench-style)
- Multi-SWE-bench: ByteDance multilingual SWE-bench (7 languages, 1632 problems across 40 repos)
- SWE-bench Multilingual: Official SWE-bench multilingual extension (42 repos)
- CrossCodeEval: Cross-file code completion (4 languages, Amazon, 9928 problems)
- McEval: Massively multilingual code evaluation (40 languages)
- MultiPL-E: Multi-language HumanEval/MBPP translation (9 languages loaded)
- Defects4J: Classic Java bug-fix benchmark (467 bugs)
- Arxiv survey reference: https://arxiv.org/abs/2505.08903
Total Datasets: 41
Base (4): REval, CRUXEval, HumanEval+, BigOBench Batch 1 (5): MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X Batch 2 (5): SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP Batch 3 (5): SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench Batch 4 (3): SWE-bench Verified, CodeSearchNet, Devign Batch 5 (6): BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench Batch 6 (6): LCA Library-Based Code Gen, LCA Project-Level Completion, LCA Bug Localization, LCA Commit Message Gen, LCA CI Builds Repair, LCA Module Summarization Batch 7 (7): DPAIA EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J
Changelog
- 2026-03-03: Initial benchmark analysis and prioritization complete
- 2026-03-03: Batch 1 complete (MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X)
- 2026-03-03: Batch 2 complete (SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP)
- 2026-03-03: Batch 3 complete (SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench)
- 2026-03-03: Batch 4 complete (SWE-bench Verified, CodeSearchNet, Devign)
- 2026-03-03: Fixed APPS loading (refs/convert/parquet), PrimeVul (direct JSONL), CodeEditorBench (per-task JSONL)
- 2026-03-03: All 22 datasets verified loading successfully
- 2026-03-04: Refactored adapters into submodules (adapters/code_generation.py, code_editing.py, code_reasoning.py, vulnerability.py)
- 2026-03-04: Extracted CSS and JS into static/ directory (static/problem.css, static/problem.js)
- 2026-03-04: Added sampling for large datasets (cap at 1000 with seed=42)
- 2026-03-04: Enhanced FIM view (merged code with ground truth highlighting)
- 2026-03-04: Enhanced Before/After view (diff highlighting)
- 2026-03-04: Enhanced SWE-bench diff view (full file with diff chunks)
- 2026-03-04: Batch 5 complete (BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench)
- 2026-03-04: All 28 datasets verified loading successfully
- 2026-03-04: Batch 6 complete (Long Code Arena β 6 project-level tasks)
- 2026-03-04: Batch 7 complete (dpaia EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J)
- 2026-03-04: Dropped RepoBench (HF repo has only deprecated loading script, no data files)
- 2026-03-04: Fixed Multi-SWE-bench (load per-repo JSONL files directly instead of
load_dataset) - 2026-03-04: Fixed CrossCodeEval (load per-language JSONL files directly, inconsistent columns across files)
- 2026-03-04: Fixed Defects4J (split="train" not "test", fields: bug_id/func_before/func_after)
- 2026-03-04: All 41 datasets verified loading successfully