| # Benchmark Integration Progress | |
| ## Status: Batches 1-5 Complete | |
| ## Batch Plan | |
| ### Batch 1 (Highest Priority -- Easy HF, High Influence) | |
| | Benchmark | Slug | Status | HF Dataset | View Type | | |
| |-----------|------|--------|------------|-----------| | |
| | MBPP+ | `mbppplus` | Done | `evalplus/mbppplus` | Simple | | |
| | ClassEval | `classeval` | Done | `FudanSELab/ClassEval` | Simple | | |
| | LiveCodeBench | `livecodebench` | Done | `livecodebench/code_generation_lite` | Simple | | |
| | DebugBench | `debugbench` | Done | `Rtian/DebugBench` | Before/After | | |
| | HumanEval-X | `humanevalx` | Done | `THUDM/humaneval-x` | Multi-language | | |
| **Refactoring done:** Multi-language syntax highlighting via `get_lexer_by_name()`. Before/after code diff view. Multi-language tab view. | |
| ### Batch 2 | |
| | Benchmark | Slug | Status | HF Dataset | View Type | | |
| |-----------|------|--------|------------|-----------| | |
| | SWE-bench Lite | `swebenchlite` | Done | `princeton-nlp/SWE-bench_Lite` | Diff | | |
| | CodeContests | `codecontests` | Done | `deepmind/code_contests` | Multi-solution | | |
| | APPS | `apps` | Done | `codeparrot/apps` | Multi-solution / Simple | | |
| | CanItEdit | `canitedit` | Done | `nuprl/CanItEdit` | Before/After | | |
| | MBPP | `mbpp` | Done | `google-research-datasets/mbpp` | Simple | | |
| **New views:** Unified diff view for SWE-bench patches. Multi-solution view extended to show language labels for CodeContests. | |
| ### Batch 3 | |
| | Benchmark | Slug | Status | HF Dataset | View Type | | |
| |-----------|------|--------|------------|-----------| | |
| | SAFIM | `safim` | Done | `gonglinyuan/safim` | Fill-in-the-Middle | | |
| | BigVul | `bigvul` | Done | `bstee615/bigvul` | Vulnerability | | |
| | DiverseVul | `diversevul` | Done | `claudios/DiverseVul` | Vulnerability | | |
| | PrimeVul | `primevul` | Done | `starsofchance/PrimeVul` | Vulnerability | | |
| | CodeEditorBench | `codeeditorbench` | Done | `m-a-p/CodeEditorBench` | Before/After | | |
| **New views:** Fill-in-the-Middle view showing code with [HOLE] marker and ground truth. Vulnerability view with CWE badges and vulnerable/patched code comparison. | |
| ### Batch 4 | |
| | Benchmark | Slug | Status | HF Dataset | View Type | | |
| |-----------|------|--------|------------|-----------| | |
| | SWE-bench Verified | `swebenchverified` | Done | `princeton-nlp/SWE-bench_Verified` | Diff | | |
| | CodeSearchNet | `codesearchnet` | Done | `code-search-net/code_search_net` | Simple | | |
| | Devign | `devign` | Done | `google/code_x_glue_cc_defect_detection` | Vulnerability | | |
| ### Dropped from original plan | |
| | Benchmark | Reason | | |
| |-----------|--------| | |
| | DS-1000 | Complex library-specific format, limited visualization value | | |
| | RepoBench | Repo-level context too complex for per-problem viewing | | |
| | MultiPL-E | 22 languages but same problems as HumanEval/MBPP already covered | | |
| | McEval | Very large (40 languages), complex format | | |
| | xCodeEval | Very large (25M rows), 7 tasks, too complex | | |
| | CrossVul | Similar to DiverseVul/BigVul, diminishing returns | | |
| ### Batch 5 | |
| | Benchmark | Slug | Status | HF Dataset | View Type | | |
| |-----------|------|--------|------------|-----------| | |
| | BigCodeBench | `bigcodebench` | Done | `bigcode/bigcodebench` | Simple | | |
| | HumanEvalPack | `humanevalpack` | Done | `bigcode/humanevalpack` | Multi-language + Before/After | | |
| | CodeXGLUE Refinement | `codexgluerefinement` | Done | `google/code_x_glue_cc_code_refinement` | Before/After | | |
| | SWE-bench | `swebenchfull` | Done | `princeton-nlp/SWE-bench` | Diff | | |
| | CommitBench | `commitbench` | Done | `Maxscha/commitbench` | Diff | | |
| | EffiBench | `effibench` | Done | `DONG19/EffiBench` | Simple | | |
| **New views:** Multi-language view with canonical/buggy code toggle (HumanEvalPack). CommitBench reuses diff view. CodeXGLUE Refinement uses before/after Java view. | |
| ### Deferred (GitHub-only or complex infrastructure) | |
| CoderEval, NaturalCodeBench, DevEval, RunBugRun, Defects4J, ConDefects, FixEval, TransCoder, AVATAR, TypeEvalPy, VJBench, SVEN, PyTER | |
| ## Architecture Decisions | |
| ### Multi-language Support | |
| - `highlight_code()` in `app.py` accepts `language` parameter (default: `"python"`) | |
| - Uses `get_lexer_by_name()` from Pygments for automatic lexer selection | |
| - Adapters pass language when calling `_highlight_code(code, language=...)` | |
| ### View Types Implemented | |
| 1. **BigOBench view** -- multiple solutions with complexity badges | |
| 2. **Simple view** -- code + inputs/outputs + test suite (HumanEval+, MBPP+, MBPP, ClassEval, LiveCodeBench, APPS, CodeSearchNet) | |
| 3. **CRUXEval view** -- given/predict task selector | |
| 4. **DREval view** -- full interactive view with coverage, arrows, ground truth | |
| 5. **Before/After view** -- side-by-side buggy/fixed code (DebugBench, CanItEdit, CodeEditorBench) | |
| 6. **Multi-language view** -- same problem in multiple languages (HumanEval-X, HumanEvalPack) | |
| 7. **Diff view** -- unified diff patch visualization (SWE-bench Lite, SWE-bench Verified, SWE-bench, CommitBench) | |
| 8. **Fill-in-the-Middle view** -- prefix + [HOLE] + suffix (SAFIM) | |
| 9. **Vulnerability view** -- vulnerable/patched code + CWE labels (BigVul, DiverseVul, PrimeVul, Devign) | |
| ### Batch 6 β Long Code Arena (6 project-level tasks) | |
| | Benchmark | Slug | Status | HF Dataset | View Type | | |
| |-----------|------|--------|------------|-----------| | |
| | LCA Library-Based Code Gen | `lca-libcodegen` | Done | `JetBrains-Research/lca-library-based-code-generation` | Simple | | |
| | LCA Project-Level Completion | `lca-codecompletion` | Done | `JetBrains-Research/lca-project-level-code-completion` | Simple | | |
| | LCA Bug Localization | `lca-buglocalization` | Done | `JetBrains-Research/lca-bug-localization` | Diff | | |
| | LCA Commit Message Gen | `lca-commitmsg` | Done | `JetBrains-Research/lca-commit-message-generation` | Diff | | |
| | LCA CI Builds Repair | `lca-cirepair` | Done | `JetBrains-Research/lca-ci-builds-repair` | Diff | | |
| | LCA Module Summarization | `lca-modulesumm` | Done | `JetBrains-Research/lca-module-summarization` | Simple | | |
| **New adapter module:** `adapters/long_code_arena.py` β all 6 Long Code Arena project-level tasks. | |
| ### Batch 7 β dpaia & Additional Benchmarks (7 datasets) | |
| | Benchmark | Slug | Status | Source | View Type | | |
| |-----------|------|--------|--------|-----------| | |
| | DPAIA EE-Dataset | `dpaia-ee` | Done | `github.com/dpaia/ee-dataset` (JSON) | Diff (SWE-bench style) | | |
| | Multi-SWE-bench | `multiswebench` | Done | `ByteDance-Seed/Multi-SWE-bench` (JSONL) | Diff | | |
| | SWE-bench Multilingual | `swebenchmultilingual` | Done | `SWE-bench/SWE-bench_Multilingual` | Diff | | |
| | CrossCodeEval | `crosscodeeval` | Done | `Vincentvmt/CrossCodeEval` (JSONL) | Fill-in-the-Middle | | |
| | McEval | `mceval` | Done | `Multilingual-Multimodal-NLP/McEval` | Simple | | |
| | MultiPL-E | `multiple` | Done | `nuprl/MultiPL-E` | Multi-language | | |
| | Defects4J | `defects4j` | Done | `rufimelo/defects4j` | Before/After | | |
| ### Dropped from Batch 7 | |
| | Benchmark | Reason | | |
| |-----------|--------| | |
| | RepoBench | HF repo has only a deprecated loading script (`repobench-p.py`), no actual data files | | |
| **New adapter module:** `adapters/additional.py` β dpaia EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J. | |
| **Sources:** | |
| - Long Code Arena: https://huggingface.co/collections/JetBrains-Research/long-code-arena (OpenReview: aQoUjxlgNE) | |
| - DPAIA EE-Dataset: https://github.com/dpaia/ee-dataset (Java/Spring SWE-bench-style) | |
| - Multi-SWE-bench: ByteDance multilingual SWE-bench (7 languages, 1632 problems across 40 repos) | |
| - SWE-bench Multilingual: Official SWE-bench multilingual extension (42 repos) | |
| - CrossCodeEval: Cross-file code completion (4 languages, Amazon, 9928 problems) | |
| - McEval: Massively multilingual code evaluation (40 languages) | |
| - MultiPL-E: Multi-language HumanEval/MBPP translation (9 languages loaded) | |
| - Defects4J: Classic Java bug-fix benchmark (467 bugs) | |
| - Arxiv survey reference: https://arxiv.org/abs/2505.08903 | |
| ## Total Datasets: 41 | |
| Base (4): REval, CRUXEval, HumanEval+, BigOBench | |
| Batch 1 (5): MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X | |
| Batch 2 (5): SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP | |
| Batch 3 (5): SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench | |
| Batch 4 (3): SWE-bench Verified, CodeSearchNet, Devign | |
| Batch 5 (6): BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench | |
| Batch 6 (6): LCA Library-Based Code Gen, LCA Project-Level Completion, LCA Bug Localization, LCA Commit Message Gen, LCA CI Builds Repair, LCA Module Summarization | |
| Batch 7 (7): DPAIA EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J | |
| ## Changelog | |
| - 2026-03-03: Initial benchmark analysis and prioritization complete | |
| - 2026-03-03: Batch 1 complete (MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X) | |
| - 2026-03-03: Batch 2 complete (SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP) | |
| - 2026-03-03: Batch 3 complete (SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench) | |
| - 2026-03-03: Batch 4 complete (SWE-bench Verified, CodeSearchNet, Devign) | |
| - 2026-03-03: Fixed APPS loading (refs/convert/parquet), PrimeVul (direct JSONL), CodeEditorBench (per-task JSONL) | |
| - 2026-03-03: All 22 datasets verified loading successfully | |
| - 2026-03-04: Refactored adapters into submodules (adapters/code_generation.py, code_editing.py, code_reasoning.py, vulnerability.py) | |
| - 2026-03-04: Extracted CSS and JS into static/ directory (static/problem.css, static/problem.js) | |
| - 2026-03-04: Added sampling for large datasets (cap at 1000 with seed=42) | |
| - 2026-03-04: Enhanced FIM view (merged code with ground truth highlighting) | |
| - 2026-03-04: Enhanced Before/After view (diff highlighting) | |
| - 2026-03-04: Enhanced SWE-bench diff view (full file with diff chunks) | |
| - 2026-03-04: Batch 5 complete (BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench) | |
| - 2026-03-04: All 28 datasets verified loading successfully | |
| - 2026-03-04: Batch 6 complete (Long Code Arena β 6 project-level tasks) | |
| - 2026-03-04: Batch 7 complete (dpaia EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J) | |
| - 2026-03-04: Dropped RepoBench (HF repo has only deprecated loading script, no data files) | |
| - 2026-03-04: Fixed Multi-SWE-bench (load per-repo JSONL files directly instead of `load_dataset`) | |
| - 2026-03-04: Fixed CrossCodeEval (load per-language JSONL files directly, inconsistent columns across files) | |
| - 2026-03-04: Fixed Defects4J (split="train" not "test", fields: bug_id/func_before/func_after) | |
| - 2026-03-04: All 41 datasets verified loading successfully | |