File size: 10,489 Bytes
9a8a9c5 9f85fac 9a8a9c5 9f85fac 9a8a9c5 9f85fac | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 | # Benchmark Integration Progress
## Status: Batches 1-5 Complete
## Batch Plan
### Batch 1 (Highest Priority -- Easy HF, High Influence)
| Benchmark | Slug | Status | HF Dataset | View Type |
|-----------|------|--------|------------|-----------|
| MBPP+ | `mbppplus` | Done | `evalplus/mbppplus` | Simple |
| ClassEval | `classeval` | Done | `FudanSELab/ClassEval` | Simple |
| LiveCodeBench | `livecodebench` | Done | `livecodebench/code_generation_lite` | Simple |
| DebugBench | `debugbench` | Done | `Rtian/DebugBench` | Before/After |
| HumanEval-X | `humanevalx` | Done | `THUDM/humaneval-x` | Multi-language |
**Refactoring done:** Multi-language syntax highlighting via `get_lexer_by_name()`. Before/after code diff view. Multi-language tab view.
### Batch 2
| Benchmark | Slug | Status | HF Dataset | View Type |
|-----------|------|--------|------------|-----------|
| SWE-bench Lite | `swebenchlite` | Done | `princeton-nlp/SWE-bench_Lite` | Diff |
| CodeContests | `codecontests` | Done | `deepmind/code_contests` | Multi-solution |
| APPS | `apps` | Done | `codeparrot/apps` | Multi-solution / Simple |
| CanItEdit | `canitedit` | Done | `nuprl/CanItEdit` | Before/After |
| MBPP | `mbpp` | Done | `google-research-datasets/mbpp` | Simple |
**New views:** Unified diff view for SWE-bench patches. Multi-solution view extended to show language labels for CodeContests.
### Batch 3
| Benchmark | Slug | Status | HF Dataset | View Type |
|-----------|------|--------|------------|-----------|
| SAFIM | `safim` | Done | `gonglinyuan/safim` | Fill-in-the-Middle |
| BigVul | `bigvul` | Done | `bstee615/bigvul` | Vulnerability |
| DiverseVul | `diversevul` | Done | `claudios/DiverseVul` | Vulnerability |
| PrimeVul | `primevul` | Done | `starsofchance/PrimeVul` | Vulnerability |
| CodeEditorBench | `codeeditorbench` | Done | `m-a-p/CodeEditorBench` | Before/After |
**New views:** Fill-in-the-Middle view showing code with [HOLE] marker and ground truth. Vulnerability view with CWE badges and vulnerable/patched code comparison.
### Batch 4
| Benchmark | Slug | Status | HF Dataset | View Type |
|-----------|------|--------|------------|-----------|
| SWE-bench Verified | `swebenchverified` | Done | `princeton-nlp/SWE-bench_Verified` | Diff |
| CodeSearchNet | `codesearchnet` | Done | `code-search-net/code_search_net` | Simple |
| Devign | `devign` | Done | `google/code_x_glue_cc_defect_detection` | Vulnerability |
### Dropped from original plan
| Benchmark | Reason |
|-----------|--------|
| DS-1000 | Complex library-specific format, limited visualization value |
| RepoBench | Repo-level context too complex for per-problem viewing |
| MultiPL-E | 22 languages but same problems as HumanEval/MBPP already covered |
| McEval | Very large (40 languages), complex format |
| xCodeEval | Very large (25M rows), 7 tasks, too complex |
| CrossVul | Similar to DiverseVul/BigVul, diminishing returns |
### Batch 5
| Benchmark | Slug | Status | HF Dataset | View Type |
|-----------|------|--------|------------|-----------|
| BigCodeBench | `bigcodebench` | Done | `bigcode/bigcodebench` | Simple |
| HumanEvalPack | `humanevalpack` | Done | `bigcode/humanevalpack` | Multi-language + Before/After |
| CodeXGLUE Refinement | `codexgluerefinement` | Done | `google/code_x_glue_cc_code_refinement` | Before/After |
| SWE-bench | `swebenchfull` | Done | `princeton-nlp/SWE-bench` | Diff |
| CommitBench | `commitbench` | Done | `Maxscha/commitbench` | Diff |
| EffiBench | `effibench` | Done | `DONG19/EffiBench` | Simple |
**New views:** Multi-language view with canonical/buggy code toggle (HumanEvalPack). CommitBench reuses diff view. CodeXGLUE Refinement uses before/after Java view.
### Deferred (GitHub-only or complex infrastructure)
CoderEval, NaturalCodeBench, DevEval, RunBugRun, Defects4J, ConDefects, FixEval, TransCoder, AVATAR, TypeEvalPy, VJBench, SVEN, PyTER
## Architecture Decisions
### Multi-language Support
- `highlight_code()` in `app.py` accepts `language` parameter (default: `"python"`)
- Uses `get_lexer_by_name()` from Pygments for automatic lexer selection
- Adapters pass language when calling `_highlight_code(code, language=...)`
### View Types Implemented
1. **BigOBench view** -- multiple solutions with complexity badges
2. **Simple view** -- code + inputs/outputs + test suite (HumanEval+, MBPP+, MBPP, ClassEval, LiveCodeBench, APPS, CodeSearchNet)
3. **CRUXEval view** -- given/predict task selector
4. **DREval view** -- full interactive view with coverage, arrows, ground truth
5. **Before/After view** -- side-by-side buggy/fixed code (DebugBench, CanItEdit, CodeEditorBench)
6. **Multi-language view** -- same problem in multiple languages (HumanEval-X, HumanEvalPack)
7. **Diff view** -- unified diff patch visualization (SWE-bench Lite, SWE-bench Verified, SWE-bench, CommitBench)
8. **Fill-in-the-Middle view** -- prefix + [HOLE] + suffix (SAFIM)
9. **Vulnerability view** -- vulnerable/patched code + CWE labels (BigVul, DiverseVul, PrimeVul, Devign)
### Batch 6 β Long Code Arena (6 project-level tasks)
| Benchmark | Slug | Status | HF Dataset | View Type |
|-----------|------|--------|------------|-----------|
| LCA Library-Based Code Gen | `lca-libcodegen` | Done | `JetBrains-Research/lca-library-based-code-generation` | Simple |
| LCA Project-Level Completion | `lca-codecompletion` | Done | `JetBrains-Research/lca-project-level-code-completion` | Simple |
| LCA Bug Localization | `lca-buglocalization` | Done | `JetBrains-Research/lca-bug-localization` | Diff |
| LCA Commit Message Gen | `lca-commitmsg` | Done | `JetBrains-Research/lca-commit-message-generation` | Diff |
| LCA CI Builds Repair | `lca-cirepair` | Done | `JetBrains-Research/lca-ci-builds-repair` | Diff |
| LCA Module Summarization | `lca-modulesumm` | Done | `JetBrains-Research/lca-module-summarization` | Simple |
**New adapter module:** `adapters/long_code_arena.py` β all 6 Long Code Arena project-level tasks.
### Batch 7 β dpaia & Additional Benchmarks (7 datasets)
| Benchmark | Slug | Status | Source | View Type |
|-----------|------|--------|--------|-----------|
| DPAIA EE-Dataset | `dpaia-ee` | Done | `github.com/dpaia/ee-dataset` (JSON) | Diff (SWE-bench style) |
| Multi-SWE-bench | `multiswebench` | Done | `ByteDance-Seed/Multi-SWE-bench` (JSONL) | Diff |
| SWE-bench Multilingual | `swebenchmultilingual` | Done | `SWE-bench/SWE-bench_Multilingual` | Diff |
| CrossCodeEval | `crosscodeeval` | Done | `Vincentvmt/CrossCodeEval` (JSONL) | Fill-in-the-Middle |
| McEval | `mceval` | Done | `Multilingual-Multimodal-NLP/McEval` | Simple |
| MultiPL-E | `multiple` | Done | `nuprl/MultiPL-E` | Multi-language |
| Defects4J | `defects4j` | Done | `rufimelo/defects4j` | Before/After |
### Dropped from Batch 7
| Benchmark | Reason |
|-----------|--------|
| RepoBench | HF repo has only a deprecated loading script (`repobench-p.py`), no actual data files |
**New adapter module:** `adapters/additional.py` β dpaia EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J.
**Sources:**
- Long Code Arena: https://huggingface.co/collections/JetBrains-Research/long-code-arena (OpenReview: aQoUjxlgNE)
- DPAIA EE-Dataset: https://github.com/dpaia/ee-dataset (Java/Spring SWE-bench-style)
- Multi-SWE-bench: ByteDance multilingual SWE-bench (7 languages, 1632 problems across 40 repos)
- SWE-bench Multilingual: Official SWE-bench multilingual extension (42 repos)
- CrossCodeEval: Cross-file code completion (4 languages, Amazon, 9928 problems)
- McEval: Massively multilingual code evaluation (40 languages)
- MultiPL-E: Multi-language HumanEval/MBPP translation (9 languages loaded)
- Defects4J: Classic Java bug-fix benchmark (467 bugs)
- Arxiv survey reference: https://arxiv.org/abs/2505.08903
## Total Datasets: 41
Base (4): REval, CRUXEval, HumanEval+, BigOBench
Batch 1 (5): MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X
Batch 2 (5): SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP
Batch 3 (5): SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench
Batch 4 (3): SWE-bench Verified, CodeSearchNet, Devign
Batch 5 (6): BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench
Batch 6 (6): LCA Library-Based Code Gen, LCA Project-Level Completion, LCA Bug Localization, LCA Commit Message Gen, LCA CI Builds Repair, LCA Module Summarization
Batch 7 (7): DPAIA EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J
## Changelog
- 2026-03-03: Initial benchmark analysis and prioritization complete
- 2026-03-03: Batch 1 complete (MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X)
- 2026-03-03: Batch 2 complete (SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP)
- 2026-03-03: Batch 3 complete (SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench)
- 2026-03-03: Batch 4 complete (SWE-bench Verified, CodeSearchNet, Devign)
- 2026-03-03: Fixed APPS loading (refs/convert/parquet), PrimeVul (direct JSONL), CodeEditorBench (per-task JSONL)
- 2026-03-03: All 22 datasets verified loading successfully
- 2026-03-04: Refactored adapters into submodules (adapters/code_generation.py, code_editing.py, code_reasoning.py, vulnerability.py)
- 2026-03-04: Extracted CSS and JS into static/ directory (static/problem.css, static/problem.js)
- 2026-03-04: Added sampling for large datasets (cap at 1000 with seed=42)
- 2026-03-04: Enhanced FIM view (merged code with ground truth highlighting)
- 2026-03-04: Enhanced Before/After view (diff highlighting)
- 2026-03-04: Enhanced SWE-bench diff view (full file with diff chunks)
- 2026-03-04: Batch 5 complete (BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench)
- 2026-03-04: All 28 datasets verified loading successfully
- 2026-03-04: Batch 6 complete (Long Code Arena β 6 project-level tasks)
- 2026-03-04: Batch 7 complete (dpaia EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J)
- 2026-03-04: Dropped RepoBench (HF repo has only deprecated loading script, no data files)
- 2026-03-04: Fixed Multi-SWE-bench (load per-repo JSONL files directly instead of `load_dataset`)
- 2026-03-04: Fixed CrossCodeEval (load per-language JSONL files directly, inconsistent columns across files)
- 2026-03-04: Fixed Defects4J (split="train" not "test", fields: bug_id/func_before/func_after)
- 2026-03-04: All 41 datasets verified loading successfully
|