File size: 10,489 Bytes
9a8a9c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9f85fac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a8a9c5
 
 
 
 
 
9f85fac
 
9a8a9c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9f85fac
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
# Benchmark Integration Progress

## Status: Batches 1-5 Complete

## Batch Plan

### Batch 1 (Highest Priority -- Easy HF, High Influence)
| Benchmark | Slug | Status | HF Dataset | View Type |
|-----------|------|--------|------------|-----------|
| MBPP+ | `mbppplus` | Done | `evalplus/mbppplus` | Simple |
| ClassEval | `classeval` | Done | `FudanSELab/ClassEval` | Simple |
| LiveCodeBench | `livecodebench` | Done | `livecodebench/code_generation_lite` | Simple |
| DebugBench | `debugbench` | Done | `Rtian/DebugBench` | Before/After |
| HumanEval-X | `humanevalx` | Done | `THUDM/humaneval-x` | Multi-language |

**Refactoring done:** Multi-language syntax highlighting via `get_lexer_by_name()`. Before/after code diff view. Multi-language tab view.

### Batch 2
| Benchmark | Slug | Status | HF Dataset | View Type |
|-----------|------|--------|------------|-----------|
| SWE-bench Lite | `swebenchlite` | Done | `princeton-nlp/SWE-bench_Lite` | Diff |
| CodeContests | `codecontests` | Done | `deepmind/code_contests` | Multi-solution |
| APPS | `apps` | Done | `codeparrot/apps` | Multi-solution / Simple |
| CanItEdit | `canitedit` | Done | `nuprl/CanItEdit` | Before/After |
| MBPP | `mbpp` | Done | `google-research-datasets/mbpp` | Simple |

**New views:** Unified diff view for SWE-bench patches. Multi-solution view extended to show language labels for CodeContests.

### Batch 3
| Benchmark | Slug | Status | HF Dataset | View Type |
|-----------|------|--------|------------|-----------|
| SAFIM | `safim` | Done | `gonglinyuan/safim` | Fill-in-the-Middle |
| BigVul | `bigvul` | Done | `bstee615/bigvul` | Vulnerability |
| DiverseVul | `diversevul` | Done | `claudios/DiverseVul` | Vulnerability |
| PrimeVul | `primevul` | Done | `starsofchance/PrimeVul` | Vulnerability |
| CodeEditorBench | `codeeditorbench` | Done | `m-a-p/CodeEditorBench` | Before/After |

**New views:** Fill-in-the-Middle view showing code with [HOLE] marker and ground truth. Vulnerability view with CWE badges and vulnerable/patched code comparison.

### Batch 4
| Benchmark | Slug | Status | HF Dataset | View Type |
|-----------|------|--------|------------|-----------|
| SWE-bench Verified | `swebenchverified` | Done | `princeton-nlp/SWE-bench_Verified` | Diff |
| CodeSearchNet | `codesearchnet` | Done | `code-search-net/code_search_net` | Simple |
| Devign | `devign` | Done | `google/code_x_glue_cc_defect_detection` | Vulnerability |

### Dropped from original plan
| Benchmark | Reason |
|-----------|--------|
| DS-1000 | Complex library-specific format, limited visualization value |
| RepoBench | Repo-level context too complex for per-problem viewing |
| MultiPL-E | 22 languages but same problems as HumanEval/MBPP already covered |
| McEval | Very large (40 languages), complex format |
| xCodeEval | Very large (25M rows), 7 tasks, too complex |
| CrossVul | Similar to DiverseVul/BigVul, diminishing returns |

### Batch 5
| Benchmark | Slug | Status | HF Dataset | View Type |
|-----------|------|--------|------------|-----------|
| BigCodeBench | `bigcodebench` | Done | `bigcode/bigcodebench` | Simple |
| HumanEvalPack | `humanevalpack` | Done | `bigcode/humanevalpack` | Multi-language + Before/After |
| CodeXGLUE Refinement | `codexgluerefinement` | Done | `google/code_x_glue_cc_code_refinement` | Before/After |
| SWE-bench | `swebenchfull` | Done | `princeton-nlp/SWE-bench` | Diff |
| CommitBench | `commitbench` | Done | `Maxscha/commitbench` | Diff |
| EffiBench | `effibench` | Done | `DONG19/EffiBench` | Simple |

**New views:** Multi-language view with canonical/buggy code toggle (HumanEvalPack). CommitBench reuses diff view. CodeXGLUE Refinement uses before/after Java view.

### Deferred (GitHub-only or complex infrastructure)
CoderEval, NaturalCodeBench, DevEval, RunBugRun, Defects4J, ConDefects, FixEval, TransCoder, AVATAR, TypeEvalPy, VJBench, SVEN, PyTER

## Architecture Decisions

### Multi-language Support
- `highlight_code()` in `app.py` accepts `language` parameter (default: `"python"`)
- Uses `get_lexer_by_name()` from Pygments for automatic lexer selection
- Adapters pass language when calling `_highlight_code(code, language=...)`

### View Types Implemented
1. **BigOBench view** -- multiple solutions with complexity badges
2. **Simple view** -- code + inputs/outputs + test suite (HumanEval+, MBPP+, MBPP, ClassEval, LiveCodeBench, APPS, CodeSearchNet)
3. **CRUXEval view** -- given/predict task selector
4. **DREval view** -- full interactive view with coverage, arrows, ground truth
5. **Before/After view** -- side-by-side buggy/fixed code (DebugBench, CanItEdit, CodeEditorBench)
6. **Multi-language view** -- same problem in multiple languages (HumanEval-X, HumanEvalPack)
7. **Diff view** -- unified diff patch visualization (SWE-bench Lite, SWE-bench Verified, SWE-bench, CommitBench)
8. **Fill-in-the-Middle view** -- prefix + [HOLE] + suffix (SAFIM)
9. **Vulnerability view** -- vulnerable/patched code + CWE labels (BigVul, DiverseVul, PrimeVul, Devign)

### Batch 6 β€” Long Code Arena (6 project-level tasks)
| Benchmark | Slug | Status | HF Dataset | View Type |
|-----------|------|--------|------------|-----------|
| LCA Library-Based Code Gen | `lca-libcodegen` | Done | `JetBrains-Research/lca-library-based-code-generation` | Simple |
| LCA Project-Level Completion | `lca-codecompletion` | Done | `JetBrains-Research/lca-project-level-code-completion` | Simple |
| LCA Bug Localization | `lca-buglocalization` | Done | `JetBrains-Research/lca-bug-localization` | Diff |
| LCA Commit Message Gen | `lca-commitmsg` | Done | `JetBrains-Research/lca-commit-message-generation` | Diff |
| LCA CI Builds Repair | `lca-cirepair` | Done | `JetBrains-Research/lca-ci-builds-repair` | Diff |
| LCA Module Summarization | `lca-modulesumm` | Done | `JetBrains-Research/lca-module-summarization` | Simple |

**New adapter module:** `adapters/long_code_arena.py` β€” all 6 Long Code Arena project-level tasks.

### Batch 7 β€” dpaia & Additional Benchmarks (7 datasets)
| Benchmark | Slug | Status | Source | View Type |
|-----------|------|--------|--------|-----------|
| DPAIA EE-Dataset | `dpaia-ee` | Done | `github.com/dpaia/ee-dataset` (JSON) | Diff (SWE-bench style) |
| Multi-SWE-bench | `multiswebench` | Done | `ByteDance-Seed/Multi-SWE-bench` (JSONL) | Diff |
| SWE-bench Multilingual | `swebenchmultilingual` | Done | `SWE-bench/SWE-bench_Multilingual` | Diff |
| CrossCodeEval | `crosscodeeval` | Done | `Vincentvmt/CrossCodeEval` (JSONL) | Fill-in-the-Middle |
| McEval | `mceval` | Done | `Multilingual-Multimodal-NLP/McEval` | Simple |
| MultiPL-E | `multiple` | Done | `nuprl/MultiPL-E` | Multi-language |
| Defects4J | `defects4j` | Done | `rufimelo/defects4j` | Before/After |

### Dropped from Batch 7
| Benchmark | Reason |
|-----------|--------|
| RepoBench | HF repo has only a deprecated loading script (`repobench-p.py`), no actual data files |

**New adapter module:** `adapters/additional.py` β€” dpaia EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J.

**Sources:**
- Long Code Arena: https://huggingface.co/collections/JetBrains-Research/long-code-arena (OpenReview: aQoUjxlgNE)
- DPAIA EE-Dataset: https://github.com/dpaia/ee-dataset (Java/Spring SWE-bench-style)
- Multi-SWE-bench: ByteDance multilingual SWE-bench (7 languages, 1632 problems across 40 repos)
- SWE-bench Multilingual: Official SWE-bench multilingual extension (42 repos)
- CrossCodeEval: Cross-file code completion (4 languages, Amazon, 9928 problems)
- McEval: Massively multilingual code evaluation (40 languages)
- MultiPL-E: Multi-language HumanEval/MBPP translation (9 languages loaded)
- Defects4J: Classic Java bug-fix benchmark (467 bugs)
- Arxiv survey reference: https://arxiv.org/abs/2505.08903

## Total Datasets: 41
Base (4): REval, CRUXEval, HumanEval+, BigOBench
Batch 1 (5): MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X
Batch 2 (5): SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP
Batch 3 (5): SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench
Batch 4 (3): SWE-bench Verified, CodeSearchNet, Devign
Batch 5 (6): BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench
Batch 6 (6): LCA Library-Based Code Gen, LCA Project-Level Completion, LCA Bug Localization, LCA Commit Message Gen, LCA CI Builds Repair, LCA Module Summarization
Batch 7 (7): DPAIA EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J

## Changelog

- 2026-03-03: Initial benchmark analysis and prioritization complete
- 2026-03-03: Batch 1 complete (MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X)
- 2026-03-03: Batch 2 complete (SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP)
- 2026-03-03: Batch 3 complete (SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench)
- 2026-03-03: Batch 4 complete (SWE-bench Verified, CodeSearchNet, Devign)
- 2026-03-03: Fixed APPS loading (refs/convert/parquet), PrimeVul (direct JSONL), CodeEditorBench (per-task JSONL)
- 2026-03-03: All 22 datasets verified loading successfully
- 2026-03-04: Refactored adapters into submodules (adapters/code_generation.py, code_editing.py, code_reasoning.py, vulnerability.py)
- 2026-03-04: Extracted CSS and JS into static/ directory (static/problem.css, static/problem.js)
- 2026-03-04: Added sampling for large datasets (cap at 1000 with seed=42)
- 2026-03-04: Enhanced FIM view (merged code with ground truth highlighting)
- 2026-03-04: Enhanced Before/After view (diff highlighting)
- 2026-03-04: Enhanced SWE-bench diff view (full file with diff chunks)
- 2026-03-04: Batch 5 complete (BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench)
- 2026-03-04: All 28 datasets verified loading successfully
- 2026-03-04: Batch 6 complete (Long Code Arena β€” 6 project-level tasks)
- 2026-03-04: Batch 7 complete (dpaia EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J)
- 2026-03-04: Dropped RepoBench (HF repo has only deprecated loading script, no data files)
- 2026-03-04: Fixed Multi-SWE-bench (load per-repo JSONL files directly instead of `load_dataset`)
- 2026-03-04: Fixed CrossCodeEval (load per-language JSONL files directly, inconsistent columns across files)
- 2026-03-04: Fixed Defects4J (split="train" not "test", fields: bug_id/func_before/func_after)
- 2026-03-04: All 41 datasets verified loading successfully