egor-bogomolov commited on
Commit
9f85fac
·
1 Parent(s): 3c743c5

Add 13 new benchmark datasets (batches 6-8)

Browse files

Long Code Arena (6 tasks):
- Library-Based Code Gen, Project-Level Completion,
Bug Localization, Commit Message Gen,
CI Builds Repair, Module Summarization
Additional datasets:
- DPAIA EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual,
CrossCodeEval, Defects4J, McEval, MultiPL-E
Key implementation details:
- CrossCodeEval: load per-language JSONL directly (inconsistent HF columns)
- Multi-SWE-bench: load individual JSONL files (40 files across languages)
- Defects4J: use train split, map bug_id/func_before/func_after fields
- RepoBench dropped (no data files, deprecated loading script)
- LCA CI logs: head+tail trimming (first/last 10k chars by line)
- LCA large fields: explicit trim markers showing original vs trimmed size
- GitHub repo/commit links for all LCA tasks where data is available
- Total datasets: 41 (up from 28)

PROGRESS.md CHANGED
@@ -87,13 +87,56 @@ CoderEval, NaturalCodeBench, DevEval, RunBugRun, Defects4J, ConDefects, FixEval,
87
  8. **Fill-in-the-Middle view** -- prefix + [HOLE] + suffix (SAFIM)
88
  9. **Vulnerability view** -- vulnerable/patched code + CWE labels (BigVul, DiverseVul, PrimeVul, Devign)
89
 
90
- ## Total Datasets: 28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  Base (4): REval, CRUXEval, HumanEval+, BigOBench
92
  Batch 1 (5): MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X
93
  Batch 2 (5): SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP
94
  Batch 3 (5): SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench
95
  Batch 4 (3): SWE-bench Verified, CodeSearchNet, Devign
96
  Batch 5 (6): BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench
 
 
97
 
98
  ## Changelog
99
 
@@ -112,3 +155,10 @@ Batch 5 (6): BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, Commi
112
  - 2026-03-04: Enhanced SWE-bench diff view (full file with diff chunks)
113
  - 2026-03-04: Batch 5 complete (BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench)
114
  - 2026-03-04: All 28 datasets verified loading successfully
 
 
 
 
 
 
 
 
87
  8. **Fill-in-the-Middle view** -- prefix + [HOLE] + suffix (SAFIM)
88
  9. **Vulnerability view** -- vulnerable/patched code + CWE labels (BigVul, DiverseVul, PrimeVul, Devign)
89
 
90
+ ### Batch 6 — Long Code Arena (6 project-level tasks)
91
+ | Benchmark | Slug | Status | HF Dataset | View Type |
92
+ |-----------|------|--------|------------|-----------|
93
+ | LCA Library-Based Code Gen | `lca-libcodegen` | Done | `JetBrains-Research/lca-library-based-code-generation` | Simple |
94
+ | LCA Project-Level Completion | `lca-codecompletion` | Done | `JetBrains-Research/lca-project-level-code-completion` | Simple |
95
+ | LCA Bug Localization | `lca-buglocalization` | Done | `JetBrains-Research/lca-bug-localization` | Diff |
96
+ | LCA Commit Message Gen | `lca-commitmsg` | Done | `JetBrains-Research/lca-commit-message-generation` | Diff |
97
+ | LCA CI Builds Repair | `lca-cirepair` | Done | `JetBrains-Research/lca-ci-builds-repair` | Diff |
98
+ | LCA Module Summarization | `lca-modulesumm` | Done | `JetBrains-Research/lca-module-summarization` | Simple |
99
+
100
+ **New adapter module:** `adapters/long_code_arena.py` — all 6 Long Code Arena project-level tasks.
101
+
102
+ ### Batch 7 — dpaia & Additional Benchmarks (7 datasets)
103
+ | Benchmark | Slug | Status | Source | View Type |
104
+ |-----------|------|--------|--------|-----------|
105
+ | DPAIA EE-Dataset | `dpaia-ee` | Done | `github.com/dpaia/ee-dataset` (JSON) | Diff (SWE-bench style) |
106
+ | Multi-SWE-bench | `multiswebench` | Done | `ByteDance-Seed/Multi-SWE-bench` (JSONL) | Diff |
107
+ | SWE-bench Multilingual | `swebenchmultilingual` | Done | `SWE-bench/SWE-bench_Multilingual` | Diff |
108
+ | CrossCodeEval | `crosscodeeval` | Done | `Vincentvmt/CrossCodeEval` (JSONL) | Fill-in-the-Middle |
109
+ | McEval | `mceval` | Done | `Multilingual-Multimodal-NLP/McEval` | Simple |
110
+ | MultiPL-E | `multiple` | Done | `nuprl/MultiPL-E` | Multi-language |
111
+ | Defects4J | `defects4j` | Done | `rufimelo/defects4j` | Before/After |
112
+
113
+ ### Dropped from Batch 7
114
+ | Benchmark | Reason |
115
+ |-----------|--------|
116
+ | RepoBench | HF repo has only a deprecated loading script (`repobench-p.py`), no actual data files |
117
+
118
+ **New adapter module:** `adapters/additional.py` — dpaia EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J.
119
+
120
+ **Sources:**
121
+ - Long Code Arena: https://huggingface.co/collections/JetBrains-Research/long-code-arena (OpenReview: aQoUjxlgNE)
122
+ - DPAIA EE-Dataset: https://github.com/dpaia/ee-dataset (Java/Spring SWE-bench-style)
123
+ - Multi-SWE-bench: ByteDance multilingual SWE-bench (7 languages, 1632 problems across 40 repos)
124
+ - SWE-bench Multilingual: Official SWE-bench multilingual extension (42 repos)
125
+ - CrossCodeEval: Cross-file code completion (4 languages, Amazon, 9928 problems)
126
+ - McEval: Massively multilingual code evaluation (40 languages)
127
+ - MultiPL-E: Multi-language HumanEval/MBPP translation (9 languages loaded)
128
+ - Defects4J: Classic Java bug-fix benchmark (467 bugs)
129
+ - Arxiv survey reference: https://arxiv.org/abs/2505.08903
130
+
131
+ ## Total Datasets: 41
132
  Base (4): REval, CRUXEval, HumanEval+, BigOBench
133
  Batch 1 (5): MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X
134
  Batch 2 (5): SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP
135
  Batch 3 (5): SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench
136
  Batch 4 (3): SWE-bench Verified, CodeSearchNet, Devign
137
  Batch 5 (6): BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench
138
+ Batch 6 (6): LCA Library-Based Code Gen, LCA Project-Level Completion, LCA Bug Localization, LCA Commit Message Gen, LCA CI Builds Repair, LCA Module Summarization
139
+ Batch 7 (7): DPAIA EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J
140
 
141
  ## Changelog
142
 
 
155
  - 2026-03-04: Enhanced SWE-bench diff view (full file with diff chunks)
156
  - 2026-03-04: Batch 5 complete (BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench)
157
  - 2026-03-04: All 28 datasets verified loading successfully
158
+ - 2026-03-04: Batch 6 complete (Long Code Arena — 6 project-level tasks)
159
+ - 2026-03-04: Batch 7 complete (dpaia EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J)
160
+ - 2026-03-04: Dropped RepoBench (HF repo has only deprecated loading script, no data files)
161
+ - 2026-03-04: Fixed Multi-SWE-bench (load per-repo JSONL files directly instead of `load_dataset`)
162
+ - 2026-03-04: Fixed CrossCodeEval (load per-language JSONL files directly, inconsistent columns across files)
163
+ - 2026-03-04: Fixed Defects4J (split="train" not "test", fields: bug_id/func_before/func_after)
164
+ - 2026-03-04: All 41 datasets verified loading successfully
adapters/__init__.py CHANGED
@@ -28,9 +28,23 @@ def _set_helpers(highlight_code_fn, code_offset_fn, extract_test_classes_fn):
28
  _extract_test_classes = extract_test_classes_fn
29
 
30
  # Propagate to submodules so adapters can use them
31
- from adapters import code_editing, code_generation, code_reasoning, vulnerability
32
-
33
- for mod in (code_generation, code_editing, code_reasoning, vulnerability):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
  mod._highlight_code = highlight_code_fn
35
  mod._code_offset = code_offset_fn
36
  mod._extract_test_classes = extract_test_classes_fn
 
28
  _extract_test_classes = extract_test_classes_fn
29
 
30
  # Propagate to submodules so adapters can use them
31
+ from adapters import (
32
+ additional,
33
+ code_editing,
34
+ code_generation,
35
+ code_reasoning,
36
+ long_code_arena,
37
+ vulnerability,
38
+ )
39
+
40
+ for mod in (
41
+ code_generation,
42
+ code_editing,
43
+ code_reasoning,
44
+ vulnerability,
45
+ long_code_arena,
46
+ additional,
47
+ ):
48
  mod._highlight_code = highlight_code_fn
49
  mod._code_offset = code_offset_fn
50
  mod._extract_test_classes = extract_test_classes_fn
adapters/additional.py ADDED
@@ -0,0 +1,575 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Additional benchmark adapters (dpaia EE-Dataset, Multi-SWE-bench, SWE-bench
2
+ Multilingual, CrossCodeEval, RepoBench, McEval, MultiPL-E, Defects4J)."""
3
+
4
+ from __future__ import annotations
5
+
6
+ import json
7
+ from typing import Any
8
+
9
+ from adapters import DatasetAdapter
10
+ from adapters.code_editing import SWEBenchLiteAdapter
11
+
12
+ # Injected at runtime by _set_helpers()
13
+ _highlight_code = None
14
+ _code_offset = None
15
+ _extract_test_classes = None
16
+
17
+
18
+ # ---------------------------------------------------------------------------
19
+ # dpaia Enterprise Evaluation Dataset
20
+ # (GitHub: dpaia/ee-dataset — SWE-bench-style format for Java/Spring)
21
+ # ---------------------------------------------------------------------------
22
+
23
+
24
+ class DPAIAEEDatasetAdapter(DatasetAdapter):
25
+ slug = "dpaia-ee"
26
+ display_name = "DPAIA EE-Dataset"
27
+ has_ground_truth = False
28
+ has_tasks = False
29
+
30
+ def __init__(self, rows: list[dict[str, Any]]):
31
+ self._rows = rows
32
+
33
+ def problem_count(self) -> int:
34
+ return len(self._rows)
35
+
36
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
37
+ row = self._rows[idx]
38
+ tags = row.get("tags", [])
39
+ tag_str = ", ".join(tags[:3]) if isinstance(tags, list) else str(tags)
40
+ return {
41
+ "idx": idx,
42
+ "task_id": row.get("instance_id", str(idx)),
43
+ "entry_point": row.get("repo", f"dpaia_{idx}"),
44
+ "num_inputs": 0,
45
+ "source": tag_str or "DPAIA",
46
+ }
47
+
48
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
49
+ row = self._rows[idx]
50
+ patch = row.get("patch", "")
51
+ test_patch = row.get("test_patch", "")
52
+ fail_to_pass = row.get("FAIL_TO_PASS", [])
53
+ if isinstance(fail_to_pass, str):
54
+ try:
55
+ fail_to_pass = json.loads(fail_to_pass)
56
+ except (json.JSONDecodeError, TypeError):
57
+ fail_to_pass = [fail_to_pass]
58
+ pass_to_pass = row.get("PASS_TO_PASS", [])
59
+ if isinstance(pass_to_pass, str):
60
+ try:
61
+ pass_to_pass = json.loads(pass_to_pass)
62
+ except (json.JSONDecodeError, TypeError):
63
+ pass_to_pass = [pass_to_pass]
64
+
65
+ instance_id = row.get("instance_id", str(idx))
66
+ repo = row.get("repo", "")
67
+
68
+ return {
69
+ "idx": idx,
70
+ "task_id": instance_id,
71
+ "entry_point": repo or f"dpaia_{idx}",
72
+ "code": patch,
73
+ "highlighted_code": "",
74
+ "inputs": [],
75
+ "outputs": [],
76
+ "test": None,
77
+ "tasks": [],
78
+ "source": ", ".join(row.get("tags", [])[:3])
79
+ if isinstance(row.get("tags"), list)
80
+ else "DPAIA",
81
+ "has_ground_truth": False,
82
+ "has_tasks": False,
83
+ "description": row.get("problem_statement", ""),
84
+ "patch": patch,
85
+ "test_patch": test_patch,
86
+ "fail_to_pass": fail_to_pass,
87
+ "pass_to_pass": pass_to_pass,
88
+ "repo": repo,
89
+ "base_commit": row.get("base_commit", ""),
90
+ }
91
+
92
+
93
+ # ---------------------------------------------------------------------------
94
+ # Multi-SWE-bench (HuggingFace: ByteDance-Seed/Multi-SWE-bench)
95
+ # Multilingual SWE-bench spanning 7 languages
96
+ # ---------------------------------------------------------------------------
97
+
98
+
99
+ class MultiSWEBenchAdapter(DatasetAdapter):
100
+ slug = "multiswebench"
101
+ display_name = "Multi-SWE-bench"
102
+ has_ground_truth = False
103
+ has_tasks = False
104
+
105
+ def __init__(self, rows: list[dict[str, Any]]):
106
+ self._rows = rows
107
+
108
+ def problem_count(self) -> int:
109
+ return len(self._rows)
110
+
111
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
112
+ row = self._rows[idx]
113
+ instance_id = row.get("instance_id", str(idx))
114
+ org = row.get("org", "")
115
+ repo = row.get("repo", "")
116
+ full_repo = f"{org}/{repo}" if org and repo else repo
117
+ return {
118
+ "idx": idx,
119
+ "task_id": instance_id,
120
+ "entry_point": instance_id.split("__")[-1] if instance_id else f"mswe_{idx}",
121
+ "num_inputs": 0,
122
+ "source": row.get("_language", full_repo or "unknown"),
123
+ }
124
+
125
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
126
+ row = self._rows[idx]
127
+ patch = row.get("fix_patch", "")
128
+ instance_id = row.get("instance_id", str(idx))
129
+ org = row.get("org", "")
130
+ repo_name = row.get("repo", "")
131
+ full_repo = f"{org}/{repo_name}" if org and repo_name else repo_name
132
+ lang = row.get("_language", "")
133
+ number = row.get("number", "")
134
+
135
+ # Build description from title + body
136
+ title = row.get("title", "")
137
+ body = row.get("body", "")
138
+ description = title
139
+ if body:
140
+ description += "\n\n" + body
141
+
142
+ links: dict[str, str] = {}
143
+ if full_repo:
144
+ links["repo_url"] = f"https://github.com/{full_repo}"
145
+ if number and full_repo:
146
+ links["issue_url"] = f"https://github.com/{full_repo}/pull/{number}"
147
+
148
+ return {
149
+ "idx": idx,
150
+ "task_id": instance_id,
151
+ "entry_point": instance_id.split("__")[-1] if instance_id else f"mswe_{idx}",
152
+ "code": patch,
153
+ "highlighted_code": "",
154
+ "inputs": [],
155
+ "outputs": [],
156
+ "test": None,
157
+ "tasks": [],
158
+ "source": lang or full_repo,
159
+ "has_ground_truth": False,
160
+ "has_tasks": False,
161
+ "description": description,
162
+ "patch": patch,
163
+ "test_patch": row.get("test_patch", ""),
164
+ "fail_to_pass": [],
165
+ "pass_to_pass": [],
166
+ "repo": full_repo,
167
+ "hints": row.get("hints", ""),
168
+ **links,
169
+ }
170
+
171
+
172
+ # ---------------------------------------------------------------------------
173
+ # SWE-bench Multilingual (HuggingFace: SWE-bench/SWE-bench_Multilingual)
174
+ # 300 tasks across 42 repos in multiple languages
175
+ # ---------------------------------------------------------------------------
176
+
177
+
178
+ class SWEBenchMultilingualAdapter(SWEBenchLiteAdapter):
179
+ slug = "swebenchmultilingual"
180
+ display_name = "SWE-bench Multilingual"
181
+
182
+
183
+ # ---------------------------------------------------------------------------
184
+ # CrossCodeEval (HuggingFace: Vincentvmt/CrossCodeEval or amazon-science/cceval)
185
+ # Cross-file code completion in 4 languages
186
+ # ---------------------------------------------------------------------------
187
+
188
+
189
+ class CrossCodeEvalAdapter(DatasetAdapter):
190
+ slug = "crosscodeeval"
191
+ display_name = "CrossCodeEval"
192
+ has_ground_truth = False
193
+ has_tasks = False
194
+
195
+ def __init__(self, rows: list[dict[str, Any]]):
196
+ self._rows = rows
197
+
198
+ def problem_count(self) -> int:
199
+ return len(self._rows)
200
+
201
+ @staticmethod
202
+ def _get_metadata(row: dict, key: str, default: str = "") -> str:
203
+ """Extract a value from the nested metadata dict."""
204
+ meta = row.get("metadata", {})
205
+ if isinstance(meta, dict):
206
+ return meta.get(key, default)
207
+ return default
208
+
209
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
210
+ row = self._rows[idx]
211
+ task_id = self._get_metadata(row, "task_id", str(idx))
212
+ return {
213
+ "idx": idx,
214
+ "task_id": task_id,
215
+ "entry_point": task_id.rsplit("/", 1)[-1] if task_id else f"cceval_{idx}",
216
+ "num_inputs": 0,
217
+ "source": row.get("language", "unknown"),
218
+ }
219
+
220
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
221
+ row = self._rows[idx]
222
+ prompt = row.get("prompt", "")
223
+ reference = row.get("groundtruth", "")
224
+ right_context = row.get("right_context", "")
225
+ lang = row.get("language", "python")
226
+ lang_key = lang.lower()
227
+
228
+ task_id = self._get_metadata(row, "task_id", str(idx))
229
+
230
+ # Build a FIM-style display: prompt with hole, then merged view
231
+ display_code = prompt + "\n/* [HOLE] */\n" + right_context
232
+ merged_code = prompt + reference + right_context if reference else prompt + right_context
233
+
234
+ before_hole = prompt
235
+ gt_start_line = before_hole.count("\n") + 1
236
+ gt_line_count = reference.count("\n") + (1 if reference else 0)
237
+ gt_end_line = gt_start_line + gt_line_count - 1
238
+
239
+ return {
240
+ "idx": idx,
241
+ "task_id": task_id,
242
+ "entry_point": task_id.rsplit("/", 1)[-1] if task_id else f"cceval_{idx}",
243
+ "code": display_code,
244
+ "highlighted_code": _highlight_code(display_code, language=lang_key),
245
+ "inputs": [],
246
+ "outputs": [],
247
+ "test": None,
248
+ "tasks": [],
249
+ "source": lang,
250
+ "has_ground_truth": False,
251
+ "has_tasks": False,
252
+ "fim_prefix": prompt,
253
+ "fim_ground_truth": reference,
254
+ "fim_ground_truth_highlighted": _highlight_code(reference, language=lang_key)
255
+ if reference
256
+ else "",
257
+ "fim_merged_code": merged_code,
258
+ "fim_merged_highlighted": _highlight_code(
259
+ merged_code,
260
+ highlight_lines=list(range(gt_start_line, gt_end_line + 1)),
261
+ language=lang_key,
262
+ )
263
+ if merged_code
264
+ else "",
265
+ "fim_gt_start_line": gt_start_line,
266
+ "fim_gt_end_line": gt_end_line,
267
+ "language": lang,
268
+ }
269
+
270
+
271
+ # ---------------------------------------------------------------------------
272
+ # RepoBench (HuggingFace: tianyang/repobench-p)
273
+ # Repository-level code completion across Python and Java
274
+ # ---------------------------------------------------------------------------
275
+
276
+
277
+ class RepoBenchAdapter(DatasetAdapter):
278
+ slug = "repobench"
279
+ display_name = "RepoBench"
280
+ has_ground_truth = False
281
+ has_tasks = False
282
+
283
+ def __init__(self, rows: list[dict[str, Any]]):
284
+ self._rows = rows
285
+
286
+ def problem_count(self) -> int:
287
+ return len(self._rows)
288
+
289
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
290
+ row = self._rows[idx]
291
+ return {
292
+ "idx": idx,
293
+ "task_id": str(row.get("repo_name", idx)),
294
+ "entry_point": row.get("file_path", f"repobench_{idx}").rsplit("/", 1)[-1],
295
+ "num_inputs": 0,
296
+ "source": row.get("language", row.get("_setting", "unknown")),
297
+ }
298
+
299
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
300
+ row = self._rows[idx]
301
+ # RepoBench has context code and a next_line to predict
302
+ context = row.get("all_code", row.get("context", ""))
303
+ next_line = row.get("next_line", row.get("gold_snippet_code", ""))
304
+ lang = row.get("language", "python")
305
+ lang_key = lang.lower()
306
+
307
+ display_code = context + "\n/* [HOLE] */\n" if context else ""
308
+ merged_code = context + "\n" + next_line if context and next_line else context
309
+
310
+ gt_start_line = context.count("\n") + 2 if context else 1
311
+ gt_line_count = next_line.count("\n") + 1 if next_line else 0
312
+ gt_end_line = gt_start_line + gt_line_count - 1
313
+
314
+ return {
315
+ "idx": idx,
316
+ "task_id": str(row.get("repo_name", idx)),
317
+ "entry_point": row.get("file_path", f"repobench_{idx}").rsplit("/", 1)[-1],
318
+ "code": display_code,
319
+ "highlighted_code": _highlight_code(display_code, language=lang_key)
320
+ if display_code
321
+ else "",
322
+ "inputs": [],
323
+ "outputs": [],
324
+ "test": None,
325
+ "tasks": [],
326
+ "source": row.get("_setting", lang),
327
+ "has_ground_truth": False,
328
+ "has_tasks": False,
329
+ "fim_prefix": context,
330
+ "fim_ground_truth": next_line,
331
+ "fim_ground_truth_highlighted": _highlight_code(next_line, language=lang_key)
332
+ if next_line
333
+ else "",
334
+ "fim_merged_code": merged_code,
335
+ "fim_merged_highlighted": _highlight_code(
336
+ merged_code,
337
+ highlight_lines=list(range(gt_start_line, gt_end_line + 1)),
338
+ language=lang_key,
339
+ )
340
+ if merged_code
341
+ else "",
342
+ "fim_gt_start_line": gt_start_line,
343
+ "fim_gt_end_line": gt_end_line,
344
+ "language": lang,
345
+ }
346
+
347
+
348
+ # ---------------------------------------------------------------------------
349
+ # McEval (HuggingFace: Multilingual-Multimodal-NLP/McEval)
350
+ # Massively multilingual code evaluation — 40 languages, 16K samples
351
+ # ---------------------------------------------------------------------------
352
+
353
+
354
+ class McEvalAdapter(DatasetAdapter):
355
+ slug = "mceval"
356
+ display_name = "McEval"
357
+ has_ground_truth = False
358
+ has_tasks = False
359
+
360
+ def __init__(self, hf_dataset):
361
+ self._ds = hf_dataset
362
+
363
+ def problem_count(self) -> int:
364
+ return len(self._ds)
365
+
366
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
367
+ row = self._ds[idx]
368
+ return {
369
+ "idx": idx,
370
+ "task_id": row.get("task_id", str(idx)),
371
+ "entry_point": row.get("entry_point", row.get("task_id", f"mceval_{idx}")),
372
+ "num_inputs": 0,
373
+ "source": row.get("language", "unknown"),
374
+ }
375
+
376
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
377
+ row = self._ds[idx]
378
+ prompt = row.get("prompt", "")
379
+ canonical = row.get("canonical_solution", "")
380
+ code = prompt + canonical
381
+ lang = row.get("language", "python")
382
+ lang_key = lang.lower()
383
+ # Map some known language names to Pygments lexer names
384
+ lang_map = {
385
+ "c++": "cpp",
386
+ "c#": "csharp",
387
+ "objective-c": "objectivec",
388
+ "visual basic": "vb.net",
389
+ "typescript": "typescript",
390
+ }
391
+ lang_key = lang_map.get(lang_key, lang_key)
392
+
393
+ return {
394
+ "idx": idx,
395
+ "task_id": row.get("task_id", str(idx)),
396
+ "entry_point": row.get("entry_point", row.get("task_id", f"mceval_{idx}")),
397
+ "code": code,
398
+ "highlighted_code": _highlight_code(code, language=lang_key),
399
+ "inputs": [],
400
+ "outputs": [],
401
+ "test": row.get("test", ""),
402
+ "tasks": [],
403
+ "source": lang,
404
+ "has_ground_truth": False,
405
+ "has_tasks": False,
406
+ "description": row.get("prompt", ""),
407
+ "language": lang,
408
+ }
409
+
410
+
411
+ # ---------------------------------------------------------------------------
412
+ # MultiPL-E (HuggingFace: nuprl/MultiPL-E)
413
+ # Multi-language translated HumanEval/MBPP — 22 languages
414
+ # ---------------------------------------------------------------------------
415
+
416
+
417
+ class MultiPLEAdapter(DatasetAdapter):
418
+ slug = "multiple"
419
+ display_name = "MultiPL-E"
420
+ has_ground_truth = False
421
+ has_tasks = False
422
+
423
+ # Languages we load (subset of 22 available)
424
+ LANGUAGES = ["py", "cpp", "java", "js", "ts", "go", "rs", "cs", "rb", "lua"]
425
+
426
+ _LANG_LABELS = {
427
+ "py": "Python",
428
+ "cpp": "C++",
429
+ "java": "Java",
430
+ "js": "JavaScript",
431
+ "ts": "TypeScript",
432
+ "go": "Go",
433
+ "rs": "Rust",
434
+ "cs": "C#",
435
+ "rb": "Ruby",
436
+ "lua": "Lua",
437
+ }
438
+ _LANG_PYGMENTS = {
439
+ "py": "python",
440
+ "cpp": "cpp",
441
+ "java": "java",
442
+ "js": "javascript",
443
+ "ts": "typescript",
444
+ "go": "go",
445
+ "rs": "rust",
446
+ "cs": "csharp",
447
+ "rb": "ruby",
448
+ "lua": "lua",
449
+ }
450
+
451
+ def __init__(self, datasets_by_lang: dict[str, Any]):
452
+ self._by_lang = datasets_by_lang
453
+ first_lang = next(iter(self._by_lang))
454
+ self._count = len(self._by_lang[first_lang])
455
+
456
+ def problem_count(self) -> int:
457
+ return self._count
458
+
459
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
460
+ first_lang = next(iter(self._by_lang))
461
+ row = self._by_lang[first_lang][idx]
462
+ return {
463
+ "idx": idx,
464
+ "task_id": row.get("name", str(idx)),
465
+ "entry_point": row.get("name", f"multiple_{idx}"),
466
+ "num_inputs": len(self._by_lang),
467
+ "source": "MultiPL-E",
468
+ }
469
+
470
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
471
+ first_lang = next(iter(self._by_lang))
472
+ row = self._by_lang[first_lang][idx]
473
+
474
+ lang_solutions = []
475
+ for lang in self._by_lang:
476
+ lrow = self._by_lang[lang][idx]
477
+ prompt = lrow.get("prompt", "")
478
+ # MultiPL-E stores tests but may not have canonical solutions
479
+ tests = lrow.get("tests", "")
480
+ lang_key = self._LANG_PYGMENTS.get(lang, lang)
481
+ lang_label = self._LANG_LABELS.get(lang, lang)
482
+ lang_solutions.append(
483
+ {
484
+ "language": lang,
485
+ "language_label": lang_label,
486
+ "code": prompt,
487
+ "highlighted_code": _highlight_code(prompt, language=lang_key),
488
+ "test": tests,
489
+ }
490
+ )
491
+
492
+ py_row = self._by_lang.get("py", self._by_lang[first_lang])[idx]
493
+ default_code = py_row.get("prompt", "")
494
+
495
+ return {
496
+ "idx": idx,
497
+ "task_id": row.get("name", str(idx)),
498
+ "entry_point": row.get("name", f"multiple_{idx}"),
499
+ "code": default_code,
500
+ "highlighted_code": _highlight_code(default_code),
501
+ "inputs": [],
502
+ "outputs": [],
503
+ "test": py_row.get("tests", ""),
504
+ "tasks": [],
505
+ "source": "MultiPL-E",
506
+ "has_ground_truth": False,
507
+ "has_tasks": False,
508
+ "lang_solutions": lang_solutions,
509
+ }
510
+
511
+
512
+ # ---------------------------------------------------------------------------
513
+ # Defects4J (HuggingFace: rufimelo/defects4j)
514
+ # Java bug-fix benchmark — 854 real bugs from open-source projects
515
+ # ---------------------------------------------------------------------------
516
+
517
+
518
+ class Defects4JAdapter(DatasetAdapter):
519
+ slug = "defects4j"
520
+ display_name = "Defects4J"
521
+ has_ground_truth = False
522
+ has_tasks = False
523
+
524
+ def __init__(self, hf_dataset):
525
+ self._ds = hf_dataset
526
+
527
+ def problem_count(self) -> int:
528
+ return len(self._ds)
529
+
530
+ @staticmethod
531
+ def _project_from_bug_id(bug_id: str) -> str:
532
+ """Extract project name from bug_id like 'Compress-35'."""
533
+ return bug_id.rsplit("-", 1)[0] if "-" in bug_id else bug_id
534
+
535
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
536
+ row = self._ds[idx]
537
+ bug_id = row.get("bug_id", str(idx))
538
+ project = self._project_from_bug_id(bug_id)
539
+ return {
540
+ "idx": idx,
541
+ "task_id": bug_id,
542
+ "entry_point": project,
543
+ "num_inputs": 0,
544
+ "source": project,
545
+ }
546
+
547
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
548
+ row = self._ds[idx]
549
+ bug_id = row.get("bug_id", str(idx))
550
+ project = self._project_from_bug_id(bug_id)
551
+ buggy = row.get("func_before", "")
552
+ fixed = row.get("func_after", "")
553
+ return {
554
+ "idx": idx,
555
+ "task_id": bug_id,
556
+ "entry_point": project,
557
+ "code": fixed,
558
+ "highlighted_code": _highlight_code(fixed, language="java") if fixed else "",
559
+ "inputs": [],
560
+ "outputs": [],
561
+ "test": None,
562
+ "tasks": [],
563
+ "source": project,
564
+ "has_ground_truth": False,
565
+ "has_tasks": False,
566
+ "description": "",
567
+ "buggy_code": buggy,
568
+ "buggy_highlighted_code": _highlight_code(buggy, language="java") if buggy else "",
569
+ "fixed_code": fixed,
570
+ "fixed_highlighted_code": _highlight_code(fixed, language="java") if fixed else "",
571
+ "bug_category": "Bug Fix",
572
+ "bug_subtype": project,
573
+ "bug_explanation": "",
574
+ "language": "Java",
575
+ }
adapters/long_code_arena.py ADDED
@@ -0,0 +1,503 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Long Code Arena benchmark adapters (6 project-level tasks).
2
+
3
+ All datasets from: https://huggingface.co/collections/JetBrains-Research/long-code-arena
4
+ """
5
+
6
+ from __future__ import annotations
7
+
8
+ import json
9
+ from typing import Any
10
+
11
+ from adapters import DatasetAdapter
12
+
13
+ # Injected at runtime by _set_helpers()
14
+ _highlight_code = None
15
+ _code_offset = None
16
+ _extract_test_classes = None
17
+
18
+ # ---------------------------------------------------------------------------
19
+ # Shared helpers
20
+ # ---------------------------------------------------------------------------
21
+
22
+ _CODE_TRIM_LIMIT = 50_000 # chars for code / diff fields
23
+ _DESC_TRIM_LIMIT = 5_000 # chars for description / log fields
24
+
25
+
26
+ def _trim(text: str, limit: int, label: str = "Content") -> str:
27
+ """Return *text* unchanged if short enough, otherwise trim with an explicit marker."""
28
+ if len(text) <= limit:
29
+ return text
30
+ return (
31
+ text[:limit]
32
+ + f"\n\n--- {label} trimmed: showing {limit:,} of {len(text):,} characters ---"
33
+ )
34
+
35
+
36
+ _LOG_HEAD_LIMIT = 10_000 # chars budget for head part of CI log
37
+ _LOG_TAIL_LIMIT = 10_000 # chars budget for tail part of CI log
38
+
39
+
40
+ def _trim_head_tail(text: str, label: str = "Content") -> str:
41
+ """Show first ~10k chars and last ~10k chars (snapped to line boundaries)."""
42
+ if len(text) <= _LOG_HEAD_LIMIT + _LOG_TAIL_LIMIT:
43
+ return text
44
+
45
+ # Head: find the last newline within the budget
46
+ head_end = text.rfind("\n", 0, _LOG_HEAD_LIMIT)
47
+ if head_end <= 0:
48
+ head_end = _LOG_HEAD_LIMIT
49
+ head = text[:head_end]
50
+
51
+ # Tail: find the first newline after the cut point
52
+ tail_start = text.find("\n", len(text) - _LOG_TAIL_LIMIT)
53
+ if tail_start < 0 or tail_start >= len(text):
54
+ tail_start = len(text) - _LOG_TAIL_LIMIT
55
+ tail = text[tail_start:]
56
+
57
+ total_lines = text.count("\n") + 1
58
+ head_lines = head.count("\n") + 1
59
+ tail_lines = tail.count("\n") + 1
60
+ omitted = total_lines - head_lines - tail_lines
61
+
62
+ return (
63
+ head
64
+ + f"\n\n--- {label} trimmed: showing first {head_lines:,} and last"
65
+ f" {tail_lines:,} lines ({omitted:,} lines omitted,"
66
+ f" {len(text):,} chars total) ---\n\n"
67
+ + tail
68
+ )
69
+
70
+
71
+ def _lca_repo_url(repo_slug: str) -> str:
72
+ """Convert an LCA-style repo slug to a GitHub URL.
73
+
74
+ LCA datasets use either ``owner__name`` (double underscore) or
75
+ ``owner/name`` (slash) depending on the task.
76
+ """
77
+ if not repo_slug:
78
+ return ""
79
+ # Normalise double-underscore to slash
80
+ ghname = repo_slug.replace("__", "/", 1) if "__" in repo_slug else repo_slug
81
+ return f"https://github.com/{ghname}"
82
+
83
+
84
+ # ---------------------------------------------------------------------------
85
+ # LCA Library-Based Code Generation
86
+ # (HuggingFace: JetBrains-Research/lca-library-based-code-generation)
87
+ # ---------------------------------------------------------------------------
88
+
89
+
90
+ class LCALibCodeGenAdapter(DatasetAdapter):
91
+ slug = "lca-libcodegen"
92
+ display_name = "LCA Library-Based Code Gen"
93
+ has_ground_truth = False
94
+ has_tasks = False
95
+
96
+ def __init__(self, hf_dataset):
97
+ self._ds = hf_dataset
98
+
99
+ def problem_count(self) -> int:
100
+ return len(self._ds)
101
+
102
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
103
+ row = self._ds[idx]
104
+ return {
105
+ "idx": idx,
106
+ "task_id": row.get("repo_full_name", str(idx)),
107
+ "entry_point": row.get("repo_name", f"lca_libgen_{idx}"),
108
+ "num_inputs": row.get("n_unique_apis", 0),
109
+ "source": row.get("repo_owner", "LCA"),
110
+ }
111
+
112
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
113
+ row = self._ds[idx]
114
+ reference = row.get("clean_reference", row.get("reference", ""))
115
+ unique_apis = list(row.get("unique_apis", []))
116
+ repo_slug = row.get("repo_full_name", "")
117
+ return {
118
+ "idx": idx,
119
+ "task_id": repo_slug or str(idx),
120
+ "entry_point": row.get("repo_name", f"lca_libgen_{idx}"),
121
+ "code": reference,
122
+ "highlighted_code": _highlight_code(reference),
123
+ "inputs": [],
124
+ "outputs": [],
125
+ "test": None,
126
+ "tasks": [],
127
+ "source": row.get("repo_owner", "LCA"),
128
+ "has_ground_truth": False,
129
+ "has_tasks": False,
130
+ "description": row.get("instruction", ""),
131
+ "unique_apis": unique_apis,
132
+ "n_unique_apis": row.get("n_unique_apis", 0),
133
+ "repo_url": _lca_repo_url(repo_slug),
134
+ }
135
+
136
+
137
+ # ---------------------------------------------------------------------------
138
+ # LCA Project-Level Code Completion
139
+ # (HuggingFace: JetBrains-Research/lca-project-level-code-completion)
140
+ # ---------------------------------------------------------------------------
141
+
142
+
143
+ class LCACodeCompletionAdapter(DatasetAdapter):
144
+ slug = "lca-codecompletion"
145
+ display_name = "LCA Project-Level Completion"
146
+ has_ground_truth = False
147
+ has_tasks = False
148
+
149
+ def __init__(self, rows: list[dict[str, Any]]):
150
+ self._rows = rows
151
+
152
+ def problem_count(self) -> int:
153
+ return len(self._rows)
154
+
155
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
156
+ row = self._rows[idx]
157
+ completion_file = row.get("completion_file", {})
158
+ filename = completion_file.get("filename", "") if isinstance(completion_file, dict) else ""
159
+ return {
160
+ "idx": idx,
161
+ "task_id": row.get("repo", str(idx)),
162
+ "entry_point": filename.rsplit("/", 1)[-1] if filename else f"completion_{idx}",
163
+ "num_inputs": 0,
164
+ "source": row.get("_context_size", "LCA"),
165
+ }
166
+
167
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
168
+ row = self._rows[idx]
169
+ completion_file = row.get("completion_file", {})
170
+ if isinstance(completion_file, dict):
171
+ filename = completion_file.get("filename", "")
172
+ content = completion_file.get("content", "")
173
+ else:
174
+ filename = ""
175
+ content = ""
176
+
177
+ completion_lines = row.get("completion_lines", {})
178
+ if isinstance(completion_lines, dict):
179
+ committed = completion_lines.get("committed", [])
180
+ else:
181
+ committed = []
182
+
183
+ lang = "python"
184
+ if filename:
185
+ ext = filename.rsplit(".", 1)[-1].lower() if "." in filename else ""
186
+ ext_map = {
187
+ "py": "python",
188
+ "java": "java",
189
+ "kt": "kotlin",
190
+ "js": "javascript",
191
+ "ts": "typescript",
192
+ "cpp": "cpp",
193
+ "c": "c",
194
+ "go": "go",
195
+ "rs": "rust",
196
+ "rb": "ruby",
197
+ }
198
+ lang = ext_map.get(ext, "python")
199
+
200
+ repo_slug = row.get("repo", "")
201
+ commit_hash = row.get("commit_hash", "")
202
+ repo_url = _lca_repo_url(repo_slug)
203
+ commit_url = f"{repo_url}/commit/{commit_hash}" if repo_url and commit_hash else ""
204
+
205
+ return {
206
+ "idx": idx,
207
+ "task_id": repo_slug or str(idx),
208
+ "entry_point": filename.rsplit("/", 1)[-1] if filename else f"completion_{idx}",
209
+ "code": content,
210
+ "highlighted_code": _highlight_code(content, language=lang) if content else "",
211
+ "inputs": [],
212
+ "outputs": [],
213
+ "test": None,
214
+ "tasks": [],
215
+ "source": row.get("_context_size", "LCA"),
216
+ "has_ground_truth": False,
217
+ "has_tasks": False,
218
+ "description": f"File: {filename}\nCommit: {commit_hash[:12]}",
219
+ "completion_lines_committed": committed,
220
+ "language": lang,
221
+ "repo_url": repo_url,
222
+ "commit_url": commit_url,
223
+ }
224
+
225
+
226
+ # ---------------------------------------------------------------------------
227
+ # LCA Bug Localization
228
+ # (HuggingFace: JetBrains-Research/lca-bug-localization)
229
+ # ---------------------------------------------------------------------------
230
+
231
+
232
+ class LCABugLocalizationAdapter(DatasetAdapter):
233
+ slug = "lca-buglocalization"
234
+ display_name = "LCA Bug Localization"
235
+ has_ground_truth = False
236
+ has_tasks = False
237
+
238
+ def __init__(self, hf_dataset):
239
+ self._ds = hf_dataset
240
+
241
+ def problem_count(self) -> int:
242
+ return len(self._ds)
243
+
244
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
245
+ row = self._ds[idx]
246
+ return {
247
+ "idx": idx,
248
+ "task_id": row.get("text_id", str(idx)),
249
+ "entry_point": f"{row.get('repo_owner', '')}/{row.get('repo_name', '')}",
250
+ "num_inputs": row.get("changed_files_count", 0),
251
+ "source": row.get("repo_language", "unknown"),
252
+ }
253
+
254
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
255
+ row = self._ds[idx]
256
+ diff = row.get("diff", "")
257
+ repo_owner = row.get("repo_owner", "")
258
+ repo_name = row.get("repo_name", "")
259
+ repo = f"{repo_owner}/{repo_name}" if repo_owner and repo_name else ""
260
+ issue_url = row.get("issue_url", "")
261
+ pull_url = row.get("pull_url", "")
262
+
263
+ return {
264
+ "idx": idx,
265
+ "task_id": row.get("text_id", str(idx)),
266
+ "entry_point": repo or f"bug_{idx}",
267
+ "code": diff,
268
+ "highlighted_code": "",
269
+ "inputs": [],
270
+ "outputs": [],
271
+ "test": None,
272
+ "tasks": [],
273
+ "source": row.get("repo_language", "unknown"),
274
+ "has_ground_truth": False,
275
+ "has_tasks": False,
276
+ "description": row.get("issue_title", "")
277
+ + ("\n\n" + row.get("issue_body", "") if row.get("issue_body") else ""),
278
+ "patch": diff,
279
+ "repo": repo,
280
+ "repo_url": f"https://github.com/{repo}" if repo else "",
281
+ "issue_url": issue_url,
282
+ "commit_url": pull_url,
283
+ }
284
+
285
+
286
+ # ---------------------------------------------------------------------------
287
+ # LCA Commit Message Generation
288
+ # (HuggingFace: JetBrains-Research/lca-commit-message-generation)
289
+ # ---------------------------------------------------------------------------
290
+
291
+
292
+ class LCACommitMsgGenAdapter(DatasetAdapter):
293
+ slug = "lca-commitmsg"
294
+ display_name = "LCA Commit Message Gen"
295
+ has_ground_truth = False
296
+ has_tasks = False
297
+
298
+ def __init__(self, hf_dataset):
299
+ self._ds = hf_dataset
300
+
301
+ def problem_count(self) -> int:
302
+ return len(self._ds)
303
+
304
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
305
+ row = self._ds[idx]
306
+ mods = row.get("mods", [])
307
+ n_files = len(mods) if isinstance(mods, list) else 0
308
+ return {
309
+ "idx": idx,
310
+ "task_id": row.get("hash", str(idx))[:12],
311
+ "entry_point": row.get("repo", f"commit_{idx}"),
312
+ "num_inputs": n_files,
313
+ "source": row.get("license", "LCA")[:20],
314
+ }
315
+
316
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
317
+ row = self._ds[idx]
318
+ message = row.get("message", "")
319
+ mods = row.get("mods", [])
320
+
321
+ # Build a unified diff from all modifications
322
+ diff_parts = []
323
+ if isinstance(mods, list):
324
+ for mod in mods:
325
+ if isinstance(mod, dict):
326
+ old_path = mod.get("old_path", "")
327
+ new_path = mod.get("new_path", "")
328
+ mod_diff = mod.get("diff", "")
329
+ if mod_diff:
330
+ diff_parts.append(
331
+ f"diff --git a/{old_path} b/{new_path}\n"
332
+ f"--- a/{old_path}\n"
333
+ f"+++ b/{new_path}\n"
334
+ f"{mod_diff}"
335
+ )
336
+ combined_diff = "\n".join(diff_parts)
337
+ trimmed_diff = _trim(combined_diff, _CODE_TRIM_LIMIT, "Diff")
338
+
339
+ repo_slug = row.get("repo", "")
340
+ commit_hash = row.get("hash", "")
341
+ repo_url = _lca_repo_url(repo_slug)
342
+ commit_url = f"{repo_url}/commit/{commit_hash}" if repo_url and commit_hash else ""
343
+
344
+ return {
345
+ "idx": idx,
346
+ "task_id": (commit_hash or str(idx))[:12],
347
+ "entry_point": repo_slug or f"commit_{idx}",
348
+ "code": trimmed_diff,
349
+ "highlighted_code": "",
350
+ "inputs": [],
351
+ "outputs": [],
352
+ "test": None,
353
+ "tasks": [],
354
+ "source": row.get("license", "LCA")[:20],
355
+ "has_ground_truth": False,
356
+ "has_tasks": False,
357
+ "description": message,
358
+ "patch": trimmed_diff,
359
+ "repo": repo_slug,
360
+ "repo_url": repo_url,
361
+ "commit_url": commit_url,
362
+ "commit_hash": commit_hash,
363
+ }
364
+
365
+
366
+ # ---------------------------------------------------------------------------
367
+ # LCA CI Builds Repair
368
+ # (HuggingFace: JetBrains-Research/lca-ci-builds-repair)
369
+ # ---------------------------------------------------------------------------
370
+
371
+
372
+ class LCACIRepairAdapter(DatasetAdapter):
373
+ slug = "lca-cirepair"
374
+ display_name = "LCA CI Builds Repair"
375
+ has_ground_truth = False
376
+ has_tasks = False
377
+
378
+ def __init__(self, hf_dataset):
379
+ self._ds = hf_dataset
380
+
381
+ def problem_count(self) -> int:
382
+ return len(self._ds)
383
+
384
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
385
+ row = self._ds[idx]
386
+ repo = f"{row.get('repo_owner', '')}/{row.get('repo_name', '')}"
387
+ return {
388
+ "idx": idx,
389
+ "task_id": str(row.get("id", idx)),
390
+ "entry_point": repo,
391
+ "num_inputs": 0,
392
+ "source": f"difficulty-{row.get('difficulty', '?')}",
393
+ }
394
+
395
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
396
+ row = self._ds[idx]
397
+ diff = row.get("diff", "")
398
+ trimmed_diff = _trim(diff, _CODE_TRIM_LIMIT, "Diff")
399
+ repo_owner = row.get("repo_owner", "")
400
+ repo_name = row.get("repo_name", "")
401
+ repo = f"{repo_owner}/{repo_name}" if repo_owner and repo_name else ""
402
+ commit_link = row.get("commit_link", "")
403
+
404
+ # Extract log text — can be several MB; trim explicitly
405
+ logs = row.get("logs", [])
406
+ log_text = ""
407
+ if isinstance(logs, list):
408
+ for entry in logs:
409
+ if isinstance(entry, dict):
410
+ step = entry.get("step_name", "")
411
+ log = entry.get("log", "")
412
+ log_text += f"=== {step} ===\n{log}\n\n"
413
+ trimmed_log = _trim_head_tail(log_text, "CI log")
414
+
415
+ return {
416
+ "idx": idx,
417
+ "task_id": str(row.get("id", idx)),
418
+ "entry_point": repo or f"ci_{idx}",
419
+ "code": trimmed_diff,
420
+ "highlighted_code": "",
421
+ "inputs": [],
422
+ "outputs": [],
423
+ "test": None,
424
+ "tasks": [],
425
+ "source": f"difficulty-{row.get('difficulty', '?')}",
426
+ "has_ground_truth": False,
427
+ "has_tasks": False,
428
+ "description": f"Workflow: {row.get('workflow_name', '')}\n"
429
+ f"Branch: {row.get('head_branch', '')}\n"
430
+ f"Contributor: {row.get('contributor', '')}\n\n"
431
+ f"CI Log:\n{trimmed_log}",
432
+ "patch": trimmed_diff,
433
+ "repo": repo,
434
+ "repo_url": f"https://github.com/{repo}" if repo else "",
435
+ "commit_url": commit_link,
436
+ }
437
+
438
+
439
+ # ---------------------------------------------------------------------------
440
+ # LCA Module Summarization
441
+ # (HuggingFace: JetBrains-Research/lca-module-summarization)
442
+ # ---------------------------------------------------------------------------
443
+
444
+
445
+ class LCAModuleSummarizationAdapter(DatasetAdapter):
446
+ slug = "lca-modulesumm"
447
+ display_name = "LCA Module Summarization"
448
+ has_ground_truth = False
449
+ has_tasks = False
450
+
451
+ def __init__(self, hf_dataset):
452
+ self._ds = hf_dataset
453
+
454
+ def problem_count(self) -> int:
455
+ return len(self._ds)
456
+
457
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
458
+ row = self._ds[idx]
459
+ return {
460
+ "idx": idx,
461
+ "task_id": row.get("docfile_name", str(idx)),
462
+ "entry_point": row.get("repo", f"module_{idx}"),
463
+ "num_inputs": 0,
464
+ "source": row.get("doc_type", "LCA"),
465
+ }
466
+
467
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
468
+ row = self._ds[idx]
469
+ target_text = row.get("target_text", "")
470
+ # Code context can be extremely large (up to 23 MB); trim with explicit marker
471
+ code_context = row.get("relevant_code_context", "")
472
+ trimmed_code = _trim(code_context, _CODE_TRIM_LIMIT, "Code context")
473
+
474
+ relevant_files = row.get("relevant_code_files", [])
475
+ if isinstance(relevant_files, str):
476
+ try:
477
+ relevant_files = json.loads(relevant_files)
478
+ except (json.JSONDecodeError, TypeError):
479
+ relevant_files = [relevant_files]
480
+
481
+ repo_slug = row.get("repo", "")
482
+ repo_url = _lca_repo_url(repo_slug)
483
+ trimmed_target = _trim(target_text, _DESC_TRIM_LIMIT, "Target documentation")
484
+
485
+ return {
486
+ "idx": idx,
487
+ "task_id": row.get("docfile_name", str(idx)),
488
+ "entry_point": repo_slug or f"module_{idx}",
489
+ "code": trimmed_code,
490
+ "highlighted_code": _highlight_code(trimmed_code) if trimmed_code else "",
491
+ "inputs": [],
492
+ "outputs": [],
493
+ "test": None,
494
+ "tasks": [],
495
+ "source": row.get("doc_type", "LCA"),
496
+ "has_ground_truth": False,
497
+ "has_tasks": False,
498
+ "description": f"Intent: {row.get('intent', '')}\n\n"
499
+ f"Doc file: {row.get('path_to_docfile', '')}\n"
500
+ f"Relevant files: {', '.join(relevant_files) if isinstance(relevant_files, list) else ''}\n\n"
501
+ f"Target documentation:\n{trimmed_target}",
502
+ "repo_url": repo_url,
503
+ }
adapters/registration.py CHANGED
@@ -7,6 +7,15 @@ import random
7
  from typing import Any
8
 
9
  from adapters import REGISTRY
 
 
 
 
 
 
 
 
 
10
  from adapters.code_editing import (
11
  CanItEditAdapter,
12
  CodeEditorBenchAdapter,
@@ -38,6 +47,14 @@ from adapters.code_reasoning import (
38
  HumanEvalXAdapter,
39
  SAFIMAdapter,
40
  )
 
 
 
 
 
 
 
 
41
  from adapters.vulnerability import (
42
  BigVulAdapter,
43
  DevignAdapter,
@@ -408,3 +425,216 @@ def register_hf_datasets() -> None:
408
  print(f"Loaded EffiBench: {len(effibench)} problems")
409
  except Exception as e:
410
  print(f"Warning: could not load EffiBench: {e}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  from typing import Any
8
 
9
  from adapters import REGISTRY
10
+ from adapters.additional import (
11
+ CrossCodeEvalAdapter,
12
+ Defects4JAdapter,
13
+ DPAIAEEDatasetAdapter,
14
+ McEvalAdapter,
15
+ MultiPLEAdapter,
16
+ MultiSWEBenchAdapter,
17
+ SWEBenchMultilingualAdapter,
18
+ )
19
  from adapters.code_editing import (
20
  CanItEditAdapter,
21
  CodeEditorBenchAdapter,
 
47
  HumanEvalXAdapter,
48
  SAFIMAdapter,
49
  )
50
+ from adapters.long_code_arena import (
51
+ LCABugLocalizationAdapter,
52
+ LCACIRepairAdapter,
53
+ LCACodeCompletionAdapter,
54
+ LCACommitMsgGenAdapter,
55
+ LCALibCodeGenAdapter,
56
+ LCAModuleSummarizationAdapter,
57
+ )
58
  from adapters.vulnerability import (
59
  BigVulAdapter,
60
  DevignAdapter,
 
425
  print(f"Loaded EffiBench: {len(effibench)} problems")
426
  except Exception as e:
427
  print(f"Warning: could not load EffiBench: {e}")
428
+
429
+ # --- Long Code Arena datasets (6 tasks) ---
430
+
431
+ try:
432
+ lca_libgen = load_dataset(
433
+ "JetBrains-Research/lca-library-based-code-generation", split="test"
434
+ )
435
+ REGISTRY["lca-libcodegen"] = LCALibCodeGenAdapter(lca_libgen)
436
+ print(f"Loaded LCA Library-Based Code Gen: {len(lca_libgen)} problems")
437
+ except Exception as e:
438
+ print(f"Warning: could not load LCA Library-Based Code Gen: {e}")
439
+
440
+ try:
441
+ # Load small and medium context sizes only (large/huge are multi-GB)
442
+ lca_cc_rows: list[dict[str, Any]] = []
443
+ for ctx in ("small_context", "medium_context"):
444
+ try:
445
+ ds = load_dataset(
446
+ "JetBrains-Research/lca-project-level-code-completion", ctx, split="test"
447
+ )
448
+ for i in range(len(ds)):
449
+ row = dict(ds[i])
450
+ row["_context_size"] = ctx
451
+ lca_cc_rows.append(row)
452
+ except Exception:
453
+ pass
454
+ if lca_cc_rows:
455
+ lca_cc_sampled = _sample_list(lca_cc_rows)
456
+ adapter = LCACodeCompletionAdapter(lca_cc_sampled)
457
+ adapter.total_count = len(lca_cc_rows)
458
+ REGISTRY["lca-codecompletion"] = adapter
459
+ print(
460
+ f"Loaded LCA Project-Level Completion: "
461
+ f"{len(lca_cc_sampled)} problems (of {len(lca_cc_rows)})"
462
+ )
463
+ else:
464
+ print("Warning: could not load any LCA Code Completion context sizes")
465
+ except Exception as e:
466
+ print(f"Warning: could not load LCA Project-Level Completion: {e}")
467
+
468
+ try:
469
+ # Merge all language subsets (py, java, kt) using test split
470
+ lca_bl_all = None
471
+ for lang_subset in ("py", "java", "kt"):
472
+ try:
473
+ ds = load_dataset(
474
+ "JetBrains-Research/lca-bug-localization", lang_subset, split="test"
475
+ )
476
+ if lca_bl_all is None:
477
+ from datasets import concatenate_datasets
478
+
479
+ lca_bl_all = ds
480
+ else:
481
+ lca_bl_all = concatenate_datasets([lca_bl_all, ds])
482
+ except Exception:
483
+ pass
484
+ if lca_bl_all is not None and len(lca_bl_all) > 0:
485
+ lca_bl = _sample_hf_dataset(lca_bl_all)
486
+ adapter = LCABugLocalizationAdapter(lca_bl)
487
+ adapter.total_count = len(lca_bl_all)
488
+ REGISTRY["lca-buglocalization"] = adapter
489
+ print(f"Loaded LCA Bug Localization: {len(lca_bl)} problems (of {len(lca_bl_all)})")
490
+ else:
491
+ print("Warning: could not load any LCA Bug Localization language subsets")
492
+ except Exception as e:
493
+ print(f"Warning: could not load LCA Bug Localization: {e}")
494
+
495
+ try:
496
+ lca_cmg = load_dataset("JetBrains-Research/lca-commit-message-generation", split="test")
497
+ REGISTRY["lca-commitmsg"] = LCACommitMsgGenAdapter(lca_cmg)
498
+ print(f"Loaded LCA Commit Message Gen: {len(lca_cmg)} problems")
499
+ except Exception as e:
500
+ print(f"Warning: could not load LCA Commit Message Gen: {e}")
501
+
502
+ try:
503
+ lca_ci = load_dataset("JetBrains-Research/lca-ci-builds-repair", split="test")
504
+ REGISTRY["lca-cirepair"] = LCACIRepairAdapter(lca_ci)
505
+ print(f"Loaded LCA CI Builds Repair: {len(lca_ci)} problems")
506
+ except Exception as e:
507
+ print(f"Warning: could not load LCA CI Builds Repair: {e}")
508
+
509
+ try:
510
+ lca_ms = load_dataset("JetBrains-Research/lca-module-summarization", split="test")
511
+ REGISTRY["lca-modulesumm"] = LCAModuleSummarizationAdapter(lca_ms)
512
+ print(f"Loaded LCA Module Summarization: {len(lca_ms)} problems")
513
+ except Exception as e:
514
+ print(f"Warning: could not load LCA Module Summarization: {e}")
515
+
516
+ # --- dpaia Enterprise Evaluation Dataset ---
517
+
518
+ try:
519
+ import urllib.request
520
+
521
+ url = "https://raw.githubusercontent.com/dpaia/ee-dataset/main/datasets/java-spring-ee-dataset.json"
522
+ with urllib.request.urlopen(url) as resp:
523
+ dpaia_rows = json.loads(resp.read().decode("utf-8"))
524
+ if dpaia_rows:
525
+ REGISTRY["dpaia-ee"] = DPAIAEEDatasetAdapter(dpaia_rows)
526
+ print(f"Loaded DPAIA EE-Dataset: {len(dpaia_rows)} problems")
527
+ except Exception as e:
528
+ print(f"Warning: could not load DPAIA EE-Dataset: {e}")
529
+
530
+ # --- Multi-SWE-bench (ByteDance, multilingual issue resolving) ---
531
+ # Dataset has 40 per-repo JSONL files with inconsistent schemas; load directly.
532
+
533
+ try:
534
+ from huggingface_hub import list_repo_files
535
+
536
+ mswe_files = list_repo_files("ByteDance-Seed/Multi-SWE-bench", repo_type="dataset")
537
+ mswe_jsonl = [f for f in mswe_files if f.endswith(".jsonl")]
538
+ mswe_rows: list[dict[str, Any]] = []
539
+ for fname in mswe_jsonl:
540
+ lang_dir = fname.split("/")[0] if "/" in fname else ""
541
+ try:
542
+ rows = _load_jsonl_dataset("ByteDance-Seed/Multi-SWE-bench", [fname])
543
+ for d in rows:
544
+ d["_language"] = lang_dir
545
+ mswe_rows.extend(rows)
546
+ except Exception:
547
+ pass
548
+ if mswe_rows:
549
+ mswe_sampled = _sample_list(mswe_rows)
550
+ adapter = MultiSWEBenchAdapter(mswe_sampled)
551
+ adapter.total_count = len(mswe_rows)
552
+ REGISTRY["multiswebench"] = adapter
553
+ print(f"Loaded Multi-SWE-bench: {len(mswe_sampled)} problems (of {len(mswe_rows)})")
554
+ else:
555
+ print("Warning: could not load any Multi-SWE-bench JSONL files")
556
+ except Exception as e:
557
+ print(f"Warning: could not load Multi-SWE-bench: {e}")
558
+
559
+ # --- SWE-bench Multilingual ---
560
+
561
+ try:
562
+ swe_ml = load_dataset("SWE-bench/SWE-bench_Multilingual", split="test")
563
+ REGISTRY["swebenchmultilingual"] = SWEBenchMultilingualAdapter(swe_ml)
564
+ print(f"Loaded SWE-bench Multilingual: {len(swe_ml)} problems")
565
+ except Exception as e:
566
+ print(f"Warning: could not load SWE-bench Multilingual: {e}")
567
+
568
+ # --- CrossCodeEval (cross-file code completion, 4 languages) ---
569
+ # Dataset has inconsistent columns across files; load only base line_completion.jsonl per lang
570
+
571
+ try:
572
+ cceval_rows: list[dict[str, Any]] = []
573
+ for lang in ("python", "java", "typescript", "csharp"):
574
+ try:
575
+ rows = _load_jsonl_dataset(
576
+ "Vincentvmt/CrossCodeEval",
577
+ [f"crosscodeeval_data/{lang}/line_completion.jsonl"],
578
+ )
579
+ for d in rows:
580
+ d["language"] = lang
581
+ cceval_rows.extend(rows)
582
+ except Exception:
583
+ pass
584
+ if cceval_rows:
585
+ cceval_sampled = _sample_list(cceval_rows)
586
+ adapter = CrossCodeEvalAdapter(cceval_sampled)
587
+ adapter.total_count = len(cceval_rows)
588
+ REGISTRY["crosscodeeval"] = adapter
589
+ print(f"Loaded CrossCodeEval: {len(cceval_sampled)} problems (of {len(cceval_rows)})")
590
+ else:
591
+ print("Warning: could not load any CrossCodeEval language subsets")
592
+ except Exception as e:
593
+ print(f"Warning: could not load CrossCodeEval: {e}")
594
+
595
+ # --- McEval (massively multilingual code evaluation, 40 languages) ---
596
+
597
+ try:
598
+ mceval_full = load_dataset("Multilingual-Multimodal-NLP/McEval", "generation", split="test")
599
+ mceval = _sample_hf_dataset(mceval_full)
600
+ adapter = McEvalAdapter(mceval)
601
+ adapter.total_count = len(mceval_full)
602
+ REGISTRY["mceval"] = adapter
603
+ print(f"Loaded McEval: {len(mceval)} problems (of {len(mceval_full)})")
604
+ except Exception as e:
605
+ print(f"Warning: could not load McEval: {e}")
606
+
607
+ # --- MultiPL-E (multilingual HumanEval/MBPP, 22 languages) ---
608
+
609
+ try:
610
+ mple_datasets = {}
611
+ for lang_ext in MultiPLEAdapter.LANGUAGES:
612
+ try:
613
+ mple_datasets[lang_ext] = load_dataset(
614
+ "nuprl/MultiPL-E", f"humaneval-{lang_ext}", split="test"
615
+ )
616
+ except Exception:
617
+ pass
618
+ if mple_datasets:
619
+ REGISTRY["multiple"] = MultiPLEAdapter(mple_datasets)
620
+ first = next(iter(mple_datasets))
621
+ print(
622
+ f"Loaded MultiPL-E: {len(mple_datasets)} languages, "
623
+ f"{len(mple_datasets[first])} problems each"
624
+ )
625
+ else:
626
+ print("Warning: could not load any MultiPL-E language subsets")
627
+ except Exception as e:
628
+ print(f"Warning: could not load MultiPL-E: {e}")
629
+
630
+ # --- Defects4J (Java bug-fix benchmark) ---
631
+
632
+ try:
633
+ d4j = load_dataset("rufimelo/defects4j", split="train")
634
+ d4j_sampled = _sample_hf_dataset(d4j)
635
+ adapter = Defects4JAdapter(d4j_sampled)
636
+ adapter.total_count = len(d4j)
637
+ REGISTRY["defects4j"] = adapter
638
+ print(f"Loaded Defects4J: {len(d4j_sampled)} problems (of {len(d4j)})")
639
+ except Exception as e:
640
+ print(f"Warning: could not load Defects4J: {e}")