Add 13 new benchmark datasets (batches 6-8)
Browse filesLong Code Arena (6 tasks):
- Library-Based Code Gen, Project-Level Completion,
Bug Localization, Commit Message Gen,
CI Builds Repair, Module Summarization
Additional datasets:
- DPAIA EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual,
CrossCodeEval, Defects4J, McEval, MultiPL-E
Key implementation details:
- CrossCodeEval: load per-language JSONL directly (inconsistent HF columns)
- Multi-SWE-bench: load individual JSONL files (40 files across languages)
- Defects4J: use train split, map bug_id/func_before/func_after fields
- RepoBench dropped (no data files, deprecated loading script)
- LCA CI logs: head+tail trimming (first/last 10k chars by line)
- LCA large fields: explicit trim markers showing original vs trimmed size
- GitHub repo/commit links for all LCA tasks where data is available
- Total datasets: 41 (up from 28)
- PROGRESS.md +51 -1
- adapters/__init__.py +17 -3
- adapters/additional.py +575 -0
- adapters/long_code_arena.py +503 -0
- adapters/registration.py +230 -0
|
@@ -87,13 +87,56 @@ CoderEval, NaturalCodeBench, DevEval, RunBugRun, Defects4J, ConDefects, FixEval,
|
|
| 87 |
8. **Fill-in-the-Middle view** -- prefix + [HOLE] + suffix (SAFIM)
|
| 88 |
9. **Vulnerability view** -- vulnerable/patched code + CWE labels (BigVul, DiverseVul, PrimeVul, Devign)
|
| 89 |
|
| 90 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
Base (4): REval, CRUXEval, HumanEval+, BigOBench
|
| 92 |
Batch 1 (5): MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X
|
| 93 |
Batch 2 (5): SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP
|
| 94 |
Batch 3 (5): SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench
|
| 95 |
Batch 4 (3): SWE-bench Verified, CodeSearchNet, Devign
|
| 96 |
Batch 5 (6): BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench
|
|
|
|
|
|
|
| 97 |
|
| 98 |
## Changelog
|
| 99 |
|
|
@@ -112,3 +155,10 @@ Batch 5 (6): BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, Commi
|
|
| 112 |
- 2026-03-04: Enhanced SWE-bench diff view (full file with diff chunks)
|
| 113 |
- 2026-03-04: Batch 5 complete (BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench)
|
| 114 |
- 2026-03-04: All 28 datasets verified loading successfully
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
8. **Fill-in-the-Middle view** -- prefix + [HOLE] + suffix (SAFIM)
|
| 88 |
9. **Vulnerability view** -- vulnerable/patched code + CWE labels (BigVul, DiverseVul, PrimeVul, Devign)
|
| 89 |
|
| 90 |
+
### Batch 6 — Long Code Arena (6 project-level tasks)
|
| 91 |
+
| Benchmark | Slug | Status | HF Dataset | View Type |
|
| 92 |
+
|-----------|------|--------|------------|-----------|
|
| 93 |
+
| LCA Library-Based Code Gen | `lca-libcodegen` | Done | `JetBrains-Research/lca-library-based-code-generation` | Simple |
|
| 94 |
+
| LCA Project-Level Completion | `lca-codecompletion` | Done | `JetBrains-Research/lca-project-level-code-completion` | Simple |
|
| 95 |
+
| LCA Bug Localization | `lca-buglocalization` | Done | `JetBrains-Research/lca-bug-localization` | Diff |
|
| 96 |
+
| LCA Commit Message Gen | `lca-commitmsg` | Done | `JetBrains-Research/lca-commit-message-generation` | Diff |
|
| 97 |
+
| LCA CI Builds Repair | `lca-cirepair` | Done | `JetBrains-Research/lca-ci-builds-repair` | Diff |
|
| 98 |
+
| LCA Module Summarization | `lca-modulesumm` | Done | `JetBrains-Research/lca-module-summarization` | Simple |
|
| 99 |
+
|
| 100 |
+
**New adapter module:** `adapters/long_code_arena.py` — all 6 Long Code Arena project-level tasks.
|
| 101 |
+
|
| 102 |
+
### Batch 7 — dpaia & Additional Benchmarks (7 datasets)
|
| 103 |
+
| Benchmark | Slug | Status | Source | View Type |
|
| 104 |
+
|-----------|------|--------|--------|-----------|
|
| 105 |
+
| DPAIA EE-Dataset | `dpaia-ee` | Done | `github.com/dpaia/ee-dataset` (JSON) | Diff (SWE-bench style) |
|
| 106 |
+
| Multi-SWE-bench | `multiswebench` | Done | `ByteDance-Seed/Multi-SWE-bench` (JSONL) | Diff |
|
| 107 |
+
| SWE-bench Multilingual | `swebenchmultilingual` | Done | `SWE-bench/SWE-bench_Multilingual` | Diff |
|
| 108 |
+
| CrossCodeEval | `crosscodeeval` | Done | `Vincentvmt/CrossCodeEval` (JSONL) | Fill-in-the-Middle |
|
| 109 |
+
| McEval | `mceval` | Done | `Multilingual-Multimodal-NLP/McEval` | Simple |
|
| 110 |
+
| MultiPL-E | `multiple` | Done | `nuprl/MultiPL-E` | Multi-language |
|
| 111 |
+
| Defects4J | `defects4j` | Done | `rufimelo/defects4j` | Before/After |
|
| 112 |
+
|
| 113 |
+
### Dropped from Batch 7
|
| 114 |
+
| Benchmark | Reason |
|
| 115 |
+
|-----------|--------|
|
| 116 |
+
| RepoBench | HF repo has only a deprecated loading script (`repobench-p.py`), no actual data files |
|
| 117 |
+
|
| 118 |
+
**New adapter module:** `adapters/additional.py` — dpaia EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J.
|
| 119 |
+
|
| 120 |
+
**Sources:**
|
| 121 |
+
- Long Code Arena: https://huggingface.co/collections/JetBrains-Research/long-code-arena (OpenReview: aQoUjxlgNE)
|
| 122 |
+
- DPAIA EE-Dataset: https://github.com/dpaia/ee-dataset (Java/Spring SWE-bench-style)
|
| 123 |
+
- Multi-SWE-bench: ByteDance multilingual SWE-bench (7 languages, 1632 problems across 40 repos)
|
| 124 |
+
- SWE-bench Multilingual: Official SWE-bench multilingual extension (42 repos)
|
| 125 |
+
- CrossCodeEval: Cross-file code completion (4 languages, Amazon, 9928 problems)
|
| 126 |
+
- McEval: Massively multilingual code evaluation (40 languages)
|
| 127 |
+
- MultiPL-E: Multi-language HumanEval/MBPP translation (9 languages loaded)
|
| 128 |
+
- Defects4J: Classic Java bug-fix benchmark (467 bugs)
|
| 129 |
+
- Arxiv survey reference: https://arxiv.org/abs/2505.08903
|
| 130 |
+
|
| 131 |
+
## Total Datasets: 41
|
| 132 |
Base (4): REval, CRUXEval, HumanEval+, BigOBench
|
| 133 |
Batch 1 (5): MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X
|
| 134 |
Batch 2 (5): SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP
|
| 135 |
Batch 3 (5): SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench
|
| 136 |
Batch 4 (3): SWE-bench Verified, CodeSearchNet, Devign
|
| 137 |
Batch 5 (6): BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench
|
| 138 |
+
Batch 6 (6): LCA Library-Based Code Gen, LCA Project-Level Completion, LCA Bug Localization, LCA Commit Message Gen, LCA CI Builds Repair, LCA Module Summarization
|
| 139 |
+
Batch 7 (7): DPAIA EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J
|
| 140 |
|
| 141 |
## Changelog
|
| 142 |
|
|
|
|
| 155 |
- 2026-03-04: Enhanced SWE-bench diff view (full file with diff chunks)
|
| 156 |
- 2026-03-04: Batch 5 complete (BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench)
|
| 157 |
- 2026-03-04: All 28 datasets verified loading successfully
|
| 158 |
+
- 2026-03-04: Batch 6 complete (Long Code Arena — 6 project-level tasks)
|
| 159 |
+
- 2026-03-04: Batch 7 complete (dpaia EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J)
|
| 160 |
+
- 2026-03-04: Dropped RepoBench (HF repo has only deprecated loading script, no data files)
|
| 161 |
+
- 2026-03-04: Fixed Multi-SWE-bench (load per-repo JSONL files directly instead of `load_dataset`)
|
| 162 |
+
- 2026-03-04: Fixed CrossCodeEval (load per-language JSONL files directly, inconsistent columns across files)
|
| 163 |
+
- 2026-03-04: Fixed Defects4J (split="train" not "test", fields: bug_id/func_before/func_after)
|
| 164 |
+
- 2026-03-04: All 41 datasets verified loading successfully
|
|
@@ -28,9 +28,23 @@ def _set_helpers(highlight_code_fn, code_offset_fn, extract_test_classes_fn):
|
|
| 28 |
_extract_test_classes = extract_test_classes_fn
|
| 29 |
|
| 30 |
# Propagate to submodules so adapters can use them
|
| 31 |
-
from adapters import
|
| 32 |
-
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
mod._highlight_code = highlight_code_fn
|
| 35 |
mod._code_offset = code_offset_fn
|
| 36 |
mod._extract_test_classes = extract_test_classes_fn
|
|
|
|
| 28 |
_extract_test_classes = extract_test_classes_fn
|
| 29 |
|
| 30 |
# Propagate to submodules so adapters can use them
|
| 31 |
+
from adapters import (
|
| 32 |
+
additional,
|
| 33 |
+
code_editing,
|
| 34 |
+
code_generation,
|
| 35 |
+
code_reasoning,
|
| 36 |
+
long_code_arena,
|
| 37 |
+
vulnerability,
|
| 38 |
+
)
|
| 39 |
+
|
| 40 |
+
for mod in (
|
| 41 |
+
code_generation,
|
| 42 |
+
code_editing,
|
| 43 |
+
code_reasoning,
|
| 44 |
+
vulnerability,
|
| 45 |
+
long_code_arena,
|
| 46 |
+
additional,
|
| 47 |
+
):
|
| 48 |
mod._highlight_code = highlight_code_fn
|
| 49 |
mod._code_offset = code_offset_fn
|
| 50 |
mod._extract_test_classes = extract_test_classes_fn
|
|
@@ -0,0 +1,575 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Additional benchmark adapters (dpaia EE-Dataset, Multi-SWE-bench, SWE-bench
|
| 2 |
+
Multilingual, CrossCodeEval, RepoBench, McEval, MultiPL-E, Defects4J)."""
|
| 3 |
+
|
| 4 |
+
from __future__ import annotations
|
| 5 |
+
|
| 6 |
+
import json
|
| 7 |
+
from typing import Any
|
| 8 |
+
|
| 9 |
+
from adapters import DatasetAdapter
|
| 10 |
+
from adapters.code_editing import SWEBenchLiteAdapter
|
| 11 |
+
|
| 12 |
+
# Injected at runtime by _set_helpers()
|
| 13 |
+
_highlight_code = None
|
| 14 |
+
_code_offset = None
|
| 15 |
+
_extract_test_classes = None
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
# ---------------------------------------------------------------------------
|
| 19 |
+
# dpaia Enterprise Evaluation Dataset
|
| 20 |
+
# (GitHub: dpaia/ee-dataset — SWE-bench-style format for Java/Spring)
|
| 21 |
+
# ---------------------------------------------------------------------------
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
class DPAIAEEDatasetAdapter(DatasetAdapter):
|
| 25 |
+
slug = "dpaia-ee"
|
| 26 |
+
display_name = "DPAIA EE-Dataset"
|
| 27 |
+
has_ground_truth = False
|
| 28 |
+
has_tasks = False
|
| 29 |
+
|
| 30 |
+
def __init__(self, rows: list[dict[str, Any]]):
|
| 31 |
+
self._rows = rows
|
| 32 |
+
|
| 33 |
+
def problem_count(self) -> int:
|
| 34 |
+
return len(self._rows)
|
| 35 |
+
|
| 36 |
+
def get_problem_summary(self, idx: int) -> dict[str, Any]:
|
| 37 |
+
row = self._rows[idx]
|
| 38 |
+
tags = row.get("tags", [])
|
| 39 |
+
tag_str = ", ".join(tags[:3]) if isinstance(tags, list) else str(tags)
|
| 40 |
+
return {
|
| 41 |
+
"idx": idx,
|
| 42 |
+
"task_id": row.get("instance_id", str(idx)),
|
| 43 |
+
"entry_point": row.get("repo", f"dpaia_{idx}"),
|
| 44 |
+
"num_inputs": 0,
|
| 45 |
+
"source": tag_str or "DPAIA",
|
| 46 |
+
}
|
| 47 |
+
|
| 48 |
+
def get_problem_detail(self, idx: int) -> dict[str, Any]:
|
| 49 |
+
row = self._rows[idx]
|
| 50 |
+
patch = row.get("patch", "")
|
| 51 |
+
test_patch = row.get("test_patch", "")
|
| 52 |
+
fail_to_pass = row.get("FAIL_TO_PASS", [])
|
| 53 |
+
if isinstance(fail_to_pass, str):
|
| 54 |
+
try:
|
| 55 |
+
fail_to_pass = json.loads(fail_to_pass)
|
| 56 |
+
except (json.JSONDecodeError, TypeError):
|
| 57 |
+
fail_to_pass = [fail_to_pass]
|
| 58 |
+
pass_to_pass = row.get("PASS_TO_PASS", [])
|
| 59 |
+
if isinstance(pass_to_pass, str):
|
| 60 |
+
try:
|
| 61 |
+
pass_to_pass = json.loads(pass_to_pass)
|
| 62 |
+
except (json.JSONDecodeError, TypeError):
|
| 63 |
+
pass_to_pass = [pass_to_pass]
|
| 64 |
+
|
| 65 |
+
instance_id = row.get("instance_id", str(idx))
|
| 66 |
+
repo = row.get("repo", "")
|
| 67 |
+
|
| 68 |
+
return {
|
| 69 |
+
"idx": idx,
|
| 70 |
+
"task_id": instance_id,
|
| 71 |
+
"entry_point": repo or f"dpaia_{idx}",
|
| 72 |
+
"code": patch,
|
| 73 |
+
"highlighted_code": "",
|
| 74 |
+
"inputs": [],
|
| 75 |
+
"outputs": [],
|
| 76 |
+
"test": None,
|
| 77 |
+
"tasks": [],
|
| 78 |
+
"source": ", ".join(row.get("tags", [])[:3])
|
| 79 |
+
if isinstance(row.get("tags"), list)
|
| 80 |
+
else "DPAIA",
|
| 81 |
+
"has_ground_truth": False,
|
| 82 |
+
"has_tasks": False,
|
| 83 |
+
"description": row.get("problem_statement", ""),
|
| 84 |
+
"patch": patch,
|
| 85 |
+
"test_patch": test_patch,
|
| 86 |
+
"fail_to_pass": fail_to_pass,
|
| 87 |
+
"pass_to_pass": pass_to_pass,
|
| 88 |
+
"repo": repo,
|
| 89 |
+
"base_commit": row.get("base_commit", ""),
|
| 90 |
+
}
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
# ---------------------------------------------------------------------------
|
| 94 |
+
# Multi-SWE-bench (HuggingFace: ByteDance-Seed/Multi-SWE-bench)
|
| 95 |
+
# Multilingual SWE-bench spanning 7 languages
|
| 96 |
+
# ---------------------------------------------------------------------------
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
class MultiSWEBenchAdapter(DatasetAdapter):
|
| 100 |
+
slug = "multiswebench"
|
| 101 |
+
display_name = "Multi-SWE-bench"
|
| 102 |
+
has_ground_truth = False
|
| 103 |
+
has_tasks = False
|
| 104 |
+
|
| 105 |
+
def __init__(self, rows: list[dict[str, Any]]):
|
| 106 |
+
self._rows = rows
|
| 107 |
+
|
| 108 |
+
def problem_count(self) -> int:
|
| 109 |
+
return len(self._rows)
|
| 110 |
+
|
| 111 |
+
def get_problem_summary(self, idx: int) -> dict[str, Any]:
|
| 112 |
+
row = self._rows[idx]
|
| 113 |
+
instance_id = row.get("instance_id", str(idx))
|
| 114 |
+
org = row.get("org", "")
|
| 115 |
+
repo = row.get("repo", "")
|
| 116 |
+
full_repo = f"{org}/{repo}" if org and repo else repo
|
| 117 |
+
return {
|
| 118 |
+
"idx": idx,
|
| 119 |
+
"task_id": instance_id,
|
| 120 |
+
"entry_point": instance_id.split("__")[-1] if instance_id else f"mswe_{idx}",
|
| 121 |
+
"num_inputs": 0,
|
| 122 |
+
"source": row.get("_language", full_repo or "unknown"),
|
| 123 |
+
}
|
| 124 |
+
|
| 125 |
+
def get_problem_detail(self, idx: int) -> dict[str, Any]:
|
| 126 |
+
row = self._rows[idx]
|
| 127 |
+
patch = row.get("fix_patch", "")
|
| 128 |
+
instance_id = row.get("instance_id", str(idx))
|
| 129 |
+
org = row.get("org", "")
|
| 130 |
+
repo_name = row.get("repo", "")
|
| 131 |
+
full_repo = f"{org}/{repo_name}" if org and repo_name else repo_name
|
| 132 |
+
lang = row.get("_language", "")
|
| 133 |
+
number = row.get("number", "")
|
| 134 |
+
|
| 135 |
+
# Build description from title + body
|
| 136 |
+
title = row.get("title", "")
|
| 137 |
+
body = row.get("body", "")
|
| 138 |
+
description = title
|
| 139 |
+
if body:
|
| 140 |
+
description += "\n\n" + body
|
| 141 |
+
|
| 142 |
+
links: dict[str, str] = {}
|
| 143 |
+
if full_repo:
|
| 144 |
+
links["repo_url"] = f"https://github.com/{full_repo}"
|
| 145 |
+
if number and full_repo:
|
| 146 |
+
links["issue_url"] = f"https://github.com/{full_repo}/pull/{number}"
|
| 147 |
+
|
| 148 |
+
return {
|
| 149 |
+
"idx": idx,
|
| 150 |
+
"task_id": instance_id,
|
| 151 |
+
"entry_point": instance_id.split("__")[-1] if instance_id else f"mswe_{idx}",
|
| 152 |
+
"code": patch,
|
| 153 |
+
"highlighted_code": "",
|
| 154 |
+
"inputs": [],
|
| 155 |
+
"outputs": [],
|
| 156 |
+
"test": None,
|
| 157 |
+
"tasks": [],
|
| 158 |
+
"source": lang or full_repo,
|
| 159 |
+
"has_ground_truth": False,
|
| 160 |
+
"has_tasks": False,
|
| 161 |
+
"description": description,
|
| 162 |
+
"patch": patch,
|
| 163 |
+
"test_patch": row.get("test_patch", ""),
|
| 164 |
+
"fail_to_pass": [],
|
| 165 |
+
"pass_to_pass": [],
|
| 166 |
+
"repo": full_repo,
|
| 167 |
+
"hints": row.get("hints", ""),
|
| 168 |
+
**links,
|
| 169 |
+
}
|
| 170 |
+
|
| 171 |
+
|
| 172 |
+
# ---------------------------------------------------------------------------
|
| 173 |
+
# SWE-bench Multilingual (HuggingFace: SWE-bench/SWE-bench_Multilingual)
|
| 174 |
+
# 300 tasks across 42 repos in multiple languages
|
| 175 |
+
# ---------------------------------------------------------------------------
|
| 176 |
+
|
| 177 |
+
|
| 178 |
+
class SWEBenchMultilingualAdapter(SWEBenchLiteAdapter):
|
| 179 |
+
slug = "swebenchmultilingual"
|
| 180 |
+
display_name = "SWE-bench Multilingual"
|
| 181 |
+
|
| 182 |
+
|
| 183 |
+
# ---------------------------------------------------------------------------
|
| 184 |
+
# CrossCodeEval (HuggingFace: Vincentvmt/CrossCodeEval or amazon-science/cceval)
|
| 185 |
+
# Cross-file code completion in 4 languages
|
| 186 |
+
# ---------------------------------------------------------------------------
|
| 187 |
+
|
| 188 |
+
|
| 189 |
+
class CrossCodeEvalAdapter(DatasetAdapter):
|
| 190 |
+
slug = "crosscodeeval"
|
| 191 |
+
display_name = "CrossCodeEval"
|
| 192 |
+
has_ground_truth = False
|
| 193 |
+
has_tasks = False
|
| 194 |
+
|
| 195 |
+
def __init__(self, rows: list[dict[str, Any]]):
|
| 196 |
+
self._rows = rows
|
| 197 |
+
|
| 198 |
+
def problem_count(self) -> int:
|
| 199 |
+
return len(self._rows)
|
| 200 |
+
|
| 201 |
+
@staticmethod
|
| 202 |
+
def _get_metadata(row: dict, key: str, default: str = "") -> str:
|
| 203 |
+
"""Extract a value from the nested metadata dict."""
|
| 204 |
+
meta = row.get("metadata", {})
|
| 205 |
+
if isinstance(meta, dict):
|
| 206 |
+
return meta.get(key, default)
|
| 207 |
+
return default
|
| 208 |
+
|
| 209 |
+
def get_problem_summary(self, idx: int) -> dict[str, Any]:
|
| 210 |
+
row = self._rows[idx]
|
| 211 |
+
task_id = self._get_metadata(row, "task_id", str(idx))
|
| 212 |
+
return {
|
| 213 |
+
"idx": idx,
|
| 214 |
+
"task_id": task_id,
|
| 215 |
+
"entry_point": task_id.rsplit("/", 1)[-1] if task_id else f"cceval_{idx}",
|
| 216 |
+
"num_inputs": 0,
|
| 217 |
+
"source": row.get("language", "unknown"),
|
| 218 |
+
}
|
| 219 |
+
|
| 220 |
+
def get_problem_detail(self, idx: int) -> dict[str, Any]:
|
| 221 |
+
row = self._rows[idx]
|
| 222 |
+
prompt = row.get("prompt", "")
|
| 223 |
+
reference = row.get("groundtruth", "")
|
| 224 |
+
right_context = row.get("right_context", "")
|
| 225 |
+
lang = row.get("language", "python")
|
| 226 |
+
lang_key = lang.lower()
|
| 227 |
+
|
| 228 |
+
task_id = self._get_metadata(row, "task_id", str(idx))
|
| 229 |
+
|
| 230 |
+
# Build a FIM-style display: prompt with hole, then merged view
|
| 231 |
+
display_code = prompt + "\n/* [HOLE] */\n" + right_context
|
| 232 |
+
merged_code = prompt + reference + right_context if reference else prompt + right_context
|
| 233 |
+
|
| 234 |
+
before_hole = prompt
|
| 235 |
+
gt_start_line = before_hole.count("\n") + 1
|
| 236 |
+
gt_line_count = reference.count("\n") + (1 if reference else 0)
|
| 237 |
+
gt_end_line = gt_start_line + gt_line_count - 1
|
| 238 |
+
|
| 239 |
+
return {
|
| 240 |
+
"idx": idx,
|
| 241 |
+
"task_id": task_id,
|
| 242 |
+
"entry_point": task_id.rsplit("/", 1)[-1] if task_id else f"cceval_{idx}",
|
| 243 |
+
"code": display_code,
|
| 244 |
+
"highlighted_code": _highlight_code(display_code, language=lang_key),
|
| 245 |
+
"inputs": [],
|
| 246 |
+
"outputs": [],
|
| 247 |
+
"test": None,
|
| 248 |
+
"tasks": [],
|
| 249 |
+
"source": lang,
|
| 250 |
+
"has_ground_truth": False,
|
| 251 |
+
"has_tasks": False,
|
| 252 |
+
"fim_prefix": prompt,
|
| 253 |
+
"fim_ground_truth": reference,
|
| 254 |
+
"fim_ground_truth_highlighted": _highlight_code(reference, language=lang_key)
|
| 255 |
+
if reference
|
| 256 |
+
else "",
|
| 257 |
+
"fim_merged_code": merged_code,
|
| 258 |
+
"fim_merged_highlighted": _highlight_code(
|
| 259 |
+
merged_code,
|
| 260 |
+
highlight_lines=list(range(gt_start_line, gt_end_line + 1)),
|
| 261 |
+
language=lang_key,
|
| 262 |
+
)
|
| 263 |
+
if merged_code
|
| 264 |
+
else "",
|
| 265 |
+
"fim_gt_start_line": gt_start_line,
|
| 266 |
+
"fim_gt_end_line": gt_end_line,
|
| 267 |
+
"language": lang,
|
| 268 |
+
}
|
| 269 |
+
|
| 270 |
+
|
| 271 |
+
# ---------------------------------------------------------------------------
|
| 272 |
+
# RepoBench (HuggingFace: tianyang/repobench-p)
|
| 273 |
+
# Repository-level code completion across Python and Java
|
| 274 |
+
# ---------------------------------------------------------------------------
|
| 275 |
+
|
| 276 |
+
|
| 277 |
+
class RepoBenchAdapter(DatasetAdapter):
|
| 278 |
+
slug = "repobench"
|
| 279 |
+
display_name = "RepoBench"
|
| 280 |
+
has_ground_truth = False
|
| 281 |
+
has_tasks = False
|
| 282 |
+
|
| 283 |
+
def __init__(self, rows: list[dict[str, Any]]):
|
| 284 |
+
self._rows = rows
|
| 285 |
+
|
| 286 |
+
def problem_count(self) -> int:
|
| 287 |
+
return len(self._rows)
|
| 288 |
+
|
| 289 |
+
def get_problem_summary(self, idx: int) -> dict[str, Any]:
|
| 290 |
+
row = self._rows[idx]
|
| 291 |
+
return {
|
| 292 |
+
"idx": idx,
|
| 293 |
+
"task_id": str(row.get("repo_name", idx)),
|
| 294 |
+
"entry_point": row.get("file_path", f"repobench_{idx}").rsplit("/", 1)[-1],
|
| 295 |
+
"num_inputs": 0,
|
| 296 |
+
"source": row.get("language", row.get("_setting", "unknown")),
|
| 297 |
+
}
|
| 298 |
+
|
| 299 |
+
def get_problem_detail(self, idx: int) -> dict[str, Any]:
|
| 300 |
+
row = self._rows[idx]
|
| 301 |
+
# RepoBench has context code and a next_line to predict
|
| 302 |
+
context = row.get("all_code", row.get("context", ""))
|
| 303 |
+
next_line = row.get("next_line", row.get("gold_snippet_code", ""))
|
| 304 |
+
lang = row.get("language", "python")
|
| 305 |
+
lang_key = lang.lower()
|
| 306 |
+
|
| 307 |
+
display_code = context + "\n/* [HOLE] */\n" if context else ""
|
| 308 |
+
merged_code = context + "\n" + next_line if context and next_line else context
|
| 309 |
+
|
| 310 |
+
gt_start_line = context.count("\n") + 2 if context else 1
|
| 311 |
+
gt_line_count = next_line.count("\n") + 1 if next_line else 0
|
| 312 |
+
gt_end_line = gt_start_line + gt_line_count - 1
|
| 313 |
+
|
| 314 |
+
return {
|
| 315 |
+
"idx": idx,
|
| 316 |
+
"task_id": str(row.get("repo_name", idx)),
|
| 317 |
+
"entry_point": row.get("file_path", f"repobench_{idx}").rsplit("/", 1)[-1],
|
| 318 |
+
"code": display_code,
|
| 319 |
+
"highlighted_code": _highlight_code(display_code, language=lang_key)
|
| 320 |
+
if display_code
|
| 321 |
+
else "",
|
| 322 |
+
"inputs": [],
|
| 323 |
+
"outputs": [],
|
| 324 |
+
"test": None,
|
| 325 |
+
"tasks": [],
|
| 326 |
+
"source": row.get("_setting", lang),
|
| 327 |
+
"has_ground_truth": False,
|
| 328 |
+
"has_tasks": False,
|
| 329 |
+
"fim_prefix": context,
|
| 330 |
+
"fim_ground_truth": next_line,
|
| 331 |
+
"fim_ground_truth_highlighted": _highlight_code(next_line, language=lang_key)
|
| 332 |
+
if next_line
|
| 333 |
+
else "",
|
| 334 |
+
"fim_merged_code": merged_code,
|
| 335 |
+
"fim_merged_highlighted": _highlight_code(
|
| 336 |
+
merged_code,
|
| 337 |
+
highlight_lines=list(range(gt_start_line, gt_end_line + 1)),
|
| 338 |
+
language=lang_key,
|
| 339 |
+
)
|
| 340 |
+
if merged_code
|
| 341 |
+
else "",
|
| 342 |
+
"fim_gt_start_line": gt_start_line,
|
| 343 |
+
"fim_gt_end_line": gt_end_line,
|
| 344 |
+
"language": lang,
|
| 345 |
+
}
|
| 346 |
+
|
| 347 |
+
|
| 348 |
+
# ---------------------------------------------------------------------------
|
| 349 |
+
# McEval (HuggingFace: Multilingual-Multimodal-NLP/McEval)
|
| 350 |
+
# Massively multilingual code evaluation — 40 languages, 16K samples
|
| 351 |
+
# ---------------------------------------------------------------------------
|
| 352 |
+
|
| 353 |
+
|
| 354 |
+
class McEvalAdapter(DatasetAdapter):
|
| 355 |
+
slug = "mceval"
|
| 356 |
+
display_name = "McEval"
|
| 357 |
+
has_ground_truth = False
|
| 358 |
+
has_tasks = False
|
| 359 |
+
|
| 360 |
+
def __init__(self, hf_dataset):
|
| 361 |
+
self._ds = hf_dataset
|
| 362 |
+
|
| 363 |
+
def problem_count(self) -> int:
|
| 364 |
+
return len(self._ds)
|
| 365 |
+
|
| 366 |
+
def get_problem_summary(self, idx: int) -> dict[str, Any]:
|
| 367 |
+
row = self._ds[idx]
|
| 368 |
+
return {
|
| 369 |
+
"idx": idx,
|
| 370 |
+
"task_id": row.get("task_id", str(idx)),
|
| 371 |
+
"entry_point": row.get("entry_point", row.get("task_id", f"mceval_{idx}")),
|
| 372 |
+
"num_inputs": 0,
|
| 373 |
+
"source": row.get("language", "unknown"),
|
| 374 |
+
}
|
| 375 |
+
|
| 376 |
+
def get_problem_detail(self, idx: int) -> dict[str, Any]:
|
| 377 |
+
row = self._ds[idx]
|
| 378 |
+
prompt = row.get("prompt", "")
|
| 379 |
+
canonical = row.get("canonical_solution", "")
|
| 380 |
+
code = prompt + canonical
|
| 381 |
+
lang = row.get("language", "python")
|
| 382 |
+
lang_key = lang.lower()
|
| 383 |
+
# Map some known language names to Pygments lexer names
|
| 384 |
+
lang_map = {
|
| 385 |
+
"c++": "cpp",
|
| 386 |
+
"c#": "csharp",
|
| 387 |
+
"objective-c": "objectivec",
|
| 388 |
+
"visual basic": "vb.net",
|
| 389 |
+
"typescript": "typescript",
|
| 390 |
+
}
|
| 391 |
+
lang_key = lang_map.get(lang_key, lang_key)
|
| 392 |
+
|
| 393 |
+
return {
|
| 394 |
+
"idx": idx,
|
| 395 |
+
"task_id": row.get("task_id", str(idx)),
|
| 396 |
+
"entry_point": row.get("entry_point", row.get("task_id", f"mceval_{idx}")),
|
| 397 |
+
"code": code,
|
| 398 |
+
"highlighted_code": _highlight_code(code, language=lang_key),
|
| 399 |
+
"inputs": [],
|
| 400 |
+
"outputs": [],
|
| 401 |
+
"test": row.get("test", ""),
|
| 402 |
+
"tasks": [],
|
| 403 |
+
"source": lang,
|
| 404 |
+
"has_ground_truth": False,
|
| 405 |
+
"has_tasks": False,
|
| 406 |
+
"description": row.get("prompt", ""),
|
| 407 |
+
"language": lang,
|
| 408 |
+
}
|
| 409 |
+
|
| 410 |
+
|
| 411 |
+
# ---------------------------------------------------------------------------
|
| 412 |
+
# MultiPL-E (HuggingFace: nuprl/MultiPL-E)
|
| 413 |
+
# Multi-language translated HumanEval/MBPP — 22 languages
|
| 414 |
+
# ---------------------------------------------------------------------------
|
| 415 |
+
|
| 416 |
+
|
| 417 |
+
class MultiPLEAdapter(DatasetAdapter):
|
| 418 |
+
slug = "multiple"
|
| 419 |
+
display_name = "MultiPL-E"
|
| 420 |
+
has_ground_truth = False
|
| 421 |
+
has_tasks = False
|
| 422 |
+
|
| 423 |
+
# Languages we load (subset of 22 available)
|
| 424 |
+
LANGUAGES = ["py", "cpp", "java", "js", "ts", "go", "rs", "cs", "rb", "lua"]
|
| 425 |
+
|
| 426 |
+
_LANG_LABELS = {
|
| 427 |
+
"py": "Python",
|
| 428 |
+
"cpp": "C++",
|
| 429 |
+
"java": "Java",
|
| 430 |
+
"js": "JavaScript",
|
| 431 |
+
"ts": "TypeScript",
|
| 432 |
+
"go": "Go",
|
| 433 |
+
"rs": "Rust",
|
| 434 |
+
"cs": "C#",
|
| 435 |
+
"rb": "Ruby",
|
| 436 |
+
"lua": "Lua",
|
| 437 |
+
}
|
| 438 |
+
_LANG_PYGMENTS = {
|
| 439 |
+
"py": "python",
|
| 440 |
+
"cpp": "cpp",
|
| 441 |
+
"java": "java",
|
| 442 |
+
"js": "javascript",
|
| 443 |
+
"ts": "typescript",
|
| 444 |
+
"go": "go",
|
| 445 |
+
"rs": "rust",
|
| 446 |
+
"cs": "csharp",
|
| 447 |
+
"rb": "ruby",
|
| 448 |
+
"lua": "lua",
|
| 449 |
+
}
|
| 450 |
+
|
| 451 |
+
def __init__(self, datasets_by_lang: dict[str, Any]):
|
| 452 |
+
self._by_lang = datasets_by_lang
|
| 453 |
+
first_lang = next(iter(self._by_lang))
|
| 454 |
+
self._count = len(self._by_lang[first_lang])
|
| 455 |
+
|
| 456 |
+
def problem_count(self) -> int:
|
| 457 |
+
return self._count
|
| 458 |
+
|
| 459 |
+
def get_problem_summary(self, idx: int) -> dict[str, Any]:
|
| 460 |
+
first_lang = next(iter(self._by_lang))
|
| 461 |
+
row = self._by_lang[first_lang][idx]
|
| 462 |
+
return {
|
| 463 |
+
"idx": idx,
|
| 464 |
+
"task_id": row.get("name", str(idx)),
|
| 465 |
+
"entry_point": row.get("name", f"multiple_{idx}"),
|
| 466 |
+
"num_inputs": len(self._by_lang),
|
| 467 |
+
"source": "MultiPL-E",
|
| 468 |
+
}
|
| 469 |
+
|
| 470 |
+
def get_problem_detail(self, idx: int) -> dict[str, Any]:
|
| 471 |
+
first_lang = next(iter(self._by_lang))
|
| 472 |
+
row = self._by_lang[first_lang][idx]
|
| 473 |
+
|
| 474 |
+
lang_solutions = []
|
| 475 |
+
for lang in self._by_lang:
|
| 476 |
+
lrow = self._by_lang[lang][idx]
|
| 477 |
+
prompt = lrow.get("prompt", "")
|
| 478 |
+
# MultiPL-E stores tests but may not have canonical solutions
|
| 479 |
+
tests = lrow.get("tests", "")
|
| 480 |
+
lang_key = self._LANG_PYGMENTS.get(lang, lang)
|
| 481 |
+
lang_label = self._LANG_LABELS.get(lang, lang)
|
| 482 |
+
lang_solutions.append(
|
| 483 |
+
{
|
| 484 |
+
"language": lang,
|
| 485 |
+
"language_label": lang_label,
|
| 486 |
+
"code": prompt,
|
| 487 |
+
"highlighted_code": _highlight_code(prompt, language=lang_key),
|
| 488 |
+
"test": tests,
|
| 489 |
+
}
|
| 490 |
+
)
|
| 491 |
+
|
| 492 |
+
py_row = self._by_lang.get("py", self._by_lang[first_lang])[idx]
|
| 493 |
+
default_code = py_row.get("prompt", "")
|
| 494 |
+
|
| 495 |
+
return {
|
| 496 |
+
"idx": idx,
|
| 497 |
+
"task_id": row.get("name", str(idx)),
|
| 498 |
+
"entry_point": row.get("name", f"multiple_{idx}"),
|
| 499 |
+
"code": default_code,
|
| 500 |
+
"highlighted_code": _highlight_code(default_code),
|
| 501 |
+
"inputs": [],
|
| 502 |
+
"outputs": [],
|
| 503 |
+
"test": py_row.get("tests", ""),
|
| 504 |
+
"tasks": [],
|
| 505 |
+
"source": "MultiPL-E",
|
| 506 |
+
"has_ground_truth": False,
|
| 507 |
+
"has_tasks": False,
|
| 508 |
+
"lang_solutions": lang_solutions,
|
| 509 |
+
}
|
| 510 |
+
|
| 511 |
+
|
| 512 |
+
# ---------------------------------------------------------------------------
|
| 513 |
+
# Defects4J (HuggingFace: rufimelo/defects4j)
|
| 514 |
+
# Java bug-fix benchmark — 854 real bugs from open-source projects
|
| 515 |
+
# ---------------------------------------------------------------------------
|
| 516 |
+
|
| 517 |
+
|
| 518 |
+
class Defects4JAdapter(DatasetAdapter):
|
| 519 |
+
slug = "defects4j"
|
| 520 |
+
display_name = "Defects4J"
|
| 521 |
+
has_ground_truth = False
|
| 522 |
+
has_tasks = False
|
| 523 |
+
|
| 524 |
+
def __init__(self, hf_dataset):
|
| 525 |
+
self._ds = hf_dataset
|
| 526 |
+
|
| 527 |
+
def problem_count(self) -> int:
|
| 528 |
+
return len(self._ds)
|
| 529 |
+
|
| 530 |
+
@staticmethod
|
| 531 |
+
def _project_from_bug_id(bug_id: str) -> str:
|
| 532 |
+
"""Extract project name from bug_id like 'Compress-35'."""
|
| 533 |
+
return bug_id.rsplit("-", 1)[0] if "-" in bug_id else bug_id
|
| 534 |
+
|
| 535 |
+
def get_problem_summary(self, idx: int) -> dict[str, Any]:
|
| 536 |
+
row = self._ds[idx]
|
| 537 |
+
bug_id = row.get("bug_id", str(idx))
|
| 538 |
+
project = self._project_from_bug_id(bug_id)
|
| 539 |
+
return {
|
| 540 |
+
"idx": idx,
|
| 541 |
+
"task_id": bug_id,
|
| 542 |
+
"entry_point": project,
|
| 543 |
+
"num_inputs": 0,
|
| 544 |
+
"source": project,
|
| 545 |
+
}
|
| 546 |
+
|
| 547 |
+
def get_problem_detail(self, idx: int) -> dict[str, Any]:
|
| 548 |
+
row = self._ds[idx]
|
| 549 |
+
bug_id = row.get("bug_id", str(idx))
|
| 550 |
+
project = self._project_from_bug_id(bug_id)
|
| 551 |
+
buggy = row.get("func_before", "")
|
| 552 |
+
fixed = row.get("func_after", "")
|
| 553 |
+
return {
|
| 554 |
+
"idx": idx,
|
| 555 |
+
"task_id": bug_id,
|
| 556 |
+
"entry_point": project,
|
| 557 |
+
"code": fixed,
|
| 558 |
+
"highlighted_code": _highlight_code(fixed, language="java") if fixed else "",
|
| 559 |
+
"inputs": [],
|
| 560 |
+
"outputs": [],
|
| 561 |
+
"test": None,
|
| 562 |
+
"tasks": [],
|
| 563 |
+
"source": project,
|
| 564 |
+
"has_ground_truth": False,
|
| 565 |
+
"has_tasks": False,
|
| 566 |
+
"description": "",
|
| 567 |
+
"buggy_code": buggy,
|
| 568 |
+
"buggy_highlighted_code": _highlight_code(buggy, language="java") if buggy else "",
|
| 569 |
+
"fixed_code": fixed,
|
| 570 |
+
"fixed_highlighted_code": _highlight_code(fixed, language="java") if fixed else "",
|
| 571 |
+
"bug_category": "Bug Fix",
|
| 572 |
+
"bug_subtype": project,
|
| 573 |
+
"bug_explanation": "",
|
| 574 |
+
"language": "Java",
|
| 575 |
+
}
|
|
@@ -0,0 +1,503 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Long Code Arena benchmark adapters (6 project-level tasks).
|
| 2 |
+
|
| 3 |
+
All datasets from: https://huggingface.co/collections/JetBrains-Research/long-code-arena
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
from __future__ import annotations
|
| 7 |
+
|
| 8 |
+
import json
|
| 9 |
+
from typing import Any
|
| 10 |
+
|
| 11 |
+
from adapters import DatasetAdapter
|
| 12 |
+
|
| 13 |
+
# Injected at runtime by _set_helpers()
|
| 14 |
+
_highlight_code = None
|
| 15 |
+
_code_offset = None
|
| 16 |
+
_extract_test_classes = None
|
| 17 |
+
|
| 18 |
+
# ---------------------------------------------------------------------------
|
| 19 |
+
# Shared helpers
|
| 20 |
+
# ---------------------------------------------------------------------------
|
| 21 |
+
|
| 22 |
+
_CODE_TRIM_LIMIT = 50_000 # chars for code / diff fields
|
| 23 |
+
_DESC_TRIM_LIMIT = 5_000 # chars for description / log fields
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
def _trim(text: str, limit: int, label: str = "Content") -> str:
|
| 27 |
+
"""Return *text* unchanged if short enough, otherwise trim with an explicit marker."""
|
| 28 |
+
if len(text) <= limit:
|
| 29 |
+
return text
|
| 30 |
+
return (
|
| 31 |
+
text[:limit]
|
| 32 |
+
+ f"\n\n--- {label} trimmed: showing {limit:,} of {len(text):,} characters ---"
|
| 33 |
+
)
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
_LOG_HEAD_LIMIT = 10_000 # chars budget for head part of CI log
|
| 37 |
+
_LOG_TAIL_LIMIT = 10_000 # chars budget for tail part of CI log
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
def _trim_head_tail(text: str, label: str = "Content") -> str:
|
| 41 |
+
"""Show first ~10k chars and last ~10k chars (snapped to line boundaries)."""
|
| 42 |
+
if len(text) <= _LOG_HEAD_LIMIT + _LOG_TAIL_LIMIT:
|
| 43 |
+
return text
|
| 44 |
+
|
| 45 |
+
# Head: find the last newline within the budget
|
| 46 |
+
head_end = text.rfind("\n", 0, _LOG_HEAD_LIMIT)
|
| 47 |
+
if head_end <= 0:
|
| 48 |
+
head_end = _LOG_HEAD_LIMIT
|
| 49 |
+
head = text[:head_end]
|
| 50 |
+
|
| 51 |
+
# Tail: find the first newline after the cut point
|
| 52 |
+
tail_start = text.find("\n", len(text) - _LOG_TAIL_LIMIT)
|
| 53 |
+
if tail_start < 0 or tail_start >= len(text):
|
| 54 |
+
tail_start = len(text) - _LOG_TAIL_LIMIT
|
| 55 |
+
tail = text[tail_start:]
|
| 56 |
+
|
| 57 |
+
total_lines = text.count("\n") + 1
|
| 58 |
+
head_lines = head.count("\n") + 1
|
| 59 |
+
tail_lines = tail.count("\n") + 1
|
| 60 |
+
omitted = total_lines - head_lines - tail_lines
|
| 61 |
+
|
| 62 |
+
return (
|
| 63 |
+
head
|
| 64 |
+
+ f"\n\n--- {label} trimmed: showing first {head_lines:,} and last"
|
| 65 |
+
f" {tail_lines:,} lines ({omitted:,} lines omitted,"
|
| 66 |
+
f" {len(text):,} chars total) ---\n\n"
|
| 67 |
+
+ tail
|
| 68 |
+
)
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
def _lca_repo_url(repo_slug: str) -> str:
|
| 72 |
+
"""Convert an LCA-style repo slug to a GitHub URL.
|
| 73 |
+
|
| 74 |
+
LCA datasets use either ``owner__name`` (double underscore) or
|
| 75 |
+
``owner/name`` (slash) depending on the task.
|
| 76 |
+
"""
|
| 77 |
+
if not repo_slug:
|
| 78 |
+
return ""
|
| 79 |
+
# Normalise double-underscore to slash
|
| 80 |
+
ghname = repo_slug.replace("__", "/", 1) if "__" in repo_slug else repo_slug
|
| 81 |
+
return f"https://github.com/{ghname}"
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
# ---------------------------------------------------------------------------
|
| 85 |
+
# LCA Library-Based Code Generation
|
| 86 |
+
# (HuggingFace: JetBrains-Research/lca-library-based-code-generation)
|
| 87 |
+
# ---------------------------------------------------------------------------
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
class LCALibCodeGenAdapter(DatasetAdapter):
|
| 91 |
+
slug = "lca-libcodegen"
|
| 92 |
+
display_name = "LCA Library-Based Code Gen"
|
| 93 |
+
has_ground_truth = False
|
| 94 |
+
has_tasks = False
|
| 95 |
+
|
| 96 |
+
def __init__(self, hf_dataset):
|
| 97 |
+
self._ds = hf_dataset
|
| 98 |
+
|
| 99 |
+
def problem_count(self) -> int:
|
| 100 |
+
return len(self._ds)
|
| 101 |
+
|
| 102 |
+
def get_problem_summary(self, idx: int) -> dict[str, Any]:
|
| 103 |
+
row = self._ds[idx]
|
| 104 |
+
return {
|
| 105 |
+
"idx": idx,
|
| 106 |
+
"task_id": row.get("repo_full_name", str(idx)),
|
| 107 |
+
"entry_point": row.get("repo_name", f"lca_libgen_{idx}"),
|
| 108 |
+
"num_inputs": row.get("n_unique_apis", 0),
|
| 109 |
+
"source": row.get("repo_owner", "LCA"),
|
| 110 |
+
}
|
| 111 |
+
|
| 112 |
+
def get_problem_detail(self, idx: int) -> dict[str, Any]:
|
| 113 |
+
row = self._ds[idx]
|
| 114 |
+
reference = row.get("clean_reference", row.get("reference", ""))
|
| 115 |
+
unique_apis = list(row.get("unique_apis", []))
|
| 116 |
+
repo_slug = row.get("repo_full_name", "")
|
| 117 |
+
return {
|
| 118 |
+
"idx": idx,
|
| 119 |
+
"task_id": repo_slug or str(idx),
|
| 120 |
+
"entry_point": row.get("repo_name", f"lca_libgen_{idx}"),
|
| 121 |
+
"code": reference,
|
| 122 |
+
"highlighted_code": _highlight_code(reference),
|
| 123 |
+
"inputs": [],
|
| 124 |
+
"outputs": [],
|
| 125 |
+
"test": None,
|
| 126 |
+
"tasks": [],
|
| 127 |
+
"source": row.get("repo_owner", "LCA"),
|
| 128 |
+
"has_ground_truth": False,
|
| 129 |
+
"has_tasks": False,
|
| 130 |
+
"description": row.get("instruction", ""),
|
| 131 |
+
"unique_apis": unique_apis,
|
| 132 |
+
"n_unique_apis": row.get("n_unique_apis", 0),
|
| 133 |
+
"repo_url": _lca_repo_url(repo_slug),
|
| 134 |
+
}
|
| 135 |
+
|
| 136 |
+
|
| 137 |
+
# ---------------------------------------------------------------------------
|
| 138 |
+
# LCA Project-Level Code Completion
|
| 139 |
+
# (HuggingFace: JetBrains-Research/lca-project-level-code-completion)
|
| 140 |
+
# ---------------------------------------------------------------------------
|
| 141 |
+
|
| 142 |
+
|
| 143 |
+
class LCACodeCompletionAdapter(DatasetAdapter):
|
| 144 |
+
slug = "lca-codecompletion"
|
| 145 |
+
display_name = "LCA Project-Level Completion"
|
| 146 |
+
has_ground_truth = False
|
| 147 |
+
has_tasks = False
|
| 148 |
+
|
| 149 |
+
def __init__(self, rows: list[dict[str, Any]]):
|
| 150 |
+
self._rows = rows
|
| 151 |
+
|
| 152 |
+
def problem_count(self) -> int:
|
| 153 |
+
return len(self._rows)
|
| 154 |
+
|
| 155 |
+
def get_problem_summary(self, idx: int) -> dict[str, Any]:
|
| 156 |
+
row = self._rows[idx]
|
| 157 |
+
completion_file = row.get("completion_file", {})
|
| 158 |
+
filename = completion_file.get("filename", "") if isinstance(completion_file, dict) else ""
|
| 159 |
+
return {
|
| 160 |
+
"idx": idx,
|
| 161 |
+
"task_id": row.get("repo", str(idx)),
|
| 162 |
+
"entry_point": filename.rsplit("/", 1)[-1] if filename else f"completion_{idx}",
|
| 163 |
+
"num_inputs": 0,
|
| 164 |
+
"source": row.get("_context_size", "LCA"),
|
| 165 |
+
}
|
| 166 |
+
|
| 167 |
+
def get_problem_detail(self, idx: int) -> dict[str, Any]:
|
| 168 |
+
row = self._rows[idx]
|
| 169 |
+
completion_file = row.get("completion_file", {})
|
| 170 |
+
if isinstance(completion_file, dict):
|
| 171 |
+
filename = completion_file.get("filename", "")
|
| 172 |
+
content = completion_file.get("content", "")
|
| 173 |
+
else:
|
| 174 |
+
filename = ""
|
| 175 |
+
content = ""
|
| 176 |
+
|
| 177 |
+
completion_lines = row.get("completion_lines", {})
|
| 178 |
+
if isinstance(completion_lines, dict):
|
| 179 |
+
committed = completion_lines.get("committed", [])
|
| 180 |
+
else:
|
| 181 |
+
committed = []
|
| 182 |
+
|
| 183 |
+
lang = "python"
|
| 184 |
+
if filename:
|
| 185 |
+
ext = filename.rsplit(".", 1)[-1].lower() if "." in filename else ""
|
| 186 |
+
ext_map = {
|
| 187 |
+
"py": "python",
|
| 188 |
+
"java": "java",
|
| 189 |
+
"kt": "kotlin",
|
| 190 |
+
"js": "javascript",
|
| 191 |
+
"ts": "typescript",
|
| 192 |
+
"cpp": "cpp",
|
| 193 |
+
"c": "c",
|
| 194 |
+
"go": "go",
|
| 195 |
+
"rs": "rust",
|
| 196 |
+
"rb": "ruby",
|
| 197 |
+
}
|
| 198 |
+
lang = ext_map.get(ext, "python")
|
| 199 |
+
|
| 200 |
+
repo_slug = row.get("repo", "")
|
| 201 |
+
commit_hash = row.get("commit_hash", "")
|
| 202 |
+
repo_url = _lca_repo_url(repo_slug)
|
| 203 |
+
commit_url = f"{repo_url}/commit/{commit_hash}" if repo_url and commit_hash else ""
|
| 204 |
+
|
| 205 |
+
return {
|
| 206 |
+
"idx": idx,
|
| 207 |
+
"task_id": repo_slug or str(idx),
|
| 208 |
+
"entry_point": filename.rsplit("/", 1)[-1] if filename else f"completion_{idx}",
|
| 209 |
+
"code": content,
|
| 210 |
+
"highlighted_code": _highlight_code(content, language=lang) if content else "",
|
| 211 |
+
"inputs": [],
|
| 212 |
+
"outputs": [],
|
| 213 |
+
"test": None,
|
| 214 |
+
"tasks": [],
|
| 215 |
+
"source": row.get("_context_size", "LCA"),
|
| 216 |
+
"has_ground_truth": False,
|
| 217 |
+
"has_tasks": False,
|
| 218 |
+
"description": f"File: {filename}\nCommit: {commit_hash[:12]}",
|
| 219 |
+
"completion_lines_committed": committed,
|
| 220 |
+
"language": lang,
|
| 221 |
+
"repo_url": repo_url,
|
| 222 |
+
"commit_url": commit_url,
|
| 223 |
+
}
|
| 224 |
+
|
| 225 |
+
|
| 226 |
+
# ---------------------------------------------------------------------------
|
| 227 |
+
# LCA Bug Localization
|
| 228 |
+
# (HuggingFace: JetBrains-Research/lca-bug-localization)
|
| 229 |
+
# ---------------------------------------------------------------------------
|
| 230 |
+
|
| 231 |
+
|
| 232 |
+
class LCABugLocalizationAdapter(DatasetAdapter):
|
| 233 |
+
slug = "lca-buglocalization"
|
| 234 |
+
display_name = "LCA Bug Localization"
|
| 235 |
+
has_ground_truth = False
|
| 236 |
+
has_tasks = False
|
| 237 |
+
|
| 238 |
+
def __init__(self, hf_dataset):
|
| 239 |
+
self._ds = hf_dataset
|
| 240 |
+
|
| 241 |
+
def problem_count(self) -> int:
|
| 242 |
+
return len(self._ds)
|
| 243 |
+
|
| 244 |
+
def get_problem_summary(self, idx: int) -> dict[str, Any]:
|
| 245 |
+
row = self._ds[idx]
|
| 246 |
+
return {
|
| 247 |
+
"idx": idx,
|
| 248 |
+
"task_id": row.get("text_id", str(idx)),
|
| 249 |
+
"entry_point": f"{row.get('repo_owner', '')}/{row.get('repo_name', '')}",
|
| 250 |
+
"num_inputs": row.get("changed_files_count", 0),
|
| 251 |
+
"source": row.get("repo_language", "unknown"),
|
| 252 |
+
}
|
| 253 |
+
|
| 254 |
+
def get_problem_detail(self, idx: int) -> dict[str, Any]:
|
| 255 |
+
row = self._ds[idx]
|
| 256 |
+
diff = row.get("diff", "")
|
| 257 |
+
repo_owner = row.get("repo_owner", "")
|
| 258 |
+
repo_name = row.get("repo_name", "")
|
| 259 |
+
repo = f"{repo_owner}/{repo_name}" if repo_owner and repo_name else ""
|
| 260 |
+
issue_url = row.get("issue_url", "")
|
| 261 |
+
pull_url = row.get("pull_url", "")
|
| 262 |
+
|
| 263 |
+
return {
|
| 264 |
+
"idx": idx,
|
| 265 |
+
"task_id": row.get("text_id", str(idx)),
|
| 266 |
+
"entry_point": repo or f"bug_{idx}",
|
| 267 |
+
"code": diff,
|
| 268 |
+
"highlighted_code": "",
|
| 269 |
+
"inputs": [],
|
| 270 |
+
"outputs": [],
|
| 271 |
+
"test": None,
|
| 272 |
+
"tasks": [],
|
| 273 |
+
"source": row.get("repo_language", "unknown"),
|
| 274 |
+
"has_ground_truth": False,
|
| 275 |
+
"has_tasks": False,
|
| 276 |
+
"description": row.get("issue_title", "")
|
| 277 |
+
+ ("\n\n" + row.get("issue_body", "") if row.get("issue_body") else ""),
|
| 278 |
+
"patch": diff,
|
| 279 |
+
"repo": repo,
|
| 280 |
+
"repo_url": f"https://github.com/{repo}" if repo else "",
|
| 281 |
+
"issue_url": issue_url,
|
| 282 |
+
"commit_url": pull_url,
|
| 283 |
+
}
|
| 284 |
+
|
| 285 |
+
|
| 286 |
+
# ---------------------------------------------------------------------------
|
| 287 |
+
# LCA Commit Message Generation
|
| 288 |
+
# (HuggingFace: JetBrains-Research/lca-commit-message-generation)
|
| 289 |
+
# ---------------------------------------------------------------------------
|
| 290 |
+
|
| 291 |
+
|
| 292 |
+
class LCACommitMsgGenAdapter(DatasetAdapter):
|
| 293 |
+
slug = "lca-commitmsg"
|
| 294 |
+
display_name = "LCA Commit Message Gen"
|
| 295 |
+
has_ground_truth = False
|
| 296 |
+
has_tasks = False
|
| 297 |
+
|
| 298 |
+
def __init__(self, hf_dataset):
|
| 299 |
+
self._ds = hf_dataset
|
| 300 |
+
|
| 301 |
+
def problem_count(self) -> int:
|
| 302 |
+
return len(self._ds)
|
| 303 |
+
|
| 304 |
+
def get_problem_summary(self, idx: int) -> dict[str, Any]:
|
| 305 |
+
row = self._ds[idx]
|
| 306 |
+
mods = row.get("mods", [])
|
| 307 |
+
n_files = len(mods) if isinstance(mods, list) else 0
|
| 308 |
+
return {
|
| 309 |
+
"idx": idx,
|
| 310 |
+
"task_id": row.get("hash", str(idx))[:12],
|
| 311 |
+
"entry_point": row.get("repo", f"commit_{idx}"),
|
| 312 |
+
"num_inputs": n_files,
|
| 313 |
+
"source": row.get("license", "LCA")[:20],
|
| 314 |
+
}
|
| 315 |
+
|
| 316 |
+
def get_problem_detail(self, idx: int) -> dict[str, Any]:
|
| 317 |
+
row = self._ds[idx]
|
| 318 |
+
message = row.get("message", "")
|
| 319 |
+
mods = row.get("mods", [])
|
| 320 |
+
|
| 321 |
+
# Build a unified diff from all modifications
|
| 322 |
+
diff_parts = []
|
| 323 |
+
if isinstance(mods, list):
|
| 324 |
+
for mod in mods:
|
| 325 |
+
if isinstance(mod, dict):
|
| 326 |
+
old_path = mod.get("old_path", "")
|
| 327 |
+
new_path = mod.get("new_path", "")
|
| 328 |
+
mod_diff = mod.get("diff", "")
|
| 329 |
+
if mod_diff:
|
| 330 |
+
diff_parts.append(
|
| 331 |
+
f"diff --git a/{old_path} b/{new_path}\n"
|
| 332 |
+
f"--- a/{old_path}\n"
|
| 333 |
+
f"+++ b/{new_path}\n"
|
| 334 |
+
f"{mod_diff}"
|
| 335 |
+
)
|
| 336 |
+
combined_diff = "\n".join(diff_parts)
|
| 337 |
+
trimmed_diff = _trim(combined_diff, _CODE_TRIM_LIMIT, "Diff")
|
| 338 |
+
|
| 339 |
+
repo_slug = row.get("repo", "")
|
| 340 |
+
commit_hash = row.get("hash", "")
|
| 341 |
+
repo_url = _lca_repo_url(repo_slug)
|
| 342 |
+
commit_url = f"{repo_url}/commit/{commit_hash}" if repo_url and commit_hash else ""
|
| 343 |
+
|
| 344 |
+
return {
|
| 345 |
+
"idx": idx,
|
| 346 |
+
"task_id": (commit_hash or str(idx))[:12],
|
| 347 |
+
"entry_point": repo_slug or f"commit_{idx}",
|
| 348 |
+
"code": trimmed_diff,
|
| 349 |
+
"highlighted_code": "",
|
| 350 |
+
"inputs": [],
|
| 351 |
+
"outputs": [],
|
| 352 |
+
"test": None,
|
| 353 |
+
"tasks": [],
|
| 354 |
+
"source": row.get("license", "LCA")[:20],
|
| 355 |
+
"has_ground_truth": False,
|
| 356 |
+
"has_tasks": False,
|
| 357 |
+
"description": message,
|
| 358 |
+
"patch": trimmed_diff,
|
| 359 |
+
"repo": repo_slug,
|
| 360 |
+
"repo_url": repo_url,
|
| 361 |
+
"commit_url": commit_url,
|
| 362 |
+
"commit_hash": commit_hash,
|
| 363 |
+
}
|
| 364 |
+
|
| 365 |
+
|
| 366 |
+
# ---------------------------------------------------------------------------
|
| 367 |
+
# LCA CI Builds Repair
|
| 368 |
+
# (HuggingFace: JetBrains-Research/lca-ci-builds-repair)
|
| 369 |
+
# ---------------------------------------------------------------------------
|
| 370 |
+
|
| 371 |
+
|
| 372 |
+
class LCACIRepairAdapter(DatasetAdapter):
|
| 373 |
+
slug = "lca-cirepair"
|
| 374 |
+
display_name = "LCA CI Builds Repair"
|
| 375 |
+
has_ground_truth = False
|
| 376 |
+
has_tasks = False
|
| 377 |
+
|
| 378 |
+
def __init__(self, hf_dataset):
|
| 379 |
+
self._ds = hf_dataset
|
| 380 |
+
|
| 381 |
+
def problem_count(self) -> int:
|
| 382 |
+
return len(self._ds)
|
| 383 |
+
|
| 384 |
+
def get_problem_summary(self, idx: int) -> dict[str, Any]:
|
| 385 |
+
row = self._ds[idx]
|
| 386 |
+
repo = f"{row.get('repo_owner', '')}/{row.get('repo_name', '')}"
|
| 387 |
+
return {
|
| 388 |
+
"idx": idx,
|
| 389 |
+
"task_id": str(row.get("id", idx)),
|
| 390 |
+
"entry_point": repo,
|
| 391 |
+
"num_inputs": 0,
|
| 392 |
+
"source": f"difficulty-{row.get('difficulty', '?')}",
|
| 393 |
+
}
|
| 394 |
+
|
| 395 |
+
def get_problem_detail(self, idx: int) -> dict[str, Any]:
|
| 396 |
+
row = self._ds[idx]
|
| 397 |
+
diff = row.get("diff", "")
|
| 398 |
+
trimmed_diff = _trim(diff, _CODE_TRIM_LIMIT, "Diff")
|
| 399 |
+
repo_owner = row.get("repo_owner", "")
|
| 400 |
+
repo_name = row.get("repo_name", "")
|
| 401 |
+
repo = f"{repo_owner}/{repo_name}" if repo_owner and repo_name else ""
|
| 402 |
+
commit_link = row.get("commit_link", "")
|
| 403 |
+
|
| 404 |
+
# Extract log text — can be several MB; trim explicitly
|
| 405 |
+
logs = row.get("logs", [])
|
| 406 |
+
log_text = ""
|
| 407 |
+
if isinstance(logs, list):
|
| 408 |
+
for entry in logs:
|
| 409 |
+
if isinstance(entry, dict):
|
| 410 |
+
step = entry.get("step_name", "")
|
| 411 |
+
log = entry.get("log", "")
|
| 412 |
+
log_text += f"=== {step} ===\n{log}\n\n"
|
| 413 |
+
trimmed_log = _trim_head_tail(log_text, "CI log")
|
| 414 |
+
|
| 415 |
+
return {
|
| 416 |
+
"idx": idx,
|
| 417 |
+
"task_id": str(row.get("id", idx)),
|
| 418 |
+
"entry_point": repo or f"ci_{idx}",
|
| 419 |
+
"code": trimmed_diff,
|
| 420 |
+
"highlighted_code": "",
|
| 421 |
+
"inputs": [],
|
| 422 |
+
"outputs": [],
|
| 423 |
+
"test": None,
|
| 424 |
+
"tasks": [],
|
| 425 |
+
"source": f"difficulty-{row.get('difficulty', '?')}",
|
| 426 |
+
"has_ground_truth": False,
|
| 427 |
+
"has_tasks": False,
|
| 428 |
+
"description": f"Workflow: {row.get('workflow_name', '')}\n"
|
| 429 |
+
f"Branch: {row.get('head_branch', '')}\n"
|
| 430 |
+
f"Contributor: {row.get('contributor', '')}\n\n"
|
| 431 |
+
f"CI Log:\n{trimmed_log}",
|
| 432 |
+
"patch": trimmed_diff,
|
| 433 |
+
"repo": repo,
|
| 434 |
+
"repo_url": f"https://github.com/{repo}" if repo else "",
|
| 435 |
+
"commit_url": commit_link,
|
| 436 |
+
}
|
| 437 |
+
|
| 438 |
+
|
| 439 |
+
# ---------------------------------------------------------------------------
|
| 440 |
+
# LCA Module Summarization
|
| 441 |
+
# (HuggingFace: JetBrains-Research/lca-module-summarization)
|
| 442 |
+
# ---------------------------------------------------------------------------
|
| 443 |
+
|
| 444 |
+
|
| 445 |
+
class LCAModuleSummarizationAdapter(DatasetAdapter):
|
| 446 |
+
slug = "lca-modulesumm"
|
| 447 |
+
display_name = "LCA Module Summarization"
|
| 448 |
+
has_ground_truth = False
|
| 449 |
+
has_tasks = False
|
| 450 |
+
|
| 451 |
+
def __init__(self, hf_dataset):
|
| 452 |
+
self._ds = hf_dataset
|
| 453 |
+
|
| 454 |
+
def problem_count(self) -> int:
|
| 455 |
+
return len(self._ds)
|
| 456 |
+
|
| 457 |
+
def get_problem_summary(self, idx: int) -> dict[str, Any]:
|
| 458 |
+
row = self._ds[idx]
|
| 459 |
+
return {
|
| 460 |
+
"idx": idx,
|
| 461 |
+
"task_id": row.get("docfile_name", str(idx)),
|
| 462 |
+
"entry_point": row.get("repo", f"module_{idx}"),
|
| 463 |
+
"num_inputs": 0,
|
| 464 |
+
"source": row.get("doc_type", "LCA"),
|
| 465 |
+
}
|
| 466 |
+
|
| 467 |
+
def get_problem_detail(self, idx: int) -> dict[str, Any]:
|
| 468 |
+
row = self._ds[idx]
|
| 469 |
+
target_text = row.get("target_text", "")
|
| 470 |
+
# Code context can be extremely large (up to 23 MB); trim with explicit marker
|
| 471 |
+
code_context = row.get("relevant_code_context", "")
|
| 472 |
+
trimmed_code = _trim(code_context, _CODE_TRIM_LIMIT, "Code context")
|
| 473 |
+
|
| 474 |
+
relevant_files = row.get("relevant_code_files", [])
|
| 475 |
+
if isinstance(relevant_files, str):
|
| 476 |
+
try:
|
| 477 |
+
relevant_files = json.loads(relevant_files)
|
| 478 |
+
except (json.JSONDecodeError, TypeError):
|
| 479 |
+
relevant_files = [relevant_files]
|
| 480 |
+
|
| 481 |
+
repo_slug = row.get("repo", "")
|
| 482 |
+
repo_url = _lca_repo_url(repo_slug)
|
| 483 |
+
trimmed_target = _trim(target_text, _DESC_TRIM_LIMIT, "Target documentation")
|
| 484 |
+
|
| 485 |
+
return {
|
| 486 |
+
"idx": idx,
|
| 487 |
+
"task_id": row.get("docfile_name", str(idx)),
|
| 488 |
+
"entry_point": repo_slug or f"module_{idx}",
|
| 489 |
+
"code": trimmed_code,
|
| 490 |
+
"highlighted_code": _highlight_code(trimmed_code) if trimmed_code else "",
|
| 491 |
+
"inputs": [],
|
| 492 |
+
"outputs": [],
|
| 493 |
+
"test": None,
|
| 494 |
+
"tasks": [],
|
| 495 |
+
"source": row.get("doc_type", "LCA"),
|
| 496 |
+
"has_ground_truth": False,
|
| 497 |
+
"has_tasks": False,
|
| 498 |
+
"description": f"Intent: {row.get('intent', '')}\n\n"
|
| 499 |
+
f"Doc file: {row.get('path_to_docfile', '')}\n"
|
| 500 |
+
f"Relevant files: {', '.join(relevant_files) if isinstance(relevant_files, list) else ''}\n\n"
|
| 501 |
+
f"Target documentation:\n{trimmed_target}",
|
| 502 |
+
"repo_url": repo_url,
|
| 503 |
+
}
|
|
@@ -7,6 +7,15 @@ import random
|
|
| 7 |
from typing import Any
|
| 8 |
|
| 9 |
from adapters import REGISTRY
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
from adapters.code_editing import (
|
| 11 |
CanItEditAdapter,
|
| 12 |
CodeEditorBenchAdapter,
|
|
@@ -38,6 +47,14 @@ from adapters.code_reasoning import (
|
|
| 38 |
HumanEvalXAdapter,
|
| 39 |
SAFIMAdapter,
|
| 40 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
from adapters.vulnerability import (
|
| 42 |
BigVulAdapter,
|
| 43 |
DevignAdapter,
|
|
@@ -408,3 +425,216 @@ def register_hf_datasets() -> None:
|
|
| 408 |
print(f"Loaded EffiBench: {len(effibench)} problems")
|
| 409 |
except Exception as e:
|
| 410 |
print(f"Warning: could not load EffiBench: {e}")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
from typing import Any
|
| 8 |
|
| 9 |
from adapters import REGISTRY
|
| 10 |
+
from adapters.additional import (
|
| 11 |
+
CrossCodeEvalAdapter,
|
| 12 |
+
Defects4JAdapter,
|
| 13 |
+
DPAIAEEDatasetAdapter,
|
| 14 |
+
McEvalAdapter,
|
| 15 |
+
MultiPLEAdapter,
|
| 16 |
+
MultiSWEBenchAdapter,
|
| 17 |
+
SWEBenchMultilingualAdapter,
|
| 18 |
+
)
|
| 19 |
from adapters.code_editing import (
|
| 20 |
CanItEditAdapter,
|
| 21 |
CodeEditorBenchAdapter,
|
|
|
|
| 47 |
HumanEvalXAdapter,
|
| 48 |
SAFIMAdapter,
|
| 49 |
)
|
| 50 |
+
from adapters.long_code_arena import (
|
| 51 |
+
LCABugLocalizationAdapter,
|
| 52 |
+
LCACIRepairAdapter,
|
| 53 |
+
LCACodeCompletionAdapter,
|
| 54 |
+
LCACommitMsgGenAdapter,
|
| 55 |
+
LCALibCodeGenAdapter,
|
| 56 |
+
LCAModuleSummarizationAdapter,
|
| 57 |
+
)
|
| 58 |
from adapters.vulnerability import (
|
| 59 |
BigVulAdapter,
|
| 60 |
DevignAdapter,
|
|
|
|
| 425 |
print(f"Loaded EffiBench: {len(effibench)} problems")
|
| 426 |
except Exception as e:
|
| 427 |
print(f"Warning: could not load EffiBench: {e}")
|
| 428 |
+
|
| 429 |
+
# --- Long Code Arena datasets (6 tasks) ---
|
| 430 |
+
|
| 431 |
+
try:
|
| 432 |
+
lca_libgen = load_dataset(
|
| 433 |
+
"JetBrains-Research/lca-library-based-code-generation", split="test"
|
| 434 |
+
)
|
| 435 |
+
REGISTRY["lca-libcodegen"] = LCALibCodeGenAdapter(lca_libgen)
|
| 436 |
+
print(f"Loaded LCA Library-Based Code Gen: {len(lca_libgen)} problems")
|
| 437 |
+
except Exception as e:
|
| 438 |
+
print(f"Warning: could not load LCA Library-Based Code Gen: {e}")
|
| 439 |
+
|
| 440 |
+
try:
|
| 441 |
+
# Load small and medium context sizes only (large/huge are multi-GB)
|
| 442 |
+
lca_cc_rows: list[dict[str, Any]] = []
|
| 443 |
+
for ctx in ("small_context", "medium_context"):
|
| 444 |
+
try:
|
| 445 |
+
ds = load_dataset(
|
| 446 |
+
"JetBrains-Research/lca-project-level-code-completion", ctx, split="test"
|
| 447 |
+
)
|
| 448 |
+
for i in range(len(ds)):
|
| 449 |
+
row = dict(ds[i])
|
| 450 |
+
row["_context_size"] = ctx
|
| 451 |
+
lca_cc_rows.append(row)
|
| 452 |
+
except Exception:
|
| 453 |
+
pass
|
| 454 |
+
if lca_cc_rows:
|
| 455 |
+
lca_cc_sampled = _sample_list(lca_cc_rows)
|
| 456 |
+
adapter = LCACodeCompletionAdapter(lca_cc_sampled)
|
| 457 |
+
adapter.total_count = len(lca_cc_rows)
|
| 458 |
+
REGISTRY["lca-codecompletion"] = adapter
|
| 459 |
+
print(
|
| 460 |
+
f"Loaded LCA Project-Level Completion: "
|
| 461 |
+
f"{len(lca_cc_sampled)} problems (of {len(lca_cc_rows)})"
|
| 462 |
+
)
|
| 463 |
+
else:
|
| 464 |
+
print("Warning: could not load any LCA Code Completion context sizes")
|
| 465 |
+
except Exception as e:
|
| 466 |
+
print(f"Warning: could not load LCA Project-Level Completion: {e}")
|
| 467 |
+
|
| 468 |
+
try:
|
| 469 |
+
# Merge all language subsets (py, java, kt) using test split
|
| 470 |
+
lca_bl_all = None
|
| 471 |
+
for lang_subset in ("py", "java", "kt"):
|
| 472 |
+
try:
|
| 473 |
+
ds = load_dataset(
|
| 474 |
+
"JetBrains-Research/lca-bug-localization", lang_subset, split="test"
|
| 475 |
+
)
|
| 476 |
+
if lca_bl_all is None:
|
| 477 |
+
from datasets import concatenate_datasets
|
| 478 |
+
|
| 479 |
+
lca_bl_all = ds
|
| 480 |
+
else:
|
| 481 |
+
lca_bl_all = concatenate_datasets([lca_bl_all, ds])
|
| 482 |
+
except Exception:
|
| 483 |
+
pass
|
| 484 |
+
if lca_bl_all is not None and len(lca_bl_all) > 0:
|
| 485 |
+
lca_bl = _sample_hf_dataset(lca_bl_all)
|
| 486 |
+
adapter = LCABugLocalizationAdapter(lca_bl)
|
| 487 |
+
adapter.total_count = len(lca_bl_all)
|
| 488 |
+
REGISTRY["lca-buglocalization"] = adapter
|
| 489 |
+
print(f"Loaded LCA Bug Localization: {len(lca_bl)} problems (of {len(lca_bl_all)})")
|
| 490 |
+
else:
|
| 491 |
+
print("Warning: could not load any LCA Bug Localization language subsets")
|
| 492 |
+
except Exception as e:
|
| 493 |
+
print(f"Warning: could not load LCA Bug Localization: {e}")
|
| 494 |
+
|
| 495 |
+
try:
|
| 496 |
+
lca_cmg = load_dataset("JetBrains-Research/lca-commit-message-generation", split="test")
|
| 497 |
+
REGISTRY["lca-commitmsg"] = LCACommitMsgGenAdapter(lca_cmg)
|
| 498 |
+
print(f"Loaded LCA Commit Message Gen: {len(lca_cmg)} problems")
|
| 499 |
+
except Exception as e:
|
| 500 |
+
print(f"Warning: could not load LCA Commit Message Gen: {e}")
|
| 501 |
+
|
| 502 |
+
try:
|
| 503 |
+
lca_ci = load_dataset("JetBrains-Research/lca-ci-builds-repair", split="test")
|
| 504 |
+
REGISTRY["lca-cirepair"] = LCACIRepairAdapter(lca_ci)
|
| 505 |
+
print(f"Loaded LCA CI Builds Repair: {len(lca_ci)} problems")
|
| 506 |
+
except Exception as e:
|
| 507 |
+
print(f"Warning: could not load LCA CI Builds Repair: {e}")
|
| 508 |
+
|
| 509 |
+
try:
|
| 510 |
+
lca_ms = load_dataset("JetBrains-Research/lca-module-summarization", split="test")
|
| 511 |
+
REGISTRY["lca-modulesumm"] = LCAModuleSummarizationAdapter(lca_ms)
|
| 512 |
+
print(f"Loaded LCA Module Summarization: {len(lca_ms)} problems")
|
| 513 |
+
except Exception as e:
|
| 514 |
+
print(f"Warning: could not load LCA Module Summarization: {e}")
|
| 515 |
+
|
| 516 |
+
# --- dpaia Enterprise Evaluation Dataset ---
|
| 517 |
+
|
| 518 |
+
try:
|
| 519 |
+
import urllib.request
|
| 520 |
+
|
| 521 |
+
url = "https://raw.githubusercontent.com/dpaia/ee-dataset/main/datasets/java-spring-ee-dataset.json"
|
| 522 |
+
with urllib.request.urlopen(url) as resp:
|
| 523 |
+
dpaia_rows = json.loads(resp.read().decode("utf-8"))
|
| 524 |
+
if dpaia_rows:
|
| 525 |
+
REGISTRY["dpaia-ee"] = DPAIAEEDatasetAdapter(dpaia_rows)
|
| 526 |
+
print(f"Loaded DPAIA EE-Dataset: {len(dpaia_rows)} problems")
|
| 527 |
+
except Exception as e:
|
| 528 |
+
print(f"Warning: could not load DPAIA EE-Dataset: {e}")
|
| 529 |
+
|
| 530 |
+
# --- Multi-SWE-bench (ByteDance, multilingual issue resolving) ---
|
| 531 |
+
# Dataset has 40 per-repo JSONL files with inconsistent schemas; load directly.
|
| 532 |
+
|
| 533 |
+
try:
|
| 534 |
+
from huggingface_hub import list_repo_files
|
| 535 |
+
|
| 536 |
+
mswe_files = list_repo_files("ByteDance-Seed/Multi-SWE-bench", repo_type="dataset")
|
| 537 |
+
mswe_jsonl = [f for f in mswe_files if f.endswith(".jsonl")]
|
| 538 |
+
mswe_rows: list[dict[str, Any]] = []
|
| 539 |
+
for fname in mswe_jsonl:
|
| 540 |
+
lang_dir = fname.split("/")[0] if "/" in fname else ""
|
| 541 |
+
try:
|
| 542 |
+
rows = _load_jsonl_dataset("ByteDance-Seed/Multi-SWE-bench", [fname])
|
| 543 |
+
for d in rows:
|
| 544 |
+
d["_language"] = lang_dir
|
| 545 |
+
mswe_rows.extend(rows)
|
| 546 |
+
except Exception:
|
| 547 |
+
pass
|
| 548 |
+
if mswe_rows:
|
| 549 |
+
mswe_sampled = _sample_list(mswe_rows)
|
| 550 |
+
adapter = MultiSWEBenchAdapter(mswe_sampled)
|
| 551 |
+
adapter.total_count = len(mswe_rows)
|
| 552 |
+
REGISTRY["multiswebench"] = adapter
|
| 553 |
+
print(f"Loaded Multi-SWE-bench: {len(mswe_sampled)} problems (of {len(mswe_rows)})")
|
| 554 |
+
else:
|
| 555 |
+
print("Warning: could not load any Multi-SWE-bench JSONL files")
|
| 556 |
+
except Exception as e:
|
| 557 |
+
print(f"Warning: could not load Multi-SWE-bench: {e}")
|
| 558 |
+
|
| 559 |
+
# --- SWE-bench Multilingual ---
|
| 560 |
+
|
| 561 |
+
try:
|
| 562 |
+
swe_ml = load_dataset("SWE-bench/SWE-bench_Multilingual", split="test")
|
| 563 |
+
REGISTRY["swebenchmultilingual"] = SWEBenchMultilingualAdapter(swe_ml)
|
| 564 |
+
print(f"Loaded SWE-bench Multilingual: {len(swe_ml)} problems")
|
| 565 |
+
except Exception as e:
|
| 566 |
+
print(f"Warning: could not load SWE-bench Multilingual: {e}")
|
| 567 |
+
|
| 568 |
+
# --- CrossCodeEval (cross-file code completion, 4 languages) ---
|
| 569 |
+
# Dataset has inconsistent columns across files; load only base line_completion.jsonl per lang
|
| 570 |
+
|
| 571 |
+
try:
|
| 572 |
+
cceval_rows: list[dict[str, Any]] = []
|
| 573 |
+
for lang in ("python", "java", "typescript", "csharp"):
|
| 574 |
+
try:
|
| 575 |
+
rows = _load_jsonl_dataset(
|
| 576 |
+
"Vincentvmt/CrossCodeEval",
|
| 577 |
+
[f"crosscodeeval_data/{lang}/line_completion.jsonl"],
|
| 578 |
+
)
|
| 579 |
+
for d in rows:
|
| 580 |
+
d["language"] = lang
|
| 581 |
+
cceval_rows.extend(rows)
|
| 582 |
+
except Exception:
|
| 583 |
+
pass
|
| 584 |
+
if cceval_rows:
|
| 585 |
+
cceval_sampled = _sample_list(cceval_rows)
|
| 586 |
+
adapter = CrossCodeEvalAdapter(cceval_sampled)
|
| 587 |
+
adapter.total_count = len(cceval_rows)
|
| 588 |
+
REGISTRY["crosscodeeval"] = adapter
|
| 589 |
+
print(f"Loaded CrossCodeEval: {len(cceval_sampled)} problems (of {len(cceval_rows)})")
|
| 590 |
+
else:
|
| 591 |
+
print("Warning: could not load any CrossCodeEval language subsets")
|
| 592 |
+
except Exception as e:
|
| 593 |
+
print(f"Warning: could not load CrossCodeEval: {e}")
|
| 594 |
+
|
| 595 |
+
# --- McEval (massively multilingual code evaluation, 40 languages) ---
|
| 596 |
+
|
| 597 |
+
try:
|
| 598 |
+
mceval_full = load_dataset("Multilingual-Multimodal-NLP/McEval", "generation", split="test")
|
| 599 |
+
mceval = _sample_hf_dataset(mceval_full)
|
| 600 |
+
adapter = McEvalAdapter(mceval)
|
| 601 |
+
adapter.total_count = len(mceval_full)
|
| 602 |
+
REGISTRY["mceval"] = adapter
|
| 603 |
+
print(f"Loaded McEval: {len(mceval)} problems (of {len(mceval_full)})")
|
| 604 |
+
except Exception as e:
|
| 605 |
+
print(f"Warning: could not load McEval: {e}")
|
| 606 |
+
|
| 607 |
+
# --- MultiPL-E (multilingual HumanEval/MBPP, 22 languages) ---
|
| 608 |
+
|
| 609 |
+
try:
|
| 610 |
+
mple_datasets = {}
|
| 611 |
+
for lang_ext in MultiPLEAdapter.LANGUAGES:
|
| 612 |
+
try:
|
| 613 |
+
mple_datasets[lang_ext] = load_dataset(
|
| 614 |
+
"nuprl/MultiPL-E", f"humaneval-{lang_ext}", split="test"
|
| 615 |
+
)
|
| 616 |
+
except Exception:
|
| 617 |
+
pass
|
| 618 |
+
if mple_datasets:
|
| 619 |
+
REGISTRY["multiple"] = MultiPLEAdapter(mple_datasets)
|
| 620 |
+
first = next(iter(mple_datasets))
|
| 621 |
+
print(
|
| 622 |
+
f"Loaded MultiPL-E: {len(mple_datasets)} languages, "
|
| 623 |
+
f"{len(mple_datasets[first])} problems each"
|
| 624 |
+
)
|
| 625 |
+
else:
|
| 626 |
+
print("Warning: could not load any MultiPL-E language subsets")
|
| 627 |
+
except Exception as e:
|
| 628 |
+
print(f"Warning: could not load MultiPL-E: {e}")
|
| 629 |
+
|
| 630 |
+
# --- Defects4J (Java bug-fix benchmark) ---
|
| 631 |
+
|
| 632 |
+
try:
|
| 633 |
+
d4j = load_dataset("rufimelo/defects4j", split="train")
|
| 634 |
+
d4j_sampled = _sample_hf_dataset(d4j)
|
| 635 |
+
adapter = Defects4JAdapter(d4j_sampled)
|
| 636 |
+
adapter.total_count = len(d4j)
|
| 637 |
+
REGISTRY["defects4j"] = adapter
|
| 638 |
+
print(f"Loaded Defects4J: {len(d4j_sampled)} problems (of {len(d4j)})")
|
| 639 |
+
except Exception as e:
|
| 640 |
+
print(f"Warning: could not load Defects4J: {e}")
|