Spaces:

JetBrains-Research
/

ml4se-evals-visualization

Paused

Commit

9f85fac

1 Parent(s): 3c743c5

Add 13 new benchmark datasets (batches 6-8)

Long Code Arena (6 tasks):
- Library-Based Code Gen, Project-Level Completion,
Bug Localization, Commit Message Gen,
CI Builds Repair, Module Summarization
Additional datasets:
- DPAIA EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual,
CrossCodeEval, Defects4J, McEval, MultiPL-E
Key implementation details:
- CrossCodeEval: load per-language JSONL directly (inconsistent HF columns)
- Multi-SWE-bench: load individual JSONL files (40 files across languages)
- Defects4J: use train split, map bug_id/func_before/func_after fields
- RepoBench dropped (no data files, deprecated loading script)
- LCA CI logs: head+tail trimming (first/last 10k chars by line)
- LCA large fields: explicit trim markers showing original vs trimmed size
- GitHub repo/commit links for all LCA tasks where data is available
- Total datasets: 41 (up from 28)

Files changed (5) hide show

PROGRESS.md +51 -1
adapters/__init__.py +17 -3
adapters/additional.py +575 -0
adapters/long_code_arena.py +503 -0
adapters/registration.py +230 -0

PROGRESS.md CHANGED Viewed

@@ -87,13 +87,56 @@ CoderEval, NaturalCodeBench, DevEval, RunBugRun, Defects4J, ConDefects, FixEval,
 8. **Fill-in-the-Middle view** -- prefix + [HOLE] + suffix (SAFIM)
 9. **Vulnerability view** -- vulnerable/patched code + CWE labels (BigVul, DiverseVul, PrimeVul, Devign)
-## Total Datasets: 28
 Base (4): REval, CRUXEval, HumanEval+, BigOBench
 Batch 1 (5): MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X
 Batch 2 (5): SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP
 Batch 3 (5): SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench
 Batch 4 (3): SWE-bench Verified, CodeSearchNet, Devign
 Batch 5 (6): BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench
 ## Changelog
@@ -112,3 +155,10 @@ Batch 5 (6): BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, Commi
 - 2026-03-04: Enhanced SWE-bench diff view (full file with diff chunks)
 - 2026-03-04: Batch 5 complete (BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench)
 - 2026-03-04: All 28 datasets verified loading successfully

 8. **Fill-in-the-Middle view** -- prefix + [HOLE] + suffix (SAFIM)
 9. **Vulnerability view** -- vulnerable/patched code + CWE labels (BigVul, DiverseVul, PrimeVul, Devign)
+### Batch 6 — Long Code Arena (6 project-level tasks)
+| Benchmark | Slug | Status | HF Dataset | View Type |
+|-----------|------|--------|------------|-----------|
+| LCA Library-Based Code Gen | `lca-libcodegen` | Done | `JetBrains-Research/lca-library-based-code-generation` | Simple |
+| LCA Project-Level Completion | `lca-codecompletion` | Done | `JetBrains-Research/lca-project-level-code-completion` | Simple |
+| LCA Bug Localization | `lca-buglocalization` | Done | `JetBrains-Research/lca-bug-localization` | Diff |
+| LCA Commit Message Gen | `lca-commitmsg` | Done | `JetBrains-Research/lca-commit-message-generation` | Diff |
+| LCA CI Builds Repair | `lca-cirepair` | Done | `JetBrains-Research/lca-ci-builds-repair` | Diff |
+| LCA Module Summarization | `lca-modulesumm` | Done | `JetBrains-Research/lca-module-summarization` | Simple |
+**New adapter module:** `adapters/long_code_arena.py` — all 6 Long Code Arena project-level tasks.
+### Batch 7 — dpaia & Additional Benchmarks (7 datasets)
+| Benchmark | Slug | Status | Source | View Type |
+|-----------|------|--------|--------|-----------|
+| DPAIA EE-Dataset | `dpaia-ee` | Done | `github.com/dpaia/ee-dataset` (JSON) | Diff (SWE-bench style) |
+| Multi-SWE-bench | `multiswebench` | Done | `ByteDance-Seed/Multi-SWE-bench` (JSONL) | Diff |
+| SWE-bench Multilingual | `swebenchmultilingual` | Done | `SWE-bench/SWE-bench_Multilingual` | Diff |
+| CrossCodeEval | `crosscodeeval` | Done | `Vincentvmt/CrossCodeEval` (JSONL) | Fill-in-the-Middle |
+| McEval | `mceval` | Done | `Multilingual-Multimodal-NLP/McEval` | Simple |
+| MultiPL-E | `multiple` | Done | `nuprl/MultiPL-E` | Multi-language |
+| Defects4J | `defects4j` | Done | `rufimelo/defects4j` | Before/After |
+### Dropped from Batch 7
+| Benchmark | Reason |
+|-----------|--------|
+| RepoBench | HF repo has only a deprecated loading script (`repobench-p.py`), no actual data files |
+**New adapter module:** `adapters/additional.py` — dpaia EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J.
+**Sources:**
+- Long Code Arena: https://huggingface.co/collections/JetBrains-Research/long-code-arena (OpenReview: aQoUjxlgNE)
+- DPAIA EE-Dataset: https://github.com/dpaia/ee-dataset (Java/Spring SWE-bench-style)
+- Multi-SWE-bench: ByteDance multilingual SWE-bench (7 languages, 1632 problems across 40 repos)
+- SWE-bench Multilingual: Official SWE-bench multilingual extension (42 repos)
+- CrossCodeEval: Cross-file code completion (4 languages, Amazon, 9928 problems)
+- McEval: Massively multilingual code evaluation (40 languages)
+- MultiPL-E: Multi-language HumanEval/MBPP translation (9 languages loaded)
+- Defects4J: Classic Java bug-fix benchmark (467 bugs)
+- Arxiv survey reference: https://arxiv.org/abs/2505.08903
+## Total Datasets: 41
 Base (4): REval, CRUXEval, HumanEval+, BigOBench
 Batch 1 (5): MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X
 Batch 2 (5): SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP
 Batch 3 (5): SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench
 Batch 4 (3): SWE-bench Verified, CodeSearchNet, Devign
 Batch 5 (6): BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench
+Batch 6 (6): LCA Library-Based Code Gen, LCA Project-Level Completion, LCA Bug Localization, LCA Commit Message Gen, LCA CI Builds Repair, LCA Module Summarization
+Batch 7 (7): DPAIA EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J
 ## Changelog
 - 2026-03-04: Enhanced SWE-bench diff view (full file with diff chunks)
 - 2026-03-04: Batch 5 complete (BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench)
 - 2026-03-04: All 28 datasets verified loading successfully
+- 2026-03-04: Batch 6 complete (Long Code Arena — 6 project-level tasks)
+- 2026-03-04: Batch 7 complete (dpaia EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J)
+- 2026-03-04: Dropped RepoBench (HF repo has only deprecated loading script, no data files)
+- 2026-03-04: Fixed Multi-SWE-bench (load per-repo JSONL files directly instead of `load_dataset`)
+- 2026-03-04: Fixed CrossCodeEval (load per-language JSONL files directly, inconsistent columns across files)
+- 2026-03-04: Fixed Defects4J (split="train" not "test", fields: bug_id/func_before/func_after)
+- 2026-03-04: All 41 datasets verified loading successfully

adapters/__init__.py CHANGED Viewed

@@ -28,9 +28,23 @@ def _set_helpers(highlight_code_fn, code_offset_fn, extract_test_classes_fn):
     _extract_test_classes = extract_test_classes_fn
     # Propagate to submodules so adapters can use them
-    from adapters import code_editing, code_generation, code_reasoning, vulnerability
-    for mod in (code_generation, code_editing, code_reasoning, vulnerability):
         mod._highlight_code = highlight_code_fn
         mod._code_offset = code_offset_fn
         mod._extract_test_classes = extract_test_classes_fn

     _extract_test_classes = extract_test_classes_fn
     # Propagate to submodules so adapters can use them
+    from adapters import (
+        additional,
+        code_editing,
+        code_generation,
+        code_reasoning,
+        long_code_arena,
+        vulnerability,
+    )
+    for mod in (
+        code_generation,
+        code_editing,
+        code_reasoning,
+        vulnerability,
+        long_code_arena,
+        additional,
+    ):
         mod._highlight_code = highlight_code_fn
         mod._code_offset = code_offset_fn
         mod._extract_test_classes = extract_test_classes_fn

adapters/additional.py ADDED Viewed

	@@ -0,0 +1,575 @@

+"""Additional benchmark adapters (dpaia EE-Dataset, Multi-SWE-bench, SWE-bench
+Multilingual, CrossCodeEval, RepoBench, McEval, MultiPL-E, Defects4J)."""
+from __future__ import annotations
+import json
+from typing import Any
+from adapters import DatasetAdapter
+from adapters.code_editing import SWEBenchLiteAdapter
+# Injected at runtime by _set_helpers()
+_highlight_code = None
+_code_offset = None
+_extract_test_classes = None
+# ---------------------------------------------------------------------------
+# dpaia Enterprise Evaluation Dataset
+# (GitHub: dpaia/ee-dataset — SWE-bench-style format for Java/Spring)
+# ---------------------------------------------------------------------------
+class DPAIAEEDatasetAdapter(DatasetAdapter):
+    slug = "dpaia-ee"
+    display_name = "DPAIA EE-Dataset"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, rows: list[dict[str, Any]]):
+        self._rows = rows
+    def problem_count(self) -> int:
+        return len(self._rows)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._rows[idx]
+        tags = row.get("tags", [])
+        tag_str = ", ".join(tags[:3]) if isinstance(tags, list) else str(tags)
+        return {
+            "idx": idx,
+            "task_id": row.get("instance_id", str(idx)),
+            "entry_point": row.get("repo", f"dpaia_{idx}"),
+            "num_inputs": 0,
+            "source": tag_str or "DPAIA",
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._rows[idx]
+        patch = row.get("patch", "")
+        test_patch = row.get("test_patch", "")
+        fail_to_pass = row.get("FAIL_TO_PASS", [])
+        if isinstance(fail_to_pass, str):
+            try:
+                fail_to_pass = json.loads(fail_to_pass)
+            except (json.JSONDecodeError, TypeError):
+                fail_to_pass = [fail_to_pass]
+        pass_to_pass = row.get("PASS_TO_PASS", [])
+        if isinstance(pass_to_pass, str):
+            try:
+                pass_to_pass = json.loads(pass_to_pass)
+            except (json.JSONDecodeError, TypeError):
+                pass_to_pass = [pass_to_pass]
+        instance_id = row.get("instance_id", str(idx))
+        repo = row.get("repo", "")
+        return {
+            "idx": idx,
+            "task_id": instance_id,
+            "entry_point": repo or f"dpaia_{idx}",
+            "code": patch,
+            "highlighted_code": "",
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": ", ".join(row.get("tags", [])[:3])
+            if isinstance(row.get("tags"), list)
+            else "DPAIA",
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": row.get("problem_statement", ""),
+            "patch": patch,
+            "test_patch": test_patch,
+            "fail_to_pass": fail_to_pass,
+            "pass_to_pass": pass_to_pass,
+            "repo": repo,
+            "base_commit": row.get("base_commit", ""),
+        }
+# ---------------------------------------------------------------------------
+# Multi-SWE-bench  (HuggingFace: ByteDance-Seed/Multi-SWE-bench)
+# Multilingual SWE-bench spanning 7 languages
+# ---------------------------------------------------------------------------
+class MultiSWEBenchAdapter(DatasetAdapter):
+    slug = "multiswebench"
+    display_name = "Multi-SWE-bench"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, rows: list[dict[str, Any]]):
+        self._rows = rows
+    def problem_count(self) -> int:
+        return len(self._rows)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._rows[idx]
+        instance_id = row.get("instance_id", str(idx))
+        org = row.get("org", "")
+        repo = row.get("repo", "")
+        full_repo = f"{org}/{repo}" if org and repo else repo
+        return {
+            "idx": idx,
+            "task_id": instance_id,
+            "entry_point": instance_id.split("__")[-1] if instance_id else f"mswe_{idx}",
+            "num_inputs": 0,
+            "source": row.get("_language", full_repo or "unknown"),
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._rows[idx]
+        patch = row.get("fix_patch", "")
+        instance_id = row.get("instance_id", str(idx))
+        org = row.get("org", "")
+        repo_name = row.get("repo", "")
+        full_repo = f"{org}/{repo_name}" if org and repo_name else repo_name
+        lang = row.get("_language", "")
+        number = row.get("number", "")
+        # Build description from title + body
+        title = row.get("title", "")
+        body = row.get("body", "")
+        description = title
+        if body:
+            description += "\n\n" + body
+        links: dict[str, str] = {}
+        if full_repo:
+            links["repo_url"] = f"https://github.com/{full_repo}"
+        if number and full_repo:
+            links["issue_url"] = f"https://github.com/{full_repo}/pull/{number}"
+        return {
+            "idx": idx,
+            "task_id": instance_id,
+            "entry_point": instance_id.split("__")[-1] if instance_id else f"mswe_{idx}",
+            "code": patch,
+            "highlighted_code": "",
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": lang or full_repo,
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": description,
+            "patch": patch,
+            "test_patch": row.get("test_patch", ""),
+            "fail_to_pass": [],
+            "pass_to_pass": [],
+            "repo": full_repo,
+            "hints": row.get("hints", ""),
+            **links,
+        }
+# ---------------------------------------------------------------------------
+# SWE-bench Multilingual  (HuggingFace: SWE-bench/SWE-bench_Multilingual)
+# 300 tasks across 42 repos in multiple languages
+# ---------------------------------------------------------------------------
+class SWEBenchMultilingualAdapter(SWEBenchLiteAdapter):
+    slug = "swebenchmultilingual"
+    display_name = "SWE-bench Multilingual"
+# ---------------------------------------------------------------------------
+# CrossCodeEval  (HuggingFace: Vincentvmt/CrossCodeEval or amazon-science/cceval)
+# Cross-file code completion in 4 languages
+# ---------------------------------------------------------------------------
+class CrossCodeEvalAdapter(DatasetAdapter):
+    slug = "crosscodeeval"
+    display_name = "CrossCodeEval"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, rows: list[dict[str, Any]]):
+        self._rows = rows
+    def problem_count(self) -> int:
+        return len(self._rows)
+    @staticmethod
+    def _get_metadata(row: dict, key: str, default: str = "") -> str:
+        """Extract a value from the nested metadata dict."""
+        meta = row.get("metadata", {})
+        if isinstance(meta, dict):
+            return meta.get(key, default)
+        return default
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._rows[idx]
+        task_id = self._get_metadata(row, "task_id", str(idx))
+        return {
+            "idx": idx,
+            "task_id": task_id,
+            "entry_point": task_id.rsplit("/", 1)[-1] if task_id else f"cceval_{idx}",
+            "num_inputs": 0,
+            "source": row.get("language", "unknown"),
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._rows[idx]
+        prompt = row.get("prompt", "")
+        reference = row.get("groundtruth", "")
+        right_context = row.get("right_context", "")
+        lang = row.get("language", "python")
+        lang_key = lang.lower()
+        task_id = self._get_metadata(row, "task_id", str(idx))
+        # Build a FIM-style display: prompt with hole, then merged view
+        display_code = prompt + "\n/* [HOLE] */\n" + right_context
+        merged_code = prompt + reference + right_context if reference else prompt + right_context
+        before_hole = prompt
+        gt_start_line = before_hole.count("\n") + 1
+        gt_line_count = reference.count("\n") + (1 if reference else 0)
+        gt_end_line = gt_start_line + gt_line_count - 1
+        return {
+            "idx": idx,
+            "task_id": task_id,
+            "entry_point": task_id.rsplit("/", 1)[-1] if task_id else f"cceval_{idx}",
+            "code": display_code,
+            "highlighted_code": _highlight_code(display_code, language=lang_key),
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": lang,
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "fim_prefix": prompt,
+            "fim_ground_truth": reference,
+            "fim_ground_truth_highlighted": _highlight_code(reference, language=lang_key)
+            if reference
+            else "",
+            "fim_merged_code": merged_code,
+            "fim_merged_highlighted": _highlight_code(
+                merged_code,
+                highlight_lines=list(range(gt_start_line, gt_end_line + 1)),
+                language=lang_key,
+            )
+            if merged_code
+            else "",
+            "fim_gt_start_line": gt_start_line,
+            "fim_gt_end_line": gt_end_line,
+            "language": lang,
+        }
+# ---------------------------------------------------------------------------
+# RepoBench  (HuggingFace: tianyang/repobench-p)
+# Repository-level code completion across Python and Java
+# ---------------------------------------------------------------------------
+class RepoBenchAdapter(DatasetAdapter):
+    slug = "repobench"
+    display_name = "RepoBench"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, rows: list[dict[str, Any]]):
+        self._rows = rows
+    def problem_count(self) -> int:
+        return len(self._rows)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._rows[idx]
+        return {
+            "idx": idx,
+            "task_id": str(row.get("repo_name", idx)),
+            "entry_point": row.get("file_path", f"repobench_{idx}").rsplit("/", 1)[-1],
+            "num_inputs": 0,
+            "source": row.get("language", row.get("_setting", "unknown")),
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._rows[idx]
+        # RepoBench has context code and a next_line to predict
+        context = row.get("all_code", row.get("context", ""))
+        next_line = row.get("next_line", row.get("gold_snippet_code", ""))
+        lang = row.get("language", "python")
+        lang_key = lang.lower()
+        display_code = context + "\n/* [HOLE] */\n" if context else ""
+        merged_code = context + "\n" + next_line if context and next_line else context
+        gt_start_line = context.count("\n") + 2 if context else 1
+        gt_line_count = next_line.count("\n") + 1 if next_line else 0
+        gt_end_line = gt_start_line + gt_line_count - 1
+        return {
+            "idx": idx,
+            "task_id": str(row.get("repo_name", idx)),
+            "entry_point": row.get("file_path", f"repobench_{idx}").rsplit("/", 1)[-1],
+            "code": display_code,
+            "highlighted_code": _highlight_code(display_code, language=lang_key)
+            if display_code
+            else "",
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": row.get("_setting", lang),
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "fim_prefix": context,
+            "fim_ground_truth": next_line,
+            "fim_ground_truth_highlighted": _highlight_code(next_line, language=lang_key)
+            if next_line
+            else "",
+            "fim_merged_code": merged_code,
+            "fim_merged_highlighted": _highlight_code(
+                merged_code,
+                highlight_lines=list(range(gt_start_line, gt_end_line + 1)),
+                language=lang_key,
+            )
+            if merged_code
+            else "",
+            "fim_gt_start_line": gt_start_line,
+            "fim_gt_end_line": gt_end_line,
+            "language": lang,
+        }
+# ---------------------------------------------------------------------------
+# McEval  (HuggingFace: Multilingual-Multimodal-NLP/McEval)
+# Massively multilingual code evaluation — 40 languages, 16K samples
+# ---------------------------------------------------------------------------
+class McEvalAdapter(DatasetAdapter):
+    slug = "mceval"
+    display_name = "McEval"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        return {
+            "idx": idx,
+            "task_id": row.get("task_id", str(idx)),
+            "entry_point": row.get("entry_point", row.get("task_id", f"mceval_{idx}")),
+            "num_inputs": 0,
+            "source": row.get("language", "unknown"),
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        prompt = row.get("prompt", "")
+        canonical = row.get("canonical_solution", "")
+        code = prompt + canonical
+        lang = row.get("language", "python")
+        lang_key = lang.lower()
+        # Map some known language names to Pygments lexer names
+        lang_map = {
+            "c++": "cpp",
+            "c#": "csharp",
+            "objective-c": "objectivec",
+            "visual basic": "vb.net",
+            "typescript": "typescript",
+        }
+        lang_key = lang_map.get(lang_key, lang_key)
+        return {
+            "idx": idx,
+            "task_id": row.get("task_id", str(idx)),
+            "entry_point": row.get("entry_point", row.get("task_id", f"mceval_{idx}")),
+            "code": code,
+            "highlighted_code": _highlight_code(code, language=lang_key),
+            "inputs": [],
+            "outputs": [],
+            "test": row.get("test", ""),
+            "tasks": [],
+            "source": lang,
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": row.get("prompt", ""),
+            "language": lang,
+        }
+# ---------------------------------------------------------------------------
+# MultiPL-E  (HuggingFace: nuprl/MultiPL-E)
+# Multi-language translated HumanEval/MBPP — 22 languages
+# ---------------------------------------------------------------------------
+class MultiPLEAdapter(DatasetAdapter):
+    slug = "multiple"
+    display_name = "MultiPL-E"
+    has_ground_truth = False
+    has_tasks = False
+    # Languages we load (subset of 22 available)
+    LANGUAGES = ["py", "cpp", "java", "js", "ts", "go", "rs", "cs", "rb", "lua"]
+    _LANG_LABELS = {
+        "py": "Python",
+        "cpp": "C++",
+        "java": "Java",
+        "js": "JavaScript",
+        "ts": "TypeScript",
+        "go": "Go",
+        "rs": "Rust",
+        "cs": "C#",
+        "rb": "Ruby",
+        "lua": "Lua",
+    }
+    _LANG_PYGMENTS = {
+        "py": "python",
+        "cpp": "cpp",
+        "java": "java",
+        "js": "javascript",
+        "ts": "typescript",
+        "go": "go",
+        "rs": "rust",
+        "cs": "csharp",
+        "rb": "ruby",
+        "lua": "lua",
+    }
+    def __init__(self, datasets_by_lang: dict[str, Any]):
+        self._by_lang = datasets_by_lang
+        first_lang = next(iter(self._by_lang))
+        self._count = len(self._by_lang[first_lang])
+    def problem_count(self) -> int:
+        return self._count
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        first_lang = next(iter(self._by_lang))
+        row = self._by_lang[first_lang][idx]
+        return {
+            "idx": idx,
+            "task_id": row.get("name", str(idx)),
+            "entry_point": row.get("name", f"multiple_{idx}"),
+            "num_inputs": len(self._by_lang),
+            "source": "MultiPL-E",
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        first_lang = next(iter(self._by_lang))
+        row = self._by_lang[first_lang][idx]
+        lang_solutions = []
+        for lang in self._by_lang:
+            lrow = self._by_lang[lang][idx]
+            prompt = lrow.get("prompt", "")
+            # MultiPL-E stores tests but may not have canonical solutions
+            tests = lrow.get("tests", "")
+            lang_key = self._LANG_PYGMENTS.get(lang, lang)
+            lang_label = self._LANG_LABELS.get(lang, lang)
+            lang_solutions.append(
+                {
+                    "language": lang,
+                    "language_label": lang_label,
+                    "code": prompt,
+                    "highlighted_code": _highlight_code(prompt, language=lang_key),
+                    "test": tests,
+                }
+            )
+        py_row = self._by_lang.get("py", self._by_lang[first_lang])[idx]
+        default_code = py_row.get("prompt", "")
+        return {
+            "idx": idx,
+            "task_id": row.get("name", str(idx)),
+            "entry_point": row.get("name", f"multiple_{idx}"),
+            "code": default_code,
+            "highlighted_code": _highlight_code(default_code),
+            "inputs": [],
+            "outputs": [],
+            "test": py_row.get("tests", ""),
+            "tasks": [],
+            "source": "MultiPL-E",
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "lang_solutions": lang_solutions,
+        }
+# ---------------------------------------------------------------------------
+# Defects4J  (HuggingFace: rufimelo/defects4j)
+# Java bug-fix benchmark — 854 real bugs from open-source projects
+# ---------------------------------------------------------------------------
+class Defects4JAdapter(DatasetAdapter):
+    slug = "defects4j"
+    display_name = "Defects4J"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    @staticmethod
+    def _project_from_bug_id(bug_id: str) -> str:
+        """Extract project name from bug_id like 'Compress-35'."""
+        return bug_id.rsplit("-", 1)[0] if "-" in bug_id else bug_id
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        bug_id = row.get("bug_id", str(idx))
+        project = self._project_from_bug_id(bug_id)
+        return {
+            "idx": idx,
+            "task_id": bug_id,
+            "entry_point": project,
+            "num_inputs": 0,
+            "source": project,
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        bug_id = row.get("bug_id", str(idx))
+        project = self._project_from_bug_id(bug_id)
+        buggy = row.get("func_before", "")
+        fixed = row.get("func_after", "")
+        return {
+            "idx": idx,
+            "task_id": bug_id,
+            "entry_point": project,
+            "code": fixed,
+            "highlighted_code": _highlight_code(fixed, language="java") if fixed else "",
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": project,
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": "",
+            "buggy_code": buggy,
+            "buggy_highlighted_code": _highlight_code(buggy, language="java") if buggy else "",
+            "fixed_code": fixed,
+            "fixed_highlighted_code": _highlight_code(fixed, language="java") if fixed else "",
+            "bug_category": "Bug Fix",
+            "bug_subtype": project,
+            "bug_explanation": "",
+            "language": "Java",
+        }

adapters/long_code_arena.py ADDED Viewed

	@@ -0,0 +1,503 @@

+"""Long Code Arena benchmark adapters (6 project-level tasks).
+All datasets from: https://huggingface.co/collections/JetBrains-Research/long-code-arena
+"""
+from __future__ import annotations
+import json
+from typing import Any
+from adapters import DatasetAdapter
+# Injected at runtime by _set_helpers()
+_highlight_code = None
+_code_offset = None
+_extract_test_classes = None
+# ---------------------------------------------------------------------------
+# Shared helpers
+# ---------------------------------------------------------------------------
+_CODE_TRIM_LIMIT = 50_000  # chars for code / diff fields
+_DESC_TRIM_LIMIT = 5_000  # chars for description / log fields
+def _trim(text: str, limit: int, label: str = "Content") -> str:
+    """Return *text* unchanged if short enough, otherwise trim with an explicit marker."""
+    if len(text) <= limit:
+        return text
+    return (
+        text[:limit]
+        + f"\n\n--- {label} trimmed: showing {limit:,} of {len(text):,} characters ---"
+    )
+_LOG_HEAD_LIMIT = 10_000  # chars budget for head part of CI log
+_LOG_TAIL_LIMIT = 10_000  # chars budget for tail part of CI log
+def _trim_head_tail(text: str, label: str = "Content") -> str:
+    """Show first ~10k chars and last ~10k chars (snapped to line boundaries)."""
+    if len(text) <= _LOG_HEAD_LIMIT + _LOG_TAIL_LIMIT:
+        return text
+    # Head: find the last newline within the budget
+    head_end = text.rfind("\n", 0, _LOG_HEAD_LIMIT)
+    if head_end <= 0:
+        head_end = _LOG_HEAD_LIMIT
+    head = text[:head_end]
+    # Tail: find the first newline after the cut point
+    tail_start = text.find("\n", len(text) - _LOG_TAIL_LIMIT)
+    if tail_start < 0 or tail_start >= len(text):
+        tail_start = len(text) - _LOG_TAIL_LIMIT
+    tail = text[tail_start:]
+    total_lines = text.count("\n") + 1
+    head_lines = head.count("\n") + 1
+    tail_lines = tail.count("\n") + 1
+    omitted = total_lines - head_lines - tail_lines
+    return (
+        head
+        + f"\n\n--- {label} trimmed: showing first {head_lines:,} and last"
+        f" {tail_lines:,} lines ({omitted:,} lines omitted,"
+        f" {len(text):,} chars total) ---\n\n"
+        + tail
+    )
+def _lca_repo_url(repo_slug: str) -> str:
+    """Convert an LCA-style repo slug to a GitHub URL.
+    LCA datasets use either ``owner__name`` (double underscore) or
+    ``owner/name`` (slash) depending on the task.
+    """
+    if not repo_slug:
+        return ""
+    # Normalise double-underscore to slash
+    ghname = repo_slug.replace("__", "/", 1) if "__" in repo_slug else repo_slug
+    return f"https://github.com/{ghname}"
+# ---------------------------------------------------------------------------
+# LCA Library-Based Code Generation
+# (HuggingFace: JetBrains-Research/lca-library-based-code-generation)
+# ---------------------------------------------------------------------------
+class LCALibCodeGenAdapter(DatasetAdapter):
+    slug = "lca-libcodegen"
+    display_name = "LCA Library-Based Code Gen"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        return {
+            "idx": idx,
+            "task_id": row.get("repo_full_name", str(idx)),
+            "entry_point": row.get("repo_name", f"lca_libgen_{idx}"),
+            "num_inputs": row.get("n_unique_apis", 0),
+            "source": row.get("repo_owner", "LCA"),
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        reference = row.get("clean_reference", row.get("reference", ""))
+        unique_apis = list(row.get("unique_apis", []))
+        repo_slug = row.get("repo_full_name", "")
+        return {
+            "idx": idx,
+            "task_id": repo_slug or str(idx),
+            "entry_point": row.get("repo_name", f"lca_libgen_{idx}"),
+            "code": reference,
+            "highlighted_code": _highlight_code(reference),
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": row.get("repo_owner", "LCA"),
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": row.get("instruction", ""),
+            "unique_apis": unique_apis,
+            "n_unique_apis": row.get("n_unique_apis", 0),
+            "repo_url": _lca_repo_url(repo_slug),
+        }
+# ---------------------------------------------------------------------------
+# LCA Project-Level Code Completion
+# (HuggingFace: JetBrains-Research/lca-project-level-code-completion)
+# ---------------------------------------------------------------------------
+class LCACodeCompletionAdapter(DatasetAdapter):
+    slug = "lca-codecompletion"
+    display_name = "LCA Project-Level Completion"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, rows: list[dict[str, Any]]):
+        self._rows = rows
+    def problem_count(self) -> int:
+        return len(self._rows)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._rows[idx]
+        completion_file = row.get("completion_file", {})
+        filename = completion_file.get("filename", "") if isinstance(completion_file, dict) else ""
+        return {
+            "idx": idx,
+            "task_id": row.get("repo", str(idx)),
+            "entry_point": filename.rsplit("/", 1)[-1] if filename else f"completion_{idx}",
+            "num_inputs": 0,
+            "source": row.get("_context_size", "LCA"),
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._rows[idx]
+        completion_file = row.get("completion_file", {})
+        if isinstance(completion_file, dict):
+            filename = completion_file.get("filename", "")
+            content = completion_file.get("content", "")
+        else:
+            filename = ""
+            content = ""
+        completion_lines = row.get("completion_lines", {})
+        if isinstance(completion_lines, dict):
+            committed = completion_lines.get("committed", [])
+        else:
+            committed = []
+        lang = "python"
+        if filename:
+            ext = filename.rsplit(".", 1)[-1].lower() if "." in filename else ""
+            ext_map = {
+                "py": "python",
+                "java": "java",
+                "kt": "kotlin",
+                "js": "javascript",
+                "ts": "typescript",
+                "cpp": "cpp",
+                "c": "c",
+                "go": "go",
+                "rs": "rust",
+                "rb": "ruby",
+            }
+            lang = ext_map.get(ext, "python")
+        repo_slug = row.get("repo", "")
+        commit_hash = row.get("commit_hash", "")
+        repo_url = _lca_repo_url(repo_slug)
+        commit_url = f"{repo_url}/commit/{commit_hash}" if repo_url and commit_hash else ""
+        return {
+            "idx": idx,
+            "task_id": repo_slug or str(idx),
+            "entry_point": filename.rsplit("/", 1)[-1] if filename else f"completion_{idx}",
+            "code": content,
+            "highlighted_code": _highlight_code(content, language=lang) if content else "",
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": row.get("_context_size", "LCA"),
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": f"File: {filename}\nCommit: {commit_hash[:12]}",
+            "completion_lines_committed": committed,
+            "language": lang,
+            "repo_url": repo_url,
+            "commit_url": commit_url,
+        }
+# ---------------------------------------------------------------------------
+# LCA Bug Localization
+# (HuggingFace: JetBrains-Research/lca-bug-localization)
+# ---------------------------------------------------------------------------
+class LCABugLocalizationAdapter(DatasetAdapter):
+    slug = "lca-buglocalization"
+    display_name = "LCA Bug Localization"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        return {
+            "idx": idx,
+            "task_id": row.get("text_id", str(idx)),
+            "entry_point": f"{row.get('repo_owner', '')}/{row.get('repo_name', '')}",
+            "num_inputs": row.get("changed_files_count", 0),
+            "source": row.get("repo_language", "unknown"),
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        diff = row.get("diff", "")
+        repo_owner = row.get("repo_owner", "")
+        repo_name = row.get("repo_name", "")
+        repo = f"{repo_owner}/{repo_name}" if repo_owner and repo_name else ""
+        issue_url = row.get("issue_url", "")
+        pull_url = row.get("pull_url", "")
+        return {
+            "idx": idx,
+            "task_id": row.get("text_id", str(idx)),
+            "entry_point": repo or f"bug_{idx}",
+            "code": diff,
+            "highlighted_code": "",
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": row.get("repo_language", "unknown"),
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": row.get("issue_title", "")
+            + ("\n\n" + row.get("issue_body", "") if row.get("issue_body") else ""),
+            "patch": diff,
+            "repo": repo,
+            "repo_url": f"https://github.com/{repo}" if repo else "",
+            "issue_url": issue_url,
+            "commit_url": pull_url,
+        }
+# ---------------------------------------------------------------------------
+# LCA Commit Message Generation
+# (HuggingFace: JetBrains-Research/lca-commit-message-generation)
+# ---------------------------------------------------------------------------
+class LCACommitMsgGenAdapter(DatasetAdapter):
+    slug = "lca-commitmsg"
+    display_name = "LCA Commit Message Gen"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        mods = row.get("mods", [])
+        n_files = len(mods) if isinstance(mods, list) else 0
+        return {
+            "idx": idx,
+            "task_id": row.get("hash", str(idx))[:12],
+            "entry_point": row.get("repo", f"commit_{idx}"),
+            "num_inputs": n_files,
+            "source": row.get("license", "LCA")[:20],
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        message = row.get("message", "")
+        mods = row.get("mods", [])
+        # Build a unified diff from all modifications
+        diff_parts = []
+        if isinstance(mods, list):
+            for mod in mods:
+                if isinstance(mod, dict):
+                    old_path = mod.get("old_path", "")
+                    new_path = mod.get("new_path", "")
+                    mod_diff = mod.get("diff", "")
+                    if mod_diff:
+                        diff_parts.append(
+                            f"diff --git a/{old_path} b/{new_path}\n"
+                            f"--- a/{old_path}\n"
+                            f"+++ b/{new_path}\n"
+                            f"{mod_diff}"
+                        )
+        combined_diff = "\n".join(diff_parts)
+        trimmed_diff = _trim(combined_diff, _CODE_TRIM_LIMIT, "Diff")
+        repo_slug = row.get("repo", "")
+        commit_hash = row.get("hash", "")
+        repo_url = _lca_repo_url(repo_slug)
+        commit_url = f"{repo_url}/commit/{commit_hash}" if repo_url and commit_hash else ""
+        return {
+            "idx": idx,
+            "task_id": (commit_hash or str(idx))[:12],
+            "entry_point": repo_slug or f"commit_{idx}",
+            "code": trimmed_diff,
+            "highlighted_code": "",
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": row.get("license", "LCA")[:20],
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": message,
+            "patch": trimmed_diff,
+            "repo": repo_slug,
+            "repo_url": repo_url,
+            "commit_url": commit_url,
+            "commit_hash": commit_hash,
+        }
+# ---------------------------------------------------------------------------
+# LCA CI Builds Repair
+# (HuggingFace: JetBrains-Research/lca-ci-builds-repair)
+# ---------------------------------------------------------------------------
+class LCACIRepairAdapter(DatasetAdapter):
+    slug = "lca-cirepair"
+    display_name = "LCA CI Builds Repair"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        repo = f"{row.get('repo_owner', '')}/{row.get('repo_name', '')}"
+        return {
+            "idx": idx,
+            "task_id": str(row.get("id", idx)),
+            "entry_point": repo,
+            "num_inputs": 0,
+            "source": f"difficulty-{row.get('difficulty', '?')}",
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        diff = row.get("diff", "")
+        trimmed_diff = _trim(diff, _CODE_TRIM_LIMIT, "Diff")
+        repo_owner = row.get("repo_owner", "")
+        repo_name = row.get("repo_name", "")
+        repo = f"{repo_owner}/{repo_name}" if repo_owner and repo_name else ""
+        commit_link = row.get("commit_link", "")
+        # Extract log text — can be several MB; trim explicitly
+        logs = row.get("logs", [])
+        log_text = ""
+        if isinstance(logs, list):
+            for entry in logs:
+                if isinstance(entry, dict):
+                    step = entry.get("step_name", "")
+                    log = entry.get("log", "")
+                    log_text += f"=== {step} ===\n{log}\n\n"
+        trimmed_log = _trim_head_tail(log_text, "CI log")
+        return {
+            "idx": idx,
+            "task_id": str(row.get("id", idx)),
+            "entry_point": repo or f"ci_{idx}",
+            "code": trimmed_diff,
+            "highlighted_code": "",
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": f"difficulty-{row.get('difficulty', '?')}",
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": f"Workflow: {row.get('workflow_name', '')}\n"
+            f"Branch: {row.get('head_branch', '')}\n"
+            f"Contributor: {row.get('contributor', '')}\n\n"
+            f"CI Log:\n{trimmed_log}",
+            "patch": trimmed_diff,
+            "repo": repo,
+            "repo_url": f"https://github.com/{repo}" if repo else "",
+            "commit_url": commit_link,
+        }
+# ---------------------------------------------------------------------------
+# LCA Module Summarization
+# (HuggingFace: JetBrains-Research/lca-module-summarization)
+# ---------------------------------------------------------------------------
+class LCAModuleSummarizationAdapter(DatasetAdapter):
+    slug = "lca-modulesumm"
+    display_name = "LCA Module Summarization"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        return {
+            "idx": idx,
+            "task_id": row.get("docfile_name", str(idx)),
+            "entry_point": row.get("repo", f"module_{idx}"),
+            "num_inputs": 0,
+            "source": row.get("doc_type", "LCA"),
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        target_text = row.get("target_text", "")
+        # Code context can be extremely large (up to 23 MB); trim with explicit marker
+        code_context = row.get("relevant_code_context", "")
+        trimmed_code = _trim(code_context, _CODE_TRIM_LIMIT, "Code context")
+        relevant_files = row.get("relevant_code_files", [])
+        if isinstance(relevant_files, str):
+            try:
+                relevant_files = json.loads(relevant_files)
+            except (json.JSONDecodeError, TypeError):
+                relevant_files = [relevant_files]
+        repo_slug = row.get("repo", "")
+        repo_url = _lca_repo_url(repo_slug)
+        trimmed_target = _trim(target_text, _DESC_TRIM_LIMIT, "Target documentation")
+        return {
+            "idx": idx,
+            "task_id": row.get("docfile_name", str(idx)),
+            "entry_point": repo_slug or f"module_{idx}",
+            "code": trimmed_code,
+            "highlighted_code": _highlight_code(trimmed_code) if trimmed_code else "",
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": row.get("doc_type", "LCA"),
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": f"Intent: {row.get('intent', '')}\n\n"
+            f"Doc file: {row.get('path_to_docfile', '')}\n"
+            f"Relevant files: {', '.join(relevant_files) if isinstance(relevant_files, list) else ''}\n\n"
+            f"Target documentation:\n{trimmed_target}",
+            "repo_url": repo_url,
+        }

adapters/registration.py CHANGED Viewed

@@ -7,6 +7,15 @@ import random
 from typing import Any
 from adapters import REGISTRY
 from adapters.code_editing import (
     CanItEditAdapter,
     CodeEditorBenchAdapter,
@@ -38,6 +47,14 @@ from adapters.code_reasoning import (
     HumanEvalXAdapter,
     SAFIMAdapter,
 )
 from adapters.vulnerability import (
     BigVulAdapter,
     DevignAdapter,
@@ -408,3 +425,216 @@ def register_hf_datasets() -> None:
         print(f"Loaded EffiBench: {len(effibench)} problems")
     except Exception as e:
         print(f"Warning: could not load EffiBench: {e}")

 from typing import Any
 from adapters import REGISTRY
+from adapters.additional import (
+    CrossCodeEvalAdapter,
+    Defects4JAdapter,
+    DPAIAEEDatasetAdapter,
+    McEvalAdapter,
+    MultiPLEAdapter,
+    MultiSWEBenchAdapter,
+    SWEBenchMultilingualAdapter,
+)
 from adapters.code_editing import (
     CanItEditAdapter,
     CodeEditorBenchAdapter,
     HumanEvalXAdapter,
     SAFIMAdapter,
 )
+from adapters.long_code_arena import (
+    LCABugLocalizationAdapter,
+    LCACIRepairAdapter,
+    LCACodeCompletionAdapter,
+    LCACommitMsgGenAdapter,
+    LCALibCodeGenAdapter,
+    LCAModuleSummarizationAdapter,
+)
 from adapters.vulnerability import (
     BigVulAdapter,
     DevignAdapter,
         print(f"Loaded EffiBench: {len(effibench)} problems")
     except Exception as e:
         print(f"Warning: could not load EffiBench: {e}")
+    # --- Long Code Arena datasets (6 tasks) ---
+    try:
+        lca_libgen = load_dataset(
+            "JetBrains-Research/lca-library-based-code-generation", split="test"
+        )
+        REGISTRY["lca-libcodegen"] = LCALibCodeGenAdapter(lca_libgen)
+        print(f"Loaded LCA Library-Based Code Gen: {len(lca_libgen)} problems")
+    except Exception as e:
+        print(f"Warning: could not load LCA Library-Based Code Gen: {e}")
+    try:
+        # Load small and medium context sizes only (large/huge are multi-GB)
+        lca_cc_rows: list[dict[str, Any]] = []
+        for ctx in ("small_context", "medium_context"):
+            try:
+                ds = load_dataset(
+                    "JetBrains-Research/lca-project-level-code-completion", ctx, split="test"
+                )
+                for i in range(len(ds)):
+                    row = dict(ds[i])
+                    row["_context_size"] = ctx
+                    lca_cc_rows.append(row)
+            except Exception:
+                pass
+        if lca_cc_rows:
+            lca_cc_sampled = _sample_list(lca_cc_rows)
+            adapter = LCACodeCompletionAdapter(lca_cc_sampled)
+            adapter.total_count = len(lca_cc_rows)
+            REGISTRY["lca-codecompletion"] = adapter
+            print(
+                f"Loaded LCA Project-Level Completion: "
+                f"{len(lca_cc_sampled)} problems (of {len(lca_cc_rows)})"
+            )
+        else:
+            print("Warning: could not load any LCA Code Completion context sizes")
+    except Exception as e:
+        print(f"Warning: could not load LCA Project-Level Completion: {e}")
+    try:
+        # Merge all language subsets (py, java, kt) using test split
+        lca_bl_all = None
+        for lang_subset in ("py", "java", "kt"):
+            try:
+                ds = load_dataset(
+                    "JetBrains-Research/lca-bug-localization", lang_subset, split="test"
+                )
+                if lca_bl_all is None:
+                    from datasets import concatenate_datasets
+                    lca_bl_all = ds
+                else:
+                    lca_bl_all = concatenate_datasets([lca_bl_all, ds])
+            except Exception:
+                pass
+        if lca_bl_all is not None and len(lca_bl_all) > 0:
+            lca_bl = _sample_hf_dataset(lca_bl_all)
+            adapter = LCABugLocalizationAdapter(lca_bl)
+            adapter.total_count = len(lca_bl_all)
+            REGISTRY["lca-buglocalization"] = adapter
+            print(f"Loaded LCA Bug Localization: {len(lca_bl)} problems (of {len(lca_bl_all)})")
+        else:
+            print("Warning: could not load any LCA Bug Localization language subsets")
+    except Exception as e:
+        print(f"Warning: could not load LCA Bug Localization: {e}")
+    try:
+        lca_cmg = load_dataset("JetBrains-Research/lca-commit-message-generation", split="test")
+        REGISTRY["lca-commitmsg"] = LCACommitMsgGenAdapter(lca_cmg)
+        print(f"Loaded LCA Commit Message Gen: {len(lca_cmg)} problems")
+    except Exception as e:
+        print(f"Warning: could not load LCA Commit Message Gen: {e}")
+    try:
+        lca_ci = load_dataset("JetBrains-Research/lca-ci-builds-repair", split="test")
+        REGISTRY["lca-cirepair"] = LCACIRepairAdapter(lca_ci)
+        print(f"Loaded LCA CI Builds Repair: {len(lca_ci)} problems")
+    except Exception as e:
+        print(f"Warning: could not load LCA CI Builds Repair: {e}")
+    try:
+        lca_ms = load_dataset("JetBrains-Research/lca-module-summarization", split="test")
+        REGISTRY["lca-modulesumm"] = LCAModuleSummarizationAdapter(lca_ms)
+        print(f"Loaded LCA Module Summarization: {len(lca_ms)} problems")
+    except Exception as e:
+        print(f"Warning: could not load LCA Module Summarization: {e}")
+    # --- dpaia Enterprise Evaluation Dataset ---
+    try:
+        import urllib.request
+        url = "https://raw.githubusercontent.com/dpaia/ee-dataset/main/datasets/java-spring-ee-dataset.json"
+        with urllib.request.urlopen(url) as resp:
+            dpaia_rows = json.loads(resp.read().decode("utf-8"))
+        if dpaia_rows:
+            REGISTRY["dpaia-ee"] = DPAIAEEDatasetAdapter(dpaia_rows)
+            print(f"Loaded DPAIA EE-Dataset: {len(dpaia_rows)} problems")
+    except Exception as e:
+        print(f"Warning: could not load DPAIA EE-Dataset: {e}")
+    # --- Multi-SWE-bench (ByteDance, multilingual issue resolving) ---
+    # Dataset has 40 per-repo JSONL files with inconsistent schemas; load directly.
+    try:
+        from huggingface_hub import list_repo_files
+        mswe_files = list_repo_files("ByteDance-Seed/Multi-SWE-bench", repo_type="dataset")
+        mswe_jsonl = [f for f in mswe_files if f.endswith(".jsonl")]
+        mswe_rows: list[dict[str, Any]] = []
+        for fname in mswe_jsonl:
+            lang_dir = fname.split("/")[0] if "/" in fname else ""
+            try:
+                rows = _load_jsonl_dataset("ByteDance-Seed/Multi-SWE-bench", [fname])
+                for d in rows:
+                    d["_language"] = lang_dir
+                mswe_rows.extend(rows)
+            except Exception:
+                pass
+        if mswe_rows:
+            mswe_sampled = _sample_list(mswe_rows)
+            adapter = MultiSWEBenchAdapter(mswe_sampled)
+            adapter.total_count = len(mswe_rows)
+            REGISTRY["multiswebench"] = adapter
+            print(f"Loaded Multi-SWE-bench: {len(mswe_sampled)} problems (of {len(mswe_rows)})")
+        else:
+            print("Warning: could not load any Multi-SWE-bench JSONL files")
+    except Exception as e:
+        print(f"Warning: could not load Multi-SWE-bench: {e}")
+    # --- SWE-bench Multilingual ---
+    try:
+        swe_ml = load_dataset("SWE-bench/SWE-bench_Multilingual", split="test")
+        REGISTRY["swebenchmultilingual"] = SWEBenchMultilingualAdapter(swe_ml)
+        print(f"Loaded SWE-bench Multilingual: {len(swe_ml)} problems")
+    except Exception as e:
+        print(f"Warning: could not load SWE-bench Multilingual: {e}")
+    # --- CrossCodeEval (cross-file code completion, 4 languages) ---
+    # Dataset has inconsistent columns across files; load only base line_completion.jsonl per lang
+    try:
+        cceval_rows: list[dict[str, Any]] = []
+        for lang in ("python", "java", "typescript", "csharp"):
+            try:
+                rows = _load_jsonl_dataset(
+                    "Vincentvmt/CrossCodeEval",
+                    [f"crosscodeeval_data/{lang}/line_completion.jsonl"],
+                )
+                for d in rows:
+                    d["language"] = lang
+                cceval_rows.extend(rows)
+            except Exception:
+                pass
+        if cceval_rows:
+            cceval_sampled = _sample_list(cceval_rows)
+            adapter = CrossCodeEvalAdapter(cceval_sampled)
+            adapter.total_count = len(cceval_rows)
+            REGISTRY["crosscodeeval"] = adapter
+            print(f"Loaded CrossCodeEval: {len(cceval_sampled)} problems (of {len(cceval_rows)})")
+        else:
+            print("Warning: could not load any CrossCodeEval language subsets")
+    except Exception as e:
+        print(f"Warning: could not load CrossCodeEval: {e}")
+    # --- McEval (massively multilingual code evaluation, 40 languages) ---
+    try:
+        mceval_full = load_dataset("Multilingual-Multimodal-NLP/McEval", "generation", split="test")
+        mceval = _sample_hf_dataset(mceval_full)
+        adapter = McEvalAdapter(mceval)
+        adapter.total_count = len(mceval_full)
+        REGISTRY["mceval"] = adapter
+        print(f"Loaded McEval: {len(mceval)} problems (of {len(mceval_full)})")
+    except Exception as e:
+        print(f"Warning: could not load McEval: {e}")
+    # --- MultiPL-E (multilingual HumanEval/MBPP, 22 languages) ---
+    try:
+        mple_datasets = {}
+        for lang_ext in MultiPLEAdapter.LANGUAGES:
+            try:
+                mple_datasets[lang_ext] = load_dataset(
+                    "nuprl/MultiPL-E", f"humaneval-{lang_ext}", split="test"
+                )
+            except Exception:
+                pass
+        if mple_datasets:
+            REGISTRY["multiple"] = MultiPLEAdapter(mple_datasets)
+            first = next(iter(mple_datasets))
+            print(
+                f"Loaded MultiPL-E: {len(mple_datasets)} languages, "
+                f"{len(mple_datasets[first])} problems each"
+            )
+        else:
+            print("Warning: could not load any MultiPL-E language subsets")
+    except Exception as e:
+        print(f"Warning: could not load MultiPL-E: {e}")
+    # --- Defects4J (Java bug-fix benchmark) ---
+    try:
+        d4j = load_dataset("rufimelo/defects4j", split="train")
+        d4j_sampled = _sample_hf_dataset(d4j)
+        adapter = Defects4JAdapter(d4j_sampled)
+        adapter.total_count = len(d4j)
+        REGISTRY["defects4j"] = adapter
+        print(f"Loaded Defects4J: {len(d4j_sampled)} problems (of {len(d4j)})")
+    except Exception as e:
+        print(f"Warning: could not load Defects4J: {e}")