Spaces:

JetBrains-Research
/

ml4se-evals-visualization

Running

App Files Files Community

ml4se-evals-visualization / PROGRESS.md

egor-bogomolov

Add 13 new benchmark datasets (batches 6-8)

9f85fac 1 day ago

preview code

raw

history blame contribute delete

10.5 kB

	# Benchmark Integration Progress

	## Status: Batches 1-5 Complete

	## Batch Plan

	### Batch 1 (Highest Priority -- Easy HF, High Influence)
	\| Benchmark \| Slug \| Status \| HF Dataset \| View Type \|
	\|-----------\|------\|--------\|------------\|-----------\|
	\| MBPP+ \| `mbppplus` \| Done \| `evalplus/mbppplus` \| Simple \|
	\| ClassEval \| `classeval` \| Done \| `FudanSELab/ClassEval` \| Simple \|
	\| LiveCodeBench \| `livecodebench` \| Done \| `livecodebench/code_generation_lite` \| Simple \|
	\| DebugBench \| `debugbench` \| Done \| `Rtian/DebugBench` \| Before/After \|
	\| HumanEval-X \| `humanevalx` \| Done \| `THUDM/humaneval-x` \| Multi-language \|

	Refactoring done: Multi-language syntax highlighting via `get_lexer_by_name()`. Before/after code diff view. Multi-language tab view.

	### Batch 2
	\| Benchmark \| Slug \| Status \| HF Dataset \| View Type \|
	\|-----------\|------\|--------\|------------\|-----------\|
	\| SWE-bench Lite \| `swebenchlite` \| Done \| `princeton-nlp/SWE-bench_Lite` \| Diff \|
	\| CodeContests \| `codecontests` \| Done \| `deepmind/code_contests` \| Multi-solution \|
	\| APPS \| `apps` \| Done \| `codeparrot/apps` \| Multi-solution / Simple \|
	\| CanItEdit \| `canitedit` \| Done \| `nuprl/CanItEdit` \| Before/After \|
	\| MBPP \| `mbpp` \| Done \| `google-research-datasets/mbpp` \| Simple \|

	New views: Unified diff view for SWE-bench patches. Multi-solution view extended to show language labels for CodeContests.

	### Batch 3
	\| Benchmark \| Slug \| Status \| HF Dataset \| View Type \|
	\|-----------\|------\|--------\|------------\|-----------\|
	\| SAFIM \| `safim` \| Done \| `gonglinyuan/safim` \| Fill-in-the-Middle \|
	\| BigVul \| `bigvul` \| Done \| `bstee615/bigvul` \| Vulnerability \|
	\| DiverseVul \| `diversevul` \| Done \| `claudios/DiverseVul` \| Vulnerability \|
	\| PrimeVul \| `primevul` \| Done \| `starsofchance/PrimeVul` \| Vulnerability \|
	\| CodeEditorBench \| `codeeditorbench` \| Done \| `m-a-p/CodeEditorBench` \| Before/After \|

	New views: Fill-in-the-Middle view showing code with [HOLE] marker and ground truth. Vulnerability view with CWE badges and vulnerable/patched code comparison.

	### Batch 4
	\| Benchmark \| Slug \| Status \| HF Dataset \| View Type \|
	\|-----------\|------\|--------\|------------\|-----------\|
	\| SWE-bench Verified \| `swebenchverified` \| Done \| `princeton-nlp/SWE-bench_Verified` \| Diff \|
	\| CodeSearchNet \| `codesearchnet` \| Done \| `code-search-net/code_search_net` \| Simple \|
	\| Devign \| `devign` \| Done \| `google/code_x_glue_cc_defect_detection` \| Vulnerability \|

	### Dropped from original plan
	\| Benchmark \| Reason \|
	\|-----------\|--------\|
	\| DS-1000 \| Complex library-specific format, limited visualization value \|
	\| RepoBench \| Repo-level context too complex for per-problem viewing \|
	\| MultiPL-E \| 22 languages but same problems as HumanEval/MBPP already covered \|
	\| McEval \| Very large (40 languages), complex format \|
	\| xCodeEval \| Very large (25M rows), 7 tasks, too complex \|
	\| CrossVul \| Similar to DiverseVul/BigVul, diminishing returns \|

	### Batch 5
	\| Benchmark \| Slug \| Status \| HF Dataset \| View Type \|
	\|-----------\|------\|--------\|------------\|-----------\|
	\| BigCodeBench \| `bigcodebench` \| Done \| `bigcode/bigcodebench` \| Simple \|
	\| HumanEvalPack \| `humanevalpack` \| Done \| `bigcode/humanevalpack` \| Multi-language + Before/After \|
	\| CodeXGLUE Refinement \| `codexgluerefinement` \| Done \| `google/code_x_glue_cc_code_refinement` \| Before/After \|
	\| SWE-bench \| `swebenchfull` \| Done \| `princeton-nlp/SWE-bench` \| Diff \|
	\| CommitBench \| `commitbench` \| Done \| `Maxscha/commitbench` \| Diff \|
	\| EffiBench \| `effibench` \| Done \| `DONG19/EffiBench` \| Simple \|

	New views: Multi-language view with canonical/buggy code toggle (HumanEvalPack). CommitBench reuses diff view. CodeXGLUE Refinement uses before/after Java view.

	### Deferred (GitHub-only or complex infrastructure)
	CoderEval, NaturalCodeBench, DevEval, RunBugRun, Defects4J, ConDefects, FixEval, TransCoder, AVATAR, TypeEvalPy, VJBench, SVEN, PyTER

	## Architecture Decisions

	### Multi-language Support
	- `highlight_code()` in `app.py` accepts `language` parameter (default: `"python"`)
	- Uses `get_lexer_by_name()` from Pygments for automatic lexer selection
	- Adapters pass language when calling `_highlight_code(code, language=...)`

	### View Types Implemented
	1. BigOBench view -- multiple solutions with complexity badges
	2. Simple view -- code + inputs/outputs + test suite (HumanEval+, MBPP+, MBPP, ClassEval, LiveCodeBench, APPS, CodeSearchNet)
	3. CRUXEval view -- given/predict task selector
	4. DREval view -- full interactive view with coverage, arrows, ground truth
	5. Before/After view -- side-by-side buggy/fixed code (DebugBench, CanItEdit, CodeEditorBench)
	6. Multi-language view -- same problem in multiple languages (HumanEval-X, HumanEvalPack)
	7. Diff view -- unified diff patch visualization (SWE-bench Lite, SWE-bench Verified, SWE-bench, CommitBench)
	8. Fill-in-the-Middle view -- prefix + [HOLE] + suffix (SAFIM)
	9. Vulnerability view -- vulnerable/patched code + CWE labels (BigVul, DiverseVul, PrimeVul, Devign)

	### Batch 6 — Long Code Arena (6 project-level tasks)
	\| Benchmark \| Slug \| Status \| HF Dataset \| View Type \|
	\|-----------\|------\|--------\|------------\|-----------\|
	\| LCA Library-Based Code Gen \| `lca-libcodegen` \| Done \| `JetBrains-Research/lca-library-based-code-generation` \| Simple \|
	\| LCA Project-Level Completion \| `lca-codecompletion` \| Done \| `JetBrains-Research/lca-project-level-code-completion` \| Simple \|
	\| LCA Bug Localization \| `lca-buglocalization` \| Done \| `JetBrains-Research/lca-bug-localization` \| Diff \|
	\| LCA Commit Message Gen \| `lca-commitmsg` \| Done \| `JetBrains-Research/lca-commit-message-generation` \| Diff \|
	\| LCA CI Builds Repair \| `lca-cirepair` \| Done \| `JetBrains-Research/lca-ci-builds-repair` \| Diff \|
	\| LCA Module Summarization \| `lca-modulesumm` \| Done \| `JetBrains-Research/lca-module-summarization` \| Simple \|

	New adapter module: `adapters/long_code_arena.py` — all 6 Long Code Arena project-level tasks.

	### Batch 7 — dpaia & Additional Benchmarks (7 datasets)
	\| Benchmark \| Slug \| Status \| Source \| View Type \|
	\|-----------\|------\|--------\|--------\|-----------\|
	\| DPAIA EE-Dataset \| `dpaia-ee` \| Done \| `github.com/dpaia/ee-dataset` (JSON) \| Diff (SWE-bench style) \|
	\| Multi-SWE-bench \| `multiswebench` \| Done \| `ByteDance-Seed/Multi-SWE-bench` (JSONL) \| Diff \|
	\| SWE-bench Multilingual \| `swebenchmultilingual` \| Done \| `SWE-bench/SWE-bench_Multilingual` \| Diff \|
	\| CrossCodeEval \| `crosscodeeval` \| Done \| `Vincentvmt/CrossCodeEval` (JSONL) \| Fill-in-the-Middle \|
	\| McEval \| `mceval` \| Done \| `Multilingual-Multimodal-NLP/McEval` \| Simple \|
	\| MultiPL-E \| `multiple` \| Done \| `nuprl/MultiPL-E` \| Multi-language \|
	\| Defects4J \| `defects4j` \| Done \| `rufimelo/defects4j` \| Before/After \|

	### Dropped from Batch 7
	\| Benchmark \| Reason \|
	\|-----------\|--------\|
	\| RepoBench \| HF repo has only a deprecated loading script (`repobench-p.py`), no actual data files \|

	New adapter module: `adapters/additional.py` — dpaia EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J.

	Sources:
	- Long Code Arena: https://huggingface.co/collections/JetBrains-Research/long-code-arena (OpenReview: aQoUjxlgNE)
	- DPAIA EE-Dataset: https://github.com/dpaia/ee-dataset (Java/Spring SWE-bench-style)
	- Multi-SWE-bench: ByteDance multilingual SWE-bench (7 languages, 1632 problems across 40 repos)
	- SWE-bench Multilingual: Official SWE-bench multilingual extension (42 repos)
	- CrossCodeEval: Cross-file code completion (4 languages, Amazon, 9928 problems)
	- McEval: Massively multilingual code evaluation (40 languages)
	- MultiPL-E: Multi-language HumanEval/MBPP translation (9 languages loaded)
	- Defects4J: Classic Java bug-fix benchmark (467 bugs)
	- Arxiv survey reference: https://arxiv.org/abs/2505.08903

	## Total Datasets: 41
	Base (4): REval, CRUXEval, HumanEval+, BigOBench
	Batch 1 (5): MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X
	Batch 2 (5): SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP
	Batch 3 (5): SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench
	Batch 4 (3): SWE-bench Verified, CodeSearchNet, Devign
	Batch 5 (6): BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench
	Batch 6 (6): LCA Library-Based Code Gen, LCA Project-Level Completion, LCA Bug Localization, LCA Commit Message Gen, LCA CI Builds Repair, LCA Module Summarization
	Batch 7 (7): DPAIA EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J

	## Changelog

	- 2026-03-03: Initial benchmark analysis and prioritization complete
	- 2026-03-03: Batch 1 complete (MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X)
	- 2026-03-03: Batch 2 complete (SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP)
	- 2026-03-03: Batch 3 complete (SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench)
	- 2026-03-03: Batch 4 complete (SWE-bench Verified, CodeSearchNet, Devign)
	- 2026-03-03: Fixed APPS loading (refs/convert/parquet), PrimeVul (direct JSONL), CodeEditorBench (per-task JSONL)
	- 2026-03-03: All 22 datasets verified loading successfully
	- 2026-03-04: Refactored adapters into submodules (adapters/code_generation.py, code_editing.py, code_reasoning.py, vulnerability.py)
	- 2026-03-04: Extracted CSS and JS into static/ directory (static/problem.css, static/problem.js)
	- 2026-03-04: Added sampling for large datasets (cap at 1000 with seed=42)
	- 2026-03-04: Enhanced FIM view (merged code with ground truth highlighting)
	- 2026-03-04: Enhanced Before/After view (diff highlighting)
	- 2026-03-04: Enhanced SWE-bench diff view (full file with diff chunks)
	- 2026-03-04: Batch 5 complete (BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench)
	- 2026-03-04: All 28 datasets verified loading successfully
	- 2026-03-04: Batch 6 complete (Long Code Arena — 6 project-level tasks)
	- 2026-03-04: Batch 7 complete (dpaia EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J)
	- 2026-03-04: Dropped RepoBench (HF repo has only deprecated loading script, no data files)
	- 2026-03-04: Fixed Multi-SWE-bench (load per-repo JSONL files directly instead of `load_dataset`)
	- 2026-03-04: Fixed CrossCodeEval (load per-language JSONL files directly, inconsistent columns across files)
	- 2026-03-04: Fixed Defects4J (split="train" not "test", fields: bug_id/func_before/func_after)
	- 2026-03-04: All 41 datasets verified loading successfully

	# Benchmark Integration Progress

	## Status: Batches 1-5 Complete

	## Batch Plan

	### Batch 1 (Highest Priority -- Easy HF, High Influence)
	\| Benchmark \| Slug \| Status \| HF Dataset \| View Type \|
	\|-----------\|------\|--------\|------------\|-----------\|
	\| MBPP+ \| `mbppplus` \| Done \| `evalplus/mbppplus` \| Simple \|
	\| ClassEval \| `classeval` \| Done \| `FudanSELab/ClassEval` \| Simple \|
	\| LiveCodeBench \| `livecodebench` \| Done \| `livecodebench/code_generation_lite` \| Simple \|
	\| DebugBench \| `debugbench` \| Done \| `Rtian/DebugBench` \| Before/After \|
	\| HumanEval-X \| `humanevalx` \| Done \| `THUDM/humaneval-x` \| Multi-language \|

	Refactoring done: Multi-language syntax highlighting via `get_lexer_by_name()`. Before/after code diff view. Multi-language tab view.

	### Batch 2
	\| Benchmark \| Slug \| Status \| HF Dataset \| View Type \|
	\|-----------\|------\|--------\|------------\|-----------\|
	\| SWE-bench Lite \| `swebenchlite` \| Done \| `princeton-nlp/SWE-bench_Lite` \| Diff \|
	\| CodeContests \| `codecontests` \| Done \| `deepmind/code_contests` \| Multi-solution \|
	\| APPS \| `apps` \| Done \| `codeparrot/apps` \| Multi-solution / Simple \|
	\| CanItEdit \| `canitedit` \| Done \| `nuprl/CanItEdit` \| Before/After \|
	\| MBPP \| `mbpp` \| Done \| `google-research-datasets/mbpp` \| Simple \|

	New views: Unified diff view for SWE-bench patches. Multi-solution view extended to show language labels for CodeContests.

	### Batch 3
	\| Benchmark \| Slug \| Status \| HF Dataset \| View Type \|
	\|-----------\|------\|--------\|------------\|-----------\|
	\| SAFIM \| `safim` \| Done \| `gonglinyuan/safim` \| Fill-in-the-Middle \|
	\| BigVul \| `bigvul` \| Done \| `bstee615/bigvul` \| Vulnerability \|
	\| DiverseVul \| `diversevul` \| Done \| `claudios/DiverseVul` \| Vulnerability \|
	\| PrimeVul \| `primevul` \| Done \| `starsofchance/PrimeVul` \| Vulnerability \|
	\| CodeEditorBench \| `codeeditorbench` \| Done \| `m-a-p/CodeEditorBench` \| Before/After \|

	New views: Fill-in-the-Middle view showing code with [HOLE] marker and ground truth. Vulnerability view with CWE badges and vulnerable/patched code comparison.

	### Batch 4
	\| Benchmark \| Slug \| Status \| HF Dataset \| View Type \|
	\|-----------\|------\|--------\|------------\|-----------\|
	\| SWE-bench Verified \| `swebenchverified` \| Done \| `princeton-nlp/SWE-bench_Verified` \| Diff \|
	\| CodeSearchNet \| `codesearchnet` \| Done \| `code-search-net/code_search_net` \| Simple \|
	\| Devign \| `devign` \| Done \| `google/code_x_glue_cc_defect_detection` \| Vulnerability \|

	### Dropped from original plan
	\| Benchmark \| Reason \|
	\|-----------\|--------\|
	\| DS-1000 \| Complex library-specific format, limited visualization value \|
	\| RepoBench \| Repo-level context too complex for per-problem viewing \|
	\| MultiPL-E \| 22 languages but same problems as HumanEval/MBPP already covered \|
	\| McEval \| Very large (40 languages), complex format \|
	\| xCodeEval \| Very large (25M rows), 7 tasks, too complex \|
	\| CrossVul \| Similar to DiverseVul/BigVul, diminishing returns \|

	### Batch 5
	\| Benchmark \| Slug \| Status \| HF Dataset \| View Type \|
	\|-----------\|------\|--------\|------------\|-----------\|
	\| BigCodeBench \| `bigcodebench` \| Done \| `bigcode/bigcodebench` \| Simple \|
	\| HumanEvalPack \| `humanevalpack` \| Done \| `bigcode/humanevalpack` \| Multi-language + Before/After \|
	\| CodeXGLUE Refinement \| `codexgluerefinement` \| Done \| `google/code_x_glue_cc_code_refinement` \| Before/After \|
	\| SWE-bench \| `swebenchfull` \| Done \| `princeton-nlp/SWE-bench` \| Diff \|
	\| CommitBench \| `commitbench` \| Done \| `Maxscha/commitbench` \| Diff \|
	\| EffiBench \| `effibench` \| Done \| `DONG19/EffiBench` \| Simple \|

	New views: Multi-language view with canonical/buggy code toggle (HumanEvalPack). CommitBench reuses diff view. CodeXGLUE Refinement uses before/after Java view.

	### Deferred (GitHub-only or complex infrastructure)
	CoderEval, NaturalCodeBench, DevEval, RunBugRun, Defects4J, ConDefects, FixEval, TransCoder, AVATAR, TypeEvalPy, VJBench, SVEN, PyTER

	## Architecture Decisions

	### Multi-language Support
	- `highlight_code()` in `app.py` accepts `language` parameter (default: `"python"`)
	- Uses `get_lexer_by_name()` from Pygments for automatic lexer selection
	- Adapters pass language when calling `_highlight_code(code, language=...)`

	### View Types Implemented
	1. BigOBench view -- multiple solutions with complexity badges
	2. Simple view -- code + inputs/outputs + test suite (HumanEval+, MBPP+, MBPP, ClassEval, LiveCodeBench, APPS, CodeSearchNet)
	3. CRUXEval view -- given/predict task selector
	4. DREval view -- full interactive view with coverage, arrows, ground truth
	5. Before/After view -- side-by-side buggy/fixed code (DebugBench, CanItEdit, CodeEditorBench)
	6. Multi-language view -- same problem in multiple languages (HumanEval-X, HumanEvalPack)
	7. Diff view -- unified diff patch visualization (SWE-bench Lite, SWE-bench Verified, SWE-bench, CommitBench)
	8. Fill-in-the-Middle view -- prefix + [HOLE] + suffix (SAFIM)
	9. Vulnerability view -- vulnerable/patched code + CWE labels (BigVul, DiverseVul, PrimeVul, Devign)

	### Batch 6 — Long Code Arena (6 project-level tasks)
	\| Benchmark \| Slug \| Status \| HF Dataset \| View Type \|
	\|-----------\|------\|--------\|------------\|-----------\|
	\| LCA Library-Based Code Gen \| `lca-libcodegen` \| Done \| `JetBrains-Research/lca-library-based-code-generation` \| Simple \|
	\| LCA Project-Level Completion \| `lca-codecompletion` \| Done \| `JetBrains-Research/lca-project-level-code-completion` \| Simple \|
	\| LCA Bug Localization \| `lca-buglocalization` \| Done \| `JetBrains-Research/lca-bug-localization` \| Diff \|
	\| LCA Commit Message Gen \| `lca-commitmsg` \| Done \| `JetBrains-Research/lca-commit-message-generation` \| Diff \|
	\| LCA CI Builds Repair \| `lca-cirepair` \| Done \| `JetBrains-Research/lca-ci-builds-repair` \| Diff \|
	\| LCA Module Summarization \| `lca-modulesumm` \| Done \| `JetBrains-Research/lca-module-summarization` \| Simple \|

	New adapter module: `adapters/long_code_arena.py` — all 6 Long Code Arena project-level tasks.

	### Batch 7 — dpaia & Additional Benchmarks (7 datasets)
	\| Benchmark \| Slug \| Status \| Source \| View Type \|
	\|-----------\|------\|--------\|--------\|-----------\|
	\| DPAIA EE-Dataset \| `dpaia-ee` \| Done \| `github.com/dpaia/ee-dataset` (JSON) \| Diff (SWE-bench style) \|
	\| Multi-SWE-bench \| `multiswebench` \| Done \| `ByteDance-Seed/Multi-SWE-bench` (JSONL) \| Diff \|
	\| SWE-bench Multilingual \| `swebenchmultilingual` \| Done \| `SWE-bench/SWE-bench_Multilingual` \| Diff \|
	\| CrossCodeEval \| `crosscodeeval` \| Done \| `Vincentvmt/CrossCodeEval` (JSONL) \| Fill-in-the-Middle \|
	\| McEval \| `mceval` \| Done \| `Multilingual-Multimodal-NLP/McEval` \| Simple \|
	\| MultiPL-E \| `multiple` \| Done \| `nuprl/MultiPL-E` \| Multi-language \|
	\| Defects4J \| `defects4j` \| Done \| `rufimelo/defects4j` \| Before/After \|

	### Dropped from Batch 7
	\| Benchmark \| Reason \|
	\|-----------\|--------\|
	\| RepoBench \| HF repo has only a deprecated loading script (`repobench-p.py`), no actual data files \|

	New adapter module: `adapters/additional.py` — dpaia EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J.

	Sources:
	- Long Code Arena: https://huggingface.co/collections/JetBrains-Research/long-code-arena (OpenReview: aQoUjxlgNE)
	- DPAIA EE-Dataset: https://github.com/dpaia/ee-dataset (Java/Spring SWE-bench-style)
	- Multi-SWE-bench: ByteDance multilingual SWE-bench (7 languages, 1632 problems across 40 repos)
	- SWE-bench Multilingual: Official SWE-bench multilingual extension (42 repos)
	- CrossCodeEval: Cross-file code completion (4 languages, Amazon, 9928 problems)
	- McEval: Massively multilingual code evaluation (40 languages)
	- MultiPL-E: Multi-language HumanEval/MBPP translation (9 languages loaded)
	- Defects4J: Classic Java bug-fix benchmark (467 bugs)
	- Arxiv survey reference: https://arxiv.org/abs/2505.08903

	## Total Datasets: 41
	Base (4): REval, CRUXEval, HumanEval+, BigOBench
	Batch 1 (5): MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X
	Batch 2 (5): SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP
	Batch 3 (5): SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench
	Batch 4 (3): SWE-bench Verified, CodeSearchNet, Devign
	Batch 5 (6): BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench
	Batch 6 (6): LCA Library-Based Code Gen, LCA Project-Level Completion, LCA Bug Localization, LCA Commit Message Gen, LCA CI Builds Repair, LCA Module Summarization
	Batch 7 (7): DPAIA EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J

	## Changelog

	- 2026-03-03: Initial benchmark analysis and prioritization complete
	- 2026-03-03: Batch 1 complete (MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X)
	- 2026-03-03: Batch 2 complete (SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP)
	- 2026-03-03: Batch 3 complete (SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench)
	- 2026-03-03: Batch 4 complete (SWE-bench Verified, CodeSearchNet, Devign)
	- 2026-03-03: Fixed APPS loading (refs/convert/parquet), PrimeVul (direct JSONL), CodeEditorBench (per-task JSONL)
	- 2026-03-03: All 22 datasets verified loading successfully
	- 2026-03-04: Refactored adapters into submodules (adapters/code_generation.py, code_editing.py, code_reasoning.py, vulnerability.py)
	- 2026-03-04: Extracted CSS and JS into static/ directory (static/problem.css, static/problem.js)
	- 2026-03-04: Added sampling for large datasets (cap at 1000 with seed=42)
	- 2026-03-04: Enhanced FIM view (merged code with ground truth highlighting)
	- 2026-03-04: Enhanced Before/After view (diff highlighting)
	- 2026-03-04: Enhanced SWE-bench diff view (full file with diff chunks)
	- 2026-03-04: Batch 5 complete (BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench)
	- 2026-03-04: All 28 datasets verified loading successfully
	- 2026-03-04: Batch 6 complete (Long Code Arena — 6 project-level tasks)
	- 2026-03-04: Batch 7 complete (dpaia EE-Dataset, Multi-SWE-bench, SWE-bench Multilingual, CrossCodeEval, McEval, MultiPL-E, Defects4J)
	- 2026-03-04: Dropped RepoBench (HF repo has only deprecated loading script, no data files)
	- 2026-03-04: Fixed Multi-SWE-bench (load per-repo JSONL files directly instead of `load_dataset`)
	- 2026-03-04: Fixed CrossCodeEval (load per-language JSONL files directly, inconsistent columns across files)
	- 2026-03-04: Fixed Defects4J (split="train" not "test", fields: bug_id/func_before/func_after)
	- 2026-03-04: All 41 datasets verified loading successfully