Spaces:

JetBrains-Research
/

ml4se-evals-visualization

Running

egor-bogomolov commited on 1 day ago

Commit

9a8a9c5

1 Parent(s): f3f0934

Add 28 benchmark datasets with rich visualization views

Datasets (28 total):
- Code Generation: REval, HumanEval+, MBPP+, MBPP, ClassEval, LiveCodeBench,
APPS, CodeContests, BigOBench, BigCodeBench, EffiBench, CodeSearchNet
- Code Reasoning: CRUXEval, HumanEvalPack (6 langs)
- Code Editing: SWE-bench Lite/Verified/Full, DebugBench, CanItEdit,
CodeEditorBench, CodeXGLUE Refinement, CommitBench
- Code Completion: SAFIM, HumanEval-X (5 langs)
- Vulnerability Detection: BigVul, DiverseVul, PrimeVul, Devign
View types:
- Simple view (code + inputs/outputs + tests)
- Before/After view with diff highlighting (DebugBench, CanItEdit, etc.)
- GitHub-style diff view with per-file sections and repo/issue/commit links (SWE-bench, CommitBench)
- Multi-language tabs (HumanEval-X, HumanEvalPack with canonical/buggy toggle)
- Fill-in-the-Middle view with inline hole markers (SAFIM)
- Vulnerability view with CWE badges (BigVul, DiverseVul, PrimeVul, Devign)
- Multi-solution view with complexity badges (BigOBench, CodeContests, APPS)
Architecture:
- Refactored to adapters/ package (code_generation, code_editing, code_reasoning, vulnerability)
- Extracted CSS/JS to static/problem.css and static/problem.js
- Deterministic random sampling (seed=42, cap=1000) for large datasets
- Dataset dropdown shows original size for sampled datasets (e.g. '1000 of 33050')
- Compact stats bar with total count and top 5 source tags
- SWE-bench: GitHub-style per-file diff sections with repository/issue/commit links
- SAFIM: inline answer placement at TODO markers instead of end-of-file

Files changed (17) hide show

.gitignore +3 -0
CLAUDE.md +92 -40
PROGRESS.md +114 -0
README.md +50 -9
adapters/__init__.py +82 -0
adapters/code_editing.py +403 -0
dataset_adapters.py → adapters/code_generation.py +568 -200
adapters/code_reasoning.py +366 -0
adapters/registration.py +410 -0
adapters/vulnerability.py +245 -0
app.py +35 -23
benchmarks_analysis.csv +38 -0
static/problem.css +587 -0
static/problem.js +1313 -0
templates/base.html +154 -0
templates/index.html +43 -58
templates/problem.html +5 -972

.gitignore CHANGED Viewed

@@ -69,3 +69,6 @@ dmypy.json
 # Ruff
 .ruff_cache/

 # Ruff
 .ruff_cache/
+# AIR
+.air/

CLAUDE.md CHANGED Viewed

@@ -22,10 +22,14 @@
    - Port: 7860 (default), configurable via PORT env var
    - Debug mode: controlled by FLASK_DEBUG env var
-2. **dataset_adapters.py** - Dataset adapter system
-   - `DatasetAdapter` base class with common interface
-   - Concrete adapters for: DREval, CRUXEval, HumanEval+, BigOBench
-   - Registry pattern (`REGISTRY` dict) for dataset management
    - Each adapter normalizes dataset-specific formats to common API
 3. **templates/** - Jinja2 HTML templates
@@ -33,9 +37,13 @@
    - `index.html` - Problem list view with filtering
    - `problem.html` - Problem detail view with syntax highlighting
-4. **requirements.txt** / **pyproject.toml** - Dependencies
    - Core: flask, pygments
-   - Optional HF: datasets (for CRUXEval, HumanEval+, BigOBench)
    - Dev: ruff
 ### Data Flow
@@ -59,7 +67,7 @@ User Request → Flask Route → Dataset Adapter → API Response → Template/J
 ### Python Files
 - **app.py**: Main entry point, Flask routes, ground truth logic
-- **dataset_adapters.py**: Adapter implementations for all datasets
 - **ground_truth_loader.py**: (parent dir) Loads execution traces for DREval
 - **dynamics.py**: (parent dir) Contains `Nil` singleton for missing values
@@ -75,25 +83,21 @@ User Request → Flask Route → Dataset Adapter → API Response → Template/J
 ## Key Functionalities
-### 1. Dataset Support
-**DREval** (primary dataset):
-- 328 problems (164 HumanEval + 164 ClassEval)
-- Ground truth execution traces available
-- Tasks: Coverage, Path, State, Output predictions
-- Test inputs with expected outputs
-**CRUXEval** (HuggingFace):
-- Input/Output prediction tasks
-- Single function execution reasoning
-**HumanEval+** (HuggingFace):
-- Extended HumanEval with additional tests
-- No execution traces
-**BigOBench** (HuggingFace):
-- Algorithm complexity analysis
-- Multiple solutions per problem with time/space complexity labels
 ### 2. Problem Browsing
@@ -320,32 +324,80 @@ When making changes, verify:
 - **datasets**: HuggingFace datasets (>=2.14.0, optional)
 - **ruff**: Linting and formatting (>=0.8.0, dev)
-### Data Sources
-- **DREval**: Local JSONL files in data/ directory
-- **CRUXEval**: cruxeval-org/cruxeval (HuggingFace Hub)
-- **HumanEval+**: evalplus/humanevalplus (HuggingFace Hub)
-- **BigOBench**: facebook/BigOBench (HuggingFace Hub)
-## Future Enhancements (Not Implemented)
-Potential areas for improvement:
-- User authentication and saved preferences
-- Export functionality (PDF, CSV)
-- Comparison view for multiple solutions
-- Interactive debugging/stepping through execution
-- Code editing and re-evaluation
-- Dataset upload functionality
-- Performance metrics visualization
 ## Related Documentation
 - **README.md**: User-facing documentation, installation instructions
 - **pyproject.toml**: Package metadata, dependencies, ruff configuration
 - **Dockerfile**: Container deployment configuration (if present)
 - **requirements.txt**: Pip-format dependency list
 ---
-**Last Updated**: 2026-03-02
-**Project Status**: Active Development
 **Primary Maintainer**: Egor Bogomolov

    - Port: 7860 (default), configurable via PORT env var
    - Debug mode: controlled by FLASK_DEBUG env var
+2. **adapters/** - Dataset adapter system (modular package)
+   - `__init__.py` - `DatasetAdapter` base class, `REGISTRY` dict, `_set_helpers()` injection
+   - `code_generation.py` - REval, HumanEval+, MBPP+, MBPP, ClassEval, LiveCodeBench, APPS, CodeContests, BigOBench, CodeSearchNet, BigCodeBench, EffiBench
+   - `code_editing.py` - SWE-bench Lite/Verified/Full, DebugBench, CanItEdit, CodeEditorBench, CodeXGLUE Refinement, CommitBench
+   - `code_reasoning.py` - CRUXEval, SAFIM, HumanEval-X, HumanEvalPack
+   - `vulnerability.py` - BigVul, DiverseVul, PrimeVul, Devign
+   - `registration.py` - `register_hf_datasets()`, sampling helpers, JSONL loading
+   - 28 concrete adapters total
    - Each adapter normalizes dataset-specific formats to common API
 3. **templates/** - Jinja2 HTML templates
    - `index.html` - Problem list view with filtering
    - `problem.html` - Problem detail view with syntax highlighting
+4. **static/** - Frontend assets
+   - `problem.css` - Problem detail page styles
+   - `problem.js` - Problem detail page JavaScript (view rendering, diff, FIM, multi-language)
+5. **requirements.txt** / **pyproject.toml** - Dependencies
    - Core: flask, pygments
+   - Optional HF: datasets, huggingface_hub (for all 28 benchmark datasets)
    - Dev: ruff
 ### Data Flow
 ### Python Files
 - **app.py**: Main entry point, Flask routes, ground truth logic
+- **adapters/**: Adapter package (see Architecture above)
 - **ground_truth_loader.py**: (parent dir) Loads execution traces for DREval
 - **dynamics.py**: (parent dir) Contains `Nil` singleton for missing values
 ## Key Functionalities
+### 1. Dataset Support (28 datasets)
+**Code Generation**: REval (154), HumanEval+ (164), MBPP+ (378), MBPP (500), ClassEval (100), LiveCodeBench (1000), APPS (1000), CodeContests (165), BigOBench (556), BigCodeBench (1140), EffiBench (1000)
+**Code Reasoning**: CRUXEval (800), HumanEvalPack (6x164)
+**Code Editing**: SWE-bench Lite (300), SWE-bench Verified (500), SWE-bench (1000), DebugBench (1000), CanItEdit (105), CodeEditorBench (1000), CodeXGLUE Refinement (1000), CommitBench (1000)
+**Code Completion/Translation**: SAFIM (1000), HumanEval-X (5x164), CodeSearchNet (1000)
+**Vulnerability Detection**: BigVul (1000), DiverseVul (1000), PrimeVul (1000), Devign (1000)
+Note: Large datasets are sampled down to 1000 entries (seed=42) for fast browsing.
+REval is the primary dataset with ground truth execution traces. All other datasets are loaded from HuggingFace Hub.
 ### 2. Problem Browsing
 - **datasets**: HuggingFace datasets (>=2.14.0, optional)
 - **ruff**: Linting and formatting (>=0.8.0, dev)
+### Data Sources (all HuggingFace Hub)
+- **REval**: JetBrains-Research/REval
+- **CRUXEval**: cruxeval-org/cruxeval
+- **HumanEval+**: evalplus/humanevalplus
+- **BigOBench**: facebook/BigOBench
+- **MBPP+**: evalplus/mbppplus
+- **ClassEval**: FudanSELab/ClassEval
+- **LiveCodeBench**: livecodebench/code_generation_lite (via `_load_jsonl_dataset`)
+- **DebugBench**: Rtian/DebugBench
+- **HumanEval-X**: THUDM/humaneval-x (via `_load_jsonl_dataset`)
+- **SWE-bench Lite**: princeton-nlp/SWE-bench_Lite
+- **SWE-bench Verified**: princeton-nlp/SWE-bench_Verified
+- **SWE-bench**: princeton-nlp/SWE-bench
+- **CodeContests**: deepmind/code_contests
+- **APPS**: codeparrot/apps (via `refs/convert/parquet` revision)
+- **CanItEdit**: nuprl/CanItEdit
+- **MBPP**: google-research-datasets/mbpp
+- **SAFIM**: gonglinyuan/safim
+- **BigVul**: bstee615/bigvul
+- **DiverseVul**: claudios/DiverseVul
+- **PrimeVul**: starsofchance/PrimeVul (via direct JSONL loading)
+- **CodeEditorBench**: m-a-p/CodeEditorBench (via `_load_jsonl_dataset` per task type)
+- **CodeSearchNet**: code-search-net/code_search_net
+- **Devign**: google/code_x_glue_cc_defect_detection
+- **BigCodeBench**: bigcode/bigcodebench
+- **HumanEvalPack**: bigcode/humanevalpack (per-language configs)
+- **CodeXGLUE Refinement**: google/code_x_glue_cc_code_refinement
+- **CommitBench**: Maxscha/commitbench
+- **EffiBench**: DONG19/EffiBench
+## Benchmark Expansion
+### Progress Tracking
+See `PROGRESS.md` for detailed batch plan and status.
+See `benchmarks_analysis.csv` for full analysis of 35+ benchmarks.
+### Multi-language Syntax Highlighting
+The `highlight_code()` function in `app.py` accepts an optional `language` parameter
+(default: `"python"`). Supported languages are mapped via `LEXER_MAP` to Pygments lexers.
+Adapters pass the language when calling `_highlight_code(code, language=...)`.
+### View Types
+The problem detail page (`problem.html`) supports several view types, dispatched in `renderProblem()`:
+1. **BigOBench view** — multiple solutions with complexity badges
+2. **Simple view** — code + inputs/outputs + test suite (HumanEval+, MBPP+, MBPP, ClassEval, BigCodeBench, EffiBench)
+3. **CRUXEval view** — given/predict task selector
+4. **DREval view** — full interactive view with coverage, arrows, ground truth
+5. **Before/After view** — side-by-side buggy/fixed code (DebugBench, CanItEdit, CodeEditorBench, CodeXGLUE Refinement)
+6. **Multi-language view** — same problem in multiple languages (HumanEval-X, HumanEvalPack with canonical/buggy toggle)
+7. **Diff view** — patch visualization (SWE-bench Lite, SWE-bench Verified, SWE-bench, CommitBench)
+8. **Fill-in-the-Middle view** — prefix + [HOLE] + suffix (SAFIM)
+9. **Vulnerability view** — vulnerable/patched code + CWE labels (BigVul, DiverseVul, PrimeVul, Devign)
+### Adding New Datasets (Updated)
+1. Create adapter class in the appropriate `adapters/` submodule inheriting from `DatasetAdapter`
+2. Implement: `problem_count()`, `get_problem_summary()`, `get_problem_detail()`
+3. Set class attributes: `slug`, `display_name`, `has_ground_truth`, `has_tasks`
+4. Import adapter in `adapters/registration.py` and add registration in `register_hf_datasets()` with try/except
+5. If new language: ensure `LEXER_MAP` in `app.py` has the needed lexer
+6. If new view type: add rendering branch in `static/problem.js` `renderProblem()`
+7. Add badge color in `base.html` CSS
+8. Test: `/api/<slug>/problems` and `/api/<slug>/problem/<idx>`
 ## Related Documentation
 - **README.md**: User-facing documentation, installation instructions
+- **PROGRESS.md**: Batch integration progress and architecture decisions
+- **benchmarks_analysis.csv**: Full benchmark analysis with prioritization
 - **pyproject.toml**: Package metadata, dependencies, ruff configuration
 - **Dockerfile**: Container deployment configuration (if present)
 - **requirements.txt**: Pip-format dependency list
 ---
+**Last Updated**: 2026-03-04
+**Project Status**: Active Development — Benchmark Expansion Phase
 **Primary Maintainer**: Egor Bogomolov

PROGRESS.md ADDED Viewed

	@@ -0,0 +1,114 @@

+# Benchmark Integration Progress
+## Status: Batches 1-5 Complete
+## Batch Plan
+### Batch 1 (Highest Priority -- Easy HF, High Influence)
+| Benchmark | Slug | Status | HF Dataset | View Type |
+|-----------|------|--------|------------|-----------|
+| MBPP+ | `mbppplus` | Done | `evalplus/mbppplus` | Simple |
+| ClassEval | `classeval` | Done | `FudanSELab/ClassEval` | Simple |
+| LiveCodeBench | `livecodebench` | Done | `livecodebench/code_generation_lite` | Simple |
+| DebugBench | `debugbench` | Done | `Rtian/DebugBench` | Before/After |
+| HumanEval-X | `humanevalx` | Done | `THUDM/humaneval-x` | Multi-language |
+**Refactoring done:** Multi-language syntax highlighting via `get_lexer_by_name()`. Before/after code diff view. Multi-language tab view.
+### Batch 2
+| Benchmark | Slug | Status | HF Dataset | View Type |
+|-----------|------|--------|------------|-----------|
+| SWE-bench Lite | `swebenchlite` | Done | `princeton-nlp/SWE-bench_Lite` | Diff |
+| CodeContests | `codecontests` | Done | `deepmind/code_contests` | Multi-solution |
+| APPS | `apps` | Done | `codeparrot/apps` | Multi-solution / Simple |
+| CanItEdit | `canitedit` | Done | `nuprl/CanItEdit` | Before/After |
+| MBPP | `mbpp` | Done | `google-research-datasets/mbpp` | Simple |
+**New views:** Unified diff view for SWE-bench patches. Multi-solution view extended to show language labels for CodeContests.
+### Batch 3
+| Benchmark | Slug | Status | HF Dataset | View Type |
+|-----------|------|--------|------------|-----------|
+| SAFIM | `safim` | Done | `gonglinyuan/safim` | Fill-in-the-Middle |
+| BigVul | `bigvul` | Done | `bstee615/bigvul` | Vulnerability |
+| DiverseVul | `diversevul` | Done | `claudios/DiverseVul` | Vulnerability |
+| PrimeVul | `primevul` | Done | `starsofchance/PrimeVul` | Vulnerability |
+| CodeEditorBench | `codeeditorbench` | Done | `m-a-p/CodeEditorBench` | Before/After |
+**New views:** Fill-in-the-Middle view showing code with [HOLE] marker and ground truth. Vulnerability view with CWE badges and vulnerable/patched code comparison.
+### Batch 4
+| Benchmark | Slug | Status | HF Dataset | View Type |
+|-----------|------|--------|------------|-----------|
+| SWE-bench Verified | `swebenchverified` | Done | `princeton-nlp/SWE-bench_Verified` | Diff |
+| CodeSearchNet | `codesearchnet` | Done | `code-search-net/code_search_net` | Simple |
+| Devign | `devign` | Done | `google/code_x_glue_cc_defect_detection` | Vulnerability |
+### Dropped from original plan
+| Benchmark | Reason |
+|-----------|--------|
+| DS-1000 | Complex library-specific format, limited visualization value |
+| RepoBench | Repo-level context too complex for per-problem viewing |
+| MultiPL-E | 22 languages but same problems as HumanEval/MBPP already covered |
+| McEval | Very large (40 languages), complex format |
+| xCodeEval | Very large (25M rows), 7 tasks, too complex |
+| CrossVul | Similar to DiverseVul/BigVul, diminishing returns |
+### Batch 5
+| Benchmark | Slug | Status | HF Dataset | View Type |
+|-----------|------|--------|------------|-----------|
+| BigCodeBench | `bigcodebench` | Done | `bigcode/bigcodebench` | Simple |
+| HumanEvalPack | `humanevalpack` | Done | `bigcode/humanevalpack` | Multi-language + Before/After |
+| CodeXGLUE Refinement | `codexgluerefinement` | Done | `google/code_x_glue_cc_code_refinement` | Before/After |
+| SWE-bench | `swebenchfull` | Done | `princeton-nlp/SWE-bench` | Diff |
+| CommitBench | `commitbench` | Done | `Maxscha/commitbench` | Diff |
+| EffiBench | `effibench` | Done | `DONG19/EffiBench` | Simple |
+**New views:** Multi-language view with canonical/buggy code toggle (HumanEvalPack). CommitBench reuses diff view. CodeXGLUE Refinement uses before/after Java view.
+### Deferred (GitHub-only or complex infrastructure)
+CoderEval, NaturalCodeBench, DevEval, RunBugRun, Defects4J, ConDefects, FixEval, TransCoder, AVATAR, TypeEvalPy, VJBench, SVEN, PyTER
+## Architecture Decisions
+### Multi-language Support
+- `highlight_code()` in `app.py` accepts `language` parameter (default: `"python"`)
+- Uses `get_lexer_by_name()` from Pygments for automatic lexer selection
+- Adapters pass language when calling `_highlight_code(code, language=...)`
+### View Types Implemented
+1. **BigOBench view** -- multiple solutions with complexity badges
+2. **Simple view** -- code + inputs/outputs + test suite (HumanEval+, MBPP+, MBPP, ClassEval, LiveCodeBench, APPS, CodeSearchNet)
+3. **CRUXEval view** -- given/predict task selector
+4. **DREval view** -- full interactive view with coverage, arrows, ground truth
+5. **Before/After view** -- side-by-side buggy/fixed code (DebugBench, CanItEdit, CodeEditorBench)
+6. **Multi-language view** -- same problem in multiple languages (HumanEval-X, HumanEvalPack)
+7. **Diff view** -- unified diff patch visualization (SWE-bench Lite, SWE-bench Verified, SWE-bench, CommitBench)
+8. **Fill-in-the-Middle view** -- prefix + [HOLE] + suffix (SAFIM)
+9. **Vulnerability view** -- vulnerable/patched code + CWE labels (BigVul, DiverseVul, PrimeVul, Devign)
+## Total Datasets: 28
+Base (4): REval, CRUXEval, HumanEval+, BigOBench
+Batch 1 (5): MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X
+Batch 2 (5): SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP
+Batch 3 (5): SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench
+Batch 4 (3): SWE-bench Verified, CodeSearchNet, Devign
+Batch 5 (6): BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench
+## Changelog
+- 2026-03-03: Initial benchmark analysis and prioritization complete
+- 2026-03-03: Batch 1 complete (MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X)
+- 2026-03-03: Batch 2 complete (SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP)
+- 2026-03-03: Batch 3 complete (SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench)
+- 2026-03-03: Batch 4 complete (SWE-bench Verified, CodeSearchNet, Devign)
+- 2026-03-03: Fixed APPS loading (refs/convert/parquet), PrimeVul (direct JSONL), CodeEditorBench (per-task JSONL)
+- 2026-03-03: All 22 datasets verified loading successfully
+- 2026-03-04: Refactored adapters into submodules (adapters/code_generation.py, code_editing.py, code_reasoning.py, vulnerability.py)
+- 2026-03-04: Extracted CSS and JS into static/ directory (static/problem.css, static/problem.js)
+- 2026-03-04: Added sampling for large datasets (cap at 1000 with seed=42)
+- 2026-03-04: Enhanced FIM view (merged code with ground truth highlighting)
+- 2026-03-04: Enhanced Before/After view (diff highlighting)
+- 2026-03-04: Enhanced SWE-bench diff view (full file with diff chunks)
+- 2026-03-04: Batch 5 complete (BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench)
+- 2026-03-04: All 28 datasets verified loading successfully

README.md CHANGED Viewed

@@ -11,14 +11,55 @@ pinned: false
 A web-based interface for browsing and manually inspecting individual datapoints from popular ML4SE (Machine Learning for Software Engineering) benchmark datasets.
-## Supported Datasets
-| Dataset | Description |
-|---------|-------------|
-| **REval** | Dynamic reasoning evaluation with execution traces and ground truth variable states |
-| **CRUXEval** | Input/output prediction tasks for single-function execution reasoning |
-| **HumanEval+** | Extended HumanEval with additional tests |
-| **BigOBench** | Algorithm complexity analysis with time/space complexity labels |
 ## Installation & Usage
@@ -45,7 +86,7 @@ uv run ruff format .
 ### Adding a New Dataset
-1. Create an adapter class in `dataset_adapters.py` inheriting from `DatasetAdapter`
 2. Implement required methods: `problem_count()`, `get_problem_summary()`, `get_problem_detail()`
-3. Register the adapter in the `REGISTRY`
 4. Test: `/api/<slug>/problems` and `/api/<slug>/problem/<idx>`

 A web-based interface for browsing and manually inspecting individual datapoints from popular ML4SE (Machine Learning for Software Engineering) benchmark datasets.
+## Supported Datasets (28)
+### Code Generation
+| Dataset | Source | View Type |
+|---------|--------|-----------|
+| **HumanEval+** | evalplus/humanevalplus | Simple |
+| **MBPP+** | evalplus/mbppplus | Simple |
+| **MBPP** | google-research-datasets/mbpp | Simple |
+| **ClassEval** | FudanSELab/ClassEval | Simple |
+| **LiveCodeBench** | livecodebench/code_generation_lite | Simple |
+| **APPS** | codeparrot/apps | Multi-solution |
+| **CodeContests** | deepmind/code_contests | Multi-solution |
+| **BigOBench** | facebook/BigOBench | Complexity badges |
+| **BigCodeBench** | bigcode/bigcodebench | Simple |
+| **EffiBench** | DONG19/EffiBench | Simple |
+### Code Reasoning & Evaluation
+| Dataset | Source | View Type |
+|---------|--------|-----------|
+| **REval** | JetBrains-Research/REval | Interactive (coverage, arrows, ground truth) |
+| **CRUXEval** | cruxeval-org/cruxeval | Given/Predict task selector |
+| **HumanEvalPack** | bigcode/humanevalpack | Multi-language + buggy/canonical |
+### Code Editing & Debugging
+| Dataset | Source | View Type |
+|---------|--------|-----------|
+| **SWE-bench Lite** | princeton-nlp/SWE-bench_Lite | Unified diff |
+| **SWE-bench Verified** | princeton-nlp/SWE-bench_Verified | Unified diff |
+| **SWE-bench** | princeton-nlp/SWE-bench | Unified diff |
+| **DebugBench** | Rtian/DebugBench | Before/After |
+| **CanItEdit** | nuprl/CanItEdit | Before/After |
+| **CodeEditorBench** | m-a-p/CodeEditorBench | Before/After |
+| **CodeXGLUE Refinement** | google/code_x_glue_cc_code_refinement | Before/After |
+| **CommitBench** | Maxscha/commitbench | Unified diff |
+### Code Completion & Translation
+| Dataset | Source | View Type |
+|---------|--------|-----------|
+| **SAFIM** | gonglinyuan/safim | Fill-in-the-Middle |
+| **HumanEval-X** | THUDM/humaneval-x | Multi-language tabs |
+| **CodeSearchNet** | code-search-net/code_search_net | Simple |
+### Vulnerability Detection
+| Dataset | Source | View Type |
+|---------|--------|-----------|
+| **BigVul** | bstee615/bigvul | Vulnerability (CWE badges) |
+| **DiverseVul** | claudios/DiverseVul | Vulnerability |
+| **PrimeVul** | starsofchance/PrimeVul | Vulnerability |
+| **Devign** | google/code_x_glue_cc_defect_detection | Vulnerability |
 ## Installation & Usage
 ### Adding a New Dataset
+1. Create an adapter class in the appropriate `adapters/` submodule inheriting from `DatasetAdapter`
 2. Implement required methods: `problem_count()`, `get_problem_summary()`, `get_problem_detail()`
+3. Register the adapter in `adapters/registration.py`
 4. Test: `/api/<slug>/problems` and `/api/<slug>/problem/<idx>`

adapters/__init__.py ADDED Viewed

	@@ -0,0 +1,82 @@

+"""
+Dataset adapters for the ML4SE Benchmark Viewer.
+Each adapter normalises a different benchmark dataset into a common API shape
+so the Flask routes and templates can handle them uniformly.
+The REGISTRY dict maps slug strings (used in URLs) to adapter instances.
+"""
+from __future__ import annotations
+from typing import Any
+# ---------------------------------------------------------------------------
+# Helper function stubs – injected at runtime by app.py via _set_helpers()
+# ---------------------------------------------------------------------------
+_highlight_code = None
+_code_offset = None
+_extract_test_classes = None
+def _set_helpers(highlight_code_fn, code_offset_fn, extract_test_classes_fn):
+    """Called once by app.py to inject helper functions."""
+    global _highlight_code, _code_offset, _extract_test_classes
+    _highlight_code = highlight_code_fn
+    _code_offset = code_offset_fn
+    _extract_test_classes = extract_test_classes_fn
+    # Propagate to submodules so adapters can use them
+    from adapters import code_editing, code_generation, code_reasoning, vulnerability
+    for mod in (code_generation, code_editing, code_reasoning, vulnerability):
+        mod._highlight_code = highlight_code_fn
+        mod._code_offset = code_offset_fn
+        mod._extract_test_classes = extract_test_classes_fn
+# ---------------------------------------------------------------------------
+# Registry
+# ---------------------------------------------------------------------------
+REGISTRY: dict[str, DatasetAdapter] = {}
+# ---------------------------------------------------------------------------
+# Base class
+# ---------------------------------------------------------------------------
+class DatasetAdapter:
+    slug: str = ""
+    display_name: str = ""
+    has_ground_truth: bool = False
+    has_tasks: bool = False
+    total_count: int | None = None  # original size before sampling (None = not sampled)
+    def problem_count(self) -> int:
+        raise NotImplementedError
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        raise NotImplementedError
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        raise NotImplementedError
+    def get_ground_truth(self, idx: int, input_idx: int) -> dict[str, Any]:
+        return {"status": "unavailable", "message": "Ground truth not available for this dataset"}
+# ---------------------------------------------------------------------------
+# Re-export registration entry point
+# ---------------------------------------------------------------------------
+from adapters.registration import register_hf_datasets  # noqa: E402, F401
+__all__ = [
+    "REGISTRY",
+    "DatasetAdapter",
+    "_set_helpers",
+    "register_hf_datasets",
+]

adapters/code_editing.py ADDED Viewed

	@@ -0,0 +1,403 @@

+"""Code editing benchmark adapters (SWE-bench, DebugBench, CanItEdit, CodeEditorBench)."""
+from __future__ import annotations
+import json
+from typing import Any
+from adapters import DatasetAdapter
+# Injected at runtime by _set_helpers()
+_highlight_code = None
+_code_offset = None
+_extract_test_classes = None
+# ---------------------------------------------------------------------------
+# SWE-bench Lite adapter  (HuggingFace: princeton-nlp/SWE-bench_Lite)
+# ---------------------------------------------------------------------------
+class SWEBenchLiteAdapter(DatasetAdapter):
+    slug = "swebenchlite"
+    display_name = "SWE-bench Lite"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        return {
+            "idx": idx,
+            "task_id": row["instance_id"],
+            "entry_point": row["instance_id"].split("__")[-1],
+            "num_inputs": 0,
+            "source": row["repo"],
+        }
+    @staticmethod
+    def _github_links(instance_id: str, repo: str, base_commit: str) -> dict[str, str]:
+        """Build GitHub URLs from SWE-bench instance metadata."""
+        links: dict[str, str] = {}
+        if repo:
+            links["repo_url"] = f"https://github.com/{repo}"
+        # instance_id format: "repo__issue-number" e.g. "astropy__astropy-12907"
+        parts = instance_id.rsplit("-", 1)
+        if len(parts) == 2 and parts[1].isdigit() and repo:
+            links["issue_url"] = f"https://github.com/{repo}/issues/{parts[1]}"
+        if base_commit and repo:
+            links["commit_url"] = f"https://github.com/{repo}/commit/{base_commit}"
+        return links
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        patch = row["patch"]
+        fail_to_pass = json.loads(row["FAIL_TO_PASS"]) if row["FAIL_TO_PASS"] else []
+        pass_to_pass = json.loads(row["PASS_TO_PASS"]) if row["PASS_TO_PASS"] else []
+        instance_id = row["instance_id"]
+        repo = row["repo"]
+        base_commit = row.get("base_commit", "")
+        return {
+            "idx": idx,
+            "task_id": instance_id,
+            "entry_point": instance_id.split("__")[-1],
+            "code": patch,
+            "highlighted_code": "",
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": repo,
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": row["problem_statement"],
+            "patch": patch,
+            "test_patch": row.get("test_patch", ""),
+            "fail_to_pass": fail_to_pass,
+            "pass_to_pass": pass_to_pass,
+            "hints": row.get("hints_text", ""),
+            "repo": repo,
+            "base_commit": base_commit,
+            "version": row.get("version", ""),
+            "created_at": row.get("created_at", ""),
+            **self._github_links(instance_id, repo, base_commit),
+        }
+# ---------------------------------------------------------------------------
+# SWE-bench Verified adapter  (HuggingFace: princeton-nlp/SWE-bench_Verified)
+# ---------------------------------------------------------------------------
+class SWEBenchVerifiedAdapter(SWEBenchLiteAdapter):
+    slug = "swebenchverified"
+    display_name = "SWE-bench Verified"
+class SWEBenchFullAdapter(SWEBenchLiteAdapter):
+    slug = "swebenchfull"
+    display_name = "SWE-bench"
+# ---------------------------------------------------------------------------
+# DebugBench adapter  (HuggingFace: Rtian/DebugBench)
+# ---------------------------------------------------------------------------
+class DebugBenchAdapter(DatasetAdapter):
+    slug = "debugbench"
+    display_name = "DebugBench"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        return {
+            "idx": idx,
+            "task_id": row["slug"],
+            "entry_point": row["slug"],
+            "num_inputs": len(row["examples"]),
+            "source": f"{row['language']}/{row['category']}",
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        lang = row["language"]
+        buggy = row["buggy_code"]
+        fixed = row["solution"]
+        return {
+            "idx": idx,
+            "task_id": row["slug"],
+            "entry_point": row["slug"],
+            "code": fixed,
+            "highlighted_code": _highlight_code(fixed, language=lang),
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": f"{lang}/{row['category']}",
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": row["question"],
+            "language": lang,
+            "buggy_code": buggy,
+            "buggy_highlighted_code": _highlight_code(buggy, language=lang),
+            "fixed_code": fixed,
+            "fixed_highlighted_code": _highlight_code(fixed, language=lang),
+            "bug_category": row["category"],
+            "bug_subtype": row["subtype"],
+            "bug_explanation": row["bug_explanation"],
+            "difficulty": row["level"],
+            "examples": list(row["examples"]),
+        }
+# ---------------------------------------------------------------------------
+# CanItEdit adapter  (HuggingFace: nuprl/CanItEdit)
+# ---------------------------------------------------------------------------
+class CanItEditAdapter(DatasetAdapter):
+    slug = "canitedit"
+    display_name = "CanItEdit"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        taxonomy = row.get("taxonomy", {})
+        change_kind = taxonomy.get("change_kind", "") if isinstance(taxonomy, dict) else ""
+        return {
+            "idx": idx,
+            "task_id": row.get("full_name", str(row.get("id", idx))),
+            "entry_point": row.get("name", f"edit_{idx}"),
+            "num_inputs": 0,
+            "source": change_kind or "CanItEdit",
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        before = row["before"]
+        after = row["after"]
+        taxonomy = row.get("taxonomy", {})
+        if not isinstance(taxonomy, dict):
+            taxonomy = {}
+        return {
+            "idx": idx,
+            "task_id": row.get("full_name", str(row.get("id", idx))),
+            "entry_point": row.get("name", f"edit_{idx}"),
+            "code": after,
+            "highlighted_code": _highlight_code(after),
+            "inputs": [],
+            "outputs": [],
+            "test": row.get("tests", ""),
+            "tasks": [],
+            "source": taxonomy.get("change_kind", "CanItEdit"),
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": row.get("instruction_descriptive", ""),
+            "buggy_code": before,
+            "buggy_highlighted_code": _highlight_code(before),
+            "fixed_code": after,
+            "fixed_highlighted_code": _highlight_code(after),
+            "bug_category": taxonomy.get("change_kind", ""),
+            "bug_subtype": taxonomy.get("topic", ""),
+            "bug_explanation": row.get("instruction_lazy", ""),
+        }
+# ---------------------------------------------------------------------------
+# CodeEditorBench adapter  (HuggingFace: m-a-p/CodeEditorBench)
+# ---------------------------------------------------------------------------
+class CodeEditorBenchAdapter(DatasetAdapter):
+    slug = "codeeditorbench"
+    display_name = "CodeEditorBench"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, rows: list[dict[str, Any]]):
+        self._rows = rows
+    def problem_count(self) -> int:
+        return len(self._rows)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._rows[idx]
+        return {
+            "idx": idx,
+            "task_id": str(row.get("idx", idx)),
+            "entry_point": row.get("title", f"problem_{idx}"),
+            "num_inputs": 0,
+            "source": row.get("_task_type", "unknown"),
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._rows[idx]
+        task_type = row.get("_task_type", "unknown")
+        lang = row.get("code_language", row.get("source_lang", "python")) or "python"
+        lang_key = lang.lower()
+        if task_type == "code_debug":
+            buggy = row.get("incorrect_solutions", "")
+            fixed = row.get("solutions", "")
+        elif task_type == "code_translate":
+            buggy = row.get("source_code", "")
+            fixed = row.get("solutions", row.get("source_code", ""))
+        elif task_type == "code_polishment":
+            buggy = row.get("source_code", "")
+            fixed = row.get("solutions", row.get("source_code", ""))
+        else:  # code_switch
+            buggy = row.get("similar_source_code", row.get("source_code", ""))
+            fixed = row.get("solutions", row.get("source_code", ""))
+        return {
+            "idx": idx,
+            "task_id": str(row.get("idx", idx)),
+            "entry_point": row.get("title", f"problem_{idx}"),
+            "code": fixed,
+            "highlighted_code": _highlight_code(fixed, language=lang_key) if fixed else "",
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": task_type,
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": "",
+            "buggy_code": buggy,
+            "buggy_highlighted_code": _highlight_code(buggy, language=lang_key) if buggy else "",
+            "fixed_code": fixed,
+            "fixed_highlighted_code": _highlight_code(fixed, language=lang_key) if fixed else "",
+            "bug_category": task_type,
+            "bug_subtype": row.get("difficulty", ""),
+            "bug_explanation": "",
+            "difficulty": row.get("difficulty", ""),
+            "language": lang,
+        }
+# ---------------------------------------------------------------------------
+# CodeXGLUE Code Refinement adapter  (HuggingFace: google/code_x_glue_cc_code_refinement)
+# ---------------------------------------------------------------------------
+class CodeXGLUERefinementAdapter(DatasetAdapter):
+    slug = "codexgluerefinement"
+    display_name = "CodeXGLUE Code Refinement"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        return {
+            "idx": idx,
+            "task_id": str(row.get("id", idx)),
+            "entry_point": f"refinement_{row.get('id', idx)}",
+            "num_inputs": 0,
+            "source": "CodeXGLUE",
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        buggy = row.get("buggy", "")
+        fixed = row.get("fixed", "")
+        return {
+            "idx": idx,
+            "task_id": str(row.get("id", idx)),
+            "entry_point": f"refinement_{row.get('id', idx)}",
+            "code": fixed,
+            "highlighted_code": _highlight_code(fixed, language="java") if fixed else "",
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": "CodeXGLUE",
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": "",
+            "buggy_code": buggy,
+            "buggy_highlighted_code": _highlight_code(buggy, language="java") if buggy else "",
+            "fixed_code": fixed,
+            "fixed_highlighted_code": _highlight_code(fixed, language="java") if fixed else "",
+            "bug_category": "Code Refinement",
+            "bug_subtype": "",
+            "bug_explanation": "",
+            "language": "Java",
+        }
+# ---------------------------------------------------------------------------
+# CommitBench adapter  (HuggingFace: Maxscha/commitbench)
+# ---------------------------------------------------------------------------
+class CommitBenchAdapter(DatasetAdapter):
+    slug = "commitbench"
+    display_name = "CommitBench"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        return {
+            "idx": idx,
+            "task_id": row.get("hash", str(idx))[:12],
+            "entry_point": row.get("project", f"commit_{idx}"),
+            "num_inputs": 0,
+            "source": row.get("diff_languages", "unknown"),
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        diff = row.get("diff", "")
+        message = row.get("message", "")
+        return {
+            "idx": idx,
+            "task_id": row.get("hash", str(idx))[:12],
+            "entry_point": row.get("project", f"commit_{idx}"),
+            "code": diff,
+            "highlighted_code": "",
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": row.get("diff_languages", "unknown"),
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": message,
+            "patch": diff,
+            "repo": row.get("project", ""),
+            "commit_hash": row.get("hash", ""),
+            "diff_languages": row.get("diff_languages", ""),
+        }

dataset_adapters.py → adapters/code_generation.py RENAMED Viewed

@@ -1,65 +1,24 @@
-"""
-Dataset adapters for the ML4SE Benchmark Viewer.
-Each adapter normalises a different benchmark dataset into a common API shape
-so the Flask routes and templates can handle them uniformly.
-The REGISTRY dict maps slug strings (used in URLs) to adapter instances.
-"""
 from __future__ import annotations
 import json
 from typing import Any
-# These are imported from app.py at registration time to avoid circular imports.
 _highlight_code = None
 _code_offset = None
 _extract_test_classes = None
-def _set_helpers(highlight_code_fn, code_offset_fn, extract_test_classes_fn):
-    """Called once by app.py to inject helper functions."""
-    global _highlight_code, _code_offset, _extract_test_classes
-    _highlight_code = highlight_code_fn
-    _code_offset = code_offset_fn
-    _extract_test_classes = extract_test_classes_fn
-# ---------------------------------------------------------------------------
-# Registry
-# ---------------------------------------------------------------------------
-REGISTRY: dict[str, "DatasetAdapter"] = {}
-# ---------------------------------------------------------------------------
-# Base class
-# ---------------------------------------------------------------------------
-class DatasetAdapter:
-    slug: str = ""
-    display_name: str = ""
-    has_ground_truth: bool = False
-    has_tasks: bool = False
-    def problem_count(self) -> int:
-        raise NotImplementedError
-    def get_problem_summary(self, idx: int) -> dict[str, Any]:
-        raise NotImplementedError
-    def get_problem_detail(self, idx: int) -> dict[str, Any]:
-        raise NotImplementedError
-    def get_ground_truth(self, idx: int, input_idx: int) -> dict[str, Any]:
-        return {"status": "unavailable", "message": "Ground truth not available for this dataset"}
 # ---------------------------------------------------------------------------
 # REval adapter  (HuggingFace: JetBrains-Research/REval)
 # ---------------------------------------------------------------------------
 def _format_typed_value(val: dict) -> str:
     """Convert a {__type__, __value__} dict from REval states into a Python repr string."""
     t = val.get("__type__")
@@ -85,11 +44,9 @@ class REvalAdapter(DatasetAdapter):
     def __init__(self, problems_ds, tasks_ds, executions_ds, states_ds):
         self._problems = problems_ds
-        # Build task lookup: task_id → parsed tasks JSON
         self._tasks: dict[str, list] = {}
         for row in tasks_ds:
             self._tasks[row["task_id"]] = json.loads(row["tasks"])
-        # Build execution lookup: (task_id, input_idx) → row
         self._executions: dict[tuple[str, int], dict] = {}
         for row in executions_ds:
             self._executions[(row["task_id"], row["input_idx"])] = {
@@ -97,7 +54,6 @@ class REvalAdapter(DatasetAdapter):
                 "trace": row["trace"],
                 "coverage": row["coverage"],
             }
-        # Build states lookup: (task_id, input_idx) → parsed states JSON
         self._states: dict[tuple[str, int], list] = {}
         for row in states_ds:
             self._states[(row["task_id"], row["input_idx"])] = json.loads(row["states"])
@@ -154,7 +110,7 @@ class REvalAdapter(DatasetAdapter):
             for item in adjusted_items:
                 if "lineno" in item:
                     task_lines.add(item["lineno"])
-            task_info["task_lines"] = sorted(list(task_lines))
             tasks_info.append(task_info)
@@ -195,11 +151,9 @@ class REvalAdapter(DatasetAdapter):
         code = problem["code"]
         offset = _code_offset(code)
-        # Coverage: convert 0-indexed (original) → 1-indexed (stripped display)
         coverage_1indexed = [ln + 1 - offset for ln in exec_rec["coverage"]]
         total_lines = len(code[offset:].splitlines())
-        # Get task items for this input_idx
         task_list = self._tasks.get(task_id, [])
         task_items = []
         for t in task_list:
@@ -207,15 +161,12 @@ class REvalAdapter(DatasetAdapter):
                 task_items = t.get("task", [])
                 break
-        # Get states for this (task_id, input_idx)
         states_list = self._states.get((task_id, input_idx), [])
-        # Resolve variable answers for each task item
         variable_answers = []
         for item in task_items:
-            lineno = item["lineno"]  # 1-indexed relative to original code
             var = item["var"]
-            # Collect all values of this variable at this line across the trace
             values = []
             for s in states_list:
                 if s["lineno"] == lineno and var in s.get("locals", {}):
@@ -226,7 +177,6 @@ class REvalAdapter(DatasetAdapter):
             elif len(values) == 1:
                 answer_str = _format_typed_value(values[0])
             else:
-                # Deduplicate by formatted string to avoid showing identical values
                 seen = []
                 for v in values:
                     fmt = _format_typed_value(v)
@@ -234,13 +184,14 @@ class REvalAdapter(DatasetAdapter):
                         seen.append(fmt)
                 answer_str = "[" + ", ".join(seen) + "]" if len(seen) > 1 else seen[0]
-            variable_answers.append({
-                "lineno": lineno - offset,
-                "var": var,
-                "answer_str": answer_str,
-            })
-        # Resolve next lines from trace for arrow visualization
         trace = exec_rec["trace"]
         next_lines_answers = []
         processed_linenos: set[int] = set()
@@ -253,10 +204,12 @@ class REvalAdapter(DatasetAdapter):
             for i, ln in enumerate(trace):
                 if ln == lineno and i + 1 < len(trace):
                     nexts.add(trace[i + 1])
-            next_lines_answers.append({
-                "lineno": lineno,
-                "next_lines": sorted(nexts) if nexts else [-1],
-            })
         return {
             "status": "ok",
@@ -267,72 +220,11 @@ class REvalAdapter(DatasetAdapter):
         }
-# ---------------------------------------------------------------------------
-# CRUXEval adapter  (HuggingFace: cruxeval-org/cruxeval)
-# ---------------------------------------------------------------------------
-class CRUXEvalAdapter(DatasetAdapter):
-    slug = "cruxeval"
-    display_name = "CRUXEval"
-    has_ground_truth = False
-    has_tasks = True
-    def __init__(self, hf_dataset):
-        self._ds = hf_dataset
-    def problem_count(self) -> int:
-        return len(self._ds)
-    def get_problem_summary(self, idx: int) -> dict[str, Any]:
-        row = self._ds[idx]
-        return {
-            "idx": idx,
-            "task_id": row["id"],
-            "entry_point": "f",
-            "num_inputs": 1,
-            "source": "CRUXEval",
-        }
-    def get_problem_detail(self, idx: int) -> dict[str, Any]:
-        row = self._ds[idx]
-        code = row["code"]
-        return {
-            "idx": idx,
-            "task_id": row["id"],
-            "entry_point": "f",
-            "code": code,
-            "highlighted_code": _highlight_code(code),
-            "inputs": [row["input"]],
-            "outputs": [row["output"]],
-            "test": None,
-            "tasks": [
-                {
-                    "name": "Output Prediction",
-                    "description": "Given the code and input, predict the output.",
-                    "given": "input",
-                    "predict": "output",
-                    "input": row["input"],
-                    "output": row["output"],
-                },
-                {
-                    "name": "Input Prediction",
-                    "description": "Given the code and output, predict the input.",
-                    "given": "output",
-                    "predict": "input",
-                    "input": row["input"],
-                    "output": row["output"],
-                },
-            ],
-            "source": "CRUXEval",
-            "has_ground_truth": False,
-            "has_tasks": True,
-        }
 # ---------------------------------------------------------------------------
 # HumanEval+ adapter  (HuggingFace: evalplus/humanevalplus)
 # ---------------------------------------------------------------------------
 class HumanEvalPlusAdapter(DatasetAdapter):
     slug = "humanevalplus"
     display_name = "HumanEval+"
@@ -378,6 +270,7 @@ class HumanEvalPlusAdapter(DatasetAdapter):
 # BigOBench adapter  (HuggingFace: facebook/BigOBench)
 # ---------------------------------------------------------------------------
 class BigOBenchAdapter(DatasetAdapter):
     slug = "bigobench"
     display_name = "BigOBench"
@@ -404,13 +297,15 @@ class BigOBenchAdapter(DatasetAdapter):
         prob = self._problems[idx]
         solutions = []
         for sol in prob["solutions"]:
-            solutions.append({
-                "solution_id": sol["solution_id"],
-                "code": sol["solution_code"],
-                "highlighted_code": _highlight_code(sol["solution_code"]),
-                "time_complexity": sol.get("time_complexity"),
-                "space_complexity": sol.get("space_complexity"),
-            })
         return {
             "idx": idx,
             "task_id": prob["problem_id"],
@@ -429,16 +324,9 @@ class BigOBenchAdapter(DatasetAdapter):
         }
-def _merge_bigobench(ds_time, ds_space) -> list[dict[str, Any]]:
-    """Merge time and space complexity test sets by problem_id.
-    Groups all solutions under their parent problem.  Solutions that appear
-    in both test sets get both complexity labels; otherwise the missing one
-    is None.  Returns a list of problem dicts sorted by problem_id.
-    """
-    # First, collect solutions keyed by (problem_id, solution_id)
     solutions: dict[tuple[str, str], dict[str, Any]] = {}
-    # Track problem-level metadata
     problem_meta: dict[str, dict[str, str]] = {}
     for row in ds_time:
@@ -456,10 +344,13 @@ def _merge_bigobench(ds_time, ds_space) -> list[dict[str, Any]]:
     for row in ds_space:
         pid, sid = row["problem_id"], row["solution_id"]
-        problem_meta.setdefault(pid, {
-            "problem_name": row["problem_name"],
-            "description": row["description"],
-        })
         key = (pid, sid)
         if key in solutions:
             solutions[key]["space_complexity"] = row["space_complexity_inferred"]
@@ -471,8 +362,6 @@ def _merge_bigobench(ds_time, ds_space) -> list[dict[str, Any]]:
                 "space_complexity": row["space_complexity_inferred"],
             }
-    # Group solutions by problem_id
-    from collections import defaultdict
     by_problem: dict[str, list[dict[str, Any]]] = defaultdict(list)
     for (pid, _sid), sol in solutions.items():
         by_problem[pid].append(sol)
@@ -480,58 +369,537 @@ def _merge_bigobench(ds_time, ds_space) -> list[dict[str, Any]]:
     problems = []
     for pid in sorted(by_problem.keys()):
         meta = problem_meta[pid]
-        problems.append({
-            "problem_id": pid,
-            "problem_name": meta["problem_name"],
-            "description": meta["description"],
-            "solutions": by_problem[pid],
-        })
     return problems
 # ---------------------------------------------------------------------------
-# Registration helpers
 # ---------------------------------------------------------------------------
-def register_hf_datasets() -> None:
-    """Load all HuggingFace datasets."""
-    from datasets import load_dataset
-    try:
-        problems = load_dataset("JetBrains-Research/REval", "problems", split="test")
-        tasks = load_dataset("JetBrains-Research/REval", "tasks", split="test")
-        executions = load_dataset("JetBrains-Research/REval", "executions", split="test")
-        states = load_dataset("JetBrains-Research/REval", "states", split="test")
-        REGISTRY["reval"] = REvalAdapter(problems, tasks, executions, states)
-        print(f"Loaded REval: {len(problems)} problems")
-    except Exception as e:
-        print(f"Warning: could not load REval: {e}")
-    try:
-        crux = load_dataset("cruxeval-org/cruxeval", split="test")
-        REGISTRY["cruxeval"] = CRUXEvalAdapter(crux)
-        print(f"Loaded CRUXEval: {len(crux)} problems")
-    except Exception as e:
-        print(f"Warning: could not load CRUXEval: {e}")
-    try:
-        heplus = load_dataset("evalplus/humanevalplus", split="test")
-        REGISTRY["humanevalplus"] = HumanEvalPlusAdapter(heplus)
-        print(f"Loaded HumanEval+: {len(heplus)} problems")
-    except Exception as e:
-        print(f"Warning: could not load HumanEval+: {e}")
-    try:
-        ds_time = load_dataset(
-            "facebook/BigOBench", "time_complexity_test_set.jsonl", split="train"
-        )
-        ds_space = load_dataset(
-            "facebook/BigOBench", "space_complexity_test_set.jsonl", split="train"
-        )
-        merged = _merge_bigobench(ds_time, ds_space)
-        REGISTRY["bigobench"] = BigOBenchAdapter(merged)
-        print(f"Loaded BigOBench: {len(merged)} problems "
-              f"({len(ds_time)} time + {len(ds_space)} space)")
-    except Exception as e:
-        print(f"Warning: could not load BigOBench: {e}")

+"""Code generation benchmark adapters."""
 from __future__ import annotations
 import json
+from collections import defaultdict
 from typing import Any
+from adapters import DatasetAdapter
+# Injected at runtime by _set_helpers()
 _highlight_code = None
 _code_offset = None
 _extract_test_classes = None
 # ---------------------------------------------------------------------------
 # REval adapter  (HuggingFace: JetBrains-Research/REval)
 # ---------------------------------------------------------------------------
 def _format_typed_value(val: dict) -> str:
     """Convert a {__type__, __value__} dict from REval states into a Python repr string."""
     t = val.get("__type__")
     def __init__(self, problems_ds, tasks_ds, executions_ds, states_ds):
         self._problems = problems_ds
         self._tasks: dict[str, list] = {}
         for row in tasks_ds:
             self._tasks[row["task_id"]] = json.loads(row["tasks"])
         self._executions: dict[tuple[str, int], dict] = {}
         for row in executions_ds:
             self._executions[(row["task_id"], row["input_idx"])] = {
                 "trace": row["trace"],
                 "coverage": row["coverage"],
             }
         self._states: dict[tuple[str, int], list] = {}
         for row in states_ds:
             self._states[(row["task_id"], row["input_idx"])] = json.loads(row["states"])
             for item in adjusted_items:
                 if "lineno" in item:
                     task_lines.add(item["lineno"])
+            task_info["task_lines"] = sorted(task_lines)
             tasks_info.append(task_info)
         code = problem["code"]
         offset = _code_offset(code)
         coverage_1indexed = [ln + 1 - offset for ln in exec_rec["coverage"]]
         total_lines = len(code[offset:].splitlines())
         task_list = self._tasks.get(task_id, [])
         task_items = []
         for t in task_list:
                 task_items = t.get("task", [])
                 break
         states_list = self._states.get((task_id, input_idx), [])
         variable_answers = []
         for item in task_items:
+            lineno = item["lineno"]
             var = item["var"]
             values = []
             for s in states_list:
                 if s["lineno"] == lineno and var in s.get("locals", {}):
             elif len(values) == 1:
                 answer_str = _format_typed_value(values[0])
             else:
                 seen = []
                 for v in values:
                     fmt = _format_typed_value(v)
                         seen.append(fmt)
                 answer_str = "[" + ", ".join(seen) + "]" if len(seen) > 1 else seen[0]
+            variable_answers.append(
+                {
+                    "lineno": lineno - offset,
+                    "var": var,
+                    "answer_str": answer_str,
+                }
+            )
         trace = exec_rec["trace"]
         next_lines_answers = []
         processed_linenos: set[int] = set()
             for i, ln in enumerate(trace):
                 if ln == lineno and i + 1 < len(trace):
                     nexts.add(trace[i + 1])
+            next_lines_answers.append(
+                {
+                    "lineno": lineno,
+                    "next_lines": sorted(nexts) if nexts else [-1],
+                }
+            )
         return {
             "status": "ok",
         }
 # ---------------------------------------------------------------------------
 # HumanEval+ adapter  (HuggingFace: evalplus/humanevalplus)
 # ---------------------------------------------------------------------------
 class HumanEvalPlusAdapter(DatasetAdapter):
     slug = "humanevalplus"
     display_name = "HumanEval+"
 # BigOBench adapter  (HuggingFace: facebook/BigOBench)
 # ---------------------------------------------------------------------------
 class BigOBenchAdapter(DatasetAdapter):
     slug = "bigobench"
     display_name = "BigOBench"
         prob = self._problems[idx]
         solutions = []
         for sol in prob["solutions"]:
+            solutions.append(
+                {
+                    "solution_id": sol["solution_id"],
+                    "code": sol["solution_code"],
+                    "highlighted_code": _highlight_code(sol["solution_code"]),
+                    "time_complexity": sol.get("time_complexity"),
+                    "space_complexity": sol.get("space_complexity"),
+                }
+            )
         return {
             "idx": idx,
             "task_id": prob["problem_id"],
         }
+def merge_bigobench(ds_time, ds_space) -> list[dict[str, Any]]:
+    """Merge time and space complexity test sets by problem_id."""
     solutions: dict[tuple[str, str], dict[str, Any]] = {}
     problem_meta: dict[str, dict[str, str]] = {}
     for row in ds_time:
     for row in ds_space:
         pid, sid = row["problem_id"], row["solution_id"]
+        problem_meta.setdefault(
+            pid,
+            {
+                "problem_name": row["problem_name"],
+                "description": row["description"],
+            },
+        )
         key = (pid, sid)
         if key in solutions:
             solutions[key]["space_complexity"] = row["space_complexity_inferred"]
                 "space_complexity": row["space_complexity_inferred"],
             }
     by_problem: dict[str, list[dict[str, Any]]] = defaultdict(list)
     for (pid, _sid), sol in solutions.items():
         by_problem[pid].append(sol)
     problems = []
     for pid in sorted(by_problem.keys()):
         meta = problem_meta[pid]
+        problems.append(
+            {
+                "problem_id": pid,
+                "problem_name": meta["problem_name"],
+                "description": meta["description"],
+                "solutions": by_problem[pid],
+            }
+        )
     return problems
 # ---------------------------------------------------------------------------
+# MBPP+ adapter  (HuggingFace: evalplus/mbppplus)
 # ---------------------------------------------------------------------------
+class MBPPPlusAdapter(DatasetAdapter):
+    slug = "mbppplus"
+    display_name = "MBPP+"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        return {
+            "idx": idx,
+            "task_id": str(row["task_id"]),
+            "entry_point": row["prompt"][:60].replace("\n", " ").strip(),
+            "num_inputs": len(row["test_list"]),
+            "source": "MBPP+",
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        code = row["code"]
+        return {
+            "idx": idx,
+            "task_id": str(row["task_id"]),
+            "entry_point": row["prompt"][:60].replace("\n", " ").strip(),
+            "code": code,
+            "highlighted_code": _highlight_code(code),
+            "inputs": [],
+            "outputs": [],
+            "test": "\n".join(row["test_list"]),
+            "tasks": [],
+            "source": "MBPP+",
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": row["prompt"],
+        }
+# ---------------------------------------------------------------------------
+# ClassEval adapter  (HuggingFace: FudanSELab/ClassEval)
+# ---------------------------------------------------------------------------
+class ClassEvalAdapter(DatasetAdapter):
+    slug = "classeval"
+    display_name = "ClassEval"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        return {
+            "idx": idx,
+            "task_id": row["task_id"],
+            "entry_point": row["class_name"],
+            "num_inputs": len(row["methods_info"]),
+            "source": "ClassEval",
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        code = row["solution_code"]
+        return {
+            "idx": idx,
+            "task_id": row["task_id"],
+            "entry_point": row["class_name"],
+            "code": code,
+            "highlighted_code": _highlight_code(code),
+            "inputs": [],
+            "outputs": [],
+            "test": row["test"],
+            "tasks": [],
+            "source": "ClassEval",
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": row["class_description"],
+            "skeleton": row["skeleton"],
+        }
+# ---------------------------------------------------------------------------
+# LiveCodeBench adapter  (HuggingFace: livecodebench/code_generation_lite)
+# ---------------------------------------------------------------------------
+class LiveCodeBenchAdapter(DatasetAdapter):
+    slug = "livecodebench"
+    display_name = "LiveCodeBench"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        return {
+            "idx": idx,
+            "task_id": row["question_id"],
+            "entry_point": row["question_title"],
+            "num_inputs": 0,
+            "source": row["platform"],
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        test_cases = []
+        try:
+            test_cases = json.loads(row["public_test_cases"]) if row["public_test_cases"] else []
+        except (json.JSONDecodeError, TypeError):
+            pass
+        inputs = [tc.get("input", "") for tc in test_cases]
+        outputs = [tc.get("output", "") for tc in test_cases]
+        starter = row.get("starter_code", "") or ""
+        code = starter if starter.strip() else ""
+        return {
+            "idx": idx,
+            "task_id": row["question_id"],
+            "entry_point": row["question_title"],
+            "code": code,
+            "highlighted_code": _highlight_code(code) if code else "",
+            "inputs": inputs,
+            "outputs": outputs,
+            "test": None,
+            "tasks": [],
+            "source": row["platform"],
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": row["question_content"],
+            "difficulty": row.get("difficulty", ""),
+            "contest_date": row.get("contest_date", ""),
+        }
+# ---------------------------------------------------------------------------
+# CodeContests adapter  (HuggingFace: deepmind/code_contests)
+# ---------------------------------------------------------------------------
+_CC_LANG_NAMES = {0: "Unknown", 1: "Python 2", 2: "C++", 3: "Python 3", 4: "Java"}
+class CodeContestsAdapter(DatasetAdapter):
+    slug = "codecontests"
+    display_name = "CodeContests"
+    has_ground_truth = False
+    has_tasks = False
+    _DIFFICULTY_NAMES = {
+        0: "Unknown",
+        1: "Easy",
+        2: "Medium",
+        3: "Hard",
+        4: "Harder",
+        5: "Hardest",
+        6: "External",
+    }
+    _SOURCE_NAMES = {
+        0: "Unknown",
+        1: "CodeChef",
+        2: "Codeforces",
+        3: "HackerEarth",
+        4: "CodeJam",
+        5: "AtCoder",
+        6: "Aizu",
+    }
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        source_int = row.get("source", 0)
+        source_name = self._SOURCE_NAMES.get(source_int, "Unknown")
+        return {
+            "idx": idx,
+            "task_id": row["name"],
+            "entry_point": row["name"],
+            "num_inputs": len(row.get("public_tests", {}).get("input", [])),
+            "source": source_name,
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        source_int = row.get("source", 0)
+        source_name = self._SOURCE_NAMES.get(source_int, "Unknown")
+        diff_int = row.get("difficulty", 0)
+        diff_name = self._DIFFICULTY_NAMES.get(diff_int, "Unknown")
+        sols_data = row.get("solutions", {})
+        sol_langs = sols_data.get("language", [])
+        sol_codes = sols_data.get("solution", [])
+        solutions = []
+        for i, code in enumerate(sol_codes[:10]):
+            lang_int = sol_langs[i] if i < len(sol_langs) else 0
+            lang_name = _CC_LANG_NAMES.get(lang_int, "Unknown")
+            lang_key = {1: "python", 2: "cpp", 3: "python", 4: "java"}.get(lang_int, "python")
+            solutions.append(
+                {
+                    "solution_id": f"sol_{i}",
+                    "code": code,
+                    "highlighted_code": _highlight_code(code, language=lang_key),
+                    "language": lang_name,
+                }
+            )
+        pub_tests = row.get("public_tests", {})
+        inputs = pub_tests.get("input", [])
+        outputs = pub_tests.get("output", [])
+        tags = list(row.get("cf_tags", []))
+        return {
+            "idx": idx,
+            "task_id": row["name"],
+            "entry_point": row["name"],
+            "code": solutions[0]["code"] if solutions else "",
+            "highlighted_code": solutions[0]["highlighted_code"] if solutions else "",
+            "inputs": inputs,
+            "outputs": outputs,
+            "test": None,
+            "tasks": [],
+            "source": source_name,
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": row["description"],
+            "difficulty": diff_name,
+            "solutions": solutions,
+            "cf_rating": row.get("cf_rating", 0),
+            "tags": tags,
+        }
+# ---------------------------------------------------------------------------
+# APPS adapter  (HuggingFace: codeparrot/apps)
+# ---------------------------------------------------------------------------
+class APPSAdapter(DatasetAdapter):
+    slug = "apps"
+    display_name = "APPS"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        return {
+            "idx": idx,
+            "task_id": str(row["problem_id"]),
+            "entry_point": row["question"][:60].replace("\n", " ").strip(),
+            "num_inputs": 0,
+            "source": row.get("difficulty", "unknown"),
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        solutions = []
+        if row.get("solutions"):
+            try:
+                sol_list = json.loads(row["solutions"])
+                for i, code in enumerate(sol_list[:5]):
+                    solutions.append(
+                        {
+                            "solution_id": f"sol_{i}",
+                            "code": code,
+                            "highlighted_code": _highlight_code(code),
+                        }
+                    )
+            except (json.JSONDecodeError, TypeError):
+                pass
+        inputs, outputs = [], []
+        if row.get("input_output"):
+            try:
+                io = json.loads(row["input_output"])
+                inputs = io.get("inputs", [])
+                outputs = io.get("outputs", [])
+            except (json.JSONDecodeError, TypeError):
+                pass
+        code = solutions[0]["code"] if solutions else (row.get("starter_code") or "")
+        return {
+            "idx": idx,
+            "task_id": str(row["problem_id"]),
+            "entry_point": row["question"][:60].replace("\n", " ").strip(),
+            "code": code,
+            "highlighted_code": _highlight_code(code) if code else "",
+            "inputs": inputs[:5],
+            "outputs": outputs[:5],
+            "test": None,
+            "tasks": [],
+            "source": row.get("difficulty", "unknown"),
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": row["question"],
+            "difficulty": row.get("difficulty", ""),
+            "solutions": solutions if len(solutions) > 1 else [],
+            "url": row.get("url", ""),
+            "starter_code": row.get("starter_code", ""),
+        }
+# ---------------------------------------------------------------------------
+# MBPP adapter  (HuggingFace: google-research-datasets/mbpp)
+# ---------------------------------------------------------------------------
+class MBPPAdapter(DatasetAdapter):
+    slug = "mbpp"
+    display_name = "MBPP"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        return {
+            "idx": idx,
+            "task_id": str(row["task_id"]),
+            "entry_point": row["text"][:60].replace("\n", " ").strip(),
+            "num_inputs": len(row.get("test_list", [])),
+            "source": "MBPP",
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        code = row["code"]
+        test_list = row.get("test_list", [])
+        challenge_tests = row.get("challenge_test_list", [])
+        all_tests = test_list + challenge_tests
+        return {
+            "idx": idx,
+            "task_id": str(row["task_id"]),
+            "entry_point": row["text"][:60].replace("\n", " ").strip(),
+            "code": code,
+            "highlighted_code": _highlight_code(code),
+            "inputs": [],
+            "outputs": [],
+            "test": "\n".join(all_tests),
+            "tasks": [],
+            "source": "MBPP",
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": row["text"],
+        }
+# ---------------------------------------------------------------------------
+# CodeSearchNet adapter  (HuggingFace: code-search-net/code_search_net)
+# ---------------------------------------------------------------------------
+class CodeSearchNetAdapter(DatasetAdapter):
+    slug = "codesearchnet"
+    display_name = "CodeSearchNet"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        return {
+            "idx": idx,
+            "task_id": row.get("func_name", str(idx)),
+            "entry_point": row.get("func_name", f"csn_{idx}"),
+            "num_inputs": 0,
+            "source": row.get("language", "unknown"),
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        code = row.get("func_code_string", "")
+        lang = row.get("language", "python")
+        return {
+            "idx": idx,
+            "task_id": row.get("func_name", str(idx)),
+            "entry_point": row.get("func_name", f"csn_{idx}"),
+            "code": code,
+            "highlighted_code": _highlight_code(code, language=lang),
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": lang,
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": row.get("func_documentation_string", ""),
+        }
+# ---------------------------------------------------------------------------
+# BigCodeBench adapter  (HuggingFace: bigcode/bigcodebench)
+# ---------------------------------------------------------------------------
+class BigCodeBenchAdapter(DatasetAdapter):
+    slug = "bigcodebench"
+    display_name = "BigCodeBench"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        return {
+            "idx": idx,
+            "task_id": row["task_id"],
+            "entry_point": row.get("entry_point", "task_func"),
+            "num_inputs": 0,
+            "source": "BigCodeBench",
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        code = row.get("code_prompt", "") + row.get("canonical_solution", "")
+        libs = row.get("libs", "")
+        return {
+            "idx": idx,
+            "task_id": row["task_id"],
+            "entry_point": row.get("entry_point", "task_func"),
+            "code": code,
+            "highlighted_code": _highlight_code(code),
+            "inputs": [],
+            "outputs": [],
+            "test": row.get("test", ""),
+            "tasks": [],
+            "source": "BigCodeBench",
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": row.get("complete_prompt", ""),
+            "libs": libs,
+        }
+# ---------------------------------------------------------------------------
+# EffiBench adapter  (HuggingFace: DONG19/EffiBench)
+# ---------------------------------------------------------------------------
+class EffiBenchAdapter(DatasetAdapter):
+    slug = "effibench"
+    display_name = "EffiBench"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        return {
+            "idx": idx,
+            "task_id": str(row.get("problem_idx", idx)),
+            "entry_point": row.get("task_name", f"effibench_{idx}"),
+            "num_inputs": 0,
+            "source": "EffiBench",
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        code = row.get("canonical_solution", "")
+        return {
+            "idx": idx,
+            "task_id": str(row.get("problem_idx", idx)),
+            "entry_point": row.get("task_name", f"effibench_{idx}"),
+            "code": code,
+            "highlighted_code": _highlight_code(code),
+            "inputs": [],
+            "outputs": [],
+            "test": row.get("test_case", ""),
+            "tasks": [],
+            "source": "EffiBench",
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": row.get("markdown_description", row.get("description", "")),
+        }

adapters/code_reasoning.py ADDED Viewed

	@@ -0,0 +1,366 @@

+"""Code reasoning / completion benchmark adapters (CRUXEval, SAFIM, HumanEval-X)."""
+from __future__ import annotations
+import re
+from typing import Any
+from adapters import DatasetAdapter
+# Injected at runtime by _set_helpers()
+_highlight_code = None
+_code_offset = None
+_extract_test_classes = None
+# ---------------------------------------------------------------------------
+# CRUXEval adapter  (HuggingFace: cruxeval-org/cruxeval)
+# ---------------------------------------------------------------------------
+class CRUXEvalAdapter(DatasetAdapter):
+    slug = "cruxeval"
+    display_name = "CRUXEval"
+    has_ground_truth = False
+    has_tasks = True
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        return {
+            "idx": idx,
+            "task_id": row["id"],
+            "entry_point": "f",
+            "num_inputs": 1,
+            "source": "CRUXEval",
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        code = row["code"]
+        return {
+            "idx": idx,
+            "task_id": row["id"],
+            "entry_point": "f",
+            "code": code,
+            "highlighted_code": _highlight_code(code),
+            "inputs": [row["input"]],
+            "outputs": [row["output"]],
+            "test": None,
+            "tasks": [
+                {
+                    "name": "Output Prediction",
+                    "description": "Given the code and input, predict the output.",
+                    "given": "input",
+                    "predict": "output",
+                    "input": row["input"],
+                    "output": row["output"],
+                },
+                {
+                    "name": "Input Prediction",
+                    "description": "Given the code and output, predict the input.",
+                    "given": "output",
+                    "predict": "input",
+                    "input": row["input"],
+                    "output": row["output"],
+                },
+            ],
+            "source": "CRUXEval",
+            "has_ground_truth": False,
+            "has_tasks": True,
+        }
+# ---------------------------------------------------------------------------
+# SAFIM adapter  (HuggingFace: gonglinyuan/safim)
+# ---------------------------------------------------------------------------
+class SAFIMAdapter(DatasetAdapter):
+    slug = "safim"
+    display_name = "SAFIM"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        return {
+            "idx": idx,
+            "task_id": row.get("task_id", str(idx)),
+            "entry_point": row.get("task_id", f"safim_{idx}"),
+            "num_inputs": 0,
+            "source": row.get("lang", "unknown"),
+        }
+    # Patterns that mark where the completion should be inserted
+    _HOLE_MARKERS = [
+        "{{completion}}",
+        "/* TODO: Your code here */",
+        "// TODO: Your code here",
+        "# TODO: Your code here",
+    ]
+    def _find_hole_marker(self, prompt: str) -> str | None:
+        """Return the first matching hole marker found in the prompt, or None."""
+        for marker in self._HOLE_MARKERS:
+            if marker in prompt:
+                return marker
+        return None
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        prompt = row.get("prompt", "")
+        ground_truth = row.get("ground_truth", "")
+        lang = row.get("lang", "python")
+        marker = self._find_hole_marker(prompt)
+        if marker:
+            display_code = prompt.replace(marker, "/* [HOLE] */")
+            before_hole = prompt.split(marker)[0]
+            merged_code = prompt.replace(marker, ground_truth)
+        else:
+            display_code = prompt + "\n/* [HOLE] */\n"
+            before_hole = prompt + "\n"
+            merged_code = prompt + "\n" + ground_truth + "\n"
+        # Compute 1-indexed line range of the inserted ground truth
+        gt_start_line = before_hole.count("\n") + 1
+        gt_line_count = ground_truth.count("\n") + (1 if ground_truth else 0)
+        gt_end_line = gt_start_line + gt_line_count - 1
+        lang_key = {"Python": "python", "Java": "java", "C++": "cpp", "C#": "csharp"}.get(
+            lang, lang.lower()
+        )
+        return {
+            "idx": idx,
+            "task_id": row.get("task_id", str(idx)),
+            "entry_point": row.get("task_id", f"safim_{idx}"),
+            "code": display_code,
+            "highlighted_code": _highlight_code(display_code, language=lang_key),
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": lang,
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "fim_prefix": prompt,
+            "fim_ground_truth": ground_truth,
+            "fim_ground_truth_highlighted": _highlight_code(ground_truth, language=lang_key),
+            "fim_merged_code": merged_code,
+            "fim_merged_highlighted": _highlight_code(
+                merged_code,
+                highlight_lines=list(range(gt_start_line, gt_end_line + 1)),
+                language=lang_key,
+            ),
+            "fim_gt_start_line": gt_start_line,
+            "fim_gt_end_line": gt_end_line,
+            "language": lang,
+        }
+# ---------------------------------------------------------------------------
+# HumanEval-X adapter  (HuggingFace: THUDM/humaneval-x)
+# ---------------------------------------------------------------------------
+def _extract_func_name(declaration: str) -> str:
+    """Extract the function/method name from a code declaration string."""
+    m = re.search(r"def\s+(\w+)\s*\(", declaration)
+    if m:
+        return m.group(1)
+    m = re.search(r"(\w+)\s*\(", declaration)
+    if m:
+        return m.group(1)
+    return ""
+# ---------------------------------------------------------------------------
+# HumanEvalPack adapter  (HuggingFace: bigcode/humanevalpack)
+# ---------------------------------------------------------------------------
+class HumanEvalPackAdapter(DatasetAdapter):
+    slug = "humanevalpack"
+    display_name = "HumanEvalPack"
+    has_ground_truth = False
+    has_tasks = False
+    LANGUAGES = ["python", "js", "cpp", "go", "java", "rust"]
+    def __init__(self, datasets_by_lang: dict[str, Any]):
+        self._by_lang = datasets_by_lang
+        first_lang = next(iter(self._by_lang))
+        self._count = len(self._by_lang[first_lang])
+    def problem_count(self) -> int:
+        return self._count
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        first_lang = next(iter(self._by_lang))
+        row = self._by_lang[first_lang][idx]
+        return {
+            "idx": idx,
+            "task_id": row["task_id"],
+            "entry_point": row.get("entry_point", f"problem_{idx}"),
+            "num_inputs": len(self._by_lang),
+            "source": "HumanEvalPack",
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        first_lang = next(iter(self._by_lang))
+        row = self._by_lang[first_lang][idx]
+        lang_labels = {
+            "python": "Python",
+            "js": "JavaScript",
+            "cpp": "C++",
+            "go": "Go",
+            "java": "Java",
+            "rust": "Rust",
+        }
+        lang_pygments = {
+            "python": "python",
+            "js": "javascript",
+            "cpp": "cpp",
+            "go": "go",
+            "java": "java",
+            "rust": "rust",
+        }
+        lang_solutions = []
+        for lang in self.LANGUAGES:
+            if lang not in self._by_lang:
+                continue
+            lrow = self._by_lang[lang][idx]
+            canonical = lrow.get("prompt", "") + lrow.get("canonical_solution", "")
+            buggy = lrow.get("prompt", "") + lrow.get("buggy_solution", "")
+            lang_key = lang_pygments.get(lang, lang)
+            lang_solutions.append(
+                {
+                    "language": lang,
+                    "language_label": lang_labels.get(lang, lang),
+                    "code": canonical,
+                    "highlighted_code": _highlight_code(canonical, language=lang_key),
+                    "buggy_code": buggy,
+                    "buggy_highlighted_code": _highlight_code(buggy, language=lang_key),
+                    "test": lrow.get("test", ""),
+                    "example_test": lrow.get("example_test", ""),
+                    "bug_type": lrow.get("bug_type", ""),
+                    "failure_symptoms": lrow.get("failure_symptoms", ""),
+                }
+            )
+        py_row = self._by_lang.get("python", self._by_lang[first_lang])[idx]
+        default_code = py_row.get("prompt", "") + py_row.get("canonical_solution", "")
+        return {
+            "idx": idx,
+            "task_id": row["task_id"],
+            "entry_point": row.get("entry_point", f"problem_{idx}"),
+            "code": default_code,
+            "highlighted_code": _highlight_code(default_code),
+            "inputs": [],
+            "outputs": [],
+            "test": py_row.get("test", ""),
+            "tasks": [],
+            "source": "HumanEvalPack",
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": row.get("instruction", row.get("docstring", "")),
+            "lang_solutions": lang_solutions,
+            "bug_type": py_row.get("bug_type", ""),
+            "failure_symptoms": py_row.get("failure_symptoms", ""),
+        }
+# ---------------------------------------------------------------------------
+# HumanEval-X adapter  (HuggingFace: THUDM/humaneval-x)
+# ---------------------------------------------------------------------------
+class HumanEvalXAdapter(DatasetAdapter):
+    slug = "humanevalx"
+    display_name = "HumanEval-X"
+    has_ground_truth = False
+    has_tasks = False
+    LANGUAGES = ["python", "cpp", "java", "go", "js"]
+    def __init__(self, datasets_by_lang: dict[str, Any]):
+        """datasets_by_lang maps language name -> HF dataset split."""
+        self._by_lang = datasets_by_lang
+        first_lang = next(iter(self._by_lang))
+        self._count = len(self._by_lang[first_lang])
+    def problem_count(self) -> int:
+        return self._count
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        first_lang = next(iter(self._by_lang))
+        row = self._by_lang[first_lang][idx]
+        task_id = row["task_id"].split("/")[-1]
+        decl = row.get("declaration", row.get("prompt", ""))
+        entry = _extract_func_name(decl) or f"problem_{task_id}"
+        return {
+            "idx": idx,
+            "task_id": f"HumanEval/{task_id}",
+            "entry_point": entry,
+            "num_inputs": len(self._by_lang),
+            "source": "HumanEval-X",
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        first_lang = next(iter(self._by_lang))
+        row = self._by_lang[first_lang][idx]
+        task_id = row["task_id"].split("/")[-1]
+        decl = row.get("declaration", row.get("prompt", ""))
+        entry = _extract_func_name(decl) or f"problem_{task_id}"
+        lang_solutions = []
+        for lang in self.LANGUAGES:
+            if lang not in self._by_lang:
+                continue
+            lrow = self._by_lang[lang][idx]
+            code = lrow["prompt"] + lrow["canonical_solution"]
+            lang_solutions.append(
+                {
+                    "language": lang,
+                    "code": code,
+                    "highlighted_code": _highlight_code(code, language=lang),
+                    "test": lrow.get("test", ""),
+                    "example_test": lrow.get("example_test", ""),
+                }
+            )
+        py_row = self._by_lang.get("python", self._by_lang[first_lang])[idx]
+        default_code = py_row["prompt"] + py_row["canonical_solution"]
+        return {
+            "idx": idx,
+            "task_id": f"HumanEval/{task_id}",
+            "entry_point": entry,
+            "code": default_code,
+            "highlighted_code": _highlight_code(default_code),
+            "inputs": [],
+            "outputs": [],
+            "test": py_row.get("test", ""),
+            "tasks": [],
+            "source": "HumanEval-X",
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "lang_solutions": lang_solutions,
+        }

adapters/registration.py ADDED Viewed

	@@ -0,0 +1,410 @@

+"""Dataset registration — loads all HuggingFace datasets into the adapter registry."""
+from __future__ import annotations
+import json
+import random
+from typing import Any
+from adapters import REGISTRY
+from adapters.code_editing import (
+    CanItEditAdapter,
+    CodeEditorBenchAdapter,
+    CodeXGLUERefinementAdapter,
+    CommitBenchAdapter,
+    DebugBenchAdapter,
+    SWEBenchFullAdapter,
+    SWEBenchLiteAdapter,
+    SWEBenchVerifiedAdapter,
+)
+from adapters.code_generation import (
+    APPSAdapter,
+    BigCodeBenchAdapter,
+    BigOBenchAdapter,
+    ClassEvalAdapter,
+    CodeContestsAdapter,
+    CodeSearchNetAdapter,
+    EffiBenchAdapter,
+    HumanEvalPlusAdapter,
+    LiveCodeBenchAdapter,
+    MBPPAdapter,
+    MBPPPlusAdapter,
+    REvalAdapter,
+    merge_bigobench,
+)
+from adapters.code_reasoning import (
+    CRUXEvalAdapter,
+    HumanEvalPackAdapter,
+    HumanEvalXAdapter,
+    SAFIMAdapter,
+)
+from adapters.vulnerability import (
+    BigVulAdapter,
+    DevignAdapter,
+    DiverseVulAdapter,
+    PrimeVulAdapter,
+)
+# ---------------------------------------------------------------------------
+# Sampling: cap large datasets at MAX_DISPLAY_SAMPLES for fast browsing
+# ---------------------------------------------------------------------------
+MAX_DISPLAY_SAMPLES = 1000
+_SAMPLE_SEED = 42
+def _sample_indices(total: int) -> list[int]:
+    """Return a sorted list of up to MAX_DISPLAY_SAMPLES random indices."""
+    if total <= MAX_DISPLAY_SAMPLES:
+        return list(range(total))
+    rng = random.Random(_SAMPLE_SEED)
+    return sorted(rng.sample(range(total), MAX_DISPLAY_SAMPLES))
+def _sample_hf_dataset(ds):
+    """Return a HuggingFace dataset (or subset) with at most MAX_DISPLAY_SAMPLES rows."""
+    if len(ds) <= MAX_DISPLAY_SAMPLES:
+        return ds
+    indices = _sample_indices(len(ds))
+    return ds.select(indices)
+def _sample_list(rows: list) -> list:
+    """Return a list with at most MAX_DISPLAY_SAMPLES items."""
+    if len(rows) <= MAX_DISPLAY_SAMPLES:
+        return rows
+    indices = _sample_indices(len(rows))
+    return [rows[i] for i in indices]
+def _load_jsonl_dataset(repo_id: str, filenames: list[str]) -> list[dict[str, Any]]:
+    """Download JSONL files from a HuggingFace dataset repo and return as a list of dicts.
+    This bypasses the ``datasets`` library when the repo uses deprecated loading scripts.
+    """
+    from huggingface_hub import hf_hub_download
+    rows: list[dict[str, Any]] = []
+    for fname in filenames:
+        path = hf_hub_download(repo_id, fname, repo_type="dataset")
+        with open(path) as f:
+            for line in f:
+                line = line.strip()
+                if line:
+                    rows.append(json.loads(line))
+    return rows
+def register_hf_datasets() -> None:
+    """Load all HuggingFace datasets into :data:`REGISTRY`."""
+    from datasets import load_dataset
+    # --- Base datasets ---
+    try:
+        problems = load_dataset("JetBrains-Research/REval", "problems", split="test")
+        tasks = load_dataset("JetBrains-Research/REval", "tasks", split="test")
+        executions = load_dataset("JetBrains-Research/REval", "executions", split="test")
+        states = load_dataset("JetBrains-Research/REval", "states", split="test")
+        REGISTRY["reval"] = REvalAdapter(problems, tasks, executions, states)
+        print(f"Loaded REval: {len(problems)} problems")
+    except Exception as e:
+        print(f"Warning: could not load REval: {e}")
+    try:
+        crux = load_dataset("cruxeval-org/cruxeval", split="test")
+        REGISTRY["cruxeval"] = CRUXEvalAdapter(crux)
+        print(f"Loaded CRUXEval: {len(crux)} problems")
+    except Exception as e:
+        print(f"Warning: could not load CRUXEval: {e}")
+    try:
+        heplus = load_dataset("evalplus/humanevalplus", split="test")
+        REGISTRY["humanevalplus"] = HumanEvalPlusAdapter(heplus)
+        print(f"Loaded HumanEval+: {len(heplus)} problems")
+    except Exception as e:
+        print(f"Warning: could not load HumanEval+: {e}")
+    try:
+        ds_time = load_dataset(
+            "facebook/BigOBench", "time_complexity_test_set.jsonl", split="train"
+        )
+        ds_space = load_dataset(
+            "facebook/BigOBench", "space_complexity_test_set.jsonl", split="train"
+        )
+        merged = merge_bigobench(ds_time, ds_space)
+        REGISTRY["bigobench"] = BigOBenchAdapter(merged)
+        print(
+            f"Loaded BigOBench: {len(merged)} problems "
+            f"({len(ds_time)} time + {len(ds_space)} space)"
+        )
+    except Exception as e:
+        print(f"Warning: could not load BigOBench: {e}")
+    # --- Batch 1 datasets ---
+    try:
+        mbppplus = load_dataset("evalplus/mbppplus", split="test")
+        REGISTRY["mbppplus"] = MBPPPlusAdapter(mbppplus)
+        print(f"Loaded MBPP+: {len(mbppplus)} problems")
+    except Exception as e:
+        print(f"Warning: could not load MBPP+: {e}")
+    try:
+        classeval = load_dataset("FudanSELab/ClassEval", split="test")
+        REGISTRY["classeval"] = ClassEvalAdapter(classeval)
+        print(f"Loaded ClassEval: {len(classeval)} problems")
+    except Exception as e:
+        print(f"Warning: could not load ClassEval: {e}")
+    try:
+        lcb = _load_jsonl_dataset(
+            "livecodebench/code_generation_lite",
+            [
+                "test.jsonl",
+                "test2.jsonl",
+                "test3.jsonl",
+                "test4.jsonl",
+                "test5.jsonl",
+                "test6.jsonl",
+            ],
+        )
+        lcb_sampled = _sample_list(lcb)
+        adapter = LiveCodeBenchAdapter(lcb_sampled)
+        adapter.total_count = len(lcb)
+        REGISTRY["livecodebench"] = adapter
+        print(f"Loaded LiveCodeBench: {len(lcb_sampled)} problems (of {len(lcb)})")
+    except Exception as e:
+        print(f"Warning: could not load LiveCodeBench: {e}")
+    try:
+        debugbench_full = load_dataset("Rtian/DebugBench", split="test")
+        debugbench = _sample_hf_dataset(debugbench_full)
+        adapter = DebugBenchAdapter(debugbench)
+        adapter.total_count = len(debugbench_full)
+        REGISTRY["debugbench"] = adapter
+        print(f"Loaded DebugBench: {len(debugbench)} problems (of {len(debugbench_full)})")
+    except Exception as e:
+        print(f"Warning: could not load DebugBench: {e}")
+    try:
+        hx_datasets = {}
+        for lang in HumanEvalXAdapter.LANGUAGES:
+            hx_datasets[lang] = _load_jsonl_dataset(
+                "THUDM/humaneval-x",
+                [f"data/{lang}/data/humaneval.jsonl"],
+            )
+        REGISTRY["humanevalx"] = HumanEvalXAdapter(hx_datasets)
+        print(
+            f"Loaded HumanEval-X: {len(hx_datasets)} languages, "
+            f"{len(hx_datasets[next(iter(hx_datasets))])} problems each"
+        )
+    except Exception as e:
+        print(f"Warning: could not load HumanEval-X: {e}")
+    # --- Batch 2 datasets ---
+    try:
+        swe = load_dataset("princeton-nlp/SWE-bench_Lite", split="test")
+        REGISTRY["swebenchlite"] = SWEBenchLiteAdapter(swe)
+        print(f"Loaded SWE-bench Lite: {len(swe)} problems")
+    except Exception as e:
+        print(f"Warning: could not load SWE-bench Lite: {e}")
+    try:
+        cc = load_dataset("deepmind/code_contests", split="test")
+        REGISTRY["codecontests"] = CodeContestsAdapter(cc)
+        print(f"Loaded CodeContests: {len(cc)} problems")
+    except Exception as e:
+        print(f"Warning: could not load CodeContests: {e}")
+    try:
+        apps_full = load_dataset(
+            "codeparrot/apps",
+            "default",
+            split="test",
+            revision="refs/convert/parquet",
+        )
+        apps = _sample_hf_dataset(apps_full)
+        adapter = APPSAdapter(apps)
+        adapter.total_count = len(apps_full)
+        REGISTRY["apps"] = adapter
+        print(f"Loaded APPS: {len(apps)} problems (of {len(apps_full)})")
+    except Exception as e:
+        print(f"Warning: could not load APPS: {e}")
+    try:
+        cie = load_dataset("nuprl/CanItEdit", split="test")
+        REGISTRY["canitedit"] = CanItEditAdapter(cie)
+        print(f"Loaded CanItEdit: {len(cie)} problems")
+    except Exception as e:
+        print(f"Warning: could not load CanItEdit: {e}")
+    try:
+        mbpp = load_dataset("google-research-datasets/mbpp", "full", split="test")
+        REGISTRY["mbpp"] = MBPPAdapter(mbpp)
+        print(f"Loaded MBPP: {len(mbpp)} problems")
+    except Exception as e:
+        print(f"Warning: could not load MBPP: {e}")
+    # --- Batch 3 datasets ---
+    try:
+        safim_full = load_dataset("gonglinyuan/safim", "block", split="test")
+        safim = _sample_hf_dataset(safim_full)
+        adapter = SAFIMAdapter(safim)
+        adapter.total_count = len(safim_full)
+        REGISTRY["safim"] = adapter
+        print(f"Loaded SAFIM: {len(safim)} problems (of {len(safim_full)})")
+    except Exception as e:
+        print(f"Warning: could not load SAFIM: {e}")
+    try:
+        bigvul_full = load_dataset("bstee615/bigvul", split="test")
+        bigvul = _sample_hf_dataset(bigvul_full)
+        adapter = BigVulAdapter(bigvul)
+        adapter.total_count = len(bigvul_full)
+        REGISTRY["bigvul"] = adapter
+        print(f"Loaded BigVul: {len(bigvul)} problems (of {len(bigvul_full)})")
+    except Exception as e:
+        print(f"Warning: could not load BigVul: {e}")
+    try:
+        diversevul_full = load_dataset("claudios/DiverseVul", split="test")
+        diversevul = _sample_hf_dataset(diversevul_full)
+        adapter = DiverseVulAdapter(diversevul)
+        adapter.total_count = len(diversevul_full)
+        REGISTRY["diversevul"] = adapter
+        print(f"Loaded DiverseVul: {len(diversevul)} problems (of {len(diversevul_full)})")
+    except Exception as e:
+        print(f"Warning: could not load DiverseVul: {e}")
+    try:
+        primevul_full = load_dataset(
+            "json",
+            data_files="hf://datasets/starsofchance/PrimeVul/primevul_test.jsonl",
+            split="train",
+        )
+        primevul = _sample_hf_dataset(primevul_full)
+        adapter = PrimeVulAdapter(primevul)
+        adapter.total_count = len(primevul_full)
+        REGISTRY["primevul"] = adapter
+        print(f"Loaded PrimeVul: {len(primevul)} problems (of {len(primevul_full)})")
+    except Exception as e:
+        print(f"Warning: could not load PrimeVul: {e}")
+    try:
+        ceb_rows: list[dict[str, Any]] = []
+        ceb_files = [
+            ("code_debug", ["code_debug_primary.jsonl", "code_debug_plus.jsonl"]),
+            ("code_translate", ["code_translate_primary.jsonl", "code_translate_plus.jsonl"]),
+            ("code_polishment", ["code_polishment_primary.jsonl", "code_polishment_plus.jsonl"]),
+            ("code_switch", ["code_switch_primary.jsonl", "code_switch_plus.jsonl"]),
+        ]
+        for task_type, filenames in ceb_files:
+            try:
+                rows = _load_jsonl_dataset("m-a-p/CodeEditorBench", filenames)
+                for d in rows:
+                    d["_task_type"] = task_type
+                    if "difficulty" in d:
+                        d["difficulty"] = str(d["difficulty"])
+                ceb_rows.extend(rows)
+            except Exception:
+                pass  # skip task types that fail
+        if ceb_rows:
+            ceb_sampled = _sample_list(ceb_rows)
+            adapter = CodeEditorBenchAdapter(ceb_sampled)
+            adapter.total_count = len(ceb_rows)
+            REGISTRY["codeeditorbench"] = adapter
+            print(f"Loaded CodeEditorBench: {len(ceb_sampled)} problems (of {len(ceb_rows)})")
+        else:
+            print("Warning: could not load any CodeEditorBench task types")
+    except Exception as e:
+        print(f"Warning: could not load CodeEditorBench: {e}")
+    # --- Batch 4 datasets ---
+    try:
+        swe_v = load_dataset("princeton-nlp/SWE-bench_Verified", split="test")
+        REGISTRY["swebenchverified"] = SWEBenchVerifiedAdapter(swe_v)
+        print(f"Loaded SWE-bench Verified: {len(swe_v)} problems")
+    except Exception as e:
+        print(f"Warning: could not load SWE-bench Verified: {e}")
+    try:
+        csn_full = load_dataset("code-search-net/code_search_net", "python", split="test")
+        csn = _sample_hf_dataset(csn_full)
+        adapter = CodeSearchNetAdapter(csn)
+        adapter.total_count = len(csn_full)
+        REGISTRY["codesearchnet"] = adapter
+        print(f"Loaded CodeSearchNet: {len(csn)} problems (of {len(csn_full)})")
+    except Exception as e:
+        print(f"Warning: could not load CodeSearchNet: {e}")
+    try:
+        devign_full = load_dataset("google/code_x_glue_cc_defect_detection", split="test")
+        devign = _sample_hf_dataset(devign_full)
+        adapter = DevignAdapter(devign)
+        adapter.total_count = len(devign_full)
+        REGISTRY["devign"] = adapter
+        print(f"Loaded Devign: {len(devign)} problems (of {len(devign_full)})")
+    except Exception as e:
+        print(f"Warning: could not load Devign: {e}")
+    # --- Batch 5 datasets ---
+    try:
+        bcb = load_dataset("bigcode/bigcodebench", split="v0.1.4")
+        REGISTRY["bigcodebench"] = BigCodeBenchAdapter(bcb)
+        print(f"Loaded BigCodeBench: {len(bcb)} problems")
+    except Exception as e:
+        print(f"Warning: could not load BigCodeBench: {e}")
+    try:
+        hep_datasets = {}
+        for lang in HumanEvalPackAdapter.LANGUAGES:
+            hep_datasets[lang] = load_dataset("bigcode/humanevalpack", lang, split="test")
+        REGISTRY["humanevalpack"] = HumanEvalPackAdapter(hep_datasets)
+        print(
+            f"Loaded HumanEvalPack: {len(hep_datasets)} languages, "
+            f"{len(hep_datasets[next(iter(hep_datasets))])} problems each"
+        )
+    except Exception as e:
+        print(f"Warning: could not load HumanEvalPack: {e}")
+    try:
+        cxr_full = load_dataset("google/code_x_glue_cc_code_refinement", "medium", split="test")
+        cxr = _sample_hf_dataset(cxr_full)
+        adapter = CodeXGLUERefinementAdapter(cxr)
+        adapter.total_count = len(cxr_full)
+        REGISTRY["codexgluerefinement"] = adapter
+        print(f"Loaded CodeXGLUE Code Refinement: {len(cxr)} problems (of {len(cxr_full)})")
+    except Exception as e:
+        print(f"Warning: could not load CodeXGLUE Code Refinement: {e}")
+    try:
+        swe_full_ds = load_dataset("princeton-nlp/SWE-bench", split="test")
+        swe_full = _sample_hf_dataset(swe_full_ds)
+        adapter = SWEBenchFullAdapter(swe_full)
+        adapter.total_count = len(swe_full_ds)
+        REGISTRY["swebenchfull"] = adapter
+        print(f"Loaded SWE-bench: {len(swe_full)} problems (of {len(swe_full_ds)})")
+    except Exception as e:
+        print(f"Warning: could not load SWE-bench: {e}")
+    try:
+        cb_full = load_dataset("Maxscha/commitbench", split="test")
+        cb = _sample_hf_dataset(cb_full)
+        adapter = CommitBenchAdapter(cb)
+        adapter.total_count = len(cb_full)
+        REGISTRY["commitbench"] = adapter
+        print(f"Loaded CommitBench: {len(cb)} problems (of {len(cb_full)})")
+    except Exception as e:
+        print(f"Warning: could not load CommitBench: {e}")
+    try:
+        effibench = load_dataset("DONG19/EffiBench", split="train")
+        REGISTRY["effibench"] = EffiBenchAdapter(effibench)
+        print(f"Loaded EffiBench: {len(effibench)} problems")
+    except Exception as e:
+        print(f"Warning: could not load EffiBench: {e}")

adapters/vulnerability.py ADDED Viewed

	@@ -0,0 +1,245 @@

+"""Vulnerability detection benchmark adapters (BigVul, DiverseVul, PrimeVul, Devign)."""
+from __future__ import annotations
+from typing import Any
+from adapters import DatasetAdapter
+# Injected at runtime by _set_helpers()
+_highlight_code = None
+_code_offset = None
+_extract_test_classes = None
+# ---------------------------------------------------------------------------
+# BigVul adapter  (HuggingFace: bstee615/bigvul)
+# ---------------------------------------------------------------------------
+class BigVulAdapter(DatasetAdapter):
+    slug = "bigvul"
+    display_name = "BigVul"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        return {
+            "idx": idx,
+            "task_id": row.get("CVE_ID", str(idx)),
+            "entry_point": row.get("CVE_ID", f"bigvul_{idx}"),
+            "num_inputs": 0,
+            "source": row.get("CWE_ID", "unknown"),
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        vuln_code = row.get("func_before", "")
+        fixed_code = row.get("func_after", "")
+        lang = row.get("lang", "c")
+        lang_key = {"C": "c", "Java": "java", "PHP": "php"}.get(lang, "c")
+        return {
+            "idx": idx,
+            "task_id": row.get("CVE_ID", str(idx)),
+            "entry_point": row.get("CVE_ID", f"bigvul_{idx}"),
+            "code": fixed_code,
+            "highlighted_code": _highlight_code(fixed_code, language=lang_key),
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": row.get("CWE_ID", "unknown"),
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": row.get("commit_message", ""),
+            "vulnerable_code": vuln_code,
+            "vulnerable_highlighted_code": _highlight_code(vuln_code, language=lang_key),
+            "patched_code": fixed_code,
+            "patched_highlighted_code": _highlight_code(fixed_code, language=lang_key),
+            "cwe_id": row.get("CWE_ID", ""),
+            "cve_id": row.get("CVE_ID", ""),
+            "project": row.get("project", ""),
+            "language": lang,
+            "is_vulnerable": bool(row.get("vul", 0)),
+        }
+# ---------------------------------------------------------------------------
+# DiverseVul adapter  (HuggingFace: claudios/DiverseVul)
+# ---------------------------------------------------------------------------
+class DiverseVulAdapter(DatasetAdapter):
+    slug = "diversevul"
+    display_name = "DiverseVul"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        cwe_list = row.get("cwe", [])
+        cwe_label = cwe_list[0] if cwe_list else "unknown"
+        label = "Vulnerable" if row.get("target", 0) == 1 else "Patched"
+        return {
+            "idx": idx,
+            "task_id": row.get("commit_id", str(idx))[:12],
+            "entry_point": row.get("project", f"diversevul_{idx}"),
+            "num_inputs": 0,
+            "source": f"{label}/{cwe_label}",
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        code = row.get("func", "")
+        cwe_list = list(row.get("cwe", []))
+        is_vuln = row.get("target", 0) == 1
+        return {
+            "idx": idx,
+            "task_id": row.get("commit_id", str(idx))[:12],
+            "entry_point": row.get("project", f"diversevul_{idx}"),
+            "code": code,
+            "highlighted_code": _highlight_code(code, language="c"),
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": "Vulnerable" if is_vuln else "Patched",
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": row.get("message", ""),
+            "vulnerable_code": code if is_vuln else "",
+            "vulnerable_highlighted_code": _highlight_code(code, language="c") if is_vuln else "",
+            "patched_code": code if not is_vuln else "",
+            "patched_highlighted_code": (
+                _highlight_code(code, language="c") if not is_vuln else ""
+            ),
+            "cwe_id": ", ".join(cwe_list) if cwe_list else "",
+            "project": row.get("project", ""),
+            "language": "C/C++",
+            "is_vulnerable": is_vuln,
+        }
+# ---------------------------------------------------------------------------
+# PrimeVul adapter  (HuggingFace: starsofchance/PrimeVul)
+# ---------------------------------------------------------------------------
+class PrimeVulAdapter(DatasetAdapter):
+    slug = "primevul"
+    display_name = "PrimeVul"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        label = "Vulnerable" if row.get("target", 0) == 1 else "Patched"
+        return {
+            "idx": idx,
+            "task_id": row.get("commit_id", str(idx))[:12],
+            "entry_point": row.get("project", f"primevul_{idx}"),
+            "num_inputs": 0,
+            "source": label,
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        code = row.get("func", "")
+        is_vuln = row.get("target", 0) == 1
+        cwe_list = list(row.get("cwe", []))
+        return {
+            "idx": idx,
+            "task_id": row.get("commit_id", str(idx))[:12],
+            "entry_point": row.get("project", f"primevul_{idx}"),
+            "code": code,
+            "highlighted_code": _highlight_code(code, language="c"),
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": "Vulnerable" if is_vuln else "Patched",
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": row.get("commit_message", ""),
+            "vulnerable_code": code if is_vuln else "",
+            "vulnerable_highlighted_code": _highlight_code(code, language="c") if is_vuln else "",
+            "patched_code": code if not is_vuln else "",
+            "patched_highlighted_code": (
+                _highlight_code(code, language="c") if not is_vuln else ""
+            ),
+            "cwe_id": ", ".join(cwe_list) if cwe_list else "",
+            "project": row.get("project", ""),
+            "language": "C/C++",
+            "is_vulnerable": is_vuln,
+        }
+# ---------------------------------------------------------------------------
+# Devign adapter  (HuggingFace: google/code_x_glue_cc_defect_detection)
+# ---------------------------------------------------------------------------
+class DevignAdapter(DatasetAdapter):
+    slug = "devign"
+    display_name = "Devign"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        label = "Vulnerable" if row.get("target", 0) == 1 else "Clean"
+        return {
+            "idx": idx,
+            "task_id": str(row.get("commit_id", idx))[:12],
+            "entry_point": row.get("project", f"devign_{idx}"),
+            "num_inputs": 0,
+            "source": label,
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        code = row.get("func", "")
+        is_vuln = row.get("target", 0) == 1
+        return {
+            "idx": idx,
+            "task_id": str(row.get("commit_id", idx))[:12],
+            "entry_point": row.get("project", f"devign_{idx}"),
+            "code": code,
+            "highlighted_code": _highlight_code(code, language="c"),
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": "Vulnerable" if is_vuln else "Clean",
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": row.get("commit_message", ""),
+            "is_vulnerable": is_vuln,
+            "project": row.get("project", ""),
+            "language": "C",
+        }

app.py CHANGED Viewed

@@ -12,7 +12,7 @@ import os
 from flask import Flask, jsonify, render_template, request
 from pygments import highlight
 from pygments.formatters import HtmlFormatter
-from pygments.lexers import PythonLexer
 app = Flask(__name__)
@@ -34,14 +34,16 @@ def _extract_test_classes(test_code: str, cls_name: str) -> list:
     lines = test_code.splitlines(keepends=True)
     prefix = f"{cls_name}Test"
     result = []
-    for node in tree.body:          # top-level definitions, preserves source order
         if isinstance(node, _ast.ClassDef) and node.name.startswith(prefix):
-            start = node.lineno - 1     # ast lineno is 1-indexed
-            end = node.end_lineno       # end_lineno is inclusive; slice is exclusive
-            result.append({
-                "name": node.name,
-                "code": "".join(lines[start:end]),
-            })
     return result
@@ -49,20 +51,22 @@ def _code_offset(code: str) -> int:
     """Number of leading newlines that Pygments will strip."""
     offset = 0
     for ch in code:
-        if ch == '\n':
             offset += 1
         else:
             break
     return offset
-def highlight_code(code, highlight_lines=None):
     """
-    Syntax highlight Python code with optional line highlighting.
     Args:
-        code: The Python code to highlight
         highlight_lines: List of line numbers (1-indexed) to highlight
     Returns:
         HTML string with syntax highlighted code
@@ -70,7 +74,11 @@ def highlight_code(code, highlight_lines=None):
     formatter = HtmlFormatter(
         linenos="table", cssclass="source", hl_lines=highlight_lines or [], linenostart=1
     )
-    return highlight(code, PythonLexer(), formatter)
 def get_css():
@@ -82,7 +90,7 @@ def get_css():
 # Dataset adapter registration
 # ---------------------------------------------------------------------------
-from dataset_adapters import REGISTRY, _set_helpers, register_hf_datasets
 # Inject helper functions into the adapters module (avoids circular imports)
 _set_helpers(highlight_code, _code_offset, _extract_test_classes)
@@ -100,6 +108,7 @@ def _get_adapter(dataset_slug: str):
 # Routes
 # ---------------------------------------------------------------------------
 @app.route("/")
 def index():
     """Main page showing list of all benchmark problems."""
@@ -109,15 +118,18 @@ def index():
 @app.route("/api/datasets")
 def get_datasets():
     """Return list of available datasets for the UI dataset selector."""
-    return jsonify([
-        {
-            "slug": slug,
-            "display_name": adapter.display_name,
-            "problem_count": adapter.problem_count(),
-            "has_ground_truth": adapter.has_ground_truth,
-        }
-        for slug, adapter in REGISTRY.items()
-    ])
 @app.route("/api/<dataset_slug>/problems")

 from flask import Flask, jsonify, render_template, request
 from pygments import highlight
 from pygments.formatters import HtmlFormatter
+from pygments.lexers import PythonLexer, get_lexer_by_name
 app = Flask(__name__)
     lines = test_code.splitlines(keepends=True)
     prefix = f"{cls_name}Test"
     result = []
+    for node in tree.body:  # top-level definitions, preserves source order
         if isinstance(node, _ast.ClassDef) and node.name.startswith(prefix):
+            start = node.lineno - 1  # ast lineno is 1-indexed
+            end = node.end_lineno  # end_lineno is inclusive; slice is exclusive
+            result.append(
+                {
+                    "name": node.name,
+                    "code": "".join(lines[start:end]),
+                }
+            )
     return result
     """Number of leading newlines that Pygments will strip."""
     offset = 0
     for ch in code:
+        if ch == "\n":
             offset += 1
         else:
             break
     return offset
+def highlight_code(code, highlight_lines=None, language="python"):
     """
+    Syntax highlight code with optional line highlighting.
     Args:
+        code: The source code to highlight
         highlight_lines: List of line numbers (1-indexed) to highlight
+        language: Programming language name (default: "python").
+                  Must be a key in LEXER_MAP.
     Returns:
         HTML string with syntax highlighted code
     formatter = HtmlFormatter(
         linenos="table", cssclass="source", hl_lines=highlight_lines or [], linenostart=1
     )
+    try:
+        lexer = get_lexer_by_name(language.lower())
+    except Exception:
+        lexer = PythonLexer()
+    return highlight(code, lexer, formatter)
 def get_css():
 # Dataset adapter registration
 # ---------------------------------------------------------------------------
+from adapters import REGISTRY, _set_helpers, register_hf_datasets  # noqa: E402
 # Inject helper functions into the adapters module (avoids circular imports)
 _set_helpers(highlight_code, _code_offset, _extract_test_classes)
 # Routes
 # ---------------------------------------------------------------------------
 @app.route("/")
 def index():
     """Main page showing list of all benchmark problems."""
 @app.route("/api/datasets")
 def get_datasets():
     """Return list of available datasets for the UI dataset selector."""
+    return jsonify(
+        [
+            {
+                "slug": slug,
+                "display_name": adapter.display_name,
+                "problem_count": adapter.problem_count(),
+                "total_count": adapter.total_count,
+                "has_ground_truth": adapter.has_ground_truth,
+            }
+            for slug, adapter in REGISTRY.items()
+        ]
+    )
 @app.route("/api/<dataset_slug>/problems")

benchmarks_analysis.csv ADDED Viewed

	@@ -0,0 +1,38 @@

+benchmark,category,year,size,languages,hf_dataset_id,data_access,visualization_complexity,influence,priority_score,batch,notes
+MBPP+,Code Generation,2023,399,Python,evalplus/mbppplus,easy,simple,high,9,1,Natural companion to HumanEval+; same EvalPlus ecosystem
+ClassEval,Code Generation,2023,100 classes (410 methods),Python,FudanSELab/ClassEval,easy,moderate,high,9,1,Class-level code generation with test classes
+LiveCodeBench,Code Generation,2024,1055+,Python,livecodebench/code_generation_lite,easy,moderate,high,9,1,Continuously updated; contamination-free; high community interest
+DebugBench,Code Editing/Debugging,2024,4253,"C++, Java, Python",Rtian/DebugBench,easy,moderate,high,8,1,Buggy code with implanted bugs; 4 categories; 18 minor types
+HumanEval-X,Code Translation,2022,820 (164x5),"Python, C++, Java, JS, Go",THUDM/humaneval-x,easy,moderate,high,8,1,Same 164 problems in 5 languages with test cases
+SWE-bench Lite,Code Editing,2024,300,Python,princeton-nlp/SWE-bench_Lite,easy,complex,very high,8,2,GitHub issue resolution; extremely high-profile
+CodeContests,Code Generation,2022,13328,"C++, Python, Java",deepmind/code_contests,easy,moderate,high,8,2,AlphaCode benchmark; competitive programming
+APPS,Code Generation,2021,10000,Python,codeparrot/apps,easy,moderate,high,7,2,Large-scale coding problems at 3 difficulty levels
+CanItEdit,Code Editing,2023,105,Python,nuprl/CanItEdit,easy,simple,medium,7,2,Before/after code editing with dual instruction types
+MBPP,Code Generation,2021,974,Python,google-research-datasets/mbpp,easy,simple,high,7,2,Original MBPP; foundational benchmark
+DS-1000,Code Generation,2023,1000,Python,xlangai/DS-1000,easy,moderate,high,7,3,Data science library-specific problems (NumPy/Pandas/etc.)
+CodeEditorBench,Code Editing,2024,7961,Multiple,m-a-p/CodeEditorBench,easy,moderate,medium,7,3,4 editing scenarios: debug/translate/polish/requirement switch
+SAFIM,Code Completion,2024,17720,"Python, Java, C++, C#",gonglinyuan/safim,easy,moderate,medium,7,3,Syntax-aware fill-in-the-middle; 3 subtasks
+BigVul,Vulnerability Detection,2020,190000,C/C++,bstee615/bigvul,easy,moderate,medium,6,3,CVE-linked vulnerability detection; 91 CWE types
+RepoBench,Code Completion,2023,10000+,"Python, Java",tianyang/repobench-c,easy,complex,medium,6,3,Repo-level code completion with 3 sub-tasks
+MultiPL-E,Code Generation/Translation,2023,HumanEval+MBPP in 22 langs,22 languages,nuprl/MultiPL-E,easy,moderate,medium,6,4,Translations of HumanEval/MBPP to 22 languages
+DiverseVul,Vulnerability Detection,2023,350000+,C/C++,claudios/DiverseVul,easy,simple,medium,6,4,Large-scale vulnerability detection; 150 CWEs
+PrimeVul,Vulnerability Detection,2024,236000+,C/C++,starsofchance/PrimeVul,easy,simple,medium,6,4,Highest quality labels for vuln detection
+McEval,Code Generation,2024,16000,40 languages,Multilingual-Multimodal-NLP/McEval,easy,complex,medium,6,4,Massive language coverage
+CodeSearchNet,Code Search/Summarization,2019,2000000,"Python, JS, Ruby, Go, Java, PHP",code-search-net/code_search_net,easy,moderate,medium,6,4,Foundational code search benchmark
+xCodeEval,Multi-task,2023,25000000,11-17 languages,NTU-NLP-sg/xCodeEval,easy,very complex,medium,5,5,7 tasks; very large; complex format
+Devign,Vulnerability Detection,2019,20756,C,google/code_x_glue_cc_defect_detection,easy,simple,medium,5,5,Function-level vulnerability identification
+CrossVul,Vulnerability Detection,2021,9313,40+ languages,hitoshura25/crossvul,easy,simple,medium,5,5,Cross-language vulnerability detection
+SWE-bench Verified,Code Editing,2024,500,Python,princeton-nlp/SWE-bench_Verified,easy,complex,high,5,5,Curated subset of SWE-bench
+CoderEval,Code Generation,2023,460,"Python, Java",N/A (GitHub only),medium,complex,medium,4,deferred,Requires project-level context
+NaturalCodeBench,Code Generation,2024,402,"Python, Java",N/A (GitHub only),medium,moderate,medium,4,deferred,Only dev set released (140 problems)
+DevEval,Code Generation,2024,1874,Python,N/A (GitHub only),medium,complex,medium,4,deferred,Repository-level; complex dependencies
+RunBugRun,Program Repair,2023,450000+,9 languages,N/A (GitHub/SQLite),hard,complex,medium,3,deferred,SQLite format; complex infrastructure
+Defects4J,Program Repair,2014,854,Java,N/A (GitHub only),hard,very complex,high,3,deferred,Requires Java tooling; full project repos
+ConDefects,Program Repair,2023,2879,"Java, Python",N/A (GitHub only),medium,moderate,medium,3,deferred,AtCoder buggy/fixed pairs
+FixEval,Program Repair,2023,varies,"Python, Java",N/A (GitHub only),medium,moderate,low,3,deferred,Competitive programming fixes
+TransCoder,Code Translation,2020,852,"Java, Python, C++",N/A (GitHub only),medium,moderate,medium,3,deferred,Facebook Research; unsupervised translation
+AVATAR,Code Translation,2021,9515,"Java, Python",N/A (GitHub only),medium,moderate,low,3,deferred,Parallel Java-Python corpus
+TypeEvalPy,Type Inference,2023,154,Python,N/A (GitHub only),medium,moderate,low,3,deferred,Niche; type inference evaluation
+VJBench,Vulnerability Repair,2023,42,Java,N/A (GitHub only),hard,complex,low,2,deferred,Very small; requires Java tooling
+SVEN,Vulnerability Detection,2023,1606,C/C++,N/A (GitHub only),medium,moderate,low,2,deferred,Small; security hardening focus
+PyTER,Type Error Repair,2022,93,Python,N/A (Figshare),hard,complex,low,2,deferred,Very small; niche

static/problem.css ADDED Viewed

	@@ -0,0 +1,587 @@

+.problem-header {
+    display: flex;
+    justify-content: space-between;
+    align-items: center;
+    margin-bottom: 15px;
+}
+.problem-meta {
+    margin-bottom: 20px;
+}
+.meta-item {
+    display: inline-block;
+    margin-right: 15px;
+    margin-bottom: 10px;
+}
+.meta-label {
+    font-weight: 600;
+    color: #7f8c8d;
+    margin-right: 5px;
+}
+.meta-value {
+    color: #2c3e50;
+}
+.task-selector {
+    margin: 20px 0;
+    display: flex;
+    gap: 10px;
+    flex-wrap: wrap;
+}
+.task-btn {
+    padding: 10px 20px;
+    background: #ecf0f1;
+    border: 2px solid transparent;
+    border-radius: 4px;
+    cursor: pointer;
+    transition: all 0.3s;
+    font-size: 0.95rem;
+}
+.task-btn:hover {
+    background: #bdc3c7;
+}
+.task-btn.active {
+    background: #3498db;
+    color: white;
+    border-color: #2980b9;
+}
+.task-details {
+    margin-top: 20px;
+}
+.task-section {
+    margin-bottom: 25px;
+    padding: 15px;
+    background: #f8f9fa;
+    border-left: 4px solid #3498db;
+    border-radius: 4px;
+}
+.task-section h3 {
+    margin-bottom: 10px;
+    color: #2c3e50;
+    font-size: 1.1rem;
+}
+.code-block {
+    background: #f8f9fa;
+    padding: 15px;
+    border-radius: 4px;
+    overflow-x: auto;
+    font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;
+    font-size: 0.9rem;
+    border: 1px solid #e1e4e8;
+}
+.task-items-list {
+    list-style: none;
+}
+.task-items-list li {
+    padding: 10px;
+    margin-bottom: 8px;
+    background: white;
+    border-radius: 4px;
+    border: 1px solid #e1e4e8;
+}
+.line-ref {
+    display: inline-block;
+    padding: 2px 8px;
+    background: #3498db;
+    color: white;
+    border-radius: 3px;
+    font-family: monospace;
+    font-size: 0.85rem;
+    margin-right: 8px;
+}
+.var-name {
+    display: inline-block;
+    padding: 2px 8px;
+    background: #9b59b6;
+    color: white;
+    border-radius: 3px;
+    font-family: monospace;
+    font-size: 0.85rem;
+}
+.io-section {
+    display: grid;
+    grid-template-columns: 1fr 1fr;
+    gap: 15px;
+}
+@media (max-width: 768px) {
+    .io-section {
+        grid-template-columns: 1fr;
+    }
+}
+.navigation-hint {
+    margin-top: 20px;
+    padding: 15px;
+    background: #e8f4f8;
+    border-radius: 4px;
+    color: #2c3e50;
+    font-size: 0.9rem;
+}
+.test-code-section {
+    margin-top: 20px;
+}
+/* Inline task visualization */
+.code-with-tasks {
+    position: relative;
+}
+.task-marker {
+    display: inline-block;
+    margin-left: 10px;
+    padding: 2px 8px;
+    background: #9b59b6;
+    color: white;
+    border-radius: 3px;
+    font-size: 0.75rem;
+    font-weight: 600;
+    cursor: crosshair;
+}
+/* Coverage coloring on lineno spans */
+td.linenos .normal.line-executed {
+    background-color: #d4edda !important;
+    color: #155724 !important;
+}
+td.linenos .normal.line-not-executed {
+    background-color: #f8d7da !important;
+    color: #721c24 !important;
+}
+/* Coverage legend */
+.coverage-legend {
+    margin: 10px 0;
+    padding: 10px 15px;
+    background: #f8f9fa;
+    border-left: 4px solid #28a745;
+    border-radius: 4px;
+    font-size: 0.85rem;
+    display: none;
+}
+.coverage-legend-item {
+    display: inline-block;
+    margin-right: 18px;
+}
+.coverage-swatch {
+    display: inline-block;
+    width: 12px;
+    height: 12px;
+    border-radius: 2px;
+    margin-right: 4px;
+    vertical-align: middle;
+}
+/* Ground truth answer badge */
+.gt-answer {
+    display: inline-block;
+    margin-left: 10px;
+    padding: 2px 8px;
+    background: #17a2b8;
+    color: white;
+    border-radius: 3px;
+    font-family: monospace;
+    font-size: 0.82rem;
+    font-weight: 600;
+}
+.gt-answer.loading {
+    background: #6c757d;
+}
+/* SVG arrow overlay */
+#arrow-overlay {
+    position: absolute;
+    top: 0;
+    left: 0;
+    width: 100%;
+    height: 100%;
+    pointer-events: none;
+    overflow: visible;
+    z-index: 10;
+}
+.exec-arrow {
+    fill: none;
+    stroke: #e67e22;
+    stroke-width: 2.5;
+    stroke-dasharray: none;
+    opacity: 0.9;
+}
+.exec-arrow-head {
+    fill: #e67e22;
+    opacity: 0.9;
+}
+/* CRUXEval answer highlight */
+.crux-answer {
+    border-left: 4px solid #17a2b8 !important;
+    background: #e8f6f8 !important;
+}
+/* Before/after diff view */
+.diff-container {
+    display: grid;
+    grid-template-columns: 1fr 1fr;
+    gap: 20px;
+}
+@media (max-width: 1024px) {
+    .diff-container {
+        grid-template-columns: 1fr;
+    }
+}
+.diff-panel {
+    overflow-x: auto;
+}
+.diff-panel h3 {
+    margin-bottom: 10px;
+    font-size: 1.1rem;
+}
+.diff-panel h3 .diff-label-buggy {
+    color: #e74c3c;
+}
+.diff-panel h3 .diff-label-fixed {
+    color: #27ae60;
+}
+.bug-info {
+    margin-bottom: 15px;
+    padding: 12px 15px;
+    border-left: 4px solid #e74c3c;
+    background: #fdf2f2;
+    border-radius: 4px;
+}
+.bug-info .bug-category {
+    display: inline-block;
+    padding: 2px 8px;
+    background: #e74c3c;
+    color: white;
+    border-radius: 3px;
+    font-size: 0.82rem;
+    font-weight: 600;
+    margin-right: 8px;
+}
+.bug-info .bug-subtype {
+    display: inline-block;
+    padding: 2px 8px;
+    background: #c0392b;
+    color: white;
+    border-radius: 3px;
+    font-size: 0.82rem;
+    font-weight: 600;
+}
+/* Multi-language view */
+.lang-tabs {
+    display: flex;
+    gap: 0;
+    border-bottom: 2px solid #e1e4e8;
+    margin-bottom: 0;
+}
+.lang-tab {
+    padding: 10px 20px;
+    background: #f8f9fa;
+    border: 1px solid #e1e4e8;
+    border-bottom: none;
+    cursor: pointer;
+    font-size: 0.95rem;
+    font-weight: 500;
+    transition: all 0.2s;
+    border-radius: 4px 4px 0 0;
+    margin-right: 2px;
+}
+.lang-tab:hover {
+    background: #e8f4f8;
+}
+.lang-tab.active {
+    background: white;
+    border-bottom: 2px solid white;
+    margin-bottom: -2px;
+    color: #3498db;
+    font-weight: 600;
+}
+.lang-code-panel {
+    display: none;
+}
+.lang-code-panel.active {
+    display: block;
+}
+/* BigOBench complexity display */
+.complexity-badges {
+    display: flex;
+    gap: 20px;
+    flex-wrap: wrap;
+}
+.complexity-item {
+    display: flex;
+    align-items: center;
+    gap: 10px;
+}
+.complexity-label {
+    font-weight: 600;
+    color: #7f8c8d;
+    font-size: 0.95rem;
+}
+.complexity-value {
+    display: inline-block;
+    padding: 6px 16px;
+    background: #2c3e50;
+    color: #f1c40f;
+    border-radius: 4px;
+    font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;
+    font-size: 1.1rem;
+    font-weight: 600;
+}
+/* Diff view (GitHub-style table with line numbers) */
+.diff-view {
+    font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;
+    font-size: 0.85rem;
+    line-height: 1.5;
+    overflow-x: auto;
+    border: 1px solid #e1e4e8;
+    border-radius: 4px;
+}
+.diff-table {
+    border-collapse: collapse;
+    width: 100%;
+}
+.diff-table td {
+    padding: 0 8px;
+    white-space: pre;
+    vertical-align: top;
+}
+.diff-ln {
+    width: 1%;
+    min-width: 40px;
+    color: #959da5;
+    text-align: right;
+    user-select: none;
+    font-size: 0.8rem;
+    padding: 0 6px !important;
+    border-right: 1px solid #e1e4e8;
+}
+.diff-tr-add td { background: #e6ffec; }
+.diff-td-add { color: #24292e; }
+.diff-tr-add .diff-ln { background: #ccffd8; color: #22863a; }
+.diff-tr-del td { background: #ffebe9; }
+.diff-td-del { color: #24292e; }
+.diff-tr-del .diff-ln { background: #ffd7d5; color: #cb2431; }
+.diff-tr-ctx td { background: white; }
+.diff-td-ctx { color: #586069; }
+.diff-tr-hunk td {
+    background: #f1f8ff;
+    color: #0366d6;
+    font-weight: 600;
+    padding: 4px 8px;
+}
+.diff-tr-header td {
+    background: #fafbfc;
+    color: #6a737d;
+    font-weight: 600;
+    padding: 4px 8px;
+    border-bottom: 1px solid #e1e4e8;
+}
+/* Diff file sections (GitHub-style per-file headers) */
+.diff-file-section {
+    margin-bottom: 16px;
+    border: 1px solid #d0d7de;
+    border-radius: 6px;
+    overflow: hidden;
+}
+.diff-file-section .diff-view {
+    border: none;
+    border-radius: 0;
+}
+.diff-file-header {
+    display: flex;
+    justify-content: space-between;
+    align-items: center;
+    padding: 8px 12px;
+    background: #f6f8fa;
+    border-bottom: 1px solid #d0d7de;
+    font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;
+    font-size: 0.85rem;
+}
+.diff-file-path {
+    color: #24292f;
+    font-weight: 600;
+    word-break: break-all;
+}
+.diff-file-stats {
+    white-space: nowrap;
+    margin-left: 12px;
+    font-size: 0.8rem;
+}
+.diff-stat-add { color: #1a7f37; font-weight: 600; }
+.diff-stat-del { color: #cf222e; font-weight: 600; margin-left: 6px; }
+/* GitHub links bar */
+.gh-links-bar {
+    display: flex;
+    gap: 12px;
+    align-items: center;
+    flex-wrap: wrap;
+}
+.gh-link {
+    display: inline-block;
+    padding: 6px 14px;
+    background: #f6f8fa;
+    border: 1px solid #d0d7de;
+    border-radius: 6px;
+    color: #0969da;
+    text-decoration: none;
+    font-size: 0.9rem;
+    font-weight: 500;
+    transition: background 0.15s, border-color 0.15s;
+}
+.gh-link:hover {
+    background: #ddf4ff;
+    border-color: #0969da;
+}
+/* Issue / problem statement */
+.issue-statement {
+    line-height: 1.7;
+    padding: 10px;
+    white-space: pre-wrap;
+    word-wrap: break-word;
+    max-height: 500px;
+    overflow-y: auto;
+    background: #f8f9fa;
+    border: 1px solid #e1e4e8;
+    border-radius: 4px;
+    font-size: 0.9rem;
+}
+.test-id-list {
+    list-style: none;
+    padding: 0;
+}
+.test-id-list li {
+    padding: 4px 8px;
+    margin-bottom: 4px;
+    background: #f8f9fa;
+    border-radius: 3px;
+    font-family: monospace;
+    font-size: 0.82rem;
+    border-left: 3px solid #e74c3c;
+}
+.test-id-list li.pass-to-pass {
+    border-left-color: #27ae60;
+}
+/* Fill-in-the-Middle (SAFIM) view */
+.fim-hole-marker {
+    display: inline-block;
+    padding: 4px 16px;
+    background: #e74c3c;
+    color: white;
+    border-radius: 4px;
+    font-family: monospace;
+    font-weight: 600;
+    font-size: 0.9rem;
+    margin: 4px 0;
+}
+.fim-answer {
+    padding: 15px;
+    background: #e8f6e8;
+    border-left: 4px solid #27ae60;
+    border-radius: 4px;
+    font-family: monospace;
+    font-size: 0.9rem;
+}
+.fim-merged-legend {
+    margin: 8px 0;
+    padding: 6px 12px;
+    background: #f8f9fa;
+    border-radius: 4px;
+    font-size: 0.85rem;
+    color: #555;
+}
+/* Vulnerability view */
+.vuln-status {
+    display: inline-block;
+    padding: 4px 12px;
+    border-radius: 4px;
+    font-size: 0.85rem;
+    font-weight: 600;
+}
+.vuln-status-vulnerable {
+    background: #e74c3c;
+    color: white;
+}
+.vuln-status-patched {
+    background: #27ae60;
+    color: white;
+}
+.cwe-badge {
+    display: inline-block;
+    padding: 4px 12px;
+    background: #2c3e50;
+    color: #e74c3c;
+    border-radius: 4px;
+    font-family: monospace;
+    font-size: 0.85rem;
+    font-weight: 600;
+}

static/problem.js ADDED Viewed

	@@ -0,0 +1,1313 @@

+/* global problemIdx, datasetSlug, datasetName, hasGroundTruth, hasTasks */
+function badgeClass(source) {
+    return 'badge-' + source.toLowerCase().replace(/[^a-z0-9]/g, '');
+}
+async function loadProblem() {
+    try {
+        const response = await fetch(`/api/${datasetSlug}/problem/${problemIdx}`);
+        const problem = await response.json();
+        if (problem.error) {
+            document.getElementById('problem-content').innerHTML =
+                '<div class="card"><p style="color: red;">Error: ' + problem.error + '</p></div>';
+            return;
+        }
+        renderProblem(problem);
+    } catch (error) {
+        document.getElementById('problem-content').innerHTML =
+            '<div class="card"><p style="color: red;">Error loading problem: ' + error.message + '</p></div>';
+    }
+}
+function renderProblem(problem) {
+    const container = document.getElementById('problem-content');
+    // Main problem info card (shared by all datasets)
+    let html = `
+        <div class="card">
+            <div class="problem-header">
+                <h2>${escapeHtml(problem.entry_point)}</h2>
+                <span class="badge ${badgeClass(problem.source)}">${escapeHtml(problem.source)}</span>
+            </div>
+            <div class="problem-meta">
+                <div class="meta-item">
+                    <span class="meta-label">Task ID:</span>
+                    <span class="meta-value">${escapeHtml(problem.task_id)}</span>
+                </div>
+                <div class="meta-item">
+                    <span class="meta-label">Index:</span>
+                    <span class="meta-value">${problem.idx}</span>
+                </div>
+                <div class="meta-item">
+                    <span class="meta-label">Dataset:</span>
+                    <span class="meta-value">${escapeHtml(datasetName)}</span>
+                </div>
+                ${problem.inputs.length > 0 ? `
+                <div class="meta-item">
+                    <span class="meta-label">Test Inputs:</span>
+                    <span class="meta-value">${problem.inputs.length}</span>
+                </div>` : ''}
+            </div>
+        </div>
+    `;
+    // --- BigOBench view (problem description + per-solution code & complexity) ---
+    if (problem.solutions && problem.solutions.length > 0) {
+        // Problem description
+        if (problem.description) {
+            html += `
+                <div class="card">
+                    <h2>Problem Statement</h2>
+                    <pre class="code-block" style="white-space: pre-wrap;">${escapeHtml(problem.description)}</pre>
+                </div>
+            `;
+        }
+        // Each solution: code + complexity/language badges
+        problem.solutions.forEach((sol, i) => {
+            html += `
+                <div class="card">
+                    <h2>Solution ${i + 1} <span style="font-size:0.8rem;color:#7f8c8d;font-weight:400;">${escapeHtml(sol.solution_id)}</span></h2>
+                    <div class="complexity-badges" style="margin-bottom: 15px;">
+            `;
+            if (sol.language) {
+                html += `
+                        <div class="complexity-item">
+                            <span class="complexity-label">Language</span>
+                            <span class="badge badge-info">${escapeHtml(sol.language)}</span>
+                        </div>`;
+            }
+            if (sol.time_complexity) {
+                html += `
+                        <div class="complexity-item">
+                            <span class="complexity-label">Time</span>
+                            <span class="complexity-value">${escapeHtml(sol.time_complexity)}</span>
+                        </div>`;
+            }
+            if (sol.space_complexity) {
+                html += `
+                        <div class="complexity-item">
+                            <span class="complexity-label">Space</span>
+                            <span class="complexity-value">${escapeHtml(sol.space_complexity)}</span>
+                        </div>`;
+            }
+            html += `
+                    </div>
+                    <div class="code-with-tasks">
+                        ${sol.highlighted_code}
+                    </div>
+                </div>
+            `;
+        });
+        // Navigation hint
+        html += `
+            <div class="navigation-hint">
+                <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
+                or return to the list view to filter by dataset source or search by name.
+            </div>
+        `;
+        container.innerHTML = html;
+        window.currentProblem = problem;
+        return;
+    }
+    // --- DebugBench before/after view (buggy → fixed) ---
+    if (problem.buggy_code !== undefined && problem.fixed_code !== undefined) {
+        // Problem description
+        if (problem.description) {
+            html += `
+                <div class="card">
+                    <h2>Problem Statement</h2>
+                    <pre class="code-block" style="white-space: pre-wrap;">${escapeHtml(problem.description)}</pre>
+                </div>
+            `;
+        }
+        // Bug info
+        html += `
+            <div class="card">
+                <h2>Bug Information</h2>
+                <div class="bug-info">
+                    <span class="bug-category">${escapeHtml(problem.bug_category || '')}</span>
+                    <span class="bug-subtype">${escapeHtml(problem.bug_subtype || '')}</span>
+                </div>
+                <p style="margin-top: 10px;">${escapeHtml(problem.bug_explanation || '')}</p>
+        `;
+        if (problem.difficulty) {
+            html += `<p style="margin-top: 8px; color: #7f8c8d;">Difficulty: <strong>${escapeHtml(problem.difficulty)}</strong></p>`;
+        }
+        html += `</div>`;
+        // Unified diff view of buggy → fixed
+        const unifiedDiff = computeUnifiedDiff(problem.buggy_code, problem.fixed_code);
+        html += `
+            <div class="card">
+                <h2>Changes</h2>
+                <div class="diff-view">${renderComputedDiff(unifiedDiff)}</div>
+            </div>
+        `;
+        // Side-by-side buggy/fixed code
+        html += `
+            <div class="card">
+                <h2>Full Code Comparison</h2>
+                <div class="diff-container">
+                    <div class="diff-panel">
+                        <h3><span class="diff-label-buggy">Before</span></h3>
+                        <div class="code-with-tasks">${problem.buggy_highlighted_code}</div>
+                    </div>
+                    <div class="diff-panel">
+                        <h3><span class="diff-label-fixed">After</span></h3>
+                        <div class="code-with-tasks">${problem.fixed_highlighted_code}</div>
+                    </div>
+                </div>
+            </div>
+        `;
+        // Examples
+        if (problem.examples && problem.examples.length > 0) {
+            html += `<div class="card"><h2>Examples</h2>`;
+            problem.examples.forEach((ex, i) => {
+                html += `<pre class="code-block" style="margin-bottom: 10px; white-space: pre-wrap;">${escapeHtml(ex)}</pre>`;
+            });
+            html += `</div>`;
+        }
+        html += `
+            <div class="navigation-hint">
+                <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
+                or return to the list view to filter by dataset source or search by name.
+            </div>
+        `;
+        container.innerHTML = html;
+        window.currentProblem = problem;
+        return;
+    }
+    // --- HumanEval-X / HumanEvalPack multi-language view ---
+    if (problem.lang_solutions && problem.lang_solutions.length > 0) {
+        // Check if this is HumanEvalPack (has buggy_code in solutions)
+        const hasBuggy = problem.lang_solutions.some(sol => sol.buggy_code);
+        // Bug info (HumanEvalPack only)
+        if (hasBuggy && (problem.bug_type || problem.failure_symptoms)) {
+            html += `
+                <div class="card">
+                    <h2>Bug Information</h2>
+                    <div class="bug-info">
+                        ${problem.bug_type ? `<span class="bug-category">${escapeHtml(problem.bug_type)}</span>` : ''}
+                        ${problem.failure_symptoms ? `<span class="bug-subtype">${escapeHtml(problem.failure_symptoms)}</span>` : ''}
+                    </div>
+                </div>
+            `;
+        }
+        // Language tabs with code panels
+        html += `
+            <div class="card">
+                <h2>Source Code</h2>
+        `;
+        // Buggy/Canonical toggle for HumanEvalPack
+        if (hasBuggy) {
+            html += `
+                <div class="lang-tabs" id="code-mode-tabs" style="margin-bottom: 10px;">
+                    <button class="lang-tab active" onclick="toggleCodeMode('canonical')">Canonical</button>
+                    <button class="lang-tab" onclick="toggleCodeMode('buggy')">Buggy</button>
+                </div>
+            `;
+        }
+        html += `<div class="lang-tabs" id="lang-tabs">`;
+        problem.lang_solutions.forEach((sol, i) => {
+            const label = sol.language_label || sol.language;
+            html += `<button class="lang-tab ${i === 0 ? 'active' : ''}" onclick="showLangTab(${i})">${escapeHtml(label)}</button>`;
+        });
+        html += `</div>`;
+        problem.lang_solutions.forEach((sol, i) => {
+            html += `
+                <div class="lang-code-panel ${i === 0 ? 'active' : ''}" id="lang-panel-${i}">
+                    <div class="code-with-tasks" id="lang-code-canonical-${i}">${sol.highlighted_code}</div>
+                    ${sol.buggy_code ? `<div class="code-with-tasks" id="lang-code-buggy-${i}" style="display:none;">${sol.buggy_highlighted_code}</div>` : ''}
+                </div>
+            `;
+        });
+        html += `</div>`;
+        // Test suite for current language
+        html += `<div class="card" id="lang-test-container">`;
+        if (problem.lang_solutions[0].test) {
+            html += `<h2>Test Suite</h2><pre class="code-block">${escapeHtml(problem.lang_solutions[0].test)}</pre>`;
+        }
+        html += `</div>`;
+        // Description
+        if (problem.description) {
+            html += `
+                <div class="card">
+                    <h2>Problem Description</h2>
+                    <div style="padding: 10px; line-height: 1.6; white-space: pre-wrap;">${escapeHtml(problem.description)}</div>
+                </div>
+            `;
+        }
+        html += `
+            <div class="navigation-hint">
+                <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
+                or return to the list view to filter by dataset source or search by name.
+            </div>
+        `;
+        container.innerHTML = html;
+        window.currentProblem = problem;
+        window._currentCodeMode = 'canonical';
+        return;
+    }
+    // --- SWE-bench / CommitBench diff view (unified diff patch) ---
+    if (problem.patch !== undefined) {
+        // GitHub links bar (SWE-bench variants)
+        const ghLinks = [];
+        if (problem.repo_url) ghLinks.push(`<a href="${escapeHtml(problem.repo_url)}" target="_blank" class="gh-link">Repository</a>`);
+        if (problem.issue_url) ghLinks.push(`<a href="${escapeHtml(problem.issue_url)}" target="_blank" class="gh-link">Issue</a>`);
+        if (problem.commit_url) ghLinks.push(`<a href="${escapeHtml(problem.commit_url)}" target="_blank" class="gh-link">Base Commit</a>`);
+        if (ghLinks.length > 0) {
+            html += `<div class="card gh-links-bar">${ghLinks.join('')}</div>`;
+        }
+        // Metadata badges (version, date)
+        const metaBadges = [];
+        if (problem.version) metaBadges.push(`<span class="badge badge-info">v${escapeHtml(problem.version)}</span>`);
+        if (problem.created_at) metaBadges.push(`<span class="badge badge-info">${escapeHtml(problem.created_at.split('T')[0])}</span>`);
+        if (problem.commit_hash) metaBadges.push(`<span class="badge badge-info">${escapeHtml(problem.commit_hash.substring(0, 12))}</span>`);
+        if (problem.diff_languages) metaBadges.push(`<span class="badge badge-info">${escapeHtml(problem.diff_languages)}</span>`);
+        if (metaBadges.length > 0) {
+            html += `<div style="margin-bottom: 15px;">${metaBadges.join(' ')}</div>`;
+        }
+        // Problem statement (issue text / commit message)
+        if (problem.description) {
+            html += `
+                <div class="card">
+                    <h2>${problem.issue_url ? 'Issue Description' : 'Description'}</h2>
+                    <div class="issue-statement">${escapeHtml(problem.description)}</div>
+                </div>
+            `;
+        }
+        // Render unified diff with per-file sections
+        html += renderDiffFiles(problem.patch, 'Solution Patch');
+        // Test patch if available
+        if (problem.test_patch) {
+            html += renderDiffFiles(problem.test_patch, 'Test Patch');
+        }
+        // Failing tests
+        if (problem.fail_to_pass && problem.fail_to_pass.length > 0) {
+            html += `<div class="card"><h2>Tests: Fail → Pass</h2><ul class="test-id-list">`;
+            problem.fail_to_pass.forEach(t => {
+                html += `<li>${escapeHtml(t)}</li>`;
+            });
+            html += `</ul></div>`;
+        }
+        // Hints
+        if (problem.hints) {
+            html += `
+                <div class="card">
+                    <h2>Hints</h2>
+                    <div class="issue-statement">${escapeHtml(problem.hints)}</div>
+                </div>
+            `;
+        }
+        html += `
+            <div class="navigation-hint">
+                <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
+                or return to the list view to filter by dataset source or search by name.
+            </div>
+        `;
+        container.innerHTML = html;
+        window.currentProblem = problem;
+        return;
+    }
+    // --- SAFIM Fill-in-the-Middle view ---
+    if (problem.fim_ground_truth !== undefined) {
+        // Tab bar: "With Gap" | "Completed" | "Completion Only"
+        html += `
+            <div class="card">
+                <h2>Fill-in-the-Middle</h2>
+                <div class="lang-tabs" id="fim-tabs">
+                    <button class="lang-tab" onclick="showFimTab(0)">With Gap</button>
+                    <button class="lang-tab active" onclick="showFimTab(1)">Completed</button>
+                    <button class="lang-tab" onclick="showFimTab(2)">Completion Only</button>
+                </div>
+                <div class="lang-code-panel" id="fim-panel-0">
+                    <div class="code-with-tasks">${problem.highlighted_code}</div>
+                </div>
+                <div class="lang-code-panel active" id="fim-panel-1">
+                    <div class="fim-merged-legend">
+                        <span class="coverage-swatch" style="background:#ffffcc; border:1px solid #ccc;"></span>
+                        Inserted completion (lines ${problem.fim_gt_start_line}&ndash;${problem.fim_gt_end_line})
+                    </div>
+                    <div class="code-with-tasks">${problem.fim_merged_highlighted}</div>
+                </div>
+                <div class="lang-code-panel" id="fim-panel-2">
+                    <div class="fim-answer">${problem.fim_ground_truth_highlighted || escapeHtml(problem.fim_ground_truth)}</div>
+                </div>
+            </div>
+        `;
+        html += `
+            <div class="navigation-hint">
+                <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
+                or return to the list view to filter by dataset source or search by name.
+            </div>
+        `;
+        container.innerHTML = html;
+        window.currentProblem = problem;
+        return;
+    }
+    // --- Vulnerability view (BigVul, DiverseVul, PrimeVul) ---
+    if (problem.vulnerable_code !== undefined || problem.is_vulnerable !== undefined) {
+        // Vulnerability status and CWE info
+        const isVuln = problem.is_vulnerable;
+        html += `
+            <div class="card">
+                <h2>Vulnerability Information</h2>
+                <div style="margin-bottom: 10px;">
+                    <span class="vuln-status ${isVuln ? 'vuln-status-vulnerable' : 'vuln-status-patched'}">
+                        ${isVuln ? 'Vulnerable' : 'Patched'}
+                    </span>
+                    ${problem.cwe_id ? `<span class="cwe-badge">${escapeHtml(problem.cwe_id)}</span>` : ''}
+                    ${problem.cve_id ? `<span class="badge badge-info">${escapeHtml(problem.cve_id)}</span>` : ''}
+                    ${problem.project ? `<span class="badge badge-info">${escapeHtml(problem.project)}</span>` : ''}
+                </div>
+                ${problem.description ? `<p style="margin-top: 10px; color: #555;">${escapeHtml(problem.description).substring(0, 500)}</p>` : ''}
+            </div>
+        `;
+        // Show code with vuln/patched side-by-side if both available
+        if (problem.vulnerable_code && problem.patched_code) {
+            const vulnDiff = computeUnifiedDiff(problem.vulnerable_code, problem.patched_code);
+            html += `
+                <div class="card">
+                    <h2>Changes</h2>
+                    <div class="diff-view">${renderComputedDiff(vulnDiff)}</div>
+                </div>
+            `;
+            html += `
+                <div class="card">
+                    <h2>Full Code Comparison</h2>
+                    <div class="diff-container">
+                        <div class="diff-panel">
+                            <h3><span class="diff-label-buggy">Vulnerable</span></h3>
+                            <div class="code-with-tasks">${problem.vulnerable_highlighted_code}</div>
+                        </div>
+                        <div class="diff-panel">
+                            <h3><span class="diff-label-fixed">Patched</span></h3>
+                            <div class="code-with-tasks">${problem.patched_highlighted_code}</div>
+                        </div>
+                    </div>
+                </div>
+            `;
+        } else {
+            // Single code view
+            html += `
+                <div class="card">
+                    <h2>Source Code</h2>
+                    <div class="code-with-tasks">${problem.highlighted_code}</div>
+                </div>
+            `;
+        }
+        html += `
+            <div class="navigation-hint">
+                <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
+                or return to the list view to filter by dataset source or search by name.
+            </div>
+        `;
+        container.innerHTML = html;
+        window.currentProblem = problem;
+        return;
+    }
+    // Source Code card
+    html += `
+        <div class="card">
+            <h2>Source Code</h2>
+            <div class="code-with-tasks" id="code-container">
+                ${problem.highlighted_code}
+            </div>
+        </div>
+    `;
+    // --- Non-DREval (simple) view ---
+    if (!hasTasks) {
+        // Show description if available (e.g. LiveCodeBench, MBPP+, ClassEval)
+        if (problem.description) {
+            html += `
+                <div class="card">
+                    <h2>Problem Description</h2>
+                    <div style="padding: 10px; line-height: 1.6; white-space: pre-wrap;">${escapeHtml(problem.description)}</div>
+                </div>
+            `;
+        }
+        // Show difficulty, contest date, tags, rating if available
+        if (problem.difficulty || problem.contest_date || problem.tags || problem.cf_rating) {
+            let metaHtml = '';
+            if (problem.difficulty) {
+                metaHtml += `<span class="badge badge-info">Difficulty: ${escapeHtml(problem.difficulty)}</span>`;
+            }
+            if (problem.cf_rating) {
+                metaHtml += `<span class="badge badge-info">Rating: ${problem.cf_rating}</span>`;
+            }
+            if (problem.contest_date) {
+                metaHtml += `<span class="badge badge-info">Date: ${escapeHtml(problem.contest_date.split('T')[0])}</span>`;
+            }
+            if (problem.tags && problem.tags.length > 0) {
+                problem.tags.forEach(tag => {
+                    metaHtml += `<span class="badge badge-info">${escapeHtml(tag)}</span>`;
+                });
+            }
+            html += `<div style="margin-bottom: 15px;">${metaHtml}</div>`;
+        }
+        // Show inputs/outputs if available
+        if (problem.inputs && problem.inputs.length > 0) {
+            html += `<div class="card"><h2>Inputs &amp; Outputs</h2>`;
+            problem.inputs.forEach((inp, i) => {
+                const out = (problem.outputs && problem.outputs[i]) || '';
+                html += `
+                    <div class="io-section" style="margin-bottom: 15px;">
+                        <div class="task-section">
+                            <h3>Input ${i + 1}</h3>
+                            <pre class="code-block">${escapeHtml(inp)}</pre>
+                        </div>
+                        <div class="task-section">
+                            <h3>Output</h3>
+                            <pre class="code-block">${escapeHtml(out)}</pre>
+                        </div>
+                    </div>
+                `;
+            });
+            html += `</div>`;
+        }
+        // Show test suite if available
+        if (problem.test) {
+            html += `
+                <div class="card">
+                    <h2>Test Suite</h2>
+                    <pre class="code-block">${escapeHtml(problem.test)}</pre>
+                </div>
+            `;
+        }
+        // Navigation hint
+        html += `
+            <div class="navigation-hint">
+                <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
+                or return to the list view to filter by dataset source or search by name.
+            </div>
+        `;
+        container.innerHTML = html;
+        window.currentProblem = problem;
+        return;
+    }
+    // --- CRUXEval task view (tasks have given/predict fields, no task_items) ---
+    if (problem.tasks.length > 0 && problem.tasks[0].given !== undefined) {
+        // Task selector
+        html += `
+            <div class="card">
+                <h2>Tasks</h2>
+                <div class="task-selector" id="task-selector">
+        `;
+        problem.tasks.forEach((task, idx) => {
+            html += `
+                <button class="task-btn ${idx === 0 ? 'active' : ''}"
+                        onclick="showCruxTask(${idx})">
+                    ${escapeHtml(task.name)}
+                </button>
+            `;
+        });
+        html += `
+                </div>
+                <div id="task-content"></div>
+            </div>
+        `;
+        // Navigation hint
+        html += `
+            <div class="navigation-hint">
+                <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
+                or return to the list view to filter by dataset source or search by name.
+            </div>
+        `;
+        container.innerHTML = html;
+        window.currentProblem = problem;
+        showCruxTask(0);
+        return;
+    }
+    // --- DREval (full) view with tasks, coverage, arrows ---
+    // Rebuild html cleanly with coverage legend and SVG overlay
+    html = `
+        <div class="card">
+            <div class="problem-header">
+                <h2>${escapeHtml(problem.entry_point)}</h2>
+                <span class="badge ${badgeClass(problem.source)}">${escapeHtml(problem.source)}</span>
+            </div>
+            <div class="problem-meta">
+                <div class="meta-item">
+                    <span class="meta-label">Task ID:</span>
+                    <span class="meta-value">${escapeHtml(problem.task_id)}</span>
+                </div>
+                <div class="meta-item">
+                    <span class="meta-label">Index:</span>
+                    <span class="meta-value">${problem.idx}</span>
+                </div>
+                <div class="meta-item">
+                    <span class="meta-label">Dataset:</span>
+                    <span class="meta-value">${escapeHtml(datasetName)}</span>
+                </div>
+                <div class="meta-item">
+                    <span class="meta-label">Test Inputs:</span>
+                    <span class="meta-value">${problem.inputs.length}</span>
+                </div>
+            </div>
+        </div>
+        <div class="card">
+            <h2>Source Code</h2>
+            <div class="coverage-legend" id="coverage-legend">
+                <strong>Coverage:</strong>
+                <span class="coverage-legend-item">
+                    <span class="coverage-swatch" style="background:#d4edda; border:1px solid #28a745;"></span>
+                    Executed
+                </span>
+                <span class="coverage-legend-item">
+                    <span class="coverage-swatch" style="background:#f8d7da; border:1px solid #dc3545;"></span>
+                    Not executed
+                </span>
+            </div>
+            <div class="code-with-tasks" id="code-container">
+                ${problem.highlighted_code}
+                <svg id="arrow-overlay" xmlns="http://www.w3.org/2000/svg">
+                    <defs>
+                        <marker id="arrowhead" markerWidth="8" markerHeight="6"
+                                refX="8" refY="3" orient="auto">
+                            <polygon points="0 0, 8 3, 0 6" class="exec-arrow-head"/>
+                        </marker>
+                    </defs>
+                </svg>
+            </div>
+        </div>
+    `;
+    // Task selector
+    html += `
+        <div class="card">
+            <h2>Test Cases & Tasks</h2>
+            <p>Select a test input to view associated reasoning tasks:</p>
+            <div class="task-selector" id="task-selector">
+    `;
+    problem.tasks.forEach((task, idx) => {
+        html += `
+            <button class="task-btn ${idx === 0 ? 'active' : ''}"
+                    onclick="showTask(${idx})">
+                Input ${task.input_idx + 1}
+            </button>
+        `;
+    });
+    html += `
+            </div>
+            <div id="task-content"></div>
+        </div>
+    `;
+    // Navigation hint
+    html += `
+        <div class="navigation-hint">
+            <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
+            or return to the list view to filter by dataset source or search by name.
+        </div>
+    `;
+    container.innerHTML = html;
+    // Store problem data globally
+    window.currentProblem = problem;
+    // Show first task by default
+    showTask(0);
+}
+function injectTaskMarkers(taskItems) {
+    const codePre = document.querySelector('.source .code pre');
+    // Save the pristine original innerHTML once, before any modification.
+    if (codePre && !window._codePreOriginalHtml) {
+        window._codePreOriginalHtml = codePre.innerHTML;
+    }
+    // Invalidate span cache (rebuilt lazily on next arrow draw)
+    window._linenoSpanCache = null;
+    // Store current task items so applyCoverage can re-add markers after wrapping.
+    window._currentTaskItems = taskItems || [];
+    // Reset code pre to original, then add markers from scratch.
+    if (codePre && window._codePreOriginalHtml) {
+        codePre.innerHTML = window._codePreOriginalHtml;
+    }
+    if (!taskItems || taskItems.length === 0) {
+        return;
+    }
+    // Group tasks by line number
+    const tasksByLine = {};
+    taskItems.forEach(item => {
+        if (!tasksByLine[item.lineno]) tasksByLine[item.lineno] = [];
+        tasksByLine[item.lineno].push(item.var);
+    });
+    // Inject task marker badges into the code pre
+    if (!codePre) return;
+    const codeLines = codePre.innerHTML.split('\n');
+    codePre.innerHTML = codeLines.map((line, idx) => {
+        const lineNum = idx + 1;
+        if (tasksByLine[lineNum] && line.trim() !== '') {
+            const vars = tasksByLine[lineNum];
+            return line + `<span class="task-marker" data-lineno="${lineNum}" data-vars="${escapeHtml(vars.join(', '))}">${escapeHtml(vars.join(', '))}</span>`;
+        }
+        return line;
+    }).join('\n');
+}
+function applyCoverage(coverageSet, totalLines) {
+    // Remove previous coverage classes from lineno spans.
+    document.querySelectorAll('td.linenos .normal').forEach(el => {
+        el.classList.remove('line-executed', 'line-not-executed');
+    });
+    if (!coverageSet) {
+        const legend = document.getElementById('coverage-legend');
+        if (legend) legend.style.display = 'none';
+        return;
+    }
+    const legend = document.getElementById('coverage-legend');
+    if (legend) legend.style.display = 'block';
+    // Color lineno spans only.
+    document.querySelectorAll('td.linenos .normal').forEach(span => {
+        const lineNum = parseInt(span.textContent.trim());
+        if (!isNaN(lineNum) && lineNum <= totalLines) {
+            span.classList.add(coverageSet.has(lineNum) ? 'line-executed' : 'line-not-executed');
+        }
+    });
+}
+// Global map: lineno -> list of next line numbers (1-indexed; -1 = end of trace)
+window._nextLinesMap = {};
+async function loadAndApplyGroundTruth(problemIdx, inputIdx, taskItems) {
+    // Show "loading" placeholders on all task items
+    taskItems.forEach(item => {
+        const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
+        if (el) { el.textContent = '…'; el.className = 'gt-answer loading'; }
+    });
+    // Clear next-lines data from previous input
+    window._nextLinesMap = {};
+    try {
+        const resp = await fetch(`/api/${datasetSlug}/problem/${problemIdx}/ground_truth/${inputIdx}`);
+        const gt = await resp.json();
+        if (gt.status !== 'ok') {
+            taskItems.forEach(item => {
+                const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
+                if (el) { el.textContent = gt.status === 'error' ? '(exec error)' : '(unavailable)'; el.className = 'gt-answer'; }
+            });
+            applyCoverage(null, 0);
+            return;
+        }
+        // Apply coverage highlighting
+        const coverageSet = new Set(gt.coverage);
+        applyCoverage(coverageSet, gt.total_lines);
+        // Fill in variable answers
+        const answerMap = {};
+        gt.variable_answers.forEach(a => {
+            answerMap[`${a.lineno}-${a.var}`] = a.answer_str;
+        });
+        taskItems.forEach(item => {
+            const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
+            if (el) {
+                const answer = answerMap[`${item.lineno}-${item.var}`] || '(not available)';
+                el.textContent = answer;
+                el.className = 'gt-answer';
+            }
+        });
+        // Store next-lines data for arrow visualization
+        if (gt.next_lines_answers) {
+            gt.next_lines_answers.forEach(a => {
+                window._nextLinesMap[a.lineno] = a.next_lines;
+            });
+        }
+        // Attach hover handlers to task-marker spans now that we have next-lines data
+        attachArrowHoverHandlers();
+    } catch (e) {
+        taskItems.forEach(item => {
+            const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
+            if (el) { el.textContent = '(error)'; el.className = 'gt-answer'; }
+        });
+    }
+}
+// Cache of lineNum → DOM span, rebuilt whenever injectTaskMarkers runs.
+window._linenoSpanCache = null;
+function buildLinenoSpanCache(container) {
+    const cache = {};
+    container.querySelectorAll('td.linenos .normal').forEach(span => {
+        const n = parseInt(span.textContent.trim());
+        if (!isNaN(n)) cache[n] = span;
+    });
+    window._linenoSpanCache = cache;
+}
+/**
+ * Get the bounding rect of the lineno span for a given 1-indexed line number,
+ * relative to the code container element. Uses a cached span map.
+ */
+function getLinenoSpanRect(lineNum, container) {
+    if (!window._linenoSpanCache) buildLinenoSpanCache(container);
+    const span = window._linenoSpanCache[lineNum];
+    if (!span) return null;
+    const spanRect = span.getBoundingClientRect();
+    const containerRect = container.getBoundingClientRect();
+    return {
+        top: spanRect.top - containerRect.top + container.scrollTop,
+        bottom: spanRect.bottom - containerRect.top + container.scrollTop,
+        left: spanRect.left - containerRect.left,
+        right: spanRect.right - containerRect.left,
+        width: spanRect.width,
+        height: spanRect.height,
+        midY: (spanRect.top + spanRect.bottom) / 2 - containerRect.top + container.scrollTop,
+    };
+}
+/**
+ * Draw arrows from sourceLine to each of the targetLines in the SVG overlay.
+ * Lines are 1-indexed. -1 means "end of execution" (no arrow drawn).
+ */
+function drawArrows(sourceLineNum, targetLineNums) {
+    const container = document.getElementById('code-container');
+    const svg = document.getElementById('arrow-overlay');
+    if (!container || !svg) return;
+    // Remove previous arrows (but keep defs)
+    svg.querySelectorAll('.arrow-path').forEach(el => el.remove());
+    const srcRect = getLinenoSpanRect(sourceLineNum, container);
+    if (!srcRect) return;
+    // Update SVG height to match container
+    svg.setAttribute('height', container.scrollHeight);
+    targetLineNums.forEach(targetLineNum => {
+        if (targetLineNum === -1) return;  // end of trace — no arrow
+        const dstRect = getLinenoSpanRect(targetLineNum, container);
+        if (!dstRect) return;
+        // Start point: right edge of source lineno span, vertically centered
+        const x1 = srcRect.right + 2;
+        const y1 = srcRect.midY;
+        // End point: right edge of target lineno span, vertically centered
+        const x2 = dstRect.right + 2;
+        const y2 = dstRect.midY;
+        // Horizontal offset for the bezier control points — curves to the right
+        const curveOffset = Math.max(30, Math.abs(y2 - y1) * 0.4);
+        // Cubic bezier: both control points extend to the right of the lineno column
+        const cx1 = x1 + curveOffset;
+        const cy1 = y1;
+        const cx2 = x2 + curveOffset;
+        const cy2 = y2;
+        const path = document.createElementNS('http://www.w3.org/2000/svg', 'path');
+        path.setAttribute('d', `M ${x1} ${y1} C ${cx1} ${cy1}, ${cx2} ${cy2}, ${x2} ${y2}`);
+        path.setAttribute('class', 'exec-arrow arrow-path');
+        path.setAttribute('marker-end', 'url(#arrowhead)');
+        svg.appendChild(path);
+    });
+}
+/**
+ * Clear all arrows from the SVG overlay.
+ */
+function clearArrows() {
+    const svg = document.getElementById('arrow-overlay');
+    if (svg) {
+        svg.querySelectorAll('.arrow-path').forEach(el => el.remove());
+    }
+}
+// AbortController for the current set of marker hover listeners.
+let _markerListenersAbort = null;
+/**
+ * Attach mouseenter/mouseleave handlers to all .task-marker spans so that
+ * hovering shows execution-flow arrows to next lines.
+ */
+function attachArrowHoverHandlers() {
+    // Cancel any previously attached listeners without touching the DOM.
+    if (_markerListenersAbort) _markerListenersAbort.abort();
+    _markerListenersAbort = new AbortController();
+    const { signal } = _markerListenersAbort;
+    document.querySelectorAll('.task-marker').forEach(marker => {
+        marker.addEventListener('mouseenter', () => {
+            const lineNum = parseInt(marker.dataset.lineno);
+            if (!lineNum) return;
+            const nextLines = window._nextLinesMap[lineNum];
+            if (nextLines && nextLines.length > 0) {
+                drawArrows(lineNum, nextLines);
+            }
+        }, { signal });
+        marker.addEventListener('mouseleave', () => {
+            clearArrows();
+        }, { signal });
+    });
+}
+function showCruxTask(taskIdx) {
+    const problem = window.currentProblem;
+    const task = problem.tasks[taskIdx];
+    // Update active button
+    document.querySelectorAll('.task-btn').forEach((btn, idx) => {
+        btn.classList.toggle('active', idx === taskIdx);
+    });
+    const givenLabel = task.given === 'input' ? 'Input (given)' : 'Output (given)';
+    const predictLabel = task.predict === 'output' ? 'Output (predict)' : 'Input (predict)';
+    const givenValue = task.given === 'input' ? task.input : task.output;
+    const predictValue = task.predict === 'output' ? task.output : task.input;
+    const html = `
+        <div class="task-details">
+            <div class="task-section">
+                <p style="margin-bottom: 12px; color: #7f8c8d;">${escapeHtml(task.description)}</p>
+            </div>
+            <div class="io-section">
+                <div class="task-section">
+                    <h3>${escapeHtml(givenLabel)}</h3>
+                    <pre class="code-block">${escapeHtml(givenValue)}</pre>
+                </div>
+                <div class="task-section">
+                    <h3>${escapeHtml(predictLabel)}</h3>
+                    <pre class="code-block crux-answer">${escapeHtml(predictValue)}</pre>
+                </div>
+            </div>
+        </div>
+    `;
+    document.getElementById('task-content').innerHTML = html;
+}
+function showTask(taskIdx) {
+    const problem = window.currentProblem;
+    const task = problem.tasks[taskIdx];
+    // Update active button
+    const buttons = document.querySelectorAll('.task-btn');
+    buttons.forEach((btn, idx) => {
+        if (idx === taskIdx) {
+            btn.classList.add('active');
+        } else {
+            btn.classList.remove('active');
+        }
+    });
+    // Inject task markers into the code
+    injectTaskMarkers(task.task_items);
+    // Clear previous coverage while new one loads
+    applyCoverage(null, 0);
+    // Render task content
+    const ioSection = task.test_class_code
+        ? `<div class="io-section">
+               <div class="task-section">
+                   <h3>Input</h3>
+                   <pre class="code-block">${escapeHtml(task.input)}</pre>
+               </div>
+           </div>
+           <div class="task-section">
+               <h3>Test Class &mdash; <code>${escapeHtml(task.test_class_name)}</code></h3>
+               <pre class="code-block">${escapeHtml(task.test_class_code)}</pre>
+           </div>`
+        : `<div class="io-section">
+               <div class="task-section">
+                   <h3>Input</h3>
+                   <pre class="code-block">${escapeHtml(task.input)}</pre>
+               </div>
+               <div class="task-section">
+                   <h3>Expected Output</h3>
+                   <pre class="code-block">${escapeHtml(task.output)}</pre>
+               </div>
+           </div>`;
+    let html = `
+        <div class="task-details">
+            ${ioSection}
+    `;
+    // Show task items with ground truth answer placeholders
+    if (task.task_items && task.task_items.length > 0) {
+        html += `
+            <div class="task-section">
+                <h3>Reasoning Tasks</h3>
+                <p style="margin-bottom: 10px; color: #7f8c8d;">
+                    Variable state at each execution point (correct answer shown in
+                    <span style="background:#17a2b8;color:white;padding:1px 6px;border-radius:3px;font-size:0.82rem;">teal</span>):
+                </p>
+                <ul class="task-items-list">
+        `;
+        task.task_items.forEach(item => {
+            html += `
+                <li>
+                    <span class="line-ref">Line ${item.lineno}</span>
+                    <span class="var-name">${escapeHtml(item.var)}</span>
+                    <span class="gt-answer loading" id="gt-${item.lineno}-${item.var}">…</span>
+                </li>
+            `;
+        });
+        html += `
+                </ul>
+            </div>
+        `;
+    }
+    // Show output prediction task if exists
+    if (task.output_pred) {
+        html += `
+            <div class="task-section">
+                <h3>Output Completion Task</h3>
+                <p style="margin-bottom: 10px; color: #7f8c8d;">
+                    The model needs to complete this test assertion:
+                </p>
+                <pre class="code-block">${escapeHtml(task.output_pred)}</pre>
+            </div>
+        `;
+    }
+    html += `</div>`;
+    document.getElementById('task-content').innerHTML = html;
+    // Fetch and apply ground truth (coverage + variable answers)
+    if (hasGroundTruth && task.task_items) {
+        loadAndApplyGroundTruth(problem.idx, task.input_idx, task.task_items);
+    }
+}
+function showLangTab(idx) {
+    document.querySelectorAll('.lang-tab').forEach((tab, i) => {
+        tab.classList.toggle('active', i === idx);
+    });
+    document.querySelectorAll('.lang-code-panel').forEach((panel, i) => {
+        panel.classList.toggle('active', i === idx);
+    });
+    // Update test section
+    const problem = window.currentProblem;
+    if (problem && problem.lang_solutions) {
+        const sol = problem.lang_solutions[idx];
+        const testContainer = document.getElementById('lang-test-container');
+        if (testContainer && sol.test) {
+            testContainer.innerHTML = `<h2>Test Suite</h2><pre class="code-block">${escapeHtml(sol.test)}</pre>`;
+        } else if (testContainer) {
+            testContainer.innerHTML = '';
+        }
+    }
+}
+function toggleCodeMode(mode) {
+    window._currentCodeMode = mode;
+    const problem = window.currentProblem;
+    if (!problem || !problem.lang_solutions) return;
+    // Update mode tabs
+    const modeTabs = document.querySelectorAll('#code-mode-tabs .lang-tab');
+    modeTabs.forEach(tab => {
+        tab.classList.toggle('active', tab.textContent.trim().toLowerCase() === mode);
+    });
+    // Toggle visibility of canonical/buggy code in all panels
+    problem.lang_solutions.forEach((sol, i) => {
+        const canonical = document.getElementById('lang-code-canonical-' + i);
+        const buggy = document.getElementById('lang-code-buggy-' + i);
+        if (canonical) canonical.style.display = mode === 'canonical' ? '' : 'none';
+        if (buggy) buggy.style.display = mode === 'buggy' ? '' : 'none';
+    });
+}
+function showFimTab(idx) {
+    const tabs = document.querySelectorAll('#fim-tabs .lang-tab');
+    tabs.forEach((tab, i) => tab.classList.toggle('active', i === idx));
+    for (let i = 0; i < 3; i++) {
+        const panel = document.getElementById('fim-panel-' + i);
+        if (panel) panel.classList.toggle('active', i === idx);
+    }
+}
+/**
+ * Split a unified diff into per-file sections and render each with a GitHub-style
+ * file header bar. Returns an HTML string with one card per file.
+ */
+function renderDiffFiles(diffText, title) {
+    if (!diffText) return '';
+    // Split into per-file chunks by "diff --git" boundaries
+    const files = [];
+    let current = null;
+    diffText.split('\n').forEach(line => {
+        if (line.startsWith('diff --git')) {
+            if (current) files.push(current);
+            // Extract file path from "diff --git a/path b/path"
+            const m = line.match(/^diff --git a\/(.+?) b\/(.+)/);
+            const filePath = m ? m[2] : line;
+            current = { path: filePath, lines: [line] };
+        } else if (current) {
+            current.lines.push(line);
+        } else {
+            // Lines before any diff header — create a default section
+            if (!current) current = { path: '', lines: [] };
+            current.lines.push(line);
+        }
+    });
+    if (current) files.push(current);
+    if (files.length === 0) return '';
+    let html = '';
+    if (files.length === 1 && !files[0].path) {
+        // Single unnamed diff — render as before
+        html += `<div class="card"><h2>${escapeHtml(title)}</h2><div class="diff-view">${renderDiff(diffText)}</div></div>`;
+    } else {
+        html += `<div class="card"><h2>${escapeHtml(title)}</h2>`;
+        files.forEach(file => {
+            const diffChunk = file.lines.join('\n');
+            // Count additions/deletions
+            let adds = 0, dels = 0;
+            file.lines.forEach(l => {
+                if (l.startsWith('+') && !l.startsWith('+++')) adds++;
+                if (l.startsWith('-') && !l.startsWith('---')) dels++;
+            });
+            const statsHtml = `<span class="diff-file-stats"><span class="diff-stat-add">+${adds}</span> <span class="diff-stat-del">-${dels}</span></span>`;
+            html += `
+                <div class="diff-file-section">
+                    <div class="diff-file-header">
+                        <span class="diff-file-path">${escapeHtml(file.path)}</span>
+                        ${statsHtml}
+                    </div>
+                    <div class="diff-view">${renderDiff(diffChunk)}</div>
+                </div>
+            `;
+        });
+        html += `</div>`;
+    }
+    return html;
+}
+/**
+ * Render a unified diff with line numbers and file headers (GitHub-style).
+ */
+function renderDiff(diffText) {
+    if (!diffText) return '';
+    const lines = diffText.split('\n');
+    let oldLine = 0, newLine = 0;
+    const rows = [];
+    lines.forEach(line => {
+        if (line.startsWith('diff ')) {
+            rows.push(`<tr class="diff-tr-header"><td class="diff-ln"></td><td class="diff-ln"></td><td class="diff-td-header">${escapeHtml(line)}</td></tr>`);
+            return;
+        }
+        if (line.startsWith('---') || line.startsWith('+++')) {
+            rows.push(`<tr class="diff-tr-header"><td class="diff-ln"></td><td class="diff-ln"></td><td class="diff-td-header">${escapeHtml(line)}</td></tr>`);
+            return;
+        }
+        if (line.startsWith('@@')) {
+            // Parse hunk header: @@ -oldStart,oldCount +newStart,newCount @@
+            const m = line.match(/@@ -(\d+)(?:,\d+)? \+(\d+)(?:,\d+)? @@/);
+            if (m) {
+                oldLine = parseInt(m[1]);
+                newLine = parseInt(m[2]);
+            }
+            rows.push(`<tr class="diff-tr-hunk"><td class="diff-ln"></td><td class="diff-ln"></td><td class="diff-td-hunk">${escapeHtml(line)}</td></tr>`);
+            return;
+        }
+        if (line.startsWith('+')) {
+            rows.push(`<tr class="diff-tr-add"><td class="diff-ln"></td><td class="diff-ln">${newLine}</td><td class="diff-td-add">${escapeHtml(line.substring(1))}</td></tr>`);
+            newLine++;
+        } else if (line.startsWith('-')) {
+            rows.push(`<tr class="diff-tr-del"><td class="diff-ln">${oldLine}</td><td class="diff-ln"></td><td class="diff-td-del">${escapeHtml(line.substring(1))}</td></tr>`);
+            oldLine++;
+        } else if (line.startsWith(' ')) {
+            rows.push(`<tr class="diff-tr-ctx"><td class="diff-ln">${oldLine}</td><td class="diff-ln">${newLine}</td><td class="diff-td-ctx">${escapeHtml(line.substring(1))}</td></tr>`);
+            oldLine++;
+            newLine++;
+        } else if (line.trim() === '') {
+            // Empty trailing line
+        } else {
+            rows.push(`<tr class="diff-tr-ctx"><td class="diff-ln">${oldLine}</td><td class="diff-ln">${newLine}</td><td class="diff-td-ctx">${escapeHtml(line)}</td></tr>`);
+            oldLine++;
+            newLine++;
+        }
+    });
+    return `<table class="diff-table">${rows.join('')}</table>`;
+}
+/**
+ * Simple line-by-line diff (LCS-based) between two code strings.
+ * Returns an array of {type: 'context'|'add'|'del', line: string}.
+ */
+function computeUnifiedDiff(oldText, newText) {
+    const oldLines = (oldText || '').split('\n');
+    const newLines = (newText || '').split('\n');
+    // LCS for line sequences
+    const m = oldLines.length, n = newLines.length;
+    // For very large files, just show both in full instead of computing LCS
+    if (m * n > 500000) {
+        const result = [];
+        oldLines.forEach(l => result.push({type: 'del', line: l}));
+        newLines.forEach(l => result.push({type: 'add', line: l}));
+        return result;
+    }
+    const dp = Array.from({length: m + 1}, () => new Uint16Array(n + 1));
+    for (let i = 1; i <= m; i++) {
+        for (let j = 1; j <= n; j++) {
+            if (oldLines[i - 1] === newLines[j - 1]) {
+                dp[i][j] = dp[i - 1][j - 1] + 1;
+            } else {
+                dp[i][j] = Math.max(dp[i - 1][j], dp[i][j - 1]);
+            }
+        }
+    }
+    // Backtrack to build diff
+    const result = [];
+    let i = m, j = n;
+    while (i > 0 || j > 0) {
+        if (i > 0 && j > 0 && oldLines[i - 1] === newLines[j - 1]) {
+            result.push({type: 'context', line: oldLines[i - 1]});
+            i--; j--;
+        } else if (j > 0 && (i === 0 || dp[i][j - 1] >= dp[i - 1][j])) {
+            result.push({type: 'add', line: newLines[j - 1]});
+            j--;
+        } else {
+            result.push({type: 'del', line: oldLines[i - 1]});
+            i--;
+        }
+    }
+    result.reverse();
+    // Compact: only show hunks with context (3 lines around changes)
+    const contextSize = 3;
+    const hasChange = result.map(r => r.type !== 'context');
+    const show = new Uint8Array(result.length);
+    for (let k = 0; k < result.length; k++) {
+        if (hasChange[k]) {
+            for (let c = Math.max(0, k - contextSize); c <= Math.min(result.length - 1, k + contextSize); c++) {
+                show[c] = 1;
+            }
+        }
+    }
+    const compacted = [];
+    let lastShown = -1;
+    for (let k = 0; k < result.length; k++) {
+        if (show[k]) {
+            if (lastShown >= 0 && k - lastShown > 1) {
+                compacted.push({type: 'separator', line: '...'});
+            }
+            compacted.push(result[k]);
+            lastShown = k;
+        }
+    }
+    return compacted.length > 0 ? compacted : result;
+}
+/**
+ * Render the output of computeUnifiedDiff into diff HTML with line numbers.
+ */
+function renderComputedDiff(diffEntries) {
+    let oldLine = 1, newLine = 1;
+    const rows = diffEntries.map(entry => {
+        if (entry.type === 'separator') {
+            return `<tr class="diff-tr-hunk"><td class="diff-ln"></td><td class="diff-ln"></td><td class="diff-td-hunk">${escapeHtml(entry.line)}</td></tr>`;
+        }
+        if (entry.type === 'del') {
+            const row = `<tr class="diff-tr-del"><td class="diff-ln">${oldLine}</td><td class="diff-ln"></td><td class="diff-td-del">${escapeHtml(entry.line)}</td></tr>`;
+            oldLine++;
+            return row;
+        }
+        if (entry.type === 'add') {
+            const row = `<tr class="diff-tr-add"><td class="diff-ln"></td><td class="diff-ln">${newLine}</td><td class="diff-td-add">${escapeHtml(entry.line)}</td></tr>`;
+            newLine++;
+            return row;
+        }
+        // context
+        const row = `<tr class="diff-tr-ctx"><td class="diff-ln">${oldLine}</td><td class="diff-ln">${newLine}</td><td class="diff-td-ctx">${escapeHtml(entry.line)}</td></tr>`;
+        oldLine++;
+        newLine++;
+        return row;
+    });
+    return `<table class="diff-table">${rows.join('')}</table>`;
+}
+function escapeHtml(text) {
+    if (text === null || text === undefined) return '';
+    const div = document.createElement('div');
+    div.textContent = text;
+    return div.innerHTML;
+}
+loadProblem();

templates/base.html CHANGED Viewed

@@ -92,6 +92,159 @@
             color: white;
         }
         .badge-info {
             background: #ecf0f1;
             color: #2c3e50;
@@ -192,6 +345,7 @@
         {% block extra_css %}{% endblock %}
     </style>
 </head>
 <body>
     <header>

             color: white;
         }
+        .badge-mbpp {
+            background: #16a085;
+            color: white;
+        }
+        .badge-codeforces {
+            background: #e74c3c;
+            color: white;
+        }
+        .badge-leetcode {
+            background: #f39c12;
+            color: white;
+        }
+        .badge-atcoder {
+            background: #2ecc71;
+            color: white;
+        }
+        .badge-cppsyntaxerror, .badge-cppreferenceerror, .badge-cpplogicerror, .badge-cppmultipleerror {
+            background: #3498db;
+            color: white;
+        }
+        .badge-javasyntaxerror, .badge-javareferenceerror, .badge-javalogicerror, .badge-javamultipleerror {
+            background: #e67e22;
+            color: white;
+        }
+        .badge-pythonsyntaxerror, .badge-pythonreferenceerror, .badge-pythonlogicerror, .badge-pythonmultipleerror {
+            background: #2ecc71;
+            color: white;
+        }
+        .badge-humanevalx {
+            background: #1abc9c;
+            color: white;
+        }
+        /* SWE-bench repo badges */
+        .badge-djangodjango, .badge-astropyastropy, .badge-matabormataborlib, .badge-scikitimagescikitimage {
+            background: #0d6efd;
+            color: white;
+        }
+        .badge-sympy, .badge-sympysympy, .badge-pylintdevpylint, .badge-sphinxdocsphinx,
+        .badge-palletstflask, .badge-palletsjinja, .badge-pyaborpyabor, .badge-pytestdevpytest {
+            background: #6610f2;
+            color: white;
+        }
+        /* APPS difficulty badges */
+        .badge-introductory {
+            background: #27ae60;
+            color: white;
+        }
+        .badge-interview {
+            background: #f39c12;
+            color: white;
+        }
+        .badge-competition {
+            background: #e74c3c;
+            color: white;
+        }
+        /* CanItEdit change kind badges */
+        .badge-adaptive {
+            background: #3498db;
+            color: white;
+        }
+        .badge-perfective {
+            background: #2ecc71;
+            color: white;
+        }
+        .badge-corrective {
+            background: #e67e22;
+            color: white;
+        }
+        .badge-canitedit {
+            background: #9b59b6;
+            color: white;
+        }
+        /* CodeContests source badges (extend existing) */
+        .badge-codechef {
+            background: #5b4638;
+            color: white;
+        }
+        .badge-codejam {
+            background: #4285f4;
+            color: white;
+        }
+        .badge-hackerearth {
+            background: #2c3454;
+            color: white;
+        }
+        .badge-aizu {
+            background: #0089d0;
+            color: white;
+        }
+        .badge-unknown {
+            background: #95a5a6;
+            color: white;
+        }
+        /* SAFIM language badges */
+        .badge-python, .badge-java, .badge-c {
+            background: #3498db;
+            color: white;
+        }
+        /* Vulnerability badges */
+        .badge-vulnerable {
+            background: #e74c3c;
+            color: white;
+        }
+        .badge-patched {
+            background: #27ae60;
+            color: white;
+        }
+        /* CodeEditorBench type badges */
+        .badge-codedebug {
+            background: #e74c3c;
+            color: white;
+        }
+        .badge-codetranslate {
+            background: #3498db;
+            color: white;
+        }
+        .badge-codepolish {
+            background: #2ecc71;
+            color: white;
+        }
+        .badge-coderequirementswitch {
+            background: #9b59b6;
+            color: white;
+        }
         .badge-info {
             background: #ecf0f1;
             color: #2c3e50;
         {% block extra_css %}{% endblock %}
     </style>
+    {% block extra_head %}{% endblock %}
 </head>
 <body>
     <header>

templates/index.html CHANGED Viewed

@@ -84,55 +84,45 @@
         color: #7f8c8d;
     }
-    .stats {
         display: flex;
-        gap: 20px;
-        margin-bottom: 20px;
         flex-wrap: wrap;
     }
-    .stat-card {
-        background: white;
-        border-radius: 8px;
-        padding: 20px;
-        box-shadow: 0 2px 4px rgba(0,0,0,0.1);
-        flex: 1;
-        min-width: 200px;
     }
-    .stat-number {
-        font-size: 2.5rem;
-        font-weight: 700;
-        color: #3498db;
     }
-    .stat-label {
-        font-size: 0.9rem;
         color: #7f8c8d;
-        margin-top: 5px;
     }
 </style>
 {% endblock %}
 {% block content %}
-<div class="stats" id="stats">
-    <div class="stat-card">
-        <div class="stat-number" id="total-problems">-</div>
-        <div class="stat-label">Total Problems</div>
-    </div>
-    <div class="stat-card" id="stat-source-a">
-        <div class="stat-number" id="source-a-count">-</div>
-        <div class="stat-label" id="source-a-label">Source A</div>
-    </div>
-    <div class="stat-card" id="stat-source-b">
-        <div class="stat-number" id="source-b-count">-</div>
-        <div class="stat-label" id="source-b-label">Source B</div>
-    </div>
-    <div class="stat-card">
-        <div class="stat-number" id="filtered-count">-</div>
-        <div class="stat-label">Displayed</div>
-    </div>
-</div>
 <div class="card">
     <h2>Filter Problems</h2>
@@ -179,7 +169,10 @@ async function loadDatasets() {
         datasets.forEach(ds => {
             const opt = document.createElement('option');
             opt.value = ds.slug;
-            opt.textContent = `${ds.display_name} (${ds.problem_count})`;
             if (ds.slug === currentDataset) opt.selected = true;
             select.appendChild(opt);
         });
@@ -236,27 +229,22 @@ function updateStats() {
         sources[p.source] = (sources[p.source] || 0) + 1;
     });
-    document.getElementById('total-problems').textContent = allProblems.length;
-    const sourceNames = Object.keys(sources);
-    const statA = document.getElementById('stat-source-a');
-    const statB = document.getElementById('stat-source-b');
-    if (sourceNames.length >= 1) {
-        statA.style.display = '';
-        document.getElementById('source-a-count').textContent = sources[sourceNames[0]];
-        document.getElementById('source-a-label').textContent = sourceNames[0];
-    } else {
-        statA.style.display = 'none';
-    }
-    if (sourceNames.length >= 2) {
-        statB.style.display = '';
-        document.getElementById('source-b-count').textContent = sources[sourceNames[1]];
-        document.getElementById('source-b-label').textContent = sourceNames[1];
-    } else {
-        statB.style.display = 'none';
     }
 }
 function badgeClass(source) {
@@ -268,12 +256,9 @@ function renderProblems(problems) {
     if (problems.length === 0) {
         container.innerHTML = '<div class="card"><p>No problems match your filters.</p></div>';
-        document.getElementById('filtered-count').textContent = '0';
         return;
     }
-    document.getElementById('filtered-count').textContent = problems.length;
     const grid = document.createElement('div');
     grid.className = 'problems-grid';

         color: #7f8c8d;
     }
+    .stats-bar {
         display: flex;
+        align-items: center;
+        gap: 16px;
+        margin-bottom: 16px;
+        font-size: 0.9rem;
+        color: #555;
         flex-wrap: wrap;
     }
+    .stats-total {
+        font-weight: 700;
+        color: #2c3e50;
+        font-size: 0.95rem;
     }
+    .stats-sep {
+        color: #ccc;
     }
+    .stats-tag {
+        display: inline-flex;
+        align-items: center;
+        gap: 4px;
+    }
+    .stats-tag-name {
         color: #7f8c8d;
+    }
+    .stats-tag-count {
+        font-weight: 600;
+        color: #2c3e50;
     }
 </style>
 {% endblock %}
 {% block content %}
+<div class="stats-bar" id="stats-bar"></div>
 <div class="card">
     <h2>Filter Problems</h2>
         datasets.forEach(ds => {
             const opt = document.createElement('option');
             opt.value = ds.slug;
+            const countLabel = ds.total_count
+                ? `${ds.problem_count} of ${ds.total_count}`
+                : `${ds.problem_count}`;
+            opt.textContent = `${ds.display_name} (${countLabel})`;
             if (ds.slug === currentDataset) opt.selected = true;
             select.appendChild(opt);
         });
         sources[p.source] = (sources[p.source] || 0) + 1;
     });
+    const sorted = Object.entries(sources)
+        .sort((a, b) => b[1] - a[1]);
+    const top5 = sorted.slice(0, 5);
+    const otherCount = sorted.slice(5).reduce((sum, [, c]) => sum + c, 0);
+    const bar = document.getElementById('stats-bar');
+    let html = `<span class="stats-total">Total: ${allProblems.length}</span>`;
+    top5.forEach(([name, count]) => {
+        html += `<span class="stats-sep">|</span>`;
+        html += `<span class="stats-tag"><span class="stats-tag-name">${name}:</span> <span class="stats-tag-count">${count}</span></span>`;
+    });
+    if (otherCount > 0) {
+        html += `<span class="stats-sep">|</span>`;
+        html += `<span class="stats-tag"><span class="stats-tag-name">Other:</span> <span class="stats-tag-count">${otherCount}</span></span>`;
     }
+    bar.innerHTML = html;
 }
 function badgeClass(source) {
     if (problems.length === 0) {
         container.innerHTML = '<div class="card"><p>No problems match your filters.</p></div>';
         return;
     }
     const grid = document.createElement('div');
     grid.className = 'problems-grid';

templates/problem.html CHANGED Viewed

@@ -20,283 +20,11 @@
 {% endblock %}
 {% block extra_css %}
-<style>
-    {{ css|safe }}
-    .problem-header {
-        display: flex;
-        justify-content: space-between;
-        align-items: center;
-        margin-bottom: 15px;
-    }
-    .problem-meta {
-        margin-bottom: 20px;
-    }
-    .meta-item {
-        display: inline-block;
-        margin-right: 15px;
-        margin-bottom: 10px;
-    }
-    .meta-label {
-        font-weight: 600;
-        color: #7f8c8d;
-        margin-right: 5px;
-    }
-    .meta-value {
-        color: #2c3e50;
-    }
-    .task-selector {
-        margin: 20px 0;
-        display: flex;
-        gap: 10px;
-        flex-wrap: wrap;
-    }
-    .task-btn {
-        padding: 10px 20px;
-        background: #ecf0f1;
-        border: 2px solid transparent;
-        border-radius: 4px;
-        cursor: pointer;
-        transition: all 0.3s;
-        font-size: 0.95rem;
-    }
-    .task-btn:hover {
-        background: #bdc3c7;
-    }
-    .task-btn.active {
-        background: #3498db;
-        color: white;
-        border-color: #2980b9;
-    }
-    .task-details {
-        margin-top: 20px;
-    }
-    .task-section {
-        margin-bottom: 25px;
-        padding: 15px;
-        background: #f8f9fa;
-        border-left: 4px solid #3498db;
-        border-radius: 4px;
-    }
-    .task-section h3 {
-        margin-bottom: 10px;
-        color: #2c3e50;
-        font-size: 1.1rem;
-    }
-    .code-block {
-        background: #f8f9fa;
-        padding: 15px;
-        border-radius: 4px;
-        overflow-x: auto;
-        font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;
-        font-size: 0.9rem;
-        border: 1px solid #e1e4e8;
-    }
-    .task-items-list {
-        list-style: none;
-    }
-    .task-items-list li {
-        padding: 10px;
-        margin-bottom: 8px;
-        background: white;
-        border-radius: 4px;
-        border: 1px solid #e1e4e8;
-    }
-    .line-ref {
-        display: inline-block;
-        padding: 2px 8px;
-        background: #3498db;
-        color: white;
-        border-radius: 3px;
-        font-family: monospace;
-        font-size: 0.85rem;
-        margin-right: 8px;
-    }
-    .var-name {
-        display: inline-block;
-        padding: 2px 8px;
-        background: #9b59b6;
-        color: white;
-        border-radius: 3px;
-        font-family: monospace;
-        font-size: 0.85rem;
-    }
-    .io-section {
-        display: grid;
-        grid-template-columns: 1fr 1fr;
-        gap: 15px;
-    }
-    @media (max-width: 768px) {
-        .io-section {
-            grid-template-columns: 1fr;
-        }
-    }
-    .navigation-hint {
-        margin-top: 20px;
-        padding: 15px;
-        background: #e8f4f8;
-        border-radius: 4px;
-        color: #2c3e50;
-        font-size: 0.9rem;
-    }
-    .test-code-section {
-        margin-top: 20px;
-    }
-    /* Inline task visualization */
-    .code-with-tasks {
-        position: relative;
-    }
-    .task-marker {
-        display: inline-block;
-        margin-left: 10px;
-        padding: 2px 8px;
-        background: #9b59b6;
-        color: white;
-        border-radius: 3px;
-        font-size: 0.75rem;
-        font-weight: 600;
-        cursor: crosshair;
-    }
-    /* Coverage coloring on lineno spans.
-       Pygments emits: td.linenos > div.linenodiv > pre > span.normal
-       We must match that chain; .source .linenos doesn't work because
-       the td has class "linenos", not an element named "linenos". */
-    td.linenos .normal.line-executed {
-        background-color: #d4edda !important;
-        color: #155724 !important;
-    }
-    td.linenos .normal.line-not-executed {
-        background-color: #f8d7da !important;
-        color: #721c24 !important;
-    }
-    /* Coverage legend */
-    .coverage-legend {
-        margin: 10px 0;
-        padding: 10px 15px;
-        background: #f8f9fa;
-        border-left: 4px solid #28a745;
-        border-radius: 4px;
-        font-size: 0.85rem;
-        display: none;
-    }
-    .coverage-legend-item {
-        display: inline-block;
-        margin-right: 18px;
-    }
-    .coverage-swatch {
-        display: inline-block;
-        width: 12px;
-        height: 12px;
-        border-radius: 2px;
-        margin-right: 4px;
-        vertical-align: middle;
-    }
-    /* Ground truth answer badge shown next to task items */
-    .gt-answer {
-        display: inline-block;
-        margin-left: 10px;
-        padding: 2px 8px;
-        background: #17a2b8;
-        color: white;
-        border-radius: 3px;
-        font-family: monospace;
-        font-size: 0.82rem;
-        font-weight: 600;
-    }
-    .gt-answer.loading {
-        background: #6c757d;
-    }
-    /* SVG arrow overlay positioned over the code container */
-    #arrow-overlay {
-        position: absolute;
-        top: 0;
-        left: 0;
-        width: 100%;
-        height: 100%;
-        pointer-events: none;
-        overflow: visible;
-        z-index: 10;
-    }
-    .exec-arrow {
-        fill: none;
-        stroke: #e67e22;
-        stroke-width: 2.5;
-        stroke-dasharray: none;
-        opacity: 0.9;
-    }
-    .exec-arrow-head {
-        fill: #e67e22;
-        opacity: 0.9;
-    }
-    /* CRUXEval answer highlight */
-    .crux-answer {
-        border-left: 4px solid #17a2b8 !important;
-        background: #e8f6f8 !important;
-    }
-    /* BigOBench complexity display */
-    .complexity-badges {
-        display: flex;
-        gap: 20px;
-        flex-wrap: wrap;
-    }
-    .complexity-item {
-        display: flex;
-        align-items: center;
-        gap: 10px;
-    }
-    .complexity-label {
-        font-weight: 600;
-        color: #7f8c8d;
-        font-size: 0.95rem;
-    }
-    .complexity-value {
-        display: inline-block;
-        padding: 6px 16px;
-        background: #2c3e50;
-        color: #f1c40f;
-        border-radius: 4px;
-        font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;
-        font-size: 1.1rem;
-        font-weight: 600;
-    }
-</style>
 {% endblock %}
 {% block content %}
@@ -315,701 +43,6 @@ const datasetSlug = {{ dataset_slug|tojson }};
 const datasetName = {{ dataset_name|tojson }};
 const hasGroundTruth = {{ has_ground_truth|tojson }};
 const hasTasks = {{ has_tasks|tojson }};
-function badgeClass(source) {
-    return 'badge-' + source.toLowerCase().replace(/[^a-z0-9]/g, '');
-}
-async function loadProblem() {
-    try {
-        const response = await fetch(`/api/${datasetSlug}/problem/${problemIdx}`);
-        const problem = await response.json();
-        if (problem.error) {
-            document.getElementById('problem-content').innerHTML =
-                '<div class="card"><p style="color: red;">Error: ' + problem.error + '</p></div>';
-            return;
-        }
-        renderProblem(problem);
-    } catch (error) {
-        document.getElementById('problem-content').innerHTML =
-            '<div class="card"><p style="color: red;">Error loading problem: ' + error.message + '</p></div>';
-    }
-}
-function renderProblem(problem) {
-    const container = document.getElementById('problem-content');
-    // Main problem info card (shared by all datasets)
-    let html = `
-        <div class="card">
-            <div class="problem-header">
-                <h2>${escapeHtml(problem.entry_point)}</h2>
-                <span class="badge ${badgeClass(problem.source)}">${escapeHtml(problem.source)}</span>
-            </div>
-            <div class="problem-meta">
-                <div class="meta-item">
-                    <span class="meta-label">Task ID:</span>
-                    <span class="meta-value">${escapeHtml(problem.task_id)}</span>
-                </div>
-                <div class="meta-item">
-                    <span class="meta-label">Index:</span>
-                    <span class="meta-value">${problem.idx}</span>
-                </div>
-                <div class="meta-item">
-                    <span class="meta-label">Dataset:</span>
-                    <span class="meta-value">${escapeHtml(datasetName)}</span>
-                </div>
-                ${problem.inputs.length > 0 ? `
-                <div class="meta-item">
-                    <span class="meta-label">Test Inputs:</span>
-                    <span class="meta-value">${problem.inputs.length}</span>
-                </div>` : ''}
-            </div>
-        </div>
-    `;
-    // --- BigOBench view (problem description + per-solution code & complexity) ---
-    if (problem.solutions && problem.solutions.length > 0) {
-        // Problem description
-        if (problem.description) {
-            html += `
-                <div class="card">
-                    <h2>Problem Statement</h2>
-                    <pre class="code-block" style="white-space: pre-wrap;">${escapeHtml(problem.description)}</pre>
-                </div>
-            `;
-        }
-        // Each solution: code + complexity
-        problem.solutions.forEach((sol, i) => {
-            html += `
-                <div class="card">
-                    <h2>Solution ${i + 1} <span style="font-size:0.8rem;color:#7f8c8d;font-weight:400;">${escapeHtml(sol.solution_id)}</span></h2>
-                    <div class="complexity-badges" style="margin-bottom: 15px;">
-            `;
-            if (sol.time_complexity) {
-                html += `
-                        <div class="complexity-item">
-                            <span class="complexity-label">Time</span>
-                            <span class="complexity-value">${escapeHtml(sol.time_complexity)}</span>
-                        </div>`;
-            }
-            if (sol.space_complexity) {
-                html += `
-                        <div class="complexity-item">
-                            <span class="complexity-label">Space</span>
-                            <span class="complexity-value">${escapeHtml(sol.space_complexity)}</span>
-                        </div>`;
-            }
-            html += `
-                    </div>
-                    <div class="code-with-tasks">
-                        ${sol.highlighted_code}
-                    </div>
-                </div>
-            `;
-        });
-        // Navigation hint
-        html += `
-            <div class="navigation-hint">
-                <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
-                or return to the list view to filter by dataset source or search by name.
-            </div>
-        `;
-        container.innerHTML = html;
-        window.currentProblem = problem;
-        return;
-    }
-    // Source Code card
-    html += `
-        <div class="card">
-            <h2>Source Code</h2>
-            <div class="code-with-tasks" id="code-container">
-                ${problem.highlighted_code}
-            </div>
-        </div>
-    `;
-    // --- Non-DREval (simple) view ---
-    if (!hasTasks) {
-        // Show inputs/outputs if available
-        if (problem.inputs && problem.inputs.length > 0) {
-            html += `<div class="card"><h2>Inputs &amp; Outputs</h2>`;
-            problem.inputs.forEach((inp, i) => {
-                const out = (problem.outputs && problem.outputs[i]) || '';
-                html += `
-                    <div class="io-section" style="margin-bottom: 15px;">
-                        <div class="task-section">
-                            <h3>Input ${i + 1}</h3>
-                            <pre class="code-block">${escapeHtml(inp)}</pre>
-                        </div>
-                        <div class="task-section">
-                            <h3>Output</h3>
-                            <pre class="code-block">${escapeHtml(out)}</pre>
-                        </div>
-                    </div>
-                `;
-            });
-            html += `</div>`;
-        }
-        // Show test suite if available
-        if (problem.test) {
-            html += `
-                <div class="card">
-                    <h2>Test Suite</h2>
-                    <pre class="code-block">${escapeHtml(problem.test)}</pre>
-                </div>
-            `;
-        }
-        // Navigation hint
-        html += `
-            <div class="navigation-hint">
-                <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
-                or return to the list view to filter by dataset source or search by name.
-            </div>
-        `;
-        container.innerHTML = html;
-        window.currentProblem = problem;
-        return;
-    }
-    // --- CRUXEval task view (tasks have given/predict fields, no task_items) ---
-    if (problem.tasks.length > 0 && problem.tasks[0].given !== undefined) {
-        // Task selector
-        html += `
-            <div class="card">
-                <h2>Tasks</h2>
-                <div class="task-selector" id="task-selector">
-        `;
-        problem.tasks.forEach((task, idx) => {
-            html += `
-                <button class="task-btn ${idx === 0 ? 'active' : ''}"
-                        onclick="showCruxTask(${idx})">
-                    ${escapeHtml(task.name)}
-                </button>
-            `;
-        });
-        html += `
-                </div>
-                <div id="task-content"></div>
-            </div>
-        `;
-        // Navigation hint
-        html += `
-            <div class="navigation-hint">
-                <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
-                or return to the list view to filter by dataset source or search by name.
-            </div>
-        `;
-        container.innerHTML = html;
-        window.currentProblem = problem;
-        showCruxTask(0);
-        return;
-    }
-    // --- DREval (full) view with tasks, coverage, arrows ---
-    // Rebuild html cleanly with coverage legend and SVG overlay
-    html = `
-        <div class="card">
-            <div class="problem-header">
-                <h2>${escapeHtml(problem.entry_point)}</h2>
-                <span class="badge ${badgeClass(problem.source)}">${escapeHtml(problem.source)}</span>
-            </div>
-            <div class="problem-meta">
-                <div class="meta-item">
-                    <span class="meta-label">Task ID:</span>
-                    <span class="meta-value">${escapeHtml(problem.task_id)}</span>
-                </div>
-                <div class="meta-item">
-                    <span class="meta-label">Index:</span>
-                    <span class="meta-value">${problem.idx}</span>
-                </div>
-                <div class="meta-item">
-                    <span class="meta-label">Dataset:</span>
-                    <span class="meta-value">${escapeHtml(datasetName)}</span>
-                </div>
-                <div class="meta-item">
-                    <span class="meta-label">Test Inputs:</span>
-                    <span class="meta-value">${problem.inputs.length}</span>
-                </div>
-            </div>
-        </div>
-        <div class="card">
-            <h2>Source Code</h2>
-            <div class="coverage-legend" id="coverage-legend">
-                <strong>Coverage:</strong>
-                <span class="coverage-legend-item">
-                    <span class="coverage-swatch" style="background:#d4edda; border:1px solid #28a745;"></span>
-                    Executed
-                </span>
-                <span class="coverage-legend-item">
-                    <span class="coverage-swatch" style="background:#f8d7da; border:1px solid #dc3545;"></span>
-                    Not executed
-                </span>
-            </div>
-            <div class="code-with-tasks" id="code-container">
-                ${problem.highlighted_code}
-                <svg id="arrow-overlay" xmlns="http://www.w3.org/2000/svg">
-                    <defs>
-                        <marker id="arrowhead" markerWidth="8" markerHeight="6"
-                                refX="8" refY="3" orient="auto">
-                            <polygon points="0 0, 8 3, 0 6" class="exec-arrow-head"/>
-                        </marker>
-                    </defs>
-                </svg>
-            </div>
-        </div>
-    `;
-    // Task selector
-    html += `
-        <div class="card">
-            <h2>Test Cases & Tasks</h2>
-            <p>Select a test input to view associated reasoning tasks:</p>
-            <div class="task-selector" id="task-selector">
-    `;
-    problem.tasks.forEach((task, idx) => {
-        html += `
-            <button class="task-btn ${idx === 0 ? 'active' : ''}"
-                    onclick="showTask(${idx})">
-                Input ${task.input_idx + 1}
-            </button>
-        `;
-    });
-    html += `
-            </div>
-            <div id="task-content"></div>
-        </div>
-    `;
-    // Navigation hint
-    html += `
-        <div class="navigation-hint">
-            <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
-            or return to the list view to filter by dataset source or search by name.
-        </div>
-    `;
-    container.innerHTML = html;
-    // Store problem data globally
-    window.currentProblem = problem;
-    // Show first task by default
-    showTask(0);
-}
-function injectTaskMarkers(taskItems) {
-    const codePre = document.querySelector('.source .code pre');
-    // Save the pristine original innerHTML once, before any modification.
-    if (codePre && !window._codePreOriginalHtml) {
-        window._codePreOriginalHtml = codePre.innerHTML;
-    }
-    // Invalidate span cache (rebuilt lazily on next arrow draw)
-    window._linenoSpanCache = null;
-    // Store current task items so applyCoverage can re-add markers after wrapping.
-    window._currentTaskItems = taskItems || [];
-    // Reset code pre to original, then add markers from scratch.
-    if (codePre && window._codePreOriginalHtml) {
-        codePre.innerHTML = window._codePreOriginalHtml;
-    }
-    if (!taskItems || taskItems.length === 0) {
-        return;
-    }
-    // Group tasks by line number
-    const tasksByLine = {};
-    taskItems.forEach(item => {
-        if (!tasksByLine[item.lineno]) tasksByLine[item.lineno] = [];
-        tasksByLine[item.lineno].push(item.var);
-    });
-    // Inject task marker badges into the code pre
-    if (!codePre) return;
-    const codeLines = codePre.innerHTML.split('\n');
-    codePre.innerHTML = codeLines.map((line, idx) => {
-        const lineNum = idx + 1;
-        if (tasksByLine[lineNum] && line.trim() !== '') {
-            const vars = tasksByLine[lineNum];
-            return line + `<span class="task-marker" data-lineno="${lineNum}" data-vars="${escapeHtml(vars.join(', '))}">${escapeHtml(vars.join(', '))}</span>`;
-        }
-        return line;
-    }).join('\n');
-}
-function applyCoverage(coverageSet, totalLines) {
-    // Remove previous coverage classes from lineno spans.
-    // Pygments structure: td.linenos > div.linenodiv > pre > span.normal
-    // These are individual elements — adding/removing classes has no layout impact.
-    document.querySelectorAll('td.linenos .normal').forEach(el => {
-        el.classList.remove('line-executed', 'line-not-executed');
-    });
-    if (!coverageSet) {
-        const legend = document.getElementById('coverage-legend');
-        if (legend) legend.style.display = 'none';
-        return;
-    }
-    const legend = document.getElementById('coverage-legend');
-    if (legend) legend.style.display = 'block';
-    // Color lineno spans only. We never touch codePre.innerHTML here so:
-    //   1. The table layout is never disturbed (no alignment issue).
-    //   2. Task markers injected by injectTaskMarkers are left untouched.
-    document.querySelectorAll('td.linenos .normal').forEach(span => {
-        const lineNum = parseInt(span.textContent.trim());
-        if (!isNaN(lineNum) && lineNum <= totalLines) {
-            span.classList.add(coverageSet.has(lineNum) ? 'line-executed' : 'line-not-executed');
-        }
-    });
-}
-// Global map: lineno -> list of next line numbers (1-indexed; -1 = end of trace)
-window._nextLinesMap = {};
-async function loadAndApplyGroundTruth(problemIdx, inputIdx, taskItems) {
-    // Show "loading" placeholders on all task items
-    taskItems.forEach(item => {
-        const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
-        if (el) { el.textContent = '…'; el.className = 'gt-answer loading'; }
-    });
-    // Clear next-lines data from previous input
-    window._nextLinesMap = {};
-    try {
-        const resp = await fetch(`/api/${datasetSlug}/problem/${problemIdx}/ground_truth/${inputIdx}`);
-        const gt = await resp.json();
-        if (gt.status !== 'ok') {
-            taskItems.forEach(item => {
-                const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
-                if (el) { el.textContent = gt.status === 'error' ? '(exec error)' : '(unavailable)'; el.className = 'gt-answer'; }
-            });
-            applyCoverage(null, 0);
-            return;
-        }
-        // Apply coverage highlighting
-        const coverageSet = new Set(gt.coverage);
-        applyCoverage(coverageSet, gt.total_lines);
-        // Fill in variable answers
-        const answerMap = {};
-        gt.variable_answers.forEach(a => {
-            answerMap[`${a.lineno}-${a.var}`] = a.answer_str;
-        });
-        taskItems.forEach(item => {
-            const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
-            if (el) {
-                const answer = answerMap[`${item.lineno}-${item.var}`] || '(not available)';
-                el.textContent = answer;
-                el.className = 'gt-answer';
-            }
-        });
-        // Store next-lines data for arrow visualization
-        if (gt.next_lines_answers) {
-            gt.next_lines_answers.forEach(a => {
-                window._nextLinesMap[a.lineno] = a.next_lines;
-            });
-        }
-        // Attach hover handlers to task-marker spans now that we have next-lines data
-        attachArrowHoverHandlers();
-    } catch (e) {
-        taskItems.forEach(item => {
-            const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
-            if (el) { el.textContent = '(error)'; el.className = 'gt-answer'; }
-        });
-    }
-}
-// Cache of lineNum → DOM span, rebuilt whenever injectTaskMarkers runs.
-window._linenoSpanCache = null;
-function buildLinenoSpanCache(container) {
-    const cache = {};
-    container.querySelectorAll('td.linenos .normal').forEach(span => {
-        const n = parseInt(span.textContent.trim());
-        if (!isNaN(n)) cache[n] = span;
-    });
-    window._linenoSpanCache = cache;
-}
-/**
- * Get the bounding rect of the lineno span for a given 1-indexed line number,
- * relative to the code container element. Uses a cached span map.
- */
-function getLinenoSpanRect(lineNum, container) {
-    if (!window._linenoSpanCache) buildLinenoSpanCache(container);
-    const span = window._linenoSpanCache[lineNum];
-    if (!span) return null;
-    const spanRect = span.getBoundingClientRect();
-    const containerRect = container.getBoundingClientRect();
-    return {
-        top: spanRect.top - containerRect.top + container.scrollTop,
-        bottom: spanRect.bottom - containerRect.top + container.scrollTop,
-        left: spanRect.left - containerRect.left,
-        right: spanRect.right - containerRect.left,
-        width: spanRect.width,
-        height: spanRect.height,
-        midY: (spanRect.top + spanRect.bottom) / 2 - containerRect.top + container.scrollTop,
-    };
-}
-/**
- * Draw arrows from sourceLine to each of the targetLines in the SVG overlay.
- * Lines are 1-indexed. -1 means "end of execution" (no arrow drawn).
- */
-function drawArrows(sourceLineNum, targetLineNums) {
-    const container = document.getElementById('code-container');
-    const svg = document.getElementById('arrow-overlay');
-    if (!container || !svg) return;
-    // Remove previous arrows (but keep defs)
-    svg.querySelectorAll('.arrow-path').forEach(el => el.remove());
-    const srcRect = getLinenoSpanRect(sourceLineNum, container);
-    if (!srcRect) return;
-    // Update SVG height to match container
-    svg.setAttribute('height', container.scrollHeight);
-    targetLineNums.forEach(targetLineNum => {
-        if (targetLineNum === -1) return;  // end of trace — no arrow
-        const dstRect = getLinenoSpanRect(targetLineNum, container);
-        if (!dstRect) return;
-        // Start point: right edge of source lineno span, vertically centered
-        const x1 = srcRect.right + 2;
-        const y1 = srcRect.midY;
-        // End point: right edge of target lineno span, vertically centered
-        const x2 = dstRect.right + 2;
-        const y2 = dstRect.midY;
-        // Horizontal offset for the bezier control points — curves to the right
-        const curveOffset = Math.max(30, Math.abs(y2 - y1) * 0.4);
-        // Cubic bezier: both control points extend to the right of the lineno column
-        const cx1 = x1 + curveOffset;
-        const cy1 = y1;
-        const cx2 = x2 + curveOffset;
-        const cy2 = y2;
-        const path = document.createElementNS('http://www.w3.org/2000/svg', 'path');
-        path.setAttribute('d', `M ${x1} ${y1} C ${cx1} ${cy1}, ${cx2} ${cy2}, ${x2} ${y2}`);
-        path.setAttribute('class', 'exec-arrow arrow-path');
-        path.setAttribute('marker-end', 'url(#arrowhead)');
-        svg.appendChild(path);
-    });
-}
-/**
- * Clear all arrows from the SVG overlay.
- */
-function clearArrows() {
-    const svg = document.getElementById('arrow-overlay');
-    if (svg) {
-        svg.querySelectorAll('.arrow-path').forEach(el => el.remove());
-    }
-}
-// AbortController for the current set of marker hover listeners.
-let _markerListenersAbort = null;
-/**
- * Attach mouseenter/mouseleave handlers to all .task-marker spans so that
- * hovering shows execution-flow arrows to next lines.
- */
-function attachArrowHoverHandlers() {
-    // Cancel any previously attached listeners without touching the DOM.
-    if (_markerListenersAbort) _markerListenersAbort.abort();
-    _markerListenersAbort = new AbortController();
-    const { signal } = _markerListenersAbort;
-    document.querySelectorAll('.task-marker').forEach(marker => {
-        marker.addEventListener('mouseenter', () => {
-            const lineNum = parseInt(marker.dataset.lineno);
-            if (!lineNum) return;
-            const nextLines = window._nextLinesMap[lineNum];
-            if (nextLines && nextLines.length > 0) {
-                drawArrows(lineNum, nextLines);
-            }
-        }, { signal });
-        marker.addEventListener('mouseleave', () => {
-            clearArrows();
-        }, { signal });
-    });
-}
-function showCruxTask(taskIdx) {
-    const problem = window.currentProblem;
-    const task = problem.tasks[taskIdx];
-    // Update active button
-    document.querySelectorAll('.task-btn').forEach((btn, idx) => {
-        btn.classList.toggle('active', idx === taskIdx);
-    });
-    const givenLabel = task.given === 'input' ? 'Input (given)' : 'Output (given)';
-    const predictLabel = task.predict === 'output' ? 'Output (predict)' : 'Input (predict)';
-    const givenValue = task.given === 'input' ? task.input : task.output;
-    const predictValue = task.predict === 'output' ? task.output : task.input;
-    const html = `
-        <div class="task-details">
-            <div class="task-section">
-                <p style="margin-bottom: 12px; color: #7f8c8d;">${escapeHtml(task.description)}</p>
-            </div>
-            <div class="io-section">
-                <div class="task-section">
-                    <h3>${escapeHtml(givenLabel)}</h3>
-                    <pre class="code-block">${escapeHtml(givenValue)}</pre>
-                </div>
-                <div class="task-section">
-                    <h3>${escapeHtml(predictLabel)}</h3>
-                    <pre class="code-block crux-answer">${escapeHtml(predictValue)}</pre>
-                </div>
-            </div>
-        </div>
-    `;
-    document.getElementById('task-content').innerHTML = html;
-}
-function showTask(taskIdx) {
-    const problem = window.currentProblem;
-    const task = problem.tasks[taskIdx];
-    // Update active button
-    const buttons = document.querySelectorAll('.task-btn');
-    buttons.forEach((btn, idx) => {
-        if (idx === taskIdx) {
-            btn.classList.add('active');
-        } else {
-            btn.classList.remove('active');
-        }
-    });
-    // Inject task markers into the code
-    injectTaskMarkers(task.task_items);
-    // Clear previous coverage while new one loads
-    applyCoverage(null, 0);
-    // Render task content
-    // For HumanEval: Input + Expected Output side by side.
-    // For ClassEval: Input alone (side by side layout), then Test Class below full-width.
-    const ioSection = task.test_class_code
-        ? `<div class="io-section">
-               <div class="task-section">
-                   <h3>Input</h3>
-                   <pre class="code-block">${escapeHtml(task.input)}</pre>
-               </div>
-           </div>
-           <div class="task-section">
-               <h3>Test Class &mdash; <code>${escapeHtml(task.test_class_name)}</code></h3>
-               <pre class="code-block">${escapeHtml(task.test_class_code)}</pre>
-           </div>`
-        : `<div class="io-section">
-               <div class="task-section">
-                   <h3>Input</h3>
-                   <pre class="code-block">${escapeHtml(task.input)}</pre>
-               </div>
-               <div class="task-section">
-                   <h3>Expected Output</h3>
-                   <pre class="code-block">${escapeHtml(task.output)}</pre>
-               </div>
-           </div>`;
-    let html = `
-        <div class="task-details">
-            ${ioSection}
-    `;
-    // Show task items with ground truth answer placeholders
-    if (task.task_items && task.task_items.length > 0) {
-        html += `
-            <div class="task-section">
-                <h3>Reasoning Tasks</h3>
-                <p style="margin-bottom: 10px; color: #7f8c8d;">
-                    Variable state at each execution point (correct answer shown in
-                    <span style="background:#17a2b8;color:white;padding:1px 6px;border-radius:3px;font-size:0.82rem;">teal</span>):
-                </p>
-                <ul class="task-items-list">
-        `;
-        task.task_items.forEach(item => {
-            html += `
-                <li>
-                    <span class="line-ref">Line ${item.lineno}</span>
-                    <span class="var-name">${escapeHtml(item.var)}</span>
-                    <span class="gt-answer loading" id="gt-${item.lineno}-${item.var}">…</span>
-                </li>
-            `;
-        });
-        html += `
-                </ul>
-            </div>
-        `;
-    }
-    // Show output prediction task if exists
-    if (task.output_pred) {
-        html += `
-            <div class="task-section">
-                <h3>Output Completion Task</h3>
-                <p style="margin-bottom: 10px; color: #7f8c8d;">
-                    The model needs to complete this test assertion:
-                </p>
-                <pre class="code-block">${escapeHtml(task.output_pred)}</pre>
-            </div>
-        `;
-    }
-    html += `</div>`;
-    document.getElementById('task-content').innerHTML = html;
-    // Fetch and apply ground truth (coverage + variable answers)
-    if (hasGroundTruth && task.task_items) {
-        loadAndApplyGroundTruth(problem.idx, task.input_idx, task.task_items);
-    }
-}
-function escapeHtml(text) {
-    if (text === null || text === undefined) return '';
-    const div = document.createElement('div');
-    div.textContent = text;
-    return div.innerHTML;
-}
-loadProblem();
 </script>
 {% endblock %}

 {% endblock %}
 {% block extra_css %}
+{{ css|safe }}
+{% endblock %}
+{% block extra_head %}
+<link rel="stylesheet" href="{{ url_for('static', filename='problem.css') }}">
 {% endblock %}
 {% block content %}
 const datasetName = {{ dataset_name|tojson }};
 const hasGroundTruth = {{ has_ground_truth|tojson }};
 const hasTasks = {{ has_tasks|tojson }};
 </script>
+<script src="{{ url_for('static', filename='problem.js') }}"></script>
 {% endblock %}