egor-bogomolov commited on
Commit
9a8a9c5
·
1 Parent(s): f3f0934

Add 28 benchmark datasets with rich visualization views

Browse files

Datasets (28 total):
- Code Generation: REval, HumanEval+, MBPP+, MBPP, ClassEval, LiveCodeBench,
APPS, CodeContests, BigOBench, BigCodeBench, EffiBench, CodeSearchNet
- Code Reasoning: CRUXEval, HumanEvalPack (6 langs)
- Code Editing: SWE-bench Lite/Verified/Full, DebugBench, CanItEdit,
CodeEditorBench, CodeXGLUE Refinement, CommitBench
- Code Completion: SAFIM, HumanEval-X (5 langs)
- Vulnerability Detection: BigVul, DiverseVul, PrimeVul, Devign
View types:
- Simple view (code + inputs/outputs + tests)
- Before/After view with diff highlighting (DebugBench, CanItEdit, etc.)
- GitHub-style diff view with per-file sections and repo/issue/commit links (SWE-bench, CommitBench)
- Multi-language tabs (HumanEval-X, HumanEvalPack with canonical/buggy toggle)
- Fill-in-the-Middle view with inline hole markers (SAFIM)
- Vulnerability view with CWE badges (BigVul, DiverseVul, PrimeVul, Devign)
- Multi-solution view with complexity badges (BigOBench, CodeContests, APPS)
Architecture:
- Refactored to adapters/ package (code_generation, code_editing, code_reasoning, vulnerability)
- Extracted CSS/JS to static/problem.css and static/problem.js
- Deterministic random sampling (seed=42, cap=1000) for large datasets
- Dataset dropdown shows original size for sampled datasets (e.g. '1000 of 33050')
- Compact stats bar with total count and top 5 source tags
- SWE-bench: GitHub-style per-file diff sections with repository/issue/commit links
- SAFIM: inline answer placement at TODO markers instead of end-of-file

.gitignore CHANGED
@@ -69,3 +69,6 @@ dmypy.json
69
 
70
  # Ruff
71
  .ruff_cache/
 
 
 
 
69
 
70
  # Ruff
71
  .ruff_cache/
72
+
73
+ # AIR
74
+ .air/
CLAUDE.md CHANGED
@@ -22,10 +22,14 @@
22
  - Port: 7860 (default), configurable via PORT env var
23
  - Debug mode: controlled by FLASK_DEBUG env var
24
 
25
- 2. **dataset_adapters.py** - Dataset adapter system
26
- - `DatasetAdapter` base class with common interface
27
- - Concrete adapters for: DREval, CRUXEval, HumanEval+, BigOBench
28
- - Registry pattern (`REGISTRY` dict) for dataset management
 
 
 
 
29
  - Each adapter normalizes dataset-specific formats to common API
30
 
31
  3. **templates/** - Jinja2 HTML templates
@@ -33,9 +37,13 @@
33
  - `index.html` - Problem list view with filtering
34
  - `problem.html` - Problem detail view with syntax highlighting
35
 
36
- 4. **requirements.txt** / **pyproject.toml** - Dependencies
 
 
 
 
37
  - Core: flask, pygments
38
- - Optional HF: datasets (for CRUXEval, HumanEval+, BigOBench)
39
  - Dev: ruff
40
 
41
  ### Data Flow
@@ -59,7 +67,7 @@ User Request → Flask Route → Dataset Adapter → API Response → Template/J
59
 
60
  ### Python Files
61
  - **app.py**: Main entry point, Flask routes, ground truth logic
62
- - **dataset_adapters.py**: Adapter implementations for all datasets
63
  - **ground_truth_loader.py**: (parent dir) Loads execution traces for DREval
64
  - **dynamics.py**: (parent dir) Contains `Nil` singleton for missing values
65
 
@@ -75,25 +83,21 @@ User Request → Flask Route → Dataset Adapter → API Response → Template/J
75
 
76
  ## Key Functionalities
77
 
78
- ### 1. Dataset Support
 
 
 
 
 
 
79
 
80
- **DREval** (primary dataset):
81
- - 328 problems (164 HumanEval + 164 ClassEval)
82
- - Ground truth execution traces available
83
- - Tasks: Coverage, Path, State, Output predictions
84
- - Test inputs with expected outputs
85
 
86
- **CRUXEval** (HuggingFace):
87
- - Input/Output prediction tasks
88
- - Single function execution reasoning
89
 
90
- **HumanEval+** (HuggingFace):
91
- - Extended HumanEval with additional tests
92
- - No execution traces
93
 
94
- **BigOBench** (HuggingFace):
95
- - Algorithm complexity analysis
96
- - Multiple solutions per problem with time/space complexity labels
97
 
98
  ### 2. Problem Browsing
99
 
@@ -320,32 +324,80 @@ When making changes, verify:
320
  - **datasets**: HuggingFace datasets (>=2.14.0, optional)
321
  - **ruff**: Linting and formatting (>=0.8.0, dev)
322
 
323
- ### Data Sources
324
- - **DREval**: Local JSONL files in data/ directory
325
- - **CRUXEval**: cruxeval-org/cruxeval (HuggingFace Hub)
326
- - **HumanEval+**: evalplus/humanevalplus (HuggingFace Hub)
327
- - **BigOBench**: facebook/BigOBench (HuggingFace Hub)
328
-
329
- ## Future Enhancements (Not Implemented)
330
-
331
- Potential areas for improvement:
332
- - User authentication and saved preferences
333
- - Export functionality (PDF, CSV)
334
- - Comparison view for multiple solutions
335
- - Interactive debugging/stepping through execution
336
- - Code editing and re-evaluation
337
- - Dataset upload functionality
338
- - Performance metrics visualization
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
339
 
340
  ## Related Documentation
341
 
342
  - **README.md**: User-facing documentation, installation instructions
 
 
343
  - **pyproject.toml**: Package metadata, dependencies, ruff configuration
344
  - **Dockerfile**: Container deployment configuration (if present)
345
  - **requirements.txt**: Pip-format dependency list
346
 
347
  ---
348
 
349
- **Last Updated**: 2026-03-02
350
- **Project Status**: Active Development
351
  **Primary Maintainer**: Egor Bogomolov
 
22
  - Port: 7860 (default), configurable via PORT env var
23
  - Debug mode: controlled by FLASK_DEBUG env var
24
 
25
+ 2. **adapters/** - Dataset adapter system (modular package)
26
+ - `__init__.py` - `DatasetAdapter` base class, `REGISTRY` dict, `_set_helpers()` injection
27
+ - `code_generation.py` - REval, HumanEval+, MBPP+, MBPP, ClassEval, LiveCodeBench, APPS, CodeContests, BigOBench, CodeSearchNet, BigCodeBench, EffiBench
28
+ - `code_editing.py` - SWE-bench Lite/Verified/Full, DebugBench, CanItEdit, CodeEditorBench, CodeXGLUE Refinement, CommitBench
29
+ - `code_reasoning.py` - CRUXEval, SAFIM, HumanEval-X, HumanEvalPack
30
+ - `vulnerability.py` - BigVul, DiverseVul, PrimeVul, Devign
31
+ - `registration.py` - `register_hf_datasets()`, sampling helpers, JSONL loading
32
+ - 28 concrete adapters total
33
  - Each adapter normalizes dataset-specific formats to common API
34
 
35
  3. **templates/** - Jinja2 HTML templates
 
37
  - `index.html` - Problem list view with filtering
38
  - `problem.html` - Problem detail view with syntax highlighting
39
 
40
+ 4. **static/** - Frontend assets
41
+ - `problem.css` - Problem detail page styles
42
+ - `problem.js` - Problem detail page JavaScript (view rendering, diff, FIM, multi-language)
43
+
44
+ 5. **requirements.txt** / **pyproject.toml** - Dependencies
45
  - Core: flask, pygments
46
+ - Optional HF: datasets, huggingface_hub (for all 28 benchmark datasets)
47
  - Dev: ruff
48
 
49
  ### Data Flow
 
67
 
68
  ### Python Files
69
  - **app.py**: Main entry point, Flask routes, ground truth logic
70
+ - **adapters/**: Adapter package (see Architecture above)
71
  - **ground_truth_loader.py**: (parent dir) Loads execution traces for DREval
72
  - **dynamics.py**: (parent dir) Contains `Nil` singleton for missing values
73
 
 
83
 
84
  ## Key Functionalities
85
 
86
+ ### 1. Dataset Support (28 datasets)
87
+
88
+ **Code Generation**: REval (154), HumanEval+ (164), MBPP+ (378), MBPP (500), ClassEval (100), LiveCodeBench (1000), APPS (1000), CodeContests (165), BigOBench (556), BigCodeBench (1140), EffiBench (1000)
89
+
90
+ **Code Reasoning**: CRUXEval (800), HumanEvalPack (6x164)
91
+
92
+ **Code Editing**: SWE-bench Lite (300), SWE-bench Verified (500), SWE-bench (1000), DebugBench (1000), CanItEdit (105), CodeEditorBench (1000), CodeXGLUE Refinement (1000), CommitBench (1000)
93
 
94
+ **Code Completion/Translation**: SAFIM (1000), HumanEval-X (5x164), CodeSearchNet (1000)
 
 
 
 
95
 
96
+ **Vulnerability Detection**: BigVul (1000), DiverseVul (1000), PrimeVul (1000), Devign (1000)
 
 
97
 
98
+ Note: Large datasets are sampled down to 1000 entries (seed=42) for fast browsing.
 
 
99
 
100
+ REval is the primary dataset with ground truth execution traces. All other datasets are loaded from HuggingFace Hub.
 
 
101
 
102
  ### 2. Problem Browsing
103
 
 
324
  - **datasets**: HuggingFace datasets (>=2.14.0, optional)
325
  - **ruff**: Linting and formatting (>=0.8.0, dev)
326
 
327
+ ### Data Sources (all HuggingFace Hub)
328
+ - **REval**: JetBrains-Research/REval
329
+ - **CRUXEval**: cruxeval-org/cruxeval
330
+ - **HumanEval+**: evalplus/humanevalplus
331
+ - **BigOBench**: facebook/BigOBench
332
+ - **MBPP+**: evalplus/mbppplus
333
+ - **ClassEval**: FudanSELab/ClassEval
334
+ - **LiveCodeBench**: livecodebench/code_generation_lite (via `_load_jsonl_dataset`)
335
+ - **DebugBench**: Rtian/DebugBench
336
+ - **HumanEval-X**: THUDM/humaneval-x (via `_load_jsonl_dataset`)
337
+ - **SWE-bench Lite**: princeton-nlp/SWE-bench_Lite
338
+ - **SWE-bench Verified**: princeton-nlp/SWE-bench_Verified
339
+ - **SWE-bench**: princeton-nlp/SWE-bench
340
+ - **CodeContests**: deepmind/code_contests
341
+ - **APPS**: codeparrot/apps (via `refs/convert/parquet` revision)
342
+ - **CanItEdit**: nuprl/CanItEdit
343
+ - **MBPP**: google-research-datasets/mbpp
344
+ - **SAFIM**: gonglinyuan/safim
345
+ - **BigVul**: bstee615/bigvul
346
+ - **DiverseVul**: claudios/DiverseVul
347
+ - **PrimeVul**: starsofchance/PrimeVul (via direct JSONL loading)
348
+ - **CodeEditorBench**: m-a-p/CodeEditorBench (via `_load_jsonl_dataset` per task type)
349
+ - **CodeSearchNet**: code-search-net/code_search_net
350
+ - **Devign**: google/code_x_glue_cc_defect_detection
351
+ - **BigCodeBench**: bigcode/bigcodebench
352
+ - **HumanEvalPack**: bigcode/humanevalpack (per-language configs)
353
+ - **CodeXGLUE Refinement**: google/code_x_glue_cc_code_refinement
354
+ - **CommitBench**: Maxscha/commitbench
355
+ - **EffiBench**: DONG19/EffiBench
356
+
357
+ ## Benchmark Expansion
358
+
359
+ ### Progress Tracking
360
+ See `PROGRESS.md` for detailed batch plan and status.
361
+ See `benchmarks_analysis.csv` for full analysis of 35+ benchmarks.
362
+
363
+ ### Multi-language Syntax Highlighting
364
+ The `highlight_code()` function in `app.py` accepts an optional `language` parameter
365
+ (default: `"python"`). Supported languages are mapped via `LEXER_MAP` to Pygments lexers.
366
+ Adapters pass the language when calling `_highlight_code(code, language=...)`.
367
+
368
+ ### View Types
369
+ The problem detail page (`problem.html`) supports several view types, dispatched in `renderProblem()`:
370
+ 1. **BigOBench view** — multiple solutions with complexity badges
371
+ 2. **Simple view** — code + inputs/outputs + test suite (HumanEval+, MBPP+, MBPP, ClassEval, BigCodeBench, EffiBench)
372
+ 3. **CRUXEval view** — given/predict task selector
373
+ 4. **DREval view** — full interactive view with coverage, arrows, ground truth
374
+ 5. **Before/After view** — side-by-side buggy/fixed code (DebugBench, CanItEdit, CodeEditorBench, CodeXGLUE Refinement)
375
+ 6. **Multi-language view** — same problem in multiple languages (HumanEval-X, HumanEvalPack with canonical/buggy toggle)
376
+ 7. **Diff view** — patch visualization (SWE-bench Lite, SWE-bench Verified, SWE-bench, CommitBench)
377
+ 8. **Fill-in-the-Middle view** — prefix + [HOLE] + suffix (SAFIM)
378
+ 9. **Vulnerability view** — vulnerable/patched code + CWE labels (BigVul, DiverseVul, PrimeVul, Devign)
379
+
380
+ ### Adding New Datasets (Updated)
381
+ 1. Create adapter class in the appropriate `adapters/` submodule inheriting from `DatasetAdapter`
382
+ 2. Implement: `problem_count()`, `get_problem_summary()`, `get_problem_detail()`
383
+ 3. Set class attributes: `slug`, `display_name`, `has_ground_truth`, `has_tasks`
384
+ 4. Import adapter in `adapters/registration.py` and add registration in `register_hf_datasets()` with try/except
385
+ 5. If new language: ensure `LEXER_MAP` in `app.py` has the needed lexer
386
+ 6. If new view type: add rendering branch in `static/problem.js` `renderProblem()`
387
+ 7. Add badge color in `base.html` CSS
388
+ 8. Test: `/api/<slug>/problems` and `/api/<slug>/problem/<idx>`
389
 
390
  ## Related Documentation
391
 
392
  - **README.md**: User-facing documentation, installation instructions
393
+ - **PROGRESS.md**: Batch integration progress and architecture decisions
394
+ - **benchmarks_analysis.csv**: Full benchmark analysis with prioritization
395
  - **pyproject.toml**: Package metadata, dependencies, ruff configuration
396
  - **Dockerfile**: Container deployment configuration (if present)
397
  - **requirements.txt**: Pip-format dependency list
398
 
399
  ---
400
 
401
+ **Last Updated**: 2026-03-04
402
+ **Project Status**: Active Development — Benchmark Expansion Phase
403
  **Primary Maintainer**: Egor Bogomolov
PROGRESS.md ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Benchmark Integration Progress
2
+
3
+ ## Status: Batches 1-5 Complete
4
+
5
+ ## Batch Plan
6
+
7
+ ### Batch 1 (Highest Priority -- Easy HF, High Influence)
8
+ | Benchmark | Slug | Status | HF Dataset | View Type |
9
+ |-----------|------|--------|------------|-----------|
10
+ | MBPP+ | `mbppplus` | Done | `evalplus/mbppplus` | Simple |
11
+ | ClassEval | `classeval` | Done | `FudanSELab/ClassEval` | Simple |
12
+ | LiveCodeBench | `livecodebench` | Done | `livecodebench/code_generation_lite` | Simple |
13
+ | DebugBench | `debugbench` | Done | `Rtian/DebugBench` | Before/After |
14
+ | HumanEval-X | `humanevalx` | Done | `THUDM/humaneval-x` | Multi-language |
15
+
16
+ **Refactoring done:** Multi-language syntax highlighting via `get_lexer_by_name()`. Before/after code diff view. Multi-language tab view.
17
+
18
+ ### Batch 2
19
+ | Benchmark | Slug | Status | HF Dataset | View Type |
20
+ |-----------|------|--------|------------|-----------|
21
+ | SWE-bench Lite | `swebenchlite` | Done | `princeton-nlp/SWE-bench_Lite` | Diff |
22
+ | CodeContests | `codecontests` | Done | `deepmind/code_contests` | Multi-solution |
23
+ | APPS | `apps` | Done | `codeparrot/apps` | Multi-solution / Simple |
24
+ | CanItEdit | `canitedit` | Done | `nuprl/CanItEdit` | Before/After |
25
+ | MBPP | `mbpp` | Done | `google-research-datasets/mbpp` | Simple |
26
+
27
+ **New views:** Unified diff view for SWE-bench patches. Multi-solution view extended to show language labels for CodeContests.
28
+
29
+ ### Batch 3
30
+ | Benchmark | Slug | Status | HF Dataset | View Type |
31
+ |-----------|------|--------|------------|-----------|
32
+ | SAFIM | `safim` | Done | `gonglinyuan/safim` | Fill-in-the-Middle |
33
+ | BigVul | `bigvul` | Done | `bstee615/bigvul` | Vulnerability |
34
+ | DiverseVul | `diversevul` | Done | `claudios/DiverseVul` | Vulnerability |
35
+ | PrimeVul | `primevul` | Done | `starsofchance/PrimeVul` | Vulnerability |
36
+ | CodeEditorBench | `codeeditorbench` | Done | `m-a-p/CodeEditorBench` | Before/After |
37
+
38
+ **New views:** Fill-in-the-Middle view showing code with [HOLE] marker and ground truth. Vulnerability view with CWE badges and vulnerable/patched code comparison.
39
+
40
+ ### Batch 4
41
+ | Benchmark | Slug | Status | HF Dataset | View Type |
42
+ |-----------|------|--------|------------|-----------|
43
+ | SWE-bench Verified | `swebenchverified` | Done | `princeton-nlp/SWE-bench_Verified` | Diff |
44
+ | CodeSearchNet | `codesearchnet` | Done | `code-search-net/code_search_net` | Simple |
45
+ | Devign | `devign` | Done | `google/code_x_glue_cc_defect_detection` | Vulnerability |
46
+
47
+ ### Dropped from original plan
48
+ | Benchmark | Reason |
49
+ |-----------|--------|
50
+ | DS-1000 | Complex library-specific format, limited visualization value |
51
+ | RepoBench | Repo-level context too complex for per-problem viewing |
52
+ | MultiPL-E | 22 languages but same problems as HumanEval/MBPP already covered |
53
+ | McEval | Very large (40 languages), complex format |
54
+ | xCodeEval | Very large (25M rows), 7 tasks, too complex |
55
+ | CrossVul | Similar to DiverseVul/BigVul, diminishing returns |
56
+
57
+ ### Batch 5
58
+ | Benchmark | Slug | Status | HF Dataset | View Type |
59
+ |-----------|------|--------|------------|-----------|
60
+ | BigCodeBench | `bigcodebench` | Done | `bigcode/bigcodebench` | Simple |
61
+ | HumanEvalPack | `humanevalpack` | Done | `bigcode/humanevalpack` | Multi-language + Before/After |
62
+ | CodeXGLUE Refinement | `codexgluerefinement` | Done | `google/code_x_glue_cc_code_refinement` | Before/After |
63
+ | SWE-bench | `swebenchfull` | Done | `princeton-nlp/SWE-bench` | Diff |
64
+ | CommitBench | `commitbench` | Done | `Maxscha/commitbench` | Diff |
65
+ | EffiBench | `effibench` | Done | `DONG19/EffiBench` | Simple |
66
+
67
+ **New views:** Multi-language view with canonical/buggy code toggle (HumanEvalPack). CommitBench reuses diff view. CodeXGLUE Refinement uses before/after Java view.
68
+
69
+ ### Deferred (GitHub-only or complex infrastructure)
70
+ CoderEval, NaturalCodeBench, DevEval, RunBugRun, Defects4J, ConDefects, FixEval, TransCoder, AVATAR, TypeEvalPy, VJBench, SVEN, PyTER
71
+
72
+ ## Architecture Decisions
73
+
74
+ ### Multi-language Support
75
+ - `highlight_code()` in `app.py` accepts `language` parameter (default: `"python"`)
76
+ - Uses `get_lexer_by_name()` from Pygments for automatic lexer selection
77
+ - Adapters pass language when calling `_highlight_code(code, language=...)`
78
+
79
+ ### View Types Implemented
80
+ 1. **BigOBench view** -- multiple solutions with complexity badges
81
+ 2. **Simple view** -- code + inputs/outputs + test suite (HumanEval+, MBPP+, MBPP, ClassEval, LiveCodeBench, APPS, CodeSearchNet)
82
+ 3. **CRUXEval view** -- given/predict task selector
83
+ 4. **DREval view** -- full interactive view with coverage, arrows, ground truth
84
+ 5. **Before/After view** -- side-by-side buggy/fixed code (DebugBench, CanItEdit, CodeEditorBench)
85
+ 6. **Multi-language view** -- same problem in multiple languages (HumanEval-X, HumanEvalPack)
86
+ 7. **Diff view** -- unified diff patch visualization (SWE-bench Lite, SWE-bench Verified, SWE-bench, CommitBench)
87
+ 8. **Fill-in-the-Middle view** -- prefix + [HOLE] + suffix (SAFIM)
88
+ 9. **Vulnerability view** -- vulnerable/patched code + CWE labels (BigVul, DiverseVul, PrimeVul, Devign)
89
+
90
+ ## Total Datasets: 28
91
+ Base (4): REval, CRUXEval, HumanEval+, BigOBench
92
+ Batch 1 (5): MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X
93
+ Batch 2 (5): SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP
94
+ Batch 3 (5): SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench
95
+ Batch 4 (3): SWE-bench Verified, CodeSearchNet, Devign
96
+ Batch 5 (6): BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench
97
+
98
+ ## Changelog
99
+
100
+ - 2026-03-03: Initial benchmark analysis and prioritization complete
101
+ - 2026-03-03: Batch 1 complete (MBPP+, ClassEval, LiveCodeBench, DebugBench, HumanEval-X)
102
+ - 2026-03-03: Batch 2 complete (SWE-bench Lite, CodeContests, APPS, CanItEdit, MBPP)
103
+ - 2026-03-03: Batch 3 complete (SAFIM, BigVul, DiverseVul, PrimeVul, CodeEditorBench)
104
+ - 2026-03-03: Batch 4 complete (SWE-bench Verified, CodeSearchNet, Devign)
105
+ - 2026-03-03: Fixed APPS loading (refs/convert/parquet), PrimeVul (direct JSONL), CodeEditorBench (per-task JSONL)
106
+ - 2026-03-03: All 22 datasets verified loading successfully
107
+ - 2026-03-04: Refactored adapters into submodules (adapters/code_generation.py, code_editing.py, code_reasoning.py, vulnerability.py)
108
+ - 2026-03-04: Extracted CSS and JS into static/ directory (static/problem.css, static/problem.js)
109
+ - 2026-03-04: Added sampling for large datasets (cap at 1000 with seed=42)
110
+ - 2026-03-04: Enhanced FIM view (merged code with ground truth highlighting)
111
+ - 2026-03-04: Enhanced Before/After view (diff highlighting)
112
+ - 2026-03-04: Enhanced SWE-bench diff view (full file with diff chunks)
113
+ - 2026-03-04: Batch 5 complete (BigCodeBench, HumanEvalPack, CodeXGLUE Refinement, SWE-bench, CommitBench, EffiBench)
114
+ - 2026-03-04: All 28 datasets verified loading successfully
README.md CHANGED
@@ -11,14 +11,55 @@ pinned: false
11
 
12
  A web-based interface for browsing and manually inspecting individual datapoints from popular ML4SE (Machine Learning for Software Engineering) benchmark datasets.
13
 
14
- ## Supported Datasets
15
 
16
- | Dataset | Description |
17
- |---------|-------------|
18
- | **REval** | Dynamic reasoning evaluation with execution traces and ground truth variable states |
19
- | **CRUXEval** | Input/output prediction tasks for single-function execution reasoning |
20
- | **HumanEval+** | Extended HumanEval with additional tests |
21
- | **BigOBench** | Algorithm complexity analysis with time/space complexity labels |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
  ## Installation & Usage
24
 
@@ -45,7 +86,7 @@ uv run ruff format .
45
 
46
  ### Adding a New Dataset
47
 
48
- 1. Create an adapter class in `dataset_adapters.py` inheriting from `DatasetAdapter`
49
  2. Implement required methods: `problem_count()`, `get_problem_summary()`, `get_problem_detail()`
50
- 3. Register the adapter in the `REGISTRY`
51
  4. Test: `/api/<slug>/problems` and `/api/<slug>/problem/<idx>`
 
11
 
12
  A web-based interface for browsing and manually inspecting individual datapoints from popular ML4SE (Machine Learning for Software Engineering) benchmark datasets.
13
 
14
+ ## Supported Datasets (28)
15
 
16
+ ### Code Generation
17
+ | Dataset | Source | View Type |
18
+ |---------|--------|-----------|
19
+ | **HumanEval+** | evalplus/humanevalplus | Simple |
20
+ | **MBPP+** | evalplus/mbppplus | Simple |
21
+ | **MBPP** | google-research-datasets/mbpp | Simple |
22
+ | **ClassEval** | FudanSELab/ClassEval | Simple |
23
+ | **LiveCodeBench** | livecodebench/code_generation_lite | Simple |
24
+ | **APPS** | codeparrot/apps | Multi-solution |
25
+ | **CodeContests** | deepmind/code_contests | Multi-solution |
26
+ | **BigOBench** | facebook/BigOBench | Complexity badges |
27
+ | **BigCodeBench** | bigcode/bigcodebench | Simple |
28
+ | **EffiBench** | DONG19/EffiBench | Simple |
29
+
30
+ ### Code Reasoning & Evaluation
31
+ | Dataset | Source | View Type |
32
+ |---------|--------|-----------|
33
+ | **REval** | JetBrains-Research/REval | Interactive (coverage, arrows, ground truth) |
34
+ | **CRUXEval** | cruxeval-org/cruxeval | Given/Predict task selector |
35
+ | **HumanEvalPack** | bigcode/humanevalpack | Multi-language + buggy/canonical |
36
+
37
+ ### Code Editing & Debugging
38
+ | Dataset | Source | View Type |
39
+ |---------|--------|-----------|
40
+ | **SWE-bench Lite** | princeton-nlp/SWE-bench_Lite | Unified diff |
41
+ | **SWE-bench Verified** | princeton-nlp/SWE-bench_Verified | Unified diff |
42
+ | **SWE-bench** | princeton-nlp/SWE-bench | Unified diff |
43
+ | **DebugBench** | Rtian/DebugBench | Before/After |
44
+ | **CanItEdit** | nuprl/CanItEdit | Before/After |
45
+ | **CodeEditorBench** | m-a-p/CodeEditorBench | Before/After |
46
+ | **CodeXGLUE Refinement** | google/code_x_glue_cc_code_refinement | Before/After |
47
+ | **CommitBench** | Maxscha/commitbench | Unified diff |
48
+
49
+ ### Code Completion & Translation
50
+ | Dataset | Source | View Type |
51
+ |---------|--------|-----------|
52
+ | **SAFIM** | gonglinyuan/safim | Fill-in-the-Middle |
53
+ | **HumanEval-X** | THUDM/humaneval-x | Multi-language tabs |
54
+ | **CodeSearchNet** | code-search-net/code_search_net | Simple |
55
+
56
+ ### Vulnerability Detection
57
+ | Dataset | Source | View Type |
58
+ |---------|--------|-----------|
59
+ | **BigVul** | bstee615/bigvul | Vulnerability (CWE badges) |
60
+ | **DiverseVul** | claudios/DiverseVul | Vulnerability |
61
+ | **PrimeVul** | starsofchance/PrimeVul | Vulnerability |
62
+ | **Devign** | google/code_x_glue_cc_defect_detection | Vulnerability |
63
 
64
  ## Installation & Usage
65
 
 
86
 
87
  ### Adding a New Dataset
88
 
89
+ 1. Create an adapter class in the appropriate `adapters/` submodule inheriting from `DatasetAdapter`
90
  2. Implement required methods: `problem_count()`, `get_problem_summary()`, `get_problem_detail()`
91
+ 3. Register the adapter in `adapters/registration.py`
92
  4. Test: `/api/<slug>/problems` and `/api/<slug>/problem/<idx>`
adapters/__init__.py ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Dataset adapters for the ML4SE Benchmark Viewer.
3
+
4
+ Each adapter normalises a different benchmark dataset into a common API shape
5
+ so the Flask routes and templates can handle them uniformly.
6
+
7
+ The REGISTRY dict maps slug strings (used in URLs) to adapter instances.
8
+ """
9
+
10
+ from __future__ import annotations
11
+
12
+ from typing import Any
13
+
14
+ # ---------------------------------------------------------------------------
15
+ # Helper function stubs – injected at runtime by app.py via _set_helpers()
16
+ # ---------------------------------------------------------------------------
17
+
18
+ _highlight_code = None
19
+ _code_offset = None
20
+ _extract_test_classes = None
21
+
22
+
23
+ def _set_helpers(highlight_code_fn, code_offset_fn, extract_test_classes_fn):
24
+ """Called once by app.py to inject helper functions."""
25
+ global _highlight_code, _code_offset, _extract_test_classes
26
+ _highlight_code = highlight_code_fn
27
+ _code_offset = code_offset_fn
28
+ _extract_test_classes = extract_test_classes_fn
29
+
30
+ # Propagate to submodules so adapters can use them
31
+ from adapters import code_editing, code_generation, code_reasoning, vulnerability
32
+
33
+ for mod in (code_generation, code_editing, code_reasoning, vulnerability):
34
+ mod._highlight_code = highlight_code_fn
35
+ mod._code_offset = code_offset_fn
36
+ mod._extract_test_classes = extract_test_classes_fn
37
+
38
+
39
+ # ---------------------------------------------------------------------------
40
+ # Registry
41
+ # ---------------------------------------------------------------------------
42
+
43
+ REGISTRY: dict[str, DatasetAdapter] = {}
44
+
45
+
46
+ # ---------------------------------------------------------------------------
47
+ # Base class
48
+ # ---------------------------------------------------------------------------
49
+
50
+
51
+ class DatasetAdapter:
52
+ slug: str = ""
53
+ display_name: str = ""
54
+ has_ground_truth: bool = False
55
+ has_tasks: bool = False
56
+ total_count: int | None = None # original size before sampling (None = not sampled)
57
+
58
+ def problem_count(self) -> int:
59
+ raise NotImplementedError
60
+
61
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
62
+ raise NotImplementedError
63
+
64
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
65
+ raise NotImplementedError
66
+
67
+ def get_ground_truth(self, idx: int, input_idx: int) -> dict[str, Any]:
68
+ return {"status": "unavailable", "message": "Ground truth not available for this dataset"}
69
+
70
+
71
+ # ---------------------------------------------------------------------------
72
+ # Re-export registration entry point
73
+ # ---------------------------------------------------------------------------
74
+
75
+ from adapters.registration import register_hf_datasets # noqa: E402, F401
76
+
77
+ __all__ = [
78
+ "REGISTRY",
79
+ "DatasetAdapter",
80
+ "_set_helpers",
81
+ "register_hf_datasets",
82
+ ]
adapters/code_editing.py ADDED
@@ -0,0 +1,403 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Code editing benchmark adapters (SWE-bench, DebugBench, CanItEdit, CodeEditorBench)."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import json
6
+ from typing import Any
7
+
8
+ from adapters import DatasetAdapter
9
+
10
+ # Injected at runtime by _set_helpers()
11
+ _highlight_code = None
12
+ _code_offset = None
13
+ _extract_test_classes = None
14
+
15
+
16
+ # ---------------------------------------------------------------------------
17
+ # SWE-bench Lite adapter (HuggingFace: princeton-nlp/SWE-bench_Lite)
18
+ # ---------------------------------------------------------------------------
19
+
20
+
21
+ class SWEBenchLiteAdapter(DatasetAdapter):
22
+ slug = "swebenchlite"
23
+ display_name = "SWE-bench Lite"
24
+ has_ground_truth = False
25
+ has_tasks = False
26
+
27
+ def __init__(self, hf_dataset):
28
+ self._ds = hf_dataset
29
+
30
+ def problem_count(self) -> int:
31
+ return len(self._ds)
32
+
33
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
34
+ row = self._ds[idx]
35
+ return {
36
+ "idx": idx,
37
+ "task_id": row["instance_id"],
38
+ "entry_point": row["instance_id"].split("__")[-1],
39
+ "num_inputs": 0,
40
+ "source": row["repo"],
41
+ }
42
+
43
+ @staticmethod
44
+ def _github_links(instance_id: str, repo: str, base_commit: str) -> dict[str, str]:
45
+ """Build GitHub URLs from SWE-bench instance metadata."""
46
+ links: dict[str, str] = {}
47
+ if repo:
48
+ links["repo_url"] = f"https://github.com/{repo}"
49
+ # instance_id format: "repo__issue-number" e.g. "astropy__astropy-12907"
50
+ parts = instance_id.rsplit("-", 1)
51
+ if len(parts) == 2 and parts[1].isdigit() and repo:
52
+ links["issue_url"] = f"https://github.com/{repo}/issues/{parts[1]}"
53
+ if base_commit and repo:
54
+ links["commit_url"] = f"https://github.com/{repo}/commit/{base_commit}"
55
+ return links
56
+
57
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
58
+ row = self._ds[idx]
59
+ patch = row["patch"]
60
+ fail_to_pass = json.loads(row["FAIL_TO_PASS"]) if row["FAIL_TO_PASS"] else []
61
+ pass_to_pass = json.loads(row["PASS_TO_PASS"]) if row["PASS_TO_PASS"] else []
62
+ instance_id = row["instance_id"]
63
+ repo = row["repo"]
64
+ base_commit = row.get("base_commit", "")
65
+ return {
66
+ "idx": idx,
67
+ "task_id": instance_id,
68
+ "entry_point": instance_id.split("__")[-1],
69
+ "code": patch,
70
+ "highlighted_code": "",
71
+ "inputs": [],
72
+ "outputs": [],
73
+ "test": None,
74
+ "tasks": [],
75
+ "source": repo,
76
+ "has_ground_truth": False,
77
+ "has_tasks": False,
78
+ "description": row["problem_statement"],
79
+ "patch": patch,
80
+ "test_patch": row.get("test_patch", ""),
81
+ "fail_to_pass": fail_to_pass,
82
+ "pass_to_pass": pass_to_pass,
83
+ "hints": row.get("hints_text", ""),
84
+ "repo": repo,
85
+ "base_commit": base_commit,
86
+ "version": row.get("version", ""),
87
+ "created_at": row.get("created_at", ""),
88
+ **self._github_links(instance_id, repo, base_commit),
89
+ }
90
+
91
+
92
+ # ---------------------------------------------------------------------------
93
+ # SWE-bench Verified adapter (HuggingFace: princeton-nlp/SWE-bench_Verified)
94
+ # ---------------------------------------------------------------------------
95
+
96
+
97
+ class SWEBenchVerifiedAdapter(SWEBenchLiteAdapter):
98
+ slug = "swebenchverified"
99
+ display_name = "SWE-bench Verified"
100
+
101
+
102
+ class SWEBenchFullAdapter(SWEBenchLiteAdapter):
103
+ slug = "swebenchfull"
104
+ display_name = "SWE-bench"
105
+
106
+
107
+ # ---------------------------------------------------------------------------
108
+ # DebugBench adapter (HuggingFace: Rtian/DebugBench)
109
+ # ---------------------------------------------------------------------------
110
+
111
+
112
+ class DebugBenchAdapter(DatasetAdapter):
113
+ slug = "debugbench"
114
+ display_name = "DebugBench"
115
+ has_ground_truth = False
116
+ has_tasks = False
117
+
118
+ def __init__(self, hf_dataset):
119
+ self._ds = hf_dataset
120
+
121
+ def problem_count(self) -> int:
122
+ return len(self._ds)
123
+
124
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
125
+ row = self._ds[idx]
126
+ return {
127
+ "idx": idx,
128
+ "task_id": row["slug"],
129
+ "entry_point": row["slug"],
130
+ "num_inputs": len(row["examples"]),
131
+ "source": f"{row['language']}/{row['category']}",
132
+ }
133
+
134
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
135
+ row = self._ds[idx]
136
+ lang = row["language"]
137
+ buggy = row["buggy_code"]
138
+ fixed = row["solution"]
139
+ return {
140
+ "idx": idx,
141
+ "task_id": row["slug"],
142
+ "entry_point": row["slug"],
143
+ "code": fixed,
144
+ "highlighted_code": _highlight_code(fixed, language=lang),
145
+ "inputs": [],
146
+ "outputs": [],
147
+ "test": None,
148
+ "tasks": [],
149
+ "source": f"{lang}/{row['category']}",
150
+ "has_ground_truth": False,
151
+ "has_tasks": False,
152
+ "description": row["question"],
153
+ "language": lang,
154
+ "buggy_code": buggy,
155
+ "buggy_highlighted_code": _highlight_code(buggy, language=lang),
156
+ "fixed_code": fixed,
157
+ "fixed_highlighted_code": _highlight_code(fixed, language=lang),
158
+ "bug_category": row["category"],
159
+ "bug_subtype": row["subtype"],
160
+ "bug_explanation": row["bug_explanation"],
161
+ "difficulty": row["level"],
162
+ "examples": list(row["examples"]),
163
+ }
164
+
165
+
166
+ # ---------------------------------------------------------------------------
167
+ # CanItEdit adapter (HuggingFace: nuprl/CanItEdit)
168
+ # ---------------------------------------------------------------------------
169
+
170
+
171
+ class CanItEditAdapter(DatasetAdapter):
172
+ slug = "canitedit"
173
+ display_name = "CanItEdit"
174
+ has_ground_truth = False
175
+ has_tasks = False
176
+
177
+ def __init__(self, hf_dataset):
178
+ self._ds = hf_dataset
179
+
180
+ def problem_count(self) -> int:
181
+ return len(self._ds)
182
+
183
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
184
+ row = self._ds[idx]
185
+ taxonomy = row.get("taxonomy", {})
186
+ change_kind = taxonomy.get("change_kind", "") if isinstance(taxonomy, dict) else ""
187
+ return {
188
+ "idx": idx,
189
+ "task_id": row.get("full_name", str(row.get("id", idx))),
190
+ "entry_point": row.get("name", f"edit_{idx}"),
191
+ "num_inputs": 0,
192
+ "source": change_kind or "CanItEdit",
193
+ }
194
+
195
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
196
+ row = self._ds[idx]
197
+ before = row["before"]
198
+ after = row["after"]
199
+ taxonomy = row.get("taxonomy", {})
200
+ if not isinstance(taxonomy, dict):
201
+ taxonomy = {}
202
+ return {
203
+ "idx": idx,
204
+ "task_id": row.get("full_name", str(row.get("id", idx))),
205
+ "entry_point": row.get("name", f"edit_{idx}"),
206
+ "code": after,
207
+ "highlighted_code": _highlight_code(after),
208
+ "inputs": [],
209
+ "outputs": [],
210
+ "test": row.get("tests", ""),
211
+ "tasks": [],
212
+ "source": taxonomy.get("change_kind", "CanItEdit"),
213
+ "has_ground_truth": False,
214
+ "has_tasks": False,
215
+ "description": row.get("instruction_descriptive", ""),
216
+ "buggy_code": before,
217
+ "buggy_highlighted_code": _highlight_code(before),
218
+ "fixed_code": after,
219
+ "fixed_highlighted_code": _highlight_code(after),
220
+ "bug_category": taxonomy.get("change_kind", ""),
221
+ "bug_subtype": taxonomy.get("topic", ""),
222
+ "bug_explanation": row.get("instruction_lazy", ""),
223
+ }
224
+
225
+
226
+ # ---------------------------------------------------------------------------
227
+ # CodeEditorBench adapter (HuggingFace: m-a-p/CodeEditorBench)
228
+ # ---------------------------------------------------------------------------
229
+
230
+
231
+ class CodeEditorBenchAdapter(DatasetAdapter):
232
+ slug = "codeeditorbench"
233
+ display_name = "CodeEditorBench"
234
+ has_ground_truth = False
235
+ has_tasks = False
236
+
237
+ def __init__(self, rows: list[dict[str, Any]]):
238
+ self._rows = rows
239
+
240
+ def problem_count(self) -> int:
241
+ return len(self._rows)
242
+
243
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
244
+ row = self._rows[idx]
245
+ return {
246
+ "idx": idx,
247
+ "task_id": str(row.get("idx", idx)),
248
+ "entry_point": row.get("title", f"problem_{idx}"),
249
+ "num_inputs": 0,
250
+ "source": row.get("_task_type", "unknown"),
251
+ }
252
+
253
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
254
+ row = self._rows[idx]
255
+ task_type = row.get("_task_type", "unknown")
256
+ lang = row.get("code_language", row.get("source_lang", "python")) or "python"
257
+ lang_key = lang.lower()
258
+
259
+ if task_type == "code_debug":
260
+ buggy = row.get("incorrect_solutions", "")
261
+ fixed = row.get("solutions", "")
262
+ elif task_type == "code_translate":
263
+ buggy = row.get("source_code", "")
264
+ fixed = row.get("solutions", row.get("source_code", ""))
265
+ elif task_type == "code_polishment":
266
+ buggy = row.get("source_code", "")
267
+ fixed = row.get("solutions", row.get("source_code", ""))
268
+ else: # code_switch
269
+ buggy = row.get("similar_source_code", row.get("source_code", ""))
270
+ fixed = row.get("solutions", row.get("source_code", ""))
271
+
272
+ return {
273
+ "idx": idx,
274
+ "task_id": str(row.get("idx", idx)),
275
+ "entry_point": row.get("title", f"problem_{idx}"),
276
+ "code": fixed,
277
+ "highlighted_code": _highlight_code(fixed, language=lang_key) if fixed else "",
278
+ "inputs": [],
279
+ "outputs": [],
280
+ "test": None,
281
+ "tasks": [],
282
+ "source": task_type,
283
+ "has_ground_truth": False,
284
+ "has_tasks": False,
285
+ "description": "",
286
+ "buggy_code": buggy,
287
+ "buggy_highlighted_code": _highlight_code(buggy, language=lang_key) if buggy else "",
288
+ "fixed_code": fixed,
289
+ "fixed_highlighted_code": _highlight_code(fixed, language=lang_key) if fixed else "",
290
+ "bug_category": task_type,
291
+ "bug_subtype": row.get("difficulty", ""),
292
+ "bug_explanation": "",
293
+ "difficulty": row.get("difficulty", ""),
294
+ "language": lang,
295
+ }
296
+
297
+
298
+ # ---------------------------------------------------------------------------
299
+ # CodeXGLUE Code Refinement adapter (HuggingFace: google/code_x_glue_cc_code_refinement)
300
+ # ---------------------------------------------------------------------------
301
+
302
+
303
+ class CodeXGLUERefinementAdapter(DatasetAdapter):
304
+ slug = "codexgluerefinement"
305
+ display_name = "CodeXGLUE Code Refinement"
306
+ has_ground_truth = False
307
+ has_tasks = False
308
+
309
+ def __init__(self, hf_dataset):
310
+ self._ds = hf_dataset
311
+
312
+ def problem_count(self) -> int:
313
+ return len(self._ds)
314
+
315
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
316
+ row = self._ds[idx]
317
+ return {
318
+ "idx": idx,
319
+ "task_id": str(row.get("id", idx)),
320
+ "entry_point": f"refinement_{row.get('id', idx)}",
321
+ "num_inputs": 0,
322
+ "source": "CodeXGLUE",
323
+ }
324
+
325
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
326
+ row = self._ds[idx]
327
+ buggy = row.get("buggy", "")
328
+ fixed = row.get("fixed", "")
329
+ return {
330
+ "idx": idx,
331
+ "task_id": str(row.get("id", idx)),
332
+ "entry_point": f"refinement_{row.get('id', idx)}",
333
+ "code": fixed,
334
+ "highlighted_code": _highlight_code(fixed, language="java") if fixed else "",
335
+ "inputs": [],
336
+ "outputs": [],
337
+ "test": None,
338
+ "tasks": [],
339
+ "source": "CodeXGLUE",
340
+ "has_ground_truth": False,
341
+ "has_tasks": False,
342
+ "description": "",
343
+ "buggy_code": buggy,
344
+ "buggy_highlighted_code": _highlight_code(buggy, language="java") if buggy else "",
345
+ "fixed_code": fixed,
346
+ "fixed_highlighted_code": _highlight_code(fixed, language="java") if fixed else "",
347
+ "bug_category": "Code Refinement",
348
+ "bug_subtype": "",
349
+ "bug_explanation": "",
350
+ "language": "Java",
351
+ }
352
+
353
+
354
+ # ---------------------------------------------------------------------------
355
+ # CommitBench adapter (HuggingFace: Maxscha/commitbench)
356
+ # ---------------------------------------------------------------------------
357
+
358
+
359
+ class CommitBenchAdapter(DatasetAdapter):
360
+ slug = "commitbench"
361
+ display_name = "CommitBench"
362
+ has_ground_truth = False
363
+ has_tasks = False
364
+
365
+ def __init__(self, hf_dataset):
366
+ self._ds = hf_dataset
367
+
368
+ def problem_count(self) -> int:
369
+ return len(self._ds)
370
+
371
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
372
+ row = self._ds[idx]
373
+ return {
374
+ "idx": idx,
375
+ "task_id": row.get("hash", str(idx))[:12],
376
+ "entry_point": row.get("project", f"commit_{idx}"),
377
+ "num_inputs": 0,
378
+ "source": row.get("diff_languages", "unknown"),
379
+ }
380
+
381
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
382
+ row = self._ds[idx]
383
+ diff = row.get("diff", "")
384
+ message = row.get("message", "")
385
+ return {
386
+ "idx": idx,
387
+ "task_id": row.get("hash", str(idx))[:12],
388
+ "entry_point": row.get("project", f"commit_{idx}"),
389
+ "code": diff,
390
+ "highlighted_code": "",
391
+ "inputs": [],
392
+ "outputs": [],
393
+ "test": None,
394
+ "tasks": [],
395
+ "source": row.get("diff_languages", "unknown"),
396
+ "has_ground_truth": False,
397
+ "has_tasks": False,
398
+ "description": message,
399
+ "patch": diff,
400
+ "repo": row.get("project", ""),
401
+ "commit_hash": row.get("hash", ""),
402
+ "diff_languages": row.get("diff_languages", ""),
403
+ }
dataset_adapters.py → adapters/code_generation.py RENAMED
@@ -1,65 +1,24 @@
1
- """
2
- Dataset adapters for the ML4SE Benchmark Viewer.
3
-
4
- Each adapter normalises a different benchmark dataset into a common API shape
5
- so the Flask routes and templates can handle them uniformly.
6
-
7
- The REGISTRY dict maps slug strings (used in URLs) to adapter instances.
8
- """
9
 
10
  from __future__ import annotations
11
 
12
  import json
 
13
  from typing import Any
14
 
15
- # These are imported from app.py at registration time to avoid circular imports.
 
 
16
  _highlight_code = None
17
  _code_offset = None
18
  _extract_test_classes = None
19
 
20
 
21
- def _set_helpers(highlight_code_fn, code_offset_fn, extract_test_classes_fn):
22
- """Called once by app.py to inject helper functions."""
23
- global _highlight_code, _code_offset, _extract_test_classes
24
- _highlight_code = highlight_code_fn
25
- _code_offset = code_offset_fn
26
- _extract_test_classes = extract_test_classes_fn
27
-
28
-
29
- # ---------------------------------------------------------------------------
30
- # Registry
31
- # ---------------------------------------------------------------------------
32
-
33
- REGISTRY: dict[str, "DatasetAdapter"] = {}
34
-
35
-
36
- # ---------------------------------------------------------------------------
37
- # Base class
38
- # ---------------------------------------------------------------------------
39
-
40
- class DatasetAdapter:
41
- slug: str = ""
42
- display_name: str = ""
43
- has_ground_truth: bool = False
44
- has_tasks: bool = False
45
-
46
- def problem_count(self) -> int:
47
- raise NotImplementedError
48
-
49
- def get_problem_summary(self, idx: int) -> dict[str, Any]:
50
- raise NotImplementedError
51
-
52
- def get_problem_detail(self, idx: int) -> dict[str, Any]:
53
- raise NotImplementedError
54
-
55
- def get_ground_truth(self, idx: int, input_idx: int) -> dict[str, Any]:
56
- return {"status": "unavailable", "message": "Ground truth not available for this dataset"}
57
-
58
-
59
  # ---------------------------------------------------------------------------
60
  # REval adapter (HuggingFace: JetBrains-Research/REval)
61
  # ---------------------------------------------------------------------------
62
 
 
63
  def _format_typed_value(val: dict) -> str:
64
  """Convert a {__type__, __value__} dict from REval states into a Python repr string."""
65
  t = val.get("__type__")
@@ -85,11 +44,9 @@ class REvalAdapter(DatasetAdapter):
85
 
86
  def __init__(self, problems_ds, tasks_ds, executions_ds, states_ds):
87
  self._problems = problems_ds
88
- # Build task lookup: task_id → parsed tasks JSON
89
  self._tasks: dict[str, list] = {}
90
  for row in tasks_ds:
91
  self._tasks[row["task_id"]] = json.loads(row["tasks"])
92
- # Build execution lookup: (task_id, input_idx) → row
93
  self._executions: dict[tuple[str, int], dict] = {}
94
  for row in executions_ds:
95
  self._executions[(row["task_id"], row["input_idx"])] = {
@@ -97,7 +54,6 @@ class REvalAdapter(DatasetAdapter):
97
  "trace": row["trace"],
98
  "coverage": row["coverage"],
99
  }
100
- # Build states lookup: (task_id, input_idx) → parsed states JSON
101
  self._states: dict[tuple[str, int], list] = {}
102
  for row in states_ds:
103
  self._states[(row["task_id"], row["input_idx"])] = json.loads(row["states"])
@@ -154,7 +110,7 @@ class REvalAdapter(DatasetAdapter):
154
  for item in adjusted_items:
155
  if "lineno" in item:
156
  task_lines.add(item["lineno"])
157
- task_info["task_lines"] = sorted(list(task_lines))
158
 
159
  tasks_info.append(task_info)
160
 
@@ -195,11 +151,9 @@ class REvalAdapter(DatasetAdapter):
195
  code = problem["code"]
196
  offset = _code_offset(code)
197
 
198
- # Coverage: convert 0-indexed (original) → 1-indexed (stripped display)
199
  coverage_1indexed = [ln + 1 - offset for ln in exec_rec["coverage"]]
200
  total_lines = len(code[offset:].splitlines())
201
 
202
- # Get task items for this input_idx
203
  task_list = self._tasks.get(task_id, [])
204
  task_items = []
205
  for t in task_list:
@@ -207,15 +161,12 @@ class REvalAdapter(DatasetAdapter):
207
  task_items = t.get("task", [])
208
  break
209
 
210
- # Get states for this (task_id, input_idx)
211
  states_list = self._states.get((task_id, input_idx), [])
212
 
213
- # Resolve variable answers for each task item
214
  variable_answers = []
215
  for item in task_items:
216
- lineno = item["lineno"] # 1-indexed relative to original code
217
  var = item["var"]
218
- # Collect all values of this variable at this line across the trace
219
  values = []
220
  for s in states_list:
221
  if s["lineno"] == lineno and var in s.get("locals", {}):
@@ -226,7 +177,6 @@ class REvalAdapter(DatasetAdapter):
226
  elif len(values) == 1:
227
  answer_str = _format_typed_value(values[0])
228
  else:
229
- # Deduplicate by formatted string to avoid showing identical values
230
  seen = []
231
  for v in values:
232
  fmt = _format_typed_value(v)
@@ -234,13 +184,14 @@ class REvalAdapter(DatasetAdapter):
234
  seen.append(fmt)
235
  answer_str = "[" + ", ".join(seen) + "]" if len(seen) > 1 else seen[0]
236
 
237
- variable_answers.append({
238
- "lineno": lineno - offset,
239
- "var": var,
240
- "answer_str": answer_str,
241
- })
 
 
242
 
243
- # Resolve next lines from trace for arrow visualization
244
  trace = exec_rec["trace"]
245
  next_lines_answers = []
246
  processed_linenos: set[int] = set()
@@ -253,10 +204,12 @@ class REvalAdapter(DatasetAdapter):
253
  for i, ln in enumerate(trace):
254
  if ln == lineno and i + 1 < len(trace):
255
  nexts.add(trace[i + 1])
256
- next_lines_answers.append({
257
- "lineno": lineno,
258
- "next_lines": sorted(nexts) if nexts else [-1],
259
- })
 
 
260
 
261
  return {
262
  "status": "ok",
@@ -267,72 +220,11 @@ class REvalAdapter(DatasetAdapter):
267
  }
268
 
269
 
270
- # ---------------------------------------------------------------------------
271
- # CRUXEval adapter (HuggingFace: cruxeval-org/cruxeval)
272
- # ---------------------------------------------------------------------------
273
-
274
- class CRUXEvalAdapter(DatasetAdapter):
275
- slug = "cruxeval"
276
- display_name = "CRUXEval"
277
- has_ground_truth = False
278
- has_tasks = True
279
-
280
- def __init__(self, hf_dataset):
281
- self._ds = hf_dataset
282
-
283
- def problem_count(self) -> int:
284
- return len(self._ds)
285
-
286
- def get_problem_summary(self, idx: int) -> dict[str, Any]:
287
- row = self._ds[idx]
288
- return {
289
- "idx": idx,
290
- "task_id": row["id"],
291
- "entry_point": "f",
292
- "num_inputs": 1,
293
- "source": "CRUXEval",
294
- }
295
-
296
- def get_problem_detail(self, idx: int) -> dict[str, Any]:
297
- row = self._ds[idx]
298
- code = row["code"]
299
- return {
300
- "idx": idx,
301
- "task_id": row["id"],
302
- "entry_point": "f",
303
- "code": code,
304
- "highlighted_code": _highlight_code(code),
305
- "inputs": [row["input"]],
306
- "outputs": [row["output"]],
307
- "test": None,
308
- "tasks": [
309
- {
310
- "name": "Output Prediction",
311
- "description": "Given the code and input, predict the output.",
312
- "given": "input",
313
- "predict": "output",
314
- "input": row["input"],
315
- "output": row["output"],
316
- },
317
- {
318
- "name": "Input Prediction",
319
- "description": "Given the code and output, predict the input.",
320
- "given": "output",
321
- "predict": "input",
322
- "input": row["input"],
323
- "output": row["output"],
324
- },
325
- ],
326
- "source": "CRUXEval",
327
- "has_ground_truth": False,
328
- "has_tasks": True,
329
- }
330
-
331
-
332
  # ---------------------------------------------------------------------------
333
  # HumanEval+ adapter (HuggingFace: evalplus/humanevalplus)
334
  # ---------------------------------------------------------------------------
335
 
 
336
  class HumanEvalPlusAdapter(DatasetAdapter):
337
  slug = "humanevalplus"
338
  display_name = "HumanEval+"
@@ -378,6 +270,7 @@ class HumanEvalPlusAdapter(DatasetAdapter):
378
  # BigOBench adapter (HuggingFace: facebook/BigOBench)
379
  # ---------------------------------------------------------------------------
380
 
 
381
  class BigOBenchAdapter(DatasetAdapter):
382
  slug = "bigobench"
383
  display_name = "BigOBench"
@@ -404,13 +297,15 @@ class BigOBenchAdapter(DatasetAdapter):
404
  prob = self._problems[idx]
405
  solutions = []
406
  for sol in prob["solutions"]:
407
- solutions.append({
408
- "solution_id": sol["solution_id"],
409
- "code": sol["solution_code"],
410
- "highlighted_code": _highlight_code(sol["solution_code"]),
411
- "time_complexity": sol.get("time_complexity"),
412
- "space_complexity": sol.get("space_complexity"),
413
- })
 
 
414
  return {
415
  "idx": idx,
416
  "task_id": prob["problem_id"],
@@ -429,16 +324,9 @@ class BigOBenchAdapter(DatasetAdapter):
429
  }
430
 
431
 
432
- def _merge_bigobench(ds_time, ds_space) -> list[dict[str, Any]]:
433
- """Merge time and space complexity test sets by problem_id.
434
-
435
- Groups all solutions under their parent problem. Solutions that appear
436
- in both test sets get both complexity labels; otherwise the missing one
437
- is None. Returns a list of problem dicts sorted by problem_id.
438
- """
439
- # First, collect solutions keyed by (problem_id, solution_id)
440
  solutions: dict[tuple[str, str], dict[str, Any]] = {}
441
- # Track problem-level metadata
442
  problem_meta: dict[str, dict[str, str]] = {}
443
 
444
  for row in ds_time:
@@ -456,10 +344,13 @@ def _merge_bigobench(ds_time, ds_space) -> list[dict[str, Any]]:
456
 
457
  for row in ds_space:
458
  pid, sid = row["problem_id"], row["solution_id"]
459
- problem_meta.setdefault(pid, {
460
- "problem_name": row["problem_name"],
461
- "description": row["description"],
462
- })
 
 
 
463
  key = (pid, sid)
464
  if key in solutions:
465
  solutions[key]["space_complexity"] = row["space_complexity_inferred"]
@@ -471,8 +362,6 @@ def _merge_bigobench(ds_time, ds_space) -> list[dict[str, Any]]:
471
  "space_complexity": row["space_complexity_inferred"],
472
  }
473
 
474
- # Group solutions by problem_id
475
- from collections import defaultdict
476
  by_problem: dict[str, list[dict[str, Any]]] = defaultdict(list)
477
  for (pid, _sid), sol in solutions.items():
478
  by_problem[pid].append(sol)
@@ -480,58 +369,537 @@ def _merge_bigobench(ds_time, ds_space) -> list[dict[str, Any]]:
480
  problems = []
481
  for pid in sorted(by_problem.keys()):
482
  meta = problem_meta[pid]
483
- problems.append({
484
- "problem_id": pid,
485
- "problem_name": meta["problem_name"],
486
- "description": meta["description"],
487
- "solutions": by_problem[pid],
488
- })
 
 
489
 
490
  return problems
491
 
492
 
493
  # ---------------------------------------------------------------------------
494
- # Registration helpers
495
  # ---------------------------------------------------------------------------
496
 
497
- def register_hf_datasets() -> None:
498
- """Load all HuggingFace datasets."""
499
- from datasets import load_dataset
500
-
501
- try:
502
- problems = load_dataset("JetBrains-Research/REval", "problems", split="test")
503
- tasks = load_dataset("JetBrains-Research/REval", "tasks", split="test")
504
- executions = load_dataset("JetBrains-Research/REval", "executions", split="test")
505
- states = load_dataset("JetBrains-Research/REval", "states", split="test")
506
- REGISTRY["reval"] = REvalAdapter(problems, tasks, executions, states)
507
- print(f"Loaded REval: {len(problems)} problems")
508
- except Exception as e:
509
- print(f"Warning: could not load REval: {e}")
510
-
511
- try:
512
- crux = load_dataset("cruxeval-org/cruxeval", split="test")
513
- REGISTRY["cruxeval"] = CRUXEvalAdapter(crux)
514
- print(f"Loaded CRUXEval: {len(crux)} problems")
515
- except Exception as e:
516
- print(f"Warning: could not load CRUXEval: {e}")
517
-
518
- try:
519
- heplus = load_dataset("evalplus/humanevalplus", split="test")
520
- REGISTRY["humanevalplus"] = HumanEvalPlusAdapter(heplus)
521
- print(f"Loaded HumanEval+: {len(heplus)} problems")
522
- except Exception as e:
523
- print(f"Warning: could not load HumanEval+: {e}")
524
-
525
- try:
526
- ds_time = load_dataset(
527
- "facebook/BigOBench", "time_complexity_test_set.jsonl", split="train"
528
- )
529
- ds_space = load_dataset(
530
- "facebook/BigOBench", "space_complexity_test_set.jsonl", split="train"
531
- )
532
- merged = _merge_bigobench(ds_time, ds_space)
533
- REGISTRY["bigobench"] = BigOBenchAdapter(merged)
534
- print(f"Loaded BigOBench: {len(merged)} problems "
535
- f"({len(ds_time)} time + {len(ds_space)} space)")
536
- except Exception as e:
537
- print(f"Warning: could not load BigOBench: {e}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Code generation benchmark adapters."""
 
 
 
 
 
 
 
2
 
3
  from __future__ import annotations
4
 
5
  import json
6
+ from collections import defaultdict
7
  from typing import Any
8
 
9
+ from adapters import DatasetAdapter
10
+
11
+ # Injected at runtime by _set_helpers()
12
  _highlight_code = None
13
  _code_offset = None
14
  _extract_test_classes = None
15
 
16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  # ---------------------------------------------------------------------------
18
  # REval adapter (HuggingFace: JetBrains-Research/REval)
19
  # ---------------------------------------------------------------------------
20
 
21
+
22
  def _format_typed_value(val: dict) -> str:
23
  """Convert a {__type__, __value__} dict from REval states into a Python repr string."""
24
  t = val.get("__type__")
 
44
 
45
  def __init__(self, problems_ds, tasks_ds, executions_ds, states_ds):
46
  self._problems = problems_ds
 
47
  self._tasks: dict[str, list] = {}
48
  for row in tasks_ds:
49
  self._tasks[row["task_id"]] = json.loads(row["tasks"])
 
50
  self._executions: dict[tuple[str, int], dict] = {}
51
  for row in executions_ds:
52
  self._executions[(row["task_id"], row["input_idx"])] = {
 
54
  "trace": row["trace"],
55
  "coverage": row["coverage"],
56
  }
 
57
  self._states: dict[tuple[str, int], list] = {}
58
  for row in states_ds:
59
  self._states[(row["task_id"], row["input_idx"])] = json.loads(row["states"])
 
110
  for item in adjusted_items:
111
  if "lineno" in item:
112
  task_lines.add(item["lineno"])
113
+ task_info["task_lines"] = sorted(task_lines)
114
 
115
  tasks_info.append(task_info)
116
 
 
151
  code = problem["code"]
152
  offset = _code_offset(code)
153
 
 
154
  coverage_1indexed = [ln + 1 - offset for ln in exec_rec["coverage"]]
155
  total_lines = len(code[offset:].splitlines())
156
 
 
157
  task_list = self._tasks.get(task_id, [])
158
  task_items = []
159
  for t in task_list:
 
161
  task_items = t.get("task", [])
162
  break
163
 
 
164
  states_list = self._states.get((task_id, input_idx), [])
165
 
 
166
  variable_answers = []
167
  for item in task_items:
168
+ lineno = item["lineno"]
169
  var = item["var"]
 
170
  values = []
171
  for s in states_list:
172
  if s["lineno"] == lineno and var in s.get("locals", {}):
 
177
  elif len(values) == 1:
178
  answer_str = _format_typed_value(values[0])
179
  else:
 
180
  seen = []
181
  for v in values:
182
  fmt = _format_typed_value(v)
 
184
  seen.append(fmt)
185
  answer_str = "[" + ", ".join(seen) + "]" if len(seen) > 1 else seen[0]
186
 
187
+ variable_answers.append(
188
+ {
189
+ "lineno": lineno - offset,
190
+ "var": var,
191
+ "answer_str": answer_str,
192
+ }
193
+ )
194
 
 
195
  trace = exec_rec["trace"]
196
  next_lines_answers = []
197
  processed_linenos: set[int] = set()
 
204
  for i, ln in enumerate(trace):
205
  if ln == lineno and i + 1 < len(trace):
206
  nexts.add(trace[i + 1])
207
+ next_lines_answers.append(
208
+ {
209
+ "lineno": lineno,
210
+ "next_lines": sorted(nexts) if nexts else [-1],
211
+ }
212
+ )
213
 
214
  return {
215
  "status": "ok",
 
220
  }
221
 
222
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
223
  # ---------------------------------------------------------------------------
224
  # HumanEval+ adapter (HuggingFace: evalplus/humanevalplus)
225
  # ---------------------------------------------------------------------------
226
 
227
+
228
  class HumanEvalPlusAdapter(DatasetAdapter):
229
  slug = "humanevalplus"
230
  display_name = "HumanEval+"
 
270
  # BigOBench adapter (HuggingFace: facebook/BigOBench)
271
  # ---------------------------------------------------------------------------
272
 
273
+
274
  class BigOBenchAdapter(DatasetAdapter):
275
  slug = "bigobench"
276
  display_name = "BigOBench"
 
297
  prob = self._problems[idx]
298
  solutions = []
299
  for sol in prob["solutions"]:
300
+ solutions.append(
301
+ {
302
+ "solution_id": sol["solution_id"],
303
+ "code": sol["solution_code"],
304
+ "highlighted_code": _highlight_code(sol["solution_code"]),
305
+ "time_complexity": sol.get("time_complexity"),
306
+ "space_complexity": sol.get("space_complexity"),
307
+ }
308
+ )
309
  return {
310
  "idx": idx,
311
  "task_id": prob["problem_id"],
 
324
  }
325
 
326
 
327
+ def merge_bigobench(ds_time, ds_space) -> list[dict[str, Any]]:
328
+ """Merge time and space complexity test sets by problem_id."""
 
 
 
 
 
 
329
  solutions: dict[tuple[str, str], dict[str, Any]] = {}
 
330
  problem_meta: dict[str, dict[str, str]] = {}
331
 
332
  for row in ds_time:
 
344
 
345
  for row in ds_space:
346
  pid, sid = row["problem_id"], row["solution_id"]
347
+ problem_meta.setdefault(
348
+ pid,
349
+ {
350
+ "problem_name": row["problem_name"],
351
+ "description": row["description"],
352
+ },
353
+ )
354
  key = (pid, sid)
355
  if key in solutions:
356
  solutions[key]["space_complexity"] = row["space_complexity_inferred"]
 
362
  "space_complexity": row["space_complexity_inferred"],
363
  }
364
 
 
 
365
  by_problem: dict[str, list[dict[str, Any]]] = defaultdict(list)
366
  for (pid, _sid), sol in solutions.items():
367
  by_problem[pid].append(sol)
 
369
  problems = []
370
  for pid in sorted(by_problem.keys()):
371
  meta = problem_meta[pid]
372
+ problems.append(
373
+ {
374
+ "problem_id": pid,
375
+ "problem_name": meta["problem_name"],
376
+ "description": meta["description"],
377
+ "solutions": by_problem[pid],
378
+ }
379
+ )
380
 
381
  return problems
382
 
383
 
384
  # ---------------------------------------------------------------------------
385
+ # MBPP+ adapter (HuggingFace: evalplus/mbppplus)
386
  # ---------------------------------------------------------------------------
387
 
388
+
389
+ class MBPPPlusAdapter(DatasetAdapter):
390
+ slug = "mbppplus"
391
+ display_name = "MBPP+"
392
+ has_ground_truth = False
393
+ has_tasks = False
394
+
395
+ def __init__(self, hf_dataset):
396
+ self._ds = hf_dataset
397
+
398
+ def problem_count(self) -> int:
399
+ return len(self._ds)
400
+
401
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
402
+ row = self._ds[idx]
403
+ return {
404
+ "idx": idx,
405
+ "task_id": str(row["task_id"]),
406
+ "entry_point": row["prompt"][:60].replace("\n", " ").strip(),
407
+ "num_inputs": len(row["test_list"]),
408
+ "source": "MBPP+",
409
+ }
410
+
411
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
412
+ row = self._ds[idx]
413
+ code = row["code"]
414
+ return {
415
+ "idx": idx,
416
+ "task_id": str(row["task_id"]),
417
+ "entry_point": row["prompt"][:60].replace("\n", " ").strip(),
418
+ "code": code,
419
+ "highlighted_code": _highlight_code(code),
420
+ "inputs": [],
421
+ "outputs": [],
422
+ "test": "\n".join(row["test_list"]),
423
+ "tasks": [],
424
+ "source": "MBPP+",
425
+ "has_ground_truth": False,
426
+ "has_tasks": False,
427
+ "description": row["prompt"],
428
+ }
429
+
430
+
431
+ # ---------------------------------------------------------------------------
432
+ # ClassEval adapter (HuggingFace: FudanSELab/ClassEval)
433
+ # ---------------------------------------------------------------------------
434
+
435
+
436
+ class ClassEvalAdapter(DatasetAdapter):
437
+ slug = "classeval"
438
+ display_name = "ClassEval"
439
+ has_ground_truth = False
440
+ has_tasks = False
441
+
442
+ def __init__(self, hf_dataset):
443
+ self._ds = hf_dataset
444
+
445
+ def problem_count(self) -> int:
446
+ return len(self._ds)
447
+
448
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
449
+ row = self._ds[idx]
450
+ return {
451
+ "idx": idx,
452
+ "task_id": row["task_id"],
453
+ "entry_point": row["class_name"],
454
+ "num_inputs": len(row["methods_info"]),
455
+ "source": "ClassEval",
456
+ }
457
+
458
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
459
+ row = self._ds[idx]
460
+ code = row["solution_code"]
461
+ return {
462
+ "idx": idx,
463
+ "task_id": row["task_id"],
464
+ "entry_point": row["class_name"],
465
+ "code": code,
466
+ "highlighted_code": _highlight_code(code),
467
+ "inputs": [],
468
+ "outputs": [],
469
+ "test": row["test"],
470
+ "tasks": [],
471
+ "source": "ClassEval",
472
+ "has_ground_truth": False,
473
+ "has_tasks": False,
474
+ "description": row["class_description"],
475
+ "skeleton": row["skeleton"],
476
+ }
477
+
478
+
479
+ # ---------------------------------------------------------------------------
480
+ # LiveCodeBench adapter (HuggingFace: livecodebench/code_generation_lite)
481
+ # ---------------------------------------------------------------------------
482
+
483
+
484
+ class LiveCodeBenchAdapter(DatasetAdapter):
485
+ slug = "livecodebench"
486
+ display_name = "LiveCodeBench"
487
+ has_ground_truth = False
488
+ has_tasks = False
489
+
490
+ def __init__(self, hf_dataset):
491
+ self._ds = hf_dataset
492
+
493
+ def problem_count(self) -> int:
494
+ return len(self._ds)
495
+
496
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
497
+ row = self._ds[idx]
498
+ return {
499
+ "idx": idx,
500
+ "task_id": row["question_id"],
501
+ "entry_point": row["question_title"],
502
+ "num_inputs": 0,
503
+ "source": row["platform"],
504
+ }
505
+
506
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
507
+ row = self._ds[idx]
508
+ test_cases = []
509
+ try:
510
+ test_cases = json.loads(row["public_test_cases"]) if row["public_test_cases"] else []
511
+ except (json.JSONDecodeError, TypeError):
512
+ pass
513
+
514
+ inputs = [tc.get("input", "") for tc in test_cases]
515
+ outputs = [tc.get("output", "") for tc in test_cases]
516
+
517
+ starter = row.get("starter_code", "") or ""
518
+ code = starter if starter.strip() else ""
519
+
520
+ return {
521
+ "idx": idx,
522
+ "task_id": row["question_id"],
523
+ "entry_point": row["question_title"],
524
+ "code": code,
525
+ "highlighted_code": _highlight_code(code) if code else "",
526
+ "inputs": inputs,
527
+ "outputs": outputs,
528
+ "test": None,
529
+ "tasks": [],
530
+ "source": row["platform"],
531
+ "has_ground_truth": False,
532
+ "has_tasks": False,
533
+ "description": row["question_content"],
534
+ "difficulty": row.get("difficulty", ""),
535
+ "contest_date": row.get("contest_date", ""),
536
+ }
537
+
538
+
539
+ # ---------------------------------------------------------------------------
540
+ # CodeContests adapter (HuggingFace: deepmind/code_contests)
541
+ # ---------------------------------------------------------------------------
542
+
543
+ _CC_LANG_NAMES = {0: "Unknown", 1: "Python 2", 2: "C++", 3: "Python 3", 4: "Java"}
544
+
545
+
546
+ class CodeContestsAdapter(DatasetAdapter):
547
+ slug = "codecontests"
548
+ display_name = "CodeContests"
549
+ has_ground_truth = False
550
+ has_tasks = False
551
+
552
+ _DIFFICULTY_NAMES = {
553
+ 0: "Unknown",
554
+ 1: "Easy",
555
+ 2: "Medium",
556
+ 3: "Hard",
557
+ 4: "Harder",
558
+ 5: "Hardest",
559
+ 6: "External",
560
+ }
561
+ _SOURCE_NAMES = {
562
+ 0: "Unknown",
563
+ 1: "CodeChef",
564
+ 2: "Codeforces",
565
+ 3: "HackerEarth",
566
+ 4: "CodeJam",
567
+ 5: "AtCoder",
568
+ 6: "Aizu",
569
+ }
570
+
571
+ def __init__(self, hf_dataset):
572
+ self._ds = hf_dataset
573
+
574
+ def problem_count(self) -> int:
575
+ return len(self._ds)
576
+
577
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
578
+ row = self._ds[idx]
579
+ source_int = row.get("source", 0)
580
+ source_name = self._SOURCE_NAMES.get(source_int, "Unknown")
581
+ return {
582
+ "idx": idx,
583
+ "task_id": row["name"],
584
+ "entry_point": row["name"],
585
+ "num_inputs": len(row.get("public_tests", {}).get("input", [])),
586
+ "source": source_name,
587
+ }
588
+
589
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
590
+ row = self._ds[idx]
591
+ source_int = row.get("source", 0)
592
+ source_name = self._SOURCE_NAMES.get(source_int, "Unknown")
593
+ diff_int = row.get("difficulty", 0)
594
+ diff_name = self._DIFFICULTY_NAMES.get(diff_int, "Unknown")
595
+
596
+ sols_data = row.get("solutions", {})
597
+ sol_langs = sols_data.get("language", [])
598
+ sol_codes = sols_data.get("solution", [])
599
+ solutions = []
600
+ for i, code in enumerate(sol_codes[:10]):
601
+ lang_int = sol_langs[i] if i < len(sol_langs) else 0
602
+ lang_name = _CC_LANG_NAMES.get(lang_int, "Unknown")
603
+ lang_key = {1: "python", 2: "cpp", 3: "python", 4: "java"}.get(lang_int, "python")
604
+ solutions.append(
605
+ {
606
+ "solution_id": f"sol_{i}",
607
+ "code": code,
608
+ "highlighted_code": _highlight_code(code, language=lang_key),
609
+ "language": lang_name,
610
+ }
611
+ )
612
+
613
+ pub_tests = row.get("public_tests", {})
614
+ inputs = pub_tests.get("input", [])
615
+ outputs = pub_tests.get("output", [])
616
+ tags = list(row.get("cf_tags", []))
617
+
618
+ return {
619
+ "idx": idx,
620
+ "task_id": row["name"],
621
+ "entry_point": row["name"],
622
+ "code": solutions[0]["code"] if solutions else "",
623
+ "highlighted_code": solutions[0]["highlighted_code"] if solutions else "",
624
+ "inputs": inputs,
625
+ "outputs": outputs,
626
+ "test": None,
627
+ "tasks": [],
628
+ "source": source_name,
629
+ "has_ground_truth": False,
630
+ "has_tasks": False,
631
+ "description": row["description"],
632
+ "difficulty": diff_name,
633
+ "solutions": solutions,
634
+ "cf_rating": row.get("cf_rating", 0),
635
+ "tags": tags,
636
+ }
637
+
638
+
639
+ # ---------------------------------------------------------------------------
640
+ # APPS adapter (HuggingFace: codeparrot/apps)
641
+ # ---------------------------------------------------------------------------
642
+
643
+
644
+ class APPSAdapter(DatasetAdapter):
645
+ slug = "apps"
646
+ display_name = "APPS"
647
+ has_ground_truth = False
648
+ has_tasks = False
649
+
650
+ def __init__(self, hf_dataset):
651
+ self._ds = hf_dataset
652
+
653
+ def problem_count(self) -> int:
654
+ return len(self._ds)
655
+
656
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
657
+ row = self._ds[idx]
658
+ return {
659
+ "idx": idx,
660
+ "task_id": str(row["problem_id"]),
661
+ "entry_point": row["question"][:60].replace("\n", " ").strip(),
662
+ "num_inputs": 0,
663
+ "source": row.get("difficulty", "unknown"),
664
+ }
665
+
666
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
667
+ row = self._ds[idx]
668
+ solutions = []
669
+ if row.get("solutions"):
670
+ try:
671
+ sol_list = json.loads(row["solutions"])
672
+ for i, code in enumerate(sol_list[:5]):
673
+ solutions.append(
674
+ {
675
+ "solution_id": f"sol_{i}",
676
+ "code": code,
677
+ "highlighted_code": _highlight_code(code),
678
+ }
679
+ )
680
+ except (json.JSONDecodeError, TypeError):
681
+ pass
682
+
683
+ inputs, outputs = [], []
684
+ if row.get("input_output"):
685
+ try:
686
+ io = json.loads(row["input_output"])
687
+ inputs = io.get("inputs", [])
688
+ outputs = io.get("outputs", [])
689
+ except (json.JSONDecodeError, TypeError):
690
+ pass
691
+
692
+ code = solutions[0]["code"] if solutions else (row.get("starter_code") or "")
693
+ return {
694
+ "idx": idx,
695
+ "task_id": str(row["problem_id"]),
696
+ "entry_point": row["question"][:60].replace("\n", " ").strip(),
697
+ "code": code,
698
+ "highlighted_code": _highlight_code(code) if code else "",
699
+ "inputs": inputs[:5],
700
+ "outputs": outputs[:5],
701
+ "test": None,
702
+ "tasks": [],
703
+ "source": row.get("difficulty", "unknown"),
704
+ "has_ground_truth": False,
705
+ "has_tasks": False,
706
+ "description": row["question"],
707
+ "difficulty": row.get("difficulty", ""),
708
+ "solutions": solutions if len(solutions) > 1 else [],
709
+ "url": row.get("url", ""),
710
+ "starter_code": row.get("starter_code", ""),
711
+ }
712
+
713
+
714
+ # ---------------------------------------------------------------------------
715
+ # MBPP adapter (HuggingFace: google-research-datasets/mbpp)
716
+ # ---------------------------------------------------------------------------
717
+
718
+
719
+ class MBPPAdapter(DatasetAdapter):
720
+ slug = "mbpp"
721
+ display_name = "MBPP"
722
+ has_ground_truth = False
723
+ has_tasks = False
724
+
725
+ def __init__(self, hf_dataset):
726
+ self._ds = hf_dataset
727
+
728
+ def problem_count(self) -> int:
729
+ return len(self._ds)
730
+
731
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
732
+ row = self._ds[idx]
733
+ return {
734
+ "idx": idx,
735
+ "task_id": str(row["task_id"]),
736
+ "entry_point": row["text"][:60].replace("\n", " ").strip(),
737
+ "num_inputs": len(row.get("test_list", [])),
738
+ "source": "MBPP",
739
+ }
740
+
741
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
742
+ row = self._ds[idx]
743
+ code = row["code"]
744
+ test_list = row.get("test_list", [])
745
+ challenge_tests = row.get("challenge_test_list", [])
746
+ all_tests = test_list + challenge_tests
747
+ return {
748
+ "idx": idx,
749
+ "task_id": str(row["task_id"]),
750
+ "entry_point": row["text"][:60].replace("\n", " ").strip(),
751
+ "code": code,
752
+ "highlighted_code": _highlight_code(code),
753
+ "inputs": [],
754
+ "outputs": [],
755
+ "test": "\n".join(all_tests),
756
+ "tasks": [],
757
+ "source": "MBPP",
758
+ "has_ground_truth": False,
759
+ "has_tasks": False,
760
+ "description": row["text"],
761
+ }
762
+
763
+
764
+ # ---------------------------------------------------------------------------
765
+ # CodeSearchNet adapter (HuggingFace: code-search-net/code_search_net)
766
+ # ---------------------------------------------------------------------------
767
+
768
+
769
+ class CodeSearchNetAdapter(DatasetAdapter):
770
+ slug = "codesearchnet"
771
+ display_name = "CodeSearchNet"
772
+ has_ground_truth = False
773
+ has_tasks = False
774
+
775
+ def __init__(self, hf_dataset):
776
+ self._ds = hf_dataset
777
+
778
+ def problem_count(self) -> int:
779
+ return len(self._ds)
780
+
781
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
782
+ row = self._ds[idx]
783
+ return {
784
+ "idx": idx,
785
+ "task_id": row.get("func_name", str(idx)),
786
+ "entry_point": row.get("func_name", f"csn_{idx}"),
787
+ "num_inputs": 0,
788
+ "source": row.get("language", "unknown"),
789
+ }
790
+
791
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
792
+ row = self._ds[idx]
793
+ code = row.get("func_code_string", "")
794
+ lang = row.get("language", "python")
795
+ return {
796
+ "idx": idx,
797
+ "task_id": row.get("func_name", str(idx)),
798
+ "entry_point": row.get("func_name", f"csn_{idx}"),
799
+ "code": code,
800
+ "highlighted_code": _highlight_code(code, language=lang),
801
+ "inputs": [],
802
+ "outputs": [],
803
+ "test": None,
804
+ "tasks": [],
805
+ "source": lang,
806
+ "has_ground_truth": False,
807
+ "has_tasks": False,
808
+ "description": row.get("func_documentation_string", ""),
809
+ }
810
+
811
+
812
+ # ---------------------------------------------------------------------------
813
+ # BigCodeBench adapter (HuggingFace: bigcode/bigcodebench)
814
+ # ---------------------------------------------------------------------------
815
+
816
+
817
+ class BigCodeBenchAdapter(DatasetAdapter):
818
+ slug = "bigcodebench"
819
+ display_name = "BigCodeBench"
820
+ has_ground_truth = False
821
+ has_tasks = False
822
+
823
+ def __init__(self, hf_dataset):
824
+ self._ds = hf_dataset
825
+
826
+ def problem_count(self) -> int:
827
+ return len(self._ds)
828
+
829
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
830
+ row = self._ds[idx]
831
+ return {
832
+ "idx": idx,
833
+ "task_id": row["task_id"],
834
+ "entry_point": row.get("entry_point", "task_func"),
835
+ "num_inputs": 0,
836
+ "source": "BigCodeBench",
837
+ }
838
+
839
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
840
+ row = self._ds[idx]
841
+ code = row.get("code_prompt", "") + row.get("canonical_solution", "")
842
+ libs = row.get("libs", "")
843
+ return {
844
+ "idx": idx,
845
+ "task_id": row["task_id"],
846
+ "entry_point": row.get("entry_point", "task_func"),
847
+ "code": code,
848
+ "highlighted_code": _highlight_code(code),
849
+ "inputs": [],
850
+ "outputs": [],
851
+ "test": row.get("test", ""),
852
+ "tasks": [],
853
+ "source": "BigCodeBench",
854
+ "has_ground_truth": False,
855
+ "has_tasks": False,
856
+ "description": row.get("complete_prompt", ""),
857
+ "libs": libs,
858
+ }
859
+
860
+
861
+ # ---------------------------------------------------------------------------
862
+ # EffiBench adapter (HuggingFace: DONG19/EffiBench)
863
+ # ---------------------------------------------------------------------------
864
+
865
+
866
+ class EffiBenchAdapter(DatasetAdapter):
867
+ slug = "effibench"
868
+ display_name = "EffiBench"
869
+ has_ground_truth = False
870
+ has_tasks = False
871
+
872
+ def __init__(self, hf_dataset):
873
+ self._ds = hf_dataset
874
+
875
+ def problem_count(self) -> int:
876
+ return len(self._ds)
877
+
878
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
879
+ row = self._ds[idx]
880
+ return {
881
+ "idx": idx,
882
+ "task_id": str(row.get("problem_idx", idx)),
883
+ "entry_point": row.get("task_name", f"effibench_{idx}"),
884
+ "num_inputs": 0,
885
+ "source": "EffiBench",
886
+ }
887
+
888
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
889
+ row = self._ds[idx]
890
+ code = row.get("canonical_solution", "")
891
+ return {
892
+ "idx": idx,
893
+ "task_id": str(row.get("problem_idx", idx)),
894
+ "entry_point": row.get("task_name", f"effibench_{idx}"),
895
+ "code": code,
896
+ "highlighted_code": _highlight_code(code),
897
+ "inputs": [],
898
+ "outputs": [],
899
+ "test": row.get("test_case", ""),
900
+ "tasks": [],
901
+ "source": "EffiBench",
902
+ "has_ground_truth": False,
903
+ "has_tasks": False,
904
+ "description": row.get("markdown_description", row.get("description", "")),
905
+ }
adapters/code_reasoning.py ADDED
@@ -0,0 +1,366 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Code reasoning / completion benchmark adapters (CRUXEval, SAFIM, HumanEval-X)."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import re
6
+ from typing import Any
7
+
8
+ from adapters import DatasetAdapter
9
+
10
+ # Injected at runtime by _set_helpers()
11
+ _highlight_code = None
12
+ _code_offset = None
13
+ _extract_test_classes = None
14
+
15
+
16
+ # ---------------------------------------------------------------------------
17
+ # CRUXEval adapter (HuggingFace: cruxeval-org/cruxeval)
18
+ # ---------------------------------------------------------------------------
19
+
20
+
21
+ class CRUXEvalAdapter(DatasetAdapter):
22
+ slug = "cruxeval"
23
+ display_name = "CRUXEval"
24
+ has_ground_truth = False
25
+ has_tasks = True
26
+
27
+ def __init__(self, hf_dataset):
28
+ self._ds = hf_dataset
29
+
30
+ def problem_count(self) -> int:
31
+ return len(self._ds)
32
+
33
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
34
+ row = self._ds[idx]
35
+ return {
36
+ "idx": idx,
37
+ "task_id": row["id"],
38
+ "entry_point": "f",
39
+ "num_inputs": 1,
40
+ "source": "CRUXEval",
41
+ }
42
+
43
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
44
+ row = self._ds[idx]
45
+ code = row["code"]
46
+ return {
47
+ "idx": idx,
48
+ "task_id": row["id"],
49
+ "entry_point": "f",
50
+ "code": code,
51
+ "highlighted_code": _highlight_code(code),
52
+ "inputs": [row["input"]],
53
+ "outputs": [row["output"]],
54
+ "test": None,
55
+ "tasks": [
56
+ {
57
+ "name": "Output Prediction",
58
+ "description": "Given the code and input, predict the output.",
59
+ "given": "input",
60
+ "predict": "output",
61
+ "input": row["input"],
62
+ "output": row["output"],
63
+ },
64
+ {
65
+ "name": "Input Prediction",
66
+ "description": "Given the code and output, predict the input.",
67
+ "given": "output",
68
+ "predict": "input",
69
+ "input": row["input"],
70
+ "output": row["output"],
71
+ },
72
+ ],
73
+ "source": "CRUXEval",
74
+ "has_ground_truth": False,
75
+ "has_tasks": True,
76
+ }
77
+
78
+
79
+ # ---------------------------------------------------------------------------
80
+ # SAFIM adapter (HuggingFace: gonglinyuan/safim)
81
+ # ---------------------------------------------------------------------------
82
+
83
+
84
+ class SAFIMAdapter(DatasetAdapter):
85
+ slug = "safim"
86
+ display_name = "SAFIM"
87
+ has_ground_truth = False
88
+ has_tasks = False
89
+
90
+ def __init__(self, hf_dataset):
91
+ self._ds = hf_dataset
92
+
93
+ def problem_count(self) -> int:
94
+ return len(self._ds)
95
+
96
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
97
+ row = self._ds[idx]
98
+ return {
99
+ "idx": idx,
100
+ "task_id": row.get("task_id", str(idx)),
101
+ "entry_point": row.get("task_id", f"safim_{idx}"),
102
+ "num_inputs": 0,
103
+ "source": row.get("lang", "unknown"),
104
+ }
105
+
106
+ # Patterns that mark where the completion should be inserted
107
+ _HOLE_MARKERS = [
108
+ "{{completion}}",
109
+ "/* TODO: Your code here */",
110
+ "// TODO: Your code here",
111
+ "# TODO: Your code here",
112
+ ]
113
+
114
+ def _find_hole_marker(self, prompt: str) -> str | None:
115
+ """Return the first matching hole marker found in the prompt, or None."""
116
+ for marker in self._HOLE_MARKERS:
117
+ if marker in prompt:
118
+ return marker
119
+ return None
120
+
121
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
122
+ row = self._ds[idx]
123
+ prompt = row.get("prompt", "")
124
+ ground_truth = row.get("ground_truth", "")
125
+ lang = row.get("lang", "python")
126
+
127
+ marker = self._find_hole_marker(prompt)
128
+
129
+ if marker:
130
+ display_code = prompt.replace(marker, "/* [HOLE] */")
131
+ before_hole = prompt.split(marker)[0]
132
+ merged_code = prompt.replace(marker, ground_truth)
133
+ else:
134
+ display_code = prompt + "\n/* [HOLE] */\n"
135
+ before_hole = prompt + "\n"
136
+ merged_code = prompt + "\n" + ground_truth + "\n"
137
+
138
+ # Compute 1-indexed line range of the inserted ground truth
139
+ gt_start_line = before_hole.count("\n") + 1
140
+ gt_line_count = ground_truth.count("\n") + (1 if ground_truth else 0)
141
+ gt_end_line = gt_start_line + gt_line_count - 1
142
+
143
+ lang_key = {"Python": "python", "Java": "java", "C++": "cpp", "C#": "csharp"}.get(
144
+ lang, lang.lower()
145
+ )
146
+
147
+ return {
148
+ "idx": idx,
149
+ "task_id": row.get("task_id", str(idx)),
150
+ "entry_point": row.get("task_id", f"safim_{idx}"),
151
+ "code": display_code,
152
+ "highlighted_code": _highlight_code(display_code, language=lang_key),
153
+ "inputs": [],
154
+ "outputs": [],
155
+ "test": None,
156
+ "tasks": [],
157
+ "source": lang,
158
+ "has_ground_truth": False,
159
+ "has_tasks": False,
160
+ "fim_prefix": prompt,
161
+ "fim_ground_truth": ground_truth,
162
+ "fim_ground_truth_highlighted": _highlight_code(ground_truth, language=lang_key),
163
+ "fim_merged_code": merged_code,
164
+ "fim_merged_highlighted": _highlight_code(
165
+ merged_code,
166
+ highlight_lines=list(range(gt_start_line, gt_end_line + 1)),
167
+ language=lang_key,
168
+ ),
169
+ "fim_gt_start_line": gt_start_line,
170
+ "fim_gt_end_line": gt_end_line,
171
+ "language": lang,
172
+ }
173
+
174
+
175
+ # ---------------------------------------------------------------------------
176
+ # HumanEval-X adapter (HuggingFace: THUDM/humaneval-x)
177
+ # ---------------------------------------------------------------------------
178
+
179
+
180
+ def _extract_func_name(declaration: str) -> str:
181
+ """Extract the function/method name from a code declaration string."""
182
+ m = re.search(r"def\s+(\w+)\s*\(", declaration)
183
+ if m:
184
+ return m.group(1)
185
+ m = re.search(r"(\w+)\s*\(", declaration)
186
+ if m:
187
+ return m.group(1)
188
+ return ""
189
+
190
+
191
+ # ---------------------------------------------------------------------------
192
+ # HumanEvalPack adapter (HuggingFace: bigcode/humanevalpack)
193
+ # ---------------------------------------------------------------------------
194
+
195
+
196
+ class HumanEvalPackAdapter(DatasetAdapter):
197
+ slug = "humanevalpack"
198
+ display_name = "HumanEvalPack"
199
+ has_ground_truth = False
200
+ has_tasks = False
201
+
202
+ LANGUAGES = ["python", "js", "cpp", "go", "java", "rust"]
203
+
204
+ def __init__(self, datasets_by_lang: dict[str, Any]):
205
+ self._by_lang = datasets_by_lang
206
+ first_lang = next(iter(self._by_lang))
207
+ self._count = len(self._by_lang[first_lang])
208
+
209
+ def problem_count(self) -> int:
210
+ return self._count
211
+
212
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
213
+ first_lang = next(iter(self._by_lang))
214
+ row = self._by_lang[first_lang][idx]
215
+ return {
216
+ "idx": idx,
217
+ "task_id": row["task_id"],
218
+ "entry_point": row.get("entry_point", f"problem_{idx}"),
219
+ "num_inputs": len(self._by_lang),
220
+ "source": "HumanEvalPack",
221
+ }
222
+
223
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
224
+ first_lang = next(iter(self._by_lang))
225
+ row = self._by_lang[first_lang][idx]
226
+
227
+ lang_labels = {
228
+ "python": "Python",
229
+ "js": "JavaScript",
230
+ "cpp": "C++",
231
+ "go": "Go",
232
+ "java": "Java",
233
+ "rust": "Rust",
234
+ }
235
+ lang_pygments = {
236
+ "python": "python",
237
+ "js": "javascript",
238
+ "cpp": "cpp",
239
+ "go": "go",
240
+ "java": "java",
241
+ "rust": "rust",
242
+ }
243
+
244
+ lang_solutions = []
245
+ for lang in self.LANGUAGES:
246
+ if lang not in self._by_lang:
247
+ continue
248
+ lrow = self._by_lang[lang][idx]
249
+ canonical = lrow.get("prompt", "") + lrow.get("canonical_solution", "")
250
+ buggy = lrow.get("prompt", "") + lrow.get("buggy_solution", "")
251
+ lang_key = lang_pygments.get(lang, lang)
252
+ lang_solutions.append(
253
+ {
254
+ "language": lang,
255
+ "language_label": lang_labels.get(lang, lang),
256
+ "code": canonical,
257
+ "highlighted_code": _highlight_code(canonical, language=lang_key),
258
+ "buggy_code": buggy,
259
+ "buggy_highlighted_code": _highlight_code(buggy, language=lang_key),
260
+ "test": lrow.get("test", ""),
261
+ "example_test": lrow.get("example_test", ""),
262
+ "bug_type": lrow.get("bug_type", ""),
263
+ "failure_symptoms": lrow.get("failure_symptoms", ""),
264
+ }
265
+ )
266
+
267
+ py_row = self._by_lang.get("python", self._by_lang[first_lang])[idx]
268
+ default_code = py_row.get("prompt", "") + py_row.get("canonical_solution", "")
269
+
270
+ return {
271
+ "idx": idx,
272
+ "task_id": row["task_id"],
273
+ "entry_point": row.get("entry_point", f"problem_{idx}"),
274
+ "code": default_code,
275
+ "highlighted_code": _highlight_code(default_code),
276
+ "inputs": [],
277
+ "outputs": [],
278
+ "test": py_row.get("test", ""),
279
+ "tasks": [],
280
+ "source": "HumanEvalPack",
281
+ "has_ground_truth": False,
282
+ "has_tasks": False,
283
+ "description": row.get("instruction", row.get("docstring", "")),
284
+ "lang_solutions": lang_solutions,
285
+ "bug_type": py_row.get("bug_type", ""),
286
+ "failure_symptoms": py_row.get("failure_symptoms", ""),
287
+ }
288
+
289
+
290
+ # ---------------------------------------------------------------------------
291
+ # HumanEval-X adapter (HuggingFace: THUDM/humaneval-x)
292
+ # ---------------------------------------------------------------------------
293
+
294
+
295
+ class HumanEvalXAdapter(DatasetAdapter):
296
+ slug = "humanevalx"
297
+ display_name = "HumanEval-X"
298
+ has_ground_truth = False
299
+ has_tasks = False
300
+
301
+ LANGUAGES = ["python", "cpp", "java", "go", "js"]
302
+
303
+ def __init__(self, datasets_by_lang: dict[str, Any]):
304
+ """datasets_by_lang maps language name -> HF dataset split."""
305
+ self._by_lang = datasets_by_lang
306
+ first_lang = next(iter(self._by_lang))
307
+ self._count = len(self._by_lang[first_lang])
308
+
309
+ def problem_count(self) -> int:
310
+ return self._count
311
+
312
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
313
+ first_lang = next(iter(self._by_lang))
314
+ row = self._by_lang[first_lang][idx]
315
+ task_id = row["task_id"].split("/")[-1]
316
+ decl = row.get("declaration", row.get("prompt", ""))
317
+ entry = _extract_func_name(decl) or f"problem_{task_id}"
318
+ return {
319
+ "idx": idx,
320
+ "task_id": f"HumanEval/{task_id}",
321
+ "entry_point": entry,
322
+ "num_inputs": len(self._by_lang),
323
+ "source": "HumanEval-X",
324
+ }
325
+
326
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
327
+ first_lang = next(iter(self._by_lang))
328
+ row = self._by_lang[first_lang][idx]
329
+ task_id = row["task_id"].split("/")[-1]
330
+ decl = row.get("declaration", row.get("prompt", ""))
331
+ entry = _extract_func_name(decl) or f"problem_{task_id}"
332
+
333
+ lang_solutions = []
334
+ for lang in self.LANGUAGES:
335
+ if lang not in self._by_lang:
336
+ continue
337
+ lrow = self._by_lang[lang][idx]
338
+ code = lrow["prompt"] + lrow["canonical_solution"]
339
+ lang_solutions.append(
340
+ {
341
+ "language": lang,
342
+ "code": code,
343
+ "highlighted_code": _highlight_code(code, language=lang),
344
+ "test": lrow.get("test", ""),
345
+ "example_test": lrow.get("example_test", ""),
346
+ }
347
+ )
348
+
349
+ py_row = self._by_lang.get("python", self._by_lang[first_lang])[idx]
350
+ default_code = py_row["prompt"] + py_row["canonical_solution"]
351
+
352
+ return {
353
+ "idx": idx,
354
+ "task_id": f"HumanEval/{task_id}",
355
+ "entry_point": entry,
356
+ "code": default_code,
357
+ "highlighted_code": _highlight_code(default_code),
358
+ "inputs": [],
359
+ "outputs": [],
360
+ "test": py_row.get("test", ""),
361
+ "tasks": [],
362
+ "source": "HumanEval-X",
363
+ "has_ground_truth": False,
364
+ "has_tasks": False,
365
+ "lang_solutions": lang_solutions,
366
+ }
adapters/registration.py ADDED
@@ -0,0 +1,410 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Dataset registration — loads all HuggingFace datasets into the adapter registry."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import json
6
+ import random
7
+ from typing import Any
8
+
9
+ from adapters import REGISTRY
10
+ from adapters.code_editing import (
11
+ CanItEditAdapter,
12
+ CodeEditorBenchAdapter,
13
+ CodeXGLUERefinementAdapter,
14
+ CommitBenchAdapter,
15
+ DebugBenchAdapter,
16
+ SWEBenchFullAdapter,
17
+ SWEBenchLiteAdapter,
18
+ SWEBenchVerifiedAdapter,
19
+ )
20
+ from adapters.code_generation import (
21
+ APPSAdapter,
22
+ BigCodeBenchAdapter,
23
+ BigOBenchAdapter,
24
+ ClassEvalAdapter,
25
+ CodeContestsAdapter,
26
+ CodeSearchNetAdapter,
27
+ EffiBenchAdapter,
28
+ HumanEvalPlusAdapter,
29
+ LiveCodeBenchAdapter,
30
+ MBPPAdapter,
31
+ MBPPPlusAdapter,
32
+ REvalAdapter,
33
+ merge_bigobench,
34
+ )
35
+ from adapters.code_reasoning import (
36
+ CRUXEvalAdapter,
37
+ HumanEvalPackAdapter,
38
+ HumanEvalXAdapter,
39
+ SAFIMAdapter,
40
+ )
41
+ from adapters.vulnerability import (
42
+ BigVulAdapter,
43
+ DevignAdapter,
44
+ DiverseVulAdapter,
45
+ PrimeVulAdapter,
46
+ )
47
+
48
+ # ---------------------------------------------------------------------------
49
+ # Sampling: cap large datasets at MAX_DISPLAY_SAMPLES for fast browsing
50
+ # ---------------------------------------------------------------------------
51
+
52
+ MAX_DISPLAY_SAMPLES = 1000
53
+ _SAMPLE_SEED = 42
54
+
55
+
56
+ def _sample_indices(total: int) -> list[int]:
57
+ """Return a sorted list of up to MAX_DISPLAY_SAMPLES random indices."""
58
+ if total <= MAX_DISPLAY_SAMPLES:
59
+ return list(range(total))
60
+ rng = random.Random(_SAMPLE_SEED)
61
+ return sorted(rng.sample(range(total), MAX_DISPLAY_SAMPLES))
62
+
63
+
64
+ def _sample_hf_dataset(ds):
65
+ """Return a HuggingFace dataset (or subset) with at most MAX_DISPLAY_SAMPLES rows."""
66
+ if len(ds) <= MAX_DISPLAY_SAMPLES:
67
+ return ds
68
+ indices = _sample_indices(len(ds))
69
+ return ds.select(indices)
70
+
71
+
72
+ def _sample_list(rows: list) -> list:
73
+ """Return a list with at most MAX_DISPLAY_SAMPLES items."""
74
+ if len(rows) <= MAX_DISPLAY_SAMPLES:
75
+ return rows
76
+ indices = _sample_indices(len(rows))
77
+ return [rows[i] for i in indices]
78
+
79
+
80
+ def _load_jsonl_dataset(repo_id: str, filenames: list[str]) -> list[dict[str, Any]]:
81
+ """Download JSONL files from a HuggingFace dataset repo and return as a list of dicts.
82
+
83
+ This bypasses the ``datasets`` library when the repo uses deprecated loading scripts.
84
+ """
85
+ from huggingface_hub import hf_hub_download
86
+
87
+ rows: list[dict[str, Any]] = []
88
+ for fname in filenames:
89
+ path = hf_hub_download(repo_id, fname, repo_type="dataset")
90
+ with open(path) as f:
91
+ for line in f:
92
+ line = line.strip()
93
+ if line:
94
+ rows.append(json.loads(line))
95
+ return rows
96
+
97
+
98
+ def register_hf_datasets() -> None:
99
+ """Load all HuggingFace datasets into :data:`REGISTRY`."""
100
+ from datasets import load_dataset
101
+
102
+ # --- Base datasets ---
103
+
104
+ try:
105
+ problems = load_dataset("JetBrains-Research/REval", "problems", split="test")
106
+ tasks = load_dataset("JetBrains-Research/REval", "tasks", split="test")
107
+ executions = load_dataset("JetBrains-Research/REval", "executions", split="test")
108
+ states = load_dataset("JetBrains-Research/REval", "states", split="test")
109
+ REGISTRY["reval"] = REvalAdapter(problems, tasks, executions, states)
110
+ print(f"Loaded REval: {len(problems)} problems")
111
+ except Exception as e:
112
+ print(f"Warning: could not load REval: {e}")
113
+
114
+ try:
115
+ crux = load_dataset("cruxeval-org/cruxeval", split="test")
116
+ REGISTRY["cruxeval"] = CRUXEvalAdapter(crux)
117
+ print(f"Loaded CRUXEval: {len(crux)} problems")
118
+ except Exception as e:
119
+ print(f"Warning: could not load CRUXEval: {e}")
120
+
121
+ try:
122
+ heplus = load_dataset("evalplus/humanevalplus", split="test")
123
+ REGISTRY["humanevalplus"] = HumanEvalPlusAdapter(heplus)
124
+ print(f"Loaded HumanEval+: {len(heplus)} problems")
125
+ except Exception as e:
126
+ print(f"Warning: could not load HumanEval+: {e}")
127
+
128
+ try:
129
+ ds_time = load_dataset(
130
+ "facebook/BigOBench", "time_complexity_test_set.jsonl", split="train"
131
+ )
132
+ ds_space = load_dataset(
133
+ "facebook/BigOBench", "space_complexity_test_set.jsonl", split="train"
134
+ )
135
+ merged = merge_bigobench(ds_time, ds_space)
136
+ REGISTRY["bigobench"] = BigOBenchAdapter(merged)
137
+ print(
138
+ f"Loaded BigOBench: {len(merged)} problems "
139
+ f"({len(ds_time)} time + {len(ds_space)} space)"
140
+ )
141
+ except Exception as e:
142
+ print(f"Warning: could not load BigOBench: {e}")
143
+
144
+ # --- Batch 1 datasets ---
145
+
146
+ try:
147
+ mbppplus = load_dataset("evalplus/mbppplus", split="test")
148
+ REGISTRY["mbppplus"] = MBPPPlusAdapter(mbppplus)
149
+ print(f"Loaded MBPP+: {len(mbppplus)} problems")
150
+ except Exception as e:
151
+ print(f"Warning: could not load MBPP+: {e}")
152
+
153
+ try:
154
+ classeval = load_dataset("FudanSELab/ClassEval", split="test")
155
+ REGISTRY["classeval"] = ClassEvalAdapter(classeval)
156
+ print(f"Loaded ClassEval: {len(classeval)} problems")
157
+ except Exception as e:
158
+ print(f"Warning: could not load ClassEval: {e}")
159
+
160
+ try:
161
+ lcb = _load_jsonl_dataset(
162
+ "livecodebench/code_generation_lite",
163
+ [
164
+ "test.jsonl",
165
+ "test2.jsonl",
166
+ "test3.jsonl",
167
+ "test4.jsonl",
168
+ "test5.jsonl",
169
+ "test6.jsonl",
170
+ ],
171
+ )
172
+ lcb_sampled = _sample_list(lcb)
173
+ adapter = LiveCodeBenchAdapter(lcb_sampled)
174
+ adapter.total_count = len(lcb)
175
+ REGISTRY["livecodebench"] = adapter
176
+ print(f"Loaded LiveCodeBench: {len(lcb_sampled)} problems (of {len(lcb)})")
177
+ except Exception as e:
178
+ print(f"Warning: could not load LiveCodeBench: {e}")
179
+
180
+ try:
181
+ debugbench_full = load_dataset("Rtian/DebugBench", split="test")
182
+ debugbench = _sample_hf_dataset(debugbench_full)
183
+ adapter = DebugBenchAdapter(debugbench)
184
+ adapter.total_count = len(debugbench_full)
185
+ REGISTRY["debugbench"] = adapter
186
+ print(f"Loaded DebugBench: {len(debugbench)} problems (of {len(debugbench_full)})")
187
+ except Exception as e:
188
+ print(f"Warning: could not load DebugBench: {e}")
189
+
190
+ try:
191
+ hx_datasets = {}
192
+ for lang in HumanEvalXAdapter.LANGUAGES:
193
+ hx_datasets[lang] = _load_jsonl_dataset(
194
+ "THUDM/humaneval-x",
195
+ [f"data/{lang}/data/humaneval.jsonl"],
196
+ )
197
+ REGISTRY["humanevalx"] = HumanEvalXAdapter(hx_datasets)
198
+ print(
199
+ f"Loaded HumanEval-X: {len(hx_datasets)} languages, "
200
+ f"{len(hx_datasets[next(iter(hx_datasets))])} problems each"
201
+ )
202
+ except Exception as e:
203
+ print(f"Warning: could not load HumanEval-X: {e}")
204
+
205
+ # --- Batch 2 datasets ---
206
+
207
+ try:
208
+ swe = load_dataset("princeton-nlp/SWE-bench_Lite", split="test")
209
+ REGISTRY["swebenchlite"] = SWEBenchLiteAdapter(swe)
210
+ print(f"Loaded SWE-bench Lite: {len(swe)} problems")
211
+ except Exception as e:
212
+ print(f"Warning: could not load SWE-bench Lite: {e}")
213
+
214
+ try:
215
+ cc = load_dataset("deepmind/code_contests", split="test")
216
+ REGISTRY["codecontests"] = CodeContestsAdapter(cc)
217
+ print(f"Loaded CodeContests: {len(cc)} problems")
218
+ except Exception as e:
219
+ print(f"Warning: could not load CodeContests: {e}")
220
+
221
+ try:
222
+ apps_full = load_dataset(
223
+ "codeparrot/apps",
224
+ "default",
225
+ split="test",
226
+ revision="refs/convert/parquet",
227
+ )
228
+ apps = _sample_hf_dataset(apps_full)
229
+ adapter = APPSAdapter(apps)
230
+ adapter.total_count = len(apps_full)
231
+ REGISTRY["apps"] = adapter
232
+ print(f"Loaded APPS: {len(apps)} problems (of {len(apps_full)})")
233
+ except Exception as e:
234
+ print(f"Warning: could not load APPS: {e}")
235
+
236
+ try:
237
+ cie = load_dataset("nuprl/CanItEdit", split="test")
238
+ REGISTRY["canitedit"] = CanItEditAdapter(cie)
239
+ print(f"Loaded CanItEdit: {len(cie)} problems")
240
+ except Exception as e:
241
+ print(f"Warning: could not load CanItEdit: {e}")
242
+
243
+ try:
244
+ mbpp = load_dataset("google-research-datasets/mbpp", "full", split="test")
245
+ REGISTRY["mbpp"] = MBPPAdapter(mbpp)
246
+ print(f"Loaded MBPP: {len(mbpp)} problems")
247
+ except Exception as e:
248
+ print(f"Warning: could not load MBPP: {e}")
249
+
250
+ # --- Batch 3 datasets ---
251
+
252
+ try:
253
+ safim_full = load_dataset("gonglinyuan/safim", "block", split="test")
254
+ safim = _sample_hf_dataset(safim_full)
255
+ adapter = SAFIMAdapter(safim)
256
+ adapter.total_count = len(safim_full)
257
+ REGISTRY["safim"] = adapter
258
+ print(f"Loaded SAFIM: {len(safim)} problems (of {len(safim_full)})")
259
+ except Exception as e:
260
+ print(f"Warning: could not load SAFIM: {e}")
261
+
262
+ try:
263
+ bigvul_full = load_dataset("bstee615/bigvul", split="test")
264
+ bigvul = _sample_hf_dataset(bigvul_full)
265
+ adapter = BigVulAdapter(bigvul)
266
+ adapter.total_count = len(bigvul_full)
267
+ REGISTRY["bigvul"] = adapter
268
+ print(f"Loaded BigVul: {len(bigvul)} problems (of {len(bigvul_full)})")
269
+ except Exception as e:
270
+ print(f"Warning: could not load BigVul: {e}")
271
+
272
+ try:
273
+ diversevul_full = load_dataset("claudios/DiverseVul", split="test")
274
+ diversevul = _sample_hf_dataset(diversevul_full)
275
+ adapter = DiverseVulAdapter(diversevul)
276
+ adapter.total_count = len(diversevul_full)
277
+ REGISTRY["diversevul"] = adapter
278
+ print(f"Loaded DiverseVul: {len(diversevul)} problems (of {len(diversevul_full)})")
279
+ except Exception as e:
280
+ print(f"Warning: could not load DiverseVul: {e}")
281
+
282
+ try:
283
+ primevul_full = load_dataset(
284
+ "json",
285
+ data_files="hf://datasets/starsofchance/PrimeVul/primevul_test.jsonl",
286
+ split="train",
287
+ )
288
+ primevul = _sample_hf_dataset(primevul_full)
289
+ adapter = PrimeVulAdapter(primevul)
290
+ adapter.total_count = len(primevul_full)
291
+ REGISTRY["primevul"] = adapter
292
+ print(f"Loaded PrimeVul: {len(primevul)} problems (of {len(primevul_full)})")
293
+ except Exception as e:
294
+ print(f"Warning: could not load PrimeVul: {e}")
295
+
296
+ try:
297
+ ceb_rows: list[dict[str, Any]] = []
298
+ ceb_files = [
299
+ ("code_debug", ["code_debug_primary.jsonl", "code_debug_plus.jsonl"]),
300
+ ("code_translate", ["code_translate_primary.jsonl", "code_translate_plus.jsonl"]),
301
+ ("code_polishment", ["code_polishment_primary.jsonl", "code_polishment_plus.jsonl"]),
302
+ ("code_switch", ["code_switch_primary.jsonl", "code_switch_plus.jsonl"]),
303
+ ]
304
+ for task_type, filenames in ceb_files:
305
+ try:
306
+ rows = _load_jsonl_dataset("m-a-p/CodeEditorBench", filenames)
307
+ for d in rows:
308
+ d["_task_type"] = task_type
309
+ if "difficulty" in d:
310
+ d["difficulty"] = str(d["difficulty"])
311
+ ceb_rows.extend(rows)
312
+ except Exception:
313
+ pass # skip task types that fail
314
+ if ceb_rows:
315
+ ceb_sampled = _sample_list(ceb_rows)
316
+ adapter = CodeEditorBenchAdapter(ceb_sampled)
317
+ adapter.total_count = len(ceb_rows)
318
+ REGISTRY["codeeditorbench"] = adapter
319
+ print(f"Loaded CodeEditorBench: {len(ceb_sampled)} problems (of {len(ceb_rows)})")
320
+ else:
321
+ print("Warning: could not load any CodeEditorBench task types")
322
+ except Exception as e:
323
+ print(f"Warning: could not load CodeEditorBench: {e}")
324
+
325
+ # --- Batch 4 datasets ---
326
+
327
+ try:
328
+ swe_v = load_dataset("princeton-nlp/SWE-bench_Verified", split="test")
329
+ REGISTRY["swebenchverified"] = SWEBenchVerifiedAdapter(swe_v)
330
+ print(f"Loaded SWE-bench Verified: {len(swe_v)} problems")
331
+ except Exception as e:
332
+ print(f"Warning: could not load SWE-bench Verified: {e}")
333
+
334
+ try:
335
+ csn_full = load_dataset("code-search-net/code_search_net", "python", split="test")
336
+ csn = _sample_hf_dataset(csn_full)
337
+ adapter = CodeSearchNetAdapter(csn)
338
+ adapter.total_count = len(csn_full)
339
+ REGISTRY["codesearchnet"] = adapter
340
+ print(f"Loaded CodeSearchNet: {len(csn)} problems (of {len(csn_full)})")
341
+ except Exception as e:
342
+ print(f"Warning: could not load CodeSearchNet: {e}")
343
+
344
+ try:
345
+ devign_full = load_dataset("google/code_x_glue_cc_defect_detection", split="test")
346
+ devign = _sample_hf_dataset(devign_full)
347
+ adapter = DevignAdapter(devign)
348
+ adapter.total_count = len(devign_full)
349
+ REGISTRY["devign"] = adapter
350
+ print(f"Loaded Devign: {len(devign)} problems (of {len(devign_full)})")
351
+ except Exception as e:
352
+ print(f"Warning: could not load Devign: {e}")
353
+
354
+ # --- Batch 5 datasets ---
355
+
356
+ try:
357
+ bcb = load_dataset("bigcode/bigcodebench", split="v0.1.4")
358
+ REGISTRY["bigcodebench"] = BigCodeBenchAdapter(bcb)
359
+ print(f"Loaded BigCodeBench: {len(bcb)} problems")
360
+ except Exception as e:
361
+ print(f"Warning: could not load BigCodeBench: {e}")
362
+
363
+ try:
364
+ hep_datasets = {}
365
+ for lang in HumanEvalPackAdapter.LANGUAGES:
366
+ hep_datasets[lang] = load_dataset("bigcode/humanevalpack", lang, split="test")
367
+ REGISTRY["humanevalpack"] = HumanEvalPackAdapter(hep_datasets)
368
+ print(
369
+ f"Loaded HumanEvalPack: {len(hep_datasets)} languages, "
370
+ f"{len(hep_datasets[next(iter(hep_datasets))])} problems each"
371
+ )
372
+ except Exception as e:
373
+ print(f"Warning: could not load HumanEvalPack: {e}")
374
+
375
+ try:
376
+ cxr_full = load_dataset("google/code_x_glue_cc_code_refinement", "medium", split="test")
377
+ cxr = _sample_hf_dataset(cxr_full)
378
+ adapter = CodeXGLUERefinementAdapter(cxr)
379
+ adapter.total_count = len(cxr_full)
380
+ REGISTRY["codexgluerefinement"] = adapter
381
+ print(f"Loaded CodeXGLUE Code Refinement: {len(cxr)} problems (of {len(cxr_full)})")
382
+ except Exception as e:
383
+ print(f"Warning: could not load CodeXGLUE Code Refinement: {e}")
384
+
385
+ try:
386
+ swe_full_ds = load_dataset("princeton-nlp/SWE-bench", split="test")
387
+ swe_full = _sample_hf_dataset(swe_full_ds)
388
+ adapter = SWEBenchFullAdapter(swe_full)
389
+ adapter.total_count = len(swe_full_ds)
390
+ REGISTRY["swebenchfull"] = adapter
391
+ print(f"Loaded SWE-bench: {len(swe_full)} problems (of {len(swe_full_ds)})")
392
+ except Exception as e:
393
+ print(f"Warning: could not load SWE-bench: {e}")
394
+
395
+ try:
396
+ cb_full = load_dataset("Maxscha/commitbench", split="test")
397
+ cb = _sample_hf_dataset(cb_full)
398
+ adapter = CommitBenchAdapter(cb)
399
+ adapter.total_count = len(cb_full)
400
+ REGISTRY["commitbench"] = adapter
401
+ print(f"Loaded CommitBench: {len(cb)} problems (of {len(cb_full)})")
402
+ except Exception as e:
403
+ print(f"Warning: could not load CommitBench: {e}")
404
+
405
+ try:
406
+ effibench = load_dataset("DONG19/EffiBench", split="train")
407
+ REGISTRY["effibench"] = EffiBenchAdapter(effibench)
408
+ print(f"Loaded EffiBench: {len(effibench)} problems")
409
+ except Exception as e:
410
+ print(f"Warning: could not load EffiBench: {e}")
adapters/vulnerability.py ADDED
@@ -0,0 +1,245 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Vulnerability detection benchmark adapters (BigVul, DiverseVul, PrimeVul, Devign)."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from typing import Any
6
+
7
+ from adapters import DatasetAdapter
8
+
9
+ # Injected at runtime by _set_helpers()
10
+ _highlight_code = None
11
+ _code_offset = None
12
+ _extract_test_classes = None
13
+
14
+
15
+ # ---------------------------------------------------------------------------
16
+ # BigVul adapter (HuggingFace: bstee615/bigvul)
17
+ # ---------------------------------------------------------------------------
18
+
19
+
20
+ class BigVulAdapter(DatasetAdapter):
21
+ slug = "bigvul"
22
+ display_name = "BigVul"
23
+ has_ground_truth = False
24
+ has_tasks = False
25
+
26
+ def __init__(self, hf_dataset):
27
+ self._ds = hf_dataset
28
+
29
+ def problem_count(self) -> int:
30
+ return len(self._ds)
31
+
32
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
33
+ row = self._ds[idx]
34
+ return {
35
+ "idx": idx,
36
+ "task_id": row.get("CVE_ID", str(idx)),
37
+ "entry_point": row.get("CVE_ID", f"bigvul_{idx}"),
38
+ "num_inputs": 0,
39
+ "source": row.get("CWE_ID", "unknown"),
40
+ }
41
+
42
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
43
+ row = self._ds[idx]
44
+ vuln_code = row.get("func_before", "")
45
+ fixed_code = row.get("func_after", "")
46
+ lang = row.get("lang", "c")
47
+ lang_key = {"C": "c", "Java": "java", "PHP": "php"}.get(lang, "c")
48
+ return {
49
+ "idx": idx,
50
+ "task_id": row.get("CVE_ID", str(idx)),
51
+ "entry_point": row.get("CVE_ID", f"bigvul_{idx}"),
52
+ "code": fixed_code,
53
+ "highlighted_code": _highlight_code(fixed_code, language=lang_key),
54
+ "inputs": [],
55
+ "outputs": [],
56
+ "test": None,
57
+ "tasks": [],
58
+ "source": row.get("CWE_ID", "unknown"),
59
+ "has_ground_truth": False,
60
+ "has_tasks": False,
61
+ "description": row.get("commit_message", ""),
62
+ "vulnerable_code": vuln_code,
63
+ "vulnerable_highlighted_code": _highlight_code(vuln_code, language=lang_key),
64
+ "patched_code": fixed_code,
65
+ "patched_highlighted_code": _highlight_code(fixed_code, language=lang_key),
66
+ "cwe_id": row.get("CWE_ID", ""),
67
+ "cve_id": row.get("CVE_ID", ""),
68
+ "project": row.get("project", ""),
69
+ "language": lang,
70
+ "is_vulnerable": bool(row.get("vul", 0)),
71
+ }
72
+
73
+
74
+ # ---------------------------------------------------------------------------
75
+ # DiverseVul adapter (HuggingFace: claudios/DiverseVul)
76
+ # ---------------------------------------------------------------------------
77
+
78
+
79
+ class DiverseVulAdapter(DatasetAdapter):
80
+ slug = "diversevul"
81
+ display_name = "DiverseVul"
82
+ has_ground_truth = False
83
+ has_tasks = False
84
+
85
+ def __init__(self, hf_dataset):
86
+ self._ds = hf_dataset
87
+
88
+ def problem_count(self) -> int:
89
+ return len(self._ds)
90
+
91
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
92
+ row = self._ds[idx]
93
+ cwe_list = row.get("cwe", [])
94
+ cwe_label = cwe_list[0] if cwe_list else "unknown"
95
+ label = "Vulnerable" if row.get("target", 0) == 1 else "Patched"
96
+ return {
97
+ "idx": idx,
98
+ "task_id": row.get("commit_id", str(idx))[:12],
99
+ "entry_point": row.get("project", f"diversevul_{idx}"),
100
+ "num_inputs": 0,
101
+ "source": f"{label}/{cwe_label}",
102
+ }
103
+
104
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
105
+ row = self._ds[idx]
106
+ code = row.get("func", "")
107
+ cwe_list = list(row.get("cwe", []))
108
+ is_vuln = row.get("target", 0) == 1
109
+ return {
110
+ "idx": idx,
111
+ "task_id": row.get("commit_id", str(idx))[:12],
112
+ "entry_point": row.get("project", f"diversevul_{idx}"),
113
+ "code": code,
114
+ "highlighted_code": _highlight_code(code, language="c"),
115
+ "inputs": [],
116
+ "outputs": [],
117
+ "test": None,
118
+ "tasks": [],
119
+ "source": "Vulnerable" if is_vuln else "Patched",
120
+ "has_ground_truth": False,
121
+ "has_tasks": False,
122
+ "description": row.get("message", ""),
123
+ "vulnerable_code": code if is_vuln else "",
124
+ "vulnerable_highlighted_code": _highlight_code(code, language="c") if is_vuln else "",
125
+ "patched_code": code if not is_vuln else "",
126
+ "patched_highlighted_code": (
127
+ _highlight_code(code, language="c") if not is_vuln else ""
128
+ ),
129
+ "cwe_id": ", ".join(cwe_list) if cwe_list else "",
130
+ "project": row.get("project", ""),
131
+ "language": "C/C++",
132
+ "is_vulnerable": is_vuln,
133
+ }
134
+
135
+
136
+ # ---------------------------------------------------------------------------
137
+ # PrimeVul adapter (HuggingFace: starsofchance/PrimeVul)
138
+ # ---------------------------------------------------------------------------
139
+
140
+
141
+ class PrimeVulAdapter(DatasetAdapter):
142
+ slug = "primevul"
143
+ display_name = "PrimeVul"
144
+ has_ground_truth = False
145
+ has_tasks = False
146
+
147
+ def __init__(self, hf_dataset):
148
+ self._ds = hf_dataset
149
+
150
+ def problem_count(self) -> int:
151
+ return len(self._ds)
152
+
153
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
154
+ row = self._ds[idx]
155
+ label = "Vulnerable" if row.get("target", 0) == 1 else "Patched"
156
+ return {
157
+ "idx": idx,
158
+ "task_id": row.get("commit_id", str(idx))[:12],
159
+ "entry_point": row.get("project", f"primevul_{idx}"),
160
+ "num_inputs": 0,
161
+ "source": label,
162
+ }
163
+
164
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
165
+ row = self._ds[idx]
166
+ code = row.get("func", "")
167
+ is_vuln = row.get("target", 0) == 1
168
+ cwe_list = list(row.get("cwe", []))
169
+ return {
170
+ "idx": idx,
171
+ "task_id": row.get("commit_id", str(idx))[:12],
172
+ "entry_point": row.get("project", f"primevul_{idx}"),
173
+ "code": code,
174
+ "highlighted_code": _highlight_code(code, language="c"),
175
+ "inputs": [],
176
+ "outputs": [],
177
+ "test": None,
178
+ "tasks": [],
179
+ "source": "Vulnerable" if is_vuln else "Patched",
180
+ "has_ground_truth": False,
181
+ "has_tasks": False,
182
+ "description": row.get("commit_message", ""),
183
+ "vulnerable_code": code if is_vuln else "",
184
+ "vulnerable_highlighted_code": _highlight_code(code, language="c") if is_vuln else "",
185
+ "patched_code": code if not is_vuln else "",
186
+ "patched_highlighted_code": (
187
+ _highlight_code(code, language="c") if not is_vuln else ""
188
+ ),
189
+ "cwe_id": ", ".join(cwe_list) if cwe_list else "",
190
+ "project": row.get("project", ""),
191
+ "language": "C/C++",
192
+ "is_vulnerable": is_vuln,
193
+ }
194
+
195
+
196
+ # ---------------------------------------------------------------------------
197
+ # Devign adapter (HuggingFace: google/code_x_glue_cc_defect_detection)
198
+ # ---------------------------------------------------------------------------
199
+
200
+
201
+ class DevignAdapter(DatasetAdapter):
202
+ slug = "devign"
203
+ display_name = "Devign"
204
+ has_ground_truth = False
205
+ has_tasks = False
206
+
207
+ def __init__(self, hf_dataset):
208
+ self._ds = hf_dataset
209
+
210
+ def problem_count(self) -> int:
211
+ return len(self._ds)
212
+
213
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
214
+ row = self._ds[idx]
215
+ label = "Vulnerable" if row.get("target", 0) == 1 else "Clean"
216
+ return {
217
+ "idx": idx,
218
+ "task_id": str(row.get("commit_id", idx))[:12],
219
+ "entry_point": row.get("project", f"devign_{idx}"),
220
+ "num_inputs": 0,
221
+ "source": label,
222
+ }
223
+
224
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
225
+ row = self._ds[idx]
226
+ code = row.get("func", "")
227
+ is_vuln = row.get("target", 0) == 1
228
+ return {
229
+ "idx": idx,
230
+ "task_id": str(row.get("commit_id", idx))[:12],
231
+ "entry_point": row.get("project", f"devign_{idx}"),
232
+ "code": code,
233
+ "highlighted_code": _highlight_code(code, language="c"),
234
+ "inputs": [],
235
+ "outputs": [],
236
+ "test": None,
237
+ "tasks": [],
238
+ "source": "Vulnerable" if is_vuln else "Clean",
239
+ "has_ground_truth": False,
240
+ "has_tasks": False,
241
+ "description": row.get("commit_message", ""),
242
+ "is_vulnerable": is_vuln,
243
+ "project": row.get("project", ""),
244
+ "language": "C",
245
+ }
app.py CHANGED
@@ -12,7 +12,7 @@ import os
12
  from flask import Flask, jsonify, render_template, request
13
  from pygments import highlight
14
  from pygments.formatters import HtmlFormatter
15
- from pygments.lexers import PythonLexer
16
 
17
  app = Flask(__name__)
18
 
@@ -34,14 +34,16 @@ def _extract_test_classes(test_code: str, cls_name: str) -> list:
34
  lines = test_code.splitlines(keepends=True)
35
  prefix = f"{cls_name}Test"
36
  result = []
37
- for node in tree.body: # top-level definitions, preserves source order
38
  if isinstance(node, _ast.ClassDef) and node.name.startswith(prefix):
39
- start = node.lineno - 1 # ast lineno is 1-indexed
40
- end = node.end_lineno # end_lineno is inclusive; slice is exclusive
41
- result.append({
42
- "name": node.name,
43
- "code": "".join(lines[start:end]),
44
- })
 
 
45
  return result
46
 
47
 
@@ -49,20 +51,22 @@ def _code_offset(code: str) -> int:
49
  """Number of leading newlines that Pygments will strip."""
50
  offset = 0
51
  for ch in code:
52
- if ch == '\n':
53
  offset += 1
54
  else:
55
  break
56
  return offset
57
 
58
 
59
- def highlight_code(code, highlight_lines=None):
60
  """
61
- Syntax highlight Python code with optional line highlighting.
62
 
63
  Args:
64
- code: The Python code to highlight
65
  highlight_lines: List of line numbers (1-indexed) to highlight
 
 
66
 
67
  Returns:
68
  HTML string with syntax highlighted code
@@ -70,7 +74,11 @@ def highlight_code(code, highlight_lines=None):
70
  formatter = HtmlFormatter(
71
  linenos="table", cssclass="source", hl_lines=highlight_lines or [], linenostart=1
72
  )
73
- return highlight(code, PythonLexer(), formatter)
 
 
 
 
74
 
75
 
76
  def get_css():
@@ -82,7 +90,7 @@ def get_css():
82
  # Dataset adapter registration
83
  # ---------------------------------------------------------------------------
84
 
85
- from dataset_adapters import REGISTRY, _set_helpers, register_hf_datasets
86
 
87
  # Inject helper functions into the adapters module (avoids circular imports)
88
  _set_helpers(highlight_code, _code_offset, _extract_test_classes)
@@ -100,6 +108,7 @@ def _get_adapter(dataset_slug: str):
100
  # Routes
101
  # ---------------------------------------------------------------------------
102
 
 
103
  @app.route("/")
104
  def index():
105
  """Main page showing list of all benchmark problems."""
@@ -109,15 +118,18 @@ def index():
109
  @app.route("/api/datasets")
110
  def get_datasets():
111
  """Return list of available datasets for the UI dataset selector."""
112
- return jsonify([
113
- {
114
- "slug": slug,
115
- "display_name": adapter.display_name,
116
- "problem_count": adapter.problem_count(),
117
- "has_ground_truth": adapter.has_ground_truth,
118
- }
119
- for slug, adapter in REGISTRY.items()
120
- ])
 
 
 
121
 
122
 
123
  @app.route("/api/<dataset_slug>/problems")
 
12
  from flask import Flask, jsonify, render_template, request
13
  from pygments import highlight
14
  from pygments.formatters import HtmlFormatter
15
+ from pygments.lexers import PythonLexer, get_lexer_by_name
16
 
17
  app = Flask(__name__)
18
 
 
34
  lines = test_code.splitlines(keepends=True)
35
  prefix = f"{cls_name}Test"
36
  result = []
37
+ for node in tree.body: # top-level definitions, preserves source order
38
  if isinstance(node, _ast.ClassDef) and node.name.startswith(prefix):
39
+ start = node.lineno - 1 # ast lineno is 1-indexed
40
+ end = node.end_lineno # end_lineno is inclusive; slice is exclusive
41
+ result.append(
42
+ {
43
+ "name": node.name,
44
+ "code": "".join(lines[start:end]),
45
+ }
46
+ )
47
  return result
48
 
49
 
 
51
  """Number of leading newlines that Pygments will strip."""
52
  offset = 0
53
  for ch in code:
54
+ if ch == "\n":
55
  offset += 1
56
  else:
57
  break
58
  return offset
59
 
60
 
61
+ def highlight_code(code, highlight_lines=None, language="python"):
62
  """
63
+ Syntax highlight code with optional line highlighting.
64
 
65
  Args:
66
+ code: The source code to highlight
67
  highlight_lines: List of line numbers (1-indexed) to highlight
68
+ language: Programming language name (default: "python").
69
+ Must be a key in LEXER_MAP.
70
 
71
  Returns:
72
  HTML string with syntax highlighted code
 
74
  formatter = HtmlFormatter(
75
  linenos="table", cssclass="source", hl_lines=highlight_lines or [], linenostart=1
76
  )
77
+ try:
78
+ lexer = get_lexer_by_name(language.lower())
79
+ except Exception:
80
+ lexer = PythonLexer()
81
+ return highlight(code, lexer, formatter)
82
 
83
 
84
  def get_css():
 
90
  # Dataset adapter registration
91
  # ---------------------------------------------------------------------------
92
 
93
+ from adapters import REGISTRY, _set_helpers, register_hf_datasets # noqa: E402
94
 
95
  # Inject helper functions into the adapters module (avoids circular imports)
96
  _set_helpers(highlight_code, _code_offset, _extract_test_classes)
 
108
  # Routes
109
  # ---------------------------------------------------------------------------
110
 
111
+
112
  @app.route("/")
113
  def index():
114
  """Main page showing list of all benchmark problems."""
 
118
  @app.route("/api/datasets")
119
  def get_datasets():
120
  """Return list of available datasets for the UI dataset selector."""
121
+ return jsonify(
122
+ [
123
+ {
124
+ "slug": slug,
125
+ "display_name": adapter.display_name,
126
+ "problem_count": adapter.problem_count(),
127
+ "total_count": adapter.total_count,
128
+ "has_ground_truth": adapter.has_ground_truth,
129
+ }
130
+ for slug, adapter in REGISTRY.items()
131
+ ]
132
+ )
133
 
134
 
135
  @app.route("/api/<dataset_slug>/problems")
benchmarks_analysis.csv ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ benchmark,category,year,size,languages,hf_dataset_id,data_access,visualization_complexity,influence,priority_score,batch,notes
2
+ MBPP+,Code Generation,2023,399,Python,evalplus/mbppplus,easy,simple,high,9,1,Natural companion to HumanEval+; same EvalPlus ecosystem
3
+ ClassEval,Code Generation,2023,100 classes (410 methods),Python,FudanSELab/ClassEval,easy,moderate,high,9,1,Class-level code generation with test classes
4
+ LiveCodeBench,Code Generation,2024,1055+,Python,livecodebench/code_generation_lite,easy,moderate,high,9,1,Continuously updated; contamination-free; high community interest
5
+ DebugBench,Code Editing/Debugging,2024,4253,"C++, Java, Python",Rtian/DebugBench,easy,moderate,high,8,1,Buggy code with implanted bugs; 4 categories; 18 minor types
6
+ HumanEval-X,Code Translation,2022,820 (164x5),"Python, C++, Java, JS, Go",THUDM/humaneval-x,easy,moderate,high,8,1,Same 164 problems in 5 languages with test cases
7
+ SWE-bench Lite,Code Editing,2024,300,Python,princeton-nlp/SWE-bench_Lite,easy,complex,very high,8,2,GitHub issue resolution; extremely high-profile
8
+ CodeContests,Code Generation,2022,13328,"C++, Python, Java",deepmind/code_contests,easy,moderate,high,8,2,AlphaCode benchmark; competitive programming
9
+ APPS,Code Generation,2021,10000,Python,codeparrot/apps,easy,moderate,high,7,2,Large-scale coding problems at 3 difficulty levels
10
+ CanItEdit,Code Editing,2023,105,Python,nuprl/CanItEdit,easy,simple,medium,7,2,Before/after code editing with dual instruction types
11
+ MBPP,Code Generation,2021,974,Python,google-research-datasets/mbpp,easy,simple,high,7,2,Original MBPP; foundational benchmark
12
+ DS-1000,Code Generation,2023,1000,Python,xlangai/DS-1000,easy,moderate,high,7,3,Data science library-specific problems (NumPy/Pandas/etc.)
13
+ CodeEditorBench,Code Editing,2024,7961,Multiple,m-a-p/CodeEditorBench,easy,moderate,medium,7,3,4 editing scenarios: debug/translate/polish/requirement switch
14
+ SAFIM,Code Completion,2024,17720,"Python, Java, C++, C#",gonglinyuan/safim,easy,moderate,medium,7,3,Syntax-aware fill-in-the-middle; 3 subtasks
15
+ BigVul,Vulnerability Detection,2020,190000,C/C++,bstee615/bigvul,easy,moderate,medium,6,3,CVE-linked vulnerability detection; 91 CWE types
16
+ RepoBench,Code Completion,2023,10000+,"Python, Java",tianyang/repobench-c,easy,complex,medium,6,3,Repo-level code completion with 3 sub-tasks
17
+ MultiPL-E,Code Generation/Translation,2023,HumanEval+MBPP in 22 langs,22 languages,nuprl/MultiPL-E,easy,moderate,medium,6,4,Translations of HumanEval/MBPP to 22 languages
18
+ DiverseVul,Vulnerability Detection,2023,350000+,C/C++,claudios/DiverseVul,easy,simple,medium,6,4,Large-scale vulnerability detection; 150 CWEs
19
+ PrimeVul,Vulnerability Detection,2024,236000+,C/C++,starsofchance/PrimeVul,easy,simple,medium,6,4,Highest quality labels for vuln detection
20
+ McEval,Code Generation,2024,16000,40 languages,Multilingual-Multimodal-NLP/McEval,easy,complex,medium,6,4,Massive language coverage
21
+ CodeSearchNet,Code Search/Summarization,2019,2000000,"Python, JS, Ruby, Go, Java, PHP",code-search-net/code_search_net,easy,moderate,medium,6,4,Foundational code search benchmark
22
+ xCodeEval,Multi-task,2023,25000000,11-17 languages,NTU-NLP-sg/xCodeEval,easy,very complex,medium,5,5,7 tasks; very large; complex format
23
+ Devign,Vulnerability Detection,2019,20756,C,google/code_x_glue_cc_defect_detection,easy,simple,medium,5,5,Function-level vulnerability identification
24
+ CrossVul,Vulnerability Detection,2021,9313,40+ languages,hitoshura25/crossvul,easy,simple,medium,5,5,Cross-language vulnerability detection
25
+ SWE-bench Verified,Code Editing,2024,500,Python,princeton-nlp/SWE-bench_Verified,easy,complex,high,5,5,Curated subset of SWE-bench
26
+ CoderEval,Code Generation,2023,460,"Python, Java",N/A (GitHub only),medium,complex,medium,4,deferred,Requires project-level context
27
+ NaturalCodeBench,Code Generation,2024,402,"Python, Java",N/A (GitHub only),medium,moderate,medium,4,deferred,Only dev set released (140 problems)
28
+ DevEval,Code Generation,2024,1874,Python,N/A (GitHub only),medium,complex,medium,4,deferred,Repository-level; complex dependencies
29
+ RunBugRun,Program Repair,2023,450000+,9 languages,N/A (GitHub/SQLite),hard,complex,medium,3,deferred,SQLite format; complex infrastructure
30
+ Defects4J,Program Repair,2014,854,Java,N/A (GitHub only),hard,very complex,high,3,deferred,Requires Java tooling; full project repos
31
+ ConDefects,Program Repair,2023,2879,"Java, Python",N/A (GitHub only),medium,moderate,medium,3,deferred,AtCoder buggy/fixed pairs
32
+ FixEval,Program Repair,2023,varies,"Python, Java",N/A (GitHub only),medium,moderate,low,3,deferred,Competitive programming fixes
33
+ TransCoder,Code Translation,2020,852,"Java, Python, C++",N/A (GitHub only),medium,moderate,medium,3,deferred,Facebook Research; unsupervised translation
34
+ AVATAR,Code Translation,2021,9515,"Java, Python",N/A (GitHub only),medium,moderate,low,3,deferred,Parallel Java-Python corpus
35
+ TypeEvalPy,Type Inference,2023,154,Python,N/A (GitHub only),medium,moderate,low,3,deferred,Niche; type inference evaluation
36
+ VJBench,Vulnerability Repair,2023,42,Java,N/A (GitHub only),hard,complex,low,2,deferred,Very small; requires Java tooling
37
+ SVEN,Vulnerability Detection,2023,1606,C/C++,N/A (GitHub only),medium,moderate,low,2,deferred,Small; security hardening focus
38
+ PyTER,Type Error Repair,2022,93,Python,N/A (Figshare),hard,complex,low,2,deferred,Very small; niche
static/problem.css ADDED
@@ -0,0 +1,587 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ .problem-header {
2
+ display: flex;
3
+ justify-content: space-between;
4
+ align-items: center;
5
+ margin-bottom: 15px;
6
+ }
7
+
8
+ .problem-meta {
9
+ margin-bottom: 20px;
10
+ }
11
+
12
+ .meta-item {
13
+ display: inline-block;
14
+ margin-right: 15px;
15
+ margin-bottom: 10px;
16
+ }
17
+
18
+ .meta-label {
19
+ font-weight: 600;
20
+ color: #7f8c8d;
21
+ margin-right: 5px;
22
+ }
23
+
24
+ .meta-value {
25
+ color: #2c3e50;
26
+ }
27
+
28
+ .task-selector {
29
+ margin: 20px 0;
30
+ display: flex;
31
+ gap: 10px;
32
+ flex-wrap: wrap;
33
+ }
34
+
35
+ .task-btn {
36
+ padding: 10px 20px;
37
+ background: #ecf0f1;
38
+ border: 2px solid transparent;
39
+ border-radius: 4px;
40
+ cursor: pointer;
41
+ transition: all 0.3s;
42
+ font-size: 0.95rem;
43
+ }
44
+
45
+ .task-btn:hover {
46
+ background: #bdc3c7;
47
+ }
48
+
49
+ .task-btn.active {
50
+ background: #3498db;
51
+ color: white;
52
+ border-color: #2980b9;
53
+ }
54
+
55
+ .task-details {
56
+ margin-top: 20px;
57
+ }
58
+
59
+ .task-section {
60
+ margin-bottom: 25px;
61
+ padding: 15px;
62
+ background: #f8f9fa;
63
+ border-left: 4px solid #3498db;
64
+ border-radius: 4px;
65
+ }
66
+
67
+ .task-section h3 {
68
+ margin-bottom: 10px;
69
+ color: #2c3e50;
70
+ font-size: 1.1rem;
71
+ }
72
+
73
+ .code-block {
74
+ background: #f8f9fa;
75
+ padding: 15px;
76
+ border-radius: 4px;
77
+ overflow-x: auto;
78
+ font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;
79
+ font-size: 0.9rem;
80
+ border: 1px solid #e1e4e8;
81
+ }
82
+
83
+ .task-items-list {
84
+ list-style: none;
85
+ }
86
+
87
+ .task-items-list li {
88
+ padding: 10px;
89
+ margin-bottom: 8px;
90
+ background: white;
91
+ border-radius: 4px;
92
+ border: 1px solid #e1e4e8;
93
+ }
94
+
95
+ .line-ref {
96
+ display: inline-block;
97
+ padding: 2px 8px;
98
+ background: #3498db;
99
+ color: white;
100
+ border-radius: 3px;
101
+ font-family: monospace;
102
+ font-size: 0.85rem;
103
+ margin-right: 8px;
104
+ }
105
+
106
+ .var-name {
107
+ display: inline-block;
108
+ padding: 2px 8px;
109
+ background: #9b59b6;
110
+ color: white;
111
+ border-radius: 3px;
112
+ font-family: monospace;
113
+ font-size: 0.85rem;
114
+ }
115
+
116
+ .io-section {
117
+ display: grid;
118
+ grid-template-columns: 1fr 1fr;
119
+ gap: 15px;
120
+ }
121
+
122
+ @media (max-width: 768px) {
123
+ .io-section {
124
+ grid-template-columns: 1fr;
125
+ }
126
+ }
127
+
128
+ .navigation-hint {
129
+ margin-top: 20px;
130
+ padding: 15px;
131
+ background: #e8f4f8;
132
+ border-radius: 4px;
133
+ color: #2c3e50;
134
+ font-size: 0.9rem;
135
+ }
136
+
137
+ .test-code-section {
138
+ margin-top: 20px;
139
+ }
140
+
141
+ /* Inline task visualization */
142
+ .code-with-tasks {
143
+ position: relative;
144
+ }
145
+
146
+ .task-marker {
147
+ display: inline-block;
148
+ margin-left: 10px;
149
+ padding: 2px 8px;
150
+ background: #9b59b6;
151
+ color: white;
152
+ border-radius: 3px;
153
+ font-size: 0.75rem;
154
+ font-weight: 600;
155
+ cursor: crosshair;
156
+ }
157
+
158
+ /* Coverage coloring on lineno spans */
159
+ td.linenos .normal.line-executed {
160
+ background-color: #d4edda !important;
161
+ color: #155724 !important;
162
+ }
163
+
164
+ td.linenos .normal.line-not-executed {
165
+ background-color: #f8d7da !important;
166
+ color: #721c24 !important;
167
+ }
168
+
169
+ /* Coverage legend */
170
+ .coverage-legend {
171
+ margin: 10px 0;
172
+ padding: 10px 15px;
173
+ background: #f8f9fa;
174
+ border-left: 4px solid #28a745;
175
+ border-radius: 4px;
176
+ font-size: 0.85rem;
177
+ display: none;
178
+ }
179
+
180
+ .coverage-legend-item {
181
+ display: inline-block;
182
+ margin-right: 18px;
183
+ }
184
+
185
+ .coverage-swatch {
186
+ display: inline-block;
187
+ width: 12px;
188
+ height: 12px;
189
+ border-radius: 2px;
190
+ margin-right: 4px;
191
+ vertical-align: middle;
192
+ }
193
+
194
+ /* Ground truth answer badge */
195
+ .gt-answer {
196
+ display: inline-block;
197
+ margin-left: 10px;
198
+ padding: 2px 8px;
199
+ background: #17a2b8;
200
+ color: white;
201
+ border-radius: 3px;
202
+ font-family: monospace;
203
+ font-size: 0.82rem;
204
+ font-weight: 600;
205
+ }
206
+
207
+ .gt-answer.loading {
208
+ background: #6c757d;
209
+ }
210
+
211
+ /* SVG arrow overlay */
212
+ #arrow-overlay {
213
+ position: absolute;
214
+ top: 0;
215
+ left: 0;
216
+ width: 100%;
217
+ height: 100%;
218
+ pointer-events: none;
219
+ overflow: visible;
220
+ z-index: 10;
221
+ }
222
+
223
+ .exec-arrow {
224
+ fill: none;
225
+ stroke: #e67e22;
226
+ stroke-width: 2.5;
227
+ stroke-dasharray: none;
228
+ opacity: 0.9;
229
+ }
230
+
231
+ .exec-arrow-head {
232
+ fill: #e67e22;
233
+ opacity: 0.9;
234
+ }
235
+
236
+ /* CRUXEval answer highlight */
237
+ .crux-answer {
238
+ border-left: 4px solid #17a2b8 !important;
239
+ background: #e8f6f8 !important;
240
+ }
241
+
242
+ /* Before/after diff view */
243
+ .diff-container {
244
+ display: grid;
245
+ grid-template-columns: 1fr 1fr;
246
+ gap: 20px;
247
+ }
248
+
249
+ @media (max-width: 1024px) {
250
+ .diff-container {
251
+ grid-template-columns: 1fr;
252
+ }
253
+ }
254
+
255
+ .diff-panel {
256
+ overflow-x: auto;
257
+ }
258
+
259
+ .diff-panel h3 {
260
+ margin-bottom: 10px;
261
+ font-size: 1.1rem;
262
+ }
263
+
264
+ .diff-panel h3 .diff-label-buggy {
265
+ color: #e74c3c;
266
+ }
267
+
268
+ .diff-panel h3 .diff-label-fixed {
269
+ color: #27ae60;
270
+ }
271
+
272
+ .bug-info {
273
+ margin-bottom: 15px;
274
+ padding: 12px 15px;
275
+ border-left: 4px solid #e74c3c;
276
+ background: #fdf2f2;
277
+ border-radius: 4px;
278
+ }
279
+
280
+ .bug-info .bug-category {
281
+ display: inline-block;
282
+ padding: 2px 8px;
283
+ background: #e74c3c;
284
+ color: white;
285
+ border-radius: 3px;
286
+ font-size: 0.82rem;
287
+ font-weight: 600;
288
+ margin-right: 8px;
289
+ }
290
+
291
+ .bug-info .bug-subtype {
292
+ display: inline-block;
293
+ padding: 2px 8px;
294
+ background: #c0392b;
295
+ color: white;
296
+ border-radius: 3px;
297
+ font-size: 0.82rem;
298
+ font-weight: 600;
299
+ }
300
+
301
+ /* Multi-language view */
302
+ .lang-tabs {
303
+ display: flex;
304
+ gap: 0;
305
+ border-bottom: 2px solid #e1e4e8;
306
+ margin-bottom: 0;
307
+ }
308
+
309
+ .lang-tab {
310
+ padding: 10px 20px;
311
+ background: #f8f9fa;
312
+ border: 1px solid #e1e4e8;
313
+ border-bottom: none;
314
+ cursor: pointer;
315
+ font-size: 0.95rem;
316
+ font-weight: 500;
317
+ transition: all 0.2s;
318
+ border-radius: 4px 4px 0 0;
319
+ margin-right: 2px;
320
+ }
321
+
322
+ .lang-tab:hover {
323
+ background: #e8f4f8;
324
+ }
325
+
326
+ .lang-tab.active {
327
+ background: white;
328
+ border-bottom: 2px solid white;
329
+ margin-bottom: -2px;
330
+ color: #3498db;
331
+ font-weight: 600;
332
+ }
333
+
334
+ .lang-code-panel {
335
+ display: none;
336
+ }
337
+
338
+ .lang-code-panel.active {
339
+ display: block;
340
+ }
341
+
342
+ /* BigOBench complexity display */
343
+ .complexity-badges {
344
+ display: flex;
345
+ gap: 20px;
346
+ flex-wrap: wrap;
347
+ }
348
+
349
+ .complexity-item {
350
+ display: flex;
351
+ align-items: center;
352
+ gap: 10px;
353
+ }
354
+
355
+ .complexity-label {
356
+ font-weight: 600;
357
+ color: #7f8c8d;
358
+ font-size: 0.95rem;
359
+ }
360
+
361
+ .complexity-value {
362
+ display: inline-block;
363
+ padding: 6px 16px;
364
+ background: #2c3e50;
365
+ color: #f1c40f;
366
+ border-radius: 4px;
367
+ font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;
368
+ font-size: 1.1rem;
369
+ font-weight: 600;
370
+ }
371
+
372
+ /* Diff view (GitHub-style table with line numbers) */
373
+ .diff-view {
374
+ font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;
375
+ font-size: 0.85rem;
376
+ line-height: 1.5;
377
+ overflow-x: auto;
378
+ border: 1px solid #e1e4e8;
379
+ border-radius: 4px;
380
+ }
381
+
382
+ .diff-table {
383
+ border-collapse: collapse;
384
+ width: 100%;
385
+ }
386
+
387
+ .diff-table td {
388
+ padding: 0 8px;
389
+ white-space: pre;
390
+ vertical-align: top;
391
+ }
392
+
393
+ .diff-ln {
394
+ width: 1%;
395
+ min-width: 40px;
396
+ color: #959da5;
397
+ text-align: right;
398
+ user-select: none;
399
+ font-size: 0.8rem;
400
+ padding: 0 6px !important;
401
+ border-right: 1px solid #e1e4e8;
402
+ }
403
+
404
+ .diff-tr-add td { background: #e6ffec; }
405
+ .diff-td-add { color: #24292e; }
406
+ .diff-tr-add .diff-ln { background: #ccffd8; color: #22863a; }
407
+
408
+ .diff-tr-del td { background: #ffebe9; }
409
+ .diff-td-del { color: #24292e; }
410
+ .diff-tr-del .diff-ln { background: #ffd7d5; color: #cb2431; }
411
+
412
+ .diff-tr-ctx td { background: white; }
413
+ .diff-td-ctx { color: #586069; }
414
+
415
+ .diff-tr-hunk td {
416
+ background: #f1f8ff;
417
+ color: #0366d6;
418
+ font-weight: 600;
419
+ padding: 4px 8px;
420
+ }
421
+
422
+ .diff-tr-header td {
423
+ background: #fafbfc;
424
+ color: #6a737d;
425
+ font-weight: 600;
426
+ padding: 4px 8px;
427
+ border-bottom: 1px solid #e1e4e8;
428
+ }
429
+
430
+ /* Diff file sections (GitHub-style per-file headers) */
431
+ .diff-file-section {
432
+ margin-bottom: 16px;
433
+ border: 1px solid #d0d7de;
434
+ border-radius: 6px;
435
+ overflow: hidden;
436
+ }
437
+
438
+ .diff-file-section .diff-view {
439
+ border: none;
440
+ border-radius: 0;
441
+ }
442
+
443
+ .diff-file-header {
444
+ display: flex;
445
+ justify-content: space-between;
446
+ align-items: center;
447
+ padding: 8px 12px;
448
+ background: #f6f8fa;
449
+ border-bottom: 1px solid #d0d7de;
450
+ font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;
451
+ font-size: 0.85rem;
452
+ }
453
+
454
+ .diff-file-path {
455
+ color: #24292f;
456
+ font-weight: 600;
457
+ word-break: break-all;
458
+ }
459
+
460
+ .diff-file-stats {
461
+ white-space: nowrap;
462
+ margin-left: 12px;
463
+ font-size: 0.8rem;
464
+ }
465
+
466
+ .diff-stat-add { color: #1a7f37; font-weight: 600; }
467
+ .diff-stat-del { color: #cf222e; font-weight: 600; margin-left: 6px; }
468
+
469
+ /* GitHub links bar */
470
+ .gh-links-bar {
471
+ display: flex;
472
+ gap: 12px;
473
+ align-items: center;
474
+ flex-wrap: wrap;
475
+ }
476
+
477
+ .gh-link {
478
+ display: inline-block;
479
+ padding: 6px 14px;
480
+ background: #f6f8fa;
481
+ border: 1px solid #d0d7de;
482
+ border-radius: 6px;
483
+ color: #0969da;
484
+ text-decoration: none;
485
+ font-size: 0.9rem;
486
+ font-weight: 500;
487
+ transition: background 0.15s, border-color 0.15s;
488
+ }
489
+
490
+ .gh-link:hover {
491
+ background: #ddf4ff;
492
+ border-color: #0969da;
493
+ }
494
+
495
+ /* Issue / problem statement */
496
+ .issue-statement {
497
+ line-height: 1.7;
498
+ padding: 10px;
499
+ white-space: pre-wrap;
500
+ word-wrap: break-word;
501
+ max-height: 500px;
502
+ overflow-y: auto;
503
+ background: #f8f9fa;
504
+ border: 1px solid #e1e4e8;
505
+ border-radius: 4px;
506
+ font-size: 0.9rem;
507
+ }
508
+
509
+ .test-id-list {
510
+ list-style: none;
511
+ padding: 0;
512
+ }
513
+
514
+ .test-id-list li {
515
+ padding: 4px 8px;
516
+ margin-bottom: 4px;
517
+ background: #f8f9fa;
518
+ border-radius: 3px;
519
+ font-family: monospace;
520
+ font-size: 0.82rem;
521
+ border-left: 3px solid #e74c3c;
522
+ }
523
+
524
+ .test-id-list li.pass-to-pass {
525
+ border-left-color: #27ae60;
526
+ }
527
+
528
+ /* Fill-in-the-Middle (SAFIM) view */
529
+ .fim-hole-marker {
530
+ display: inline-block;
531
+ padding: 4px 16px;
532
+ background: #e74c3c;
533
+ color: white;
534
+ border-radius: 4px;
535
+ font-family: monospace;
536
+ font-weight: 600;
537
+ font-size: 0.9rem;
538
+ margin: 4px 0;
539
+ }
540
+
541
+ .fim-answer {
542
+ padding: 15px;
543
+ background: #e8f6e8;
544
+ border-left: 4px solid #27ae60;
545
+ border-radius: 4px;
546
+ font-family: monospace;
547
+ font-size: 0.9rem;
548
+ }
549
+
550
+ .fim-merged-legend {
551
+ margin: 8px 0;
552
+ padding: 6px 12px;
553
+ background: #f8f9fa;
554
+ border-radius: 4px;
555
+ font-size: 0.85rem;
556
+ color: #555;
557
+ }
558
+
559
+ /* Vulnerability view */
560
+ .vuln-status {
561
+ display: inline-block;
562
+ padding: 4px 12px;
563
+ border-radius: 4px;
564
+ font-size: 0.85rem;
565
+ font-weight: 600;
566
+ }
567
+
568
+ .vuln-status-vulnerable {
569
+ background: #e74c3c;
570
+ color: white;
571
+ }
572
+
573
+ .vuln-status-patched {
574
+ background: #27ae60;
575
+ color: white;
576
+ }
577
+
578
+ .cwe-badge {
579
+ display: inline-block;
580
+ padding: 4px 12px;
581
+ background: #2c3e50;
582
+ color: #e74c3c;
583
+ border-radius: 4px;
584
+ font-family: monospace;
585
+ font-size: 0.85rem;
586
+ font-weight: 600;
587
+ }
static/problem.js ADDED
@@ -0,0 +1,1313 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ /* global problemIdx, datasetSlug, datasetName, hasGroundTruth, hasTasks */
2
+
3
+ function badgeClass(source) {
4
+ return 'badge-' + source.toLowerCase().replace(/[^a-z0-9]/g, '');
5
+ }
6
+
7
+ async function loadProblem() {
8
+ try {
9
+ const response = await fetch(`/api/${datasetSlug}/problem/${problemIdx}`);
10
+ const problem = await response.json();
11
+
12
+ if (problem.error) {
13
+ document.getElementById('problem-content').innerHTML =
14
+ '<div class="card"><p style="color: red;">Error: ' + problem.error + '</p></div>';
15
+ return;
16
+ }
17
+
18
+ renderProblem(problem);
19
+ } catch (error) {
20
+ document.getElementById('problem-content').innerHTML =
21
+ '<div class="card"><p style="color: red;">Error loading problem: ' + error.message + '</p></div>';
22
+ }
23
+ }
24
+
25
+ function renderProblem(problem) {
26
+ const container = document.getElementById('problem-content');
27
+
28
+ // Main problem info card (shared by all datasets)
29
+ let html = `
30
+ <div class="card">
31
+ <div class="problem-header">
32
+ <h2>${escapeHtml(problem.entry_point)}</h2>
33
+ <span class="badge ${badgeClass(problem.source)}">${escapeHtml(problem.source)}</span>
34
+ </div>
35
+ <div class="problem-meta">
36
+ <div class="meta-item">
37
+ <span class="meta-label">Task ID:</span>
38
+ <span class="meta-value">${escapeHtml(problem.task_id)}</span>
39
+ </div>
40
+ <div class="meta-item">
41
+ <span class="meta-label">Index:</span>
42
+ <span class="meta-value">${problem.idx}</span>
43
+ </div>
44
+ <div class="meta-item">
45
+ <span class="meta-label">Dataset:</span>
46
+ <span class="meta-value">${escapeHtml(datasetName)}</span>
47
+ </div>
48
+ ${problem.inputs.length > 0 ? `
49
+ <div class="meta-item">
50
+ <span class="meta-label">Test Inputs:</span>
51
+ <span class="meta-value">${problem.inputs.length}</span>
52
+ </div>` : ''}
53
+ </div>
54
+ </div>
55
+ `;
56
+
57
+ // --- BigOBench view (problem description + per-solution code & complexity) ---
58
+ if (problem.solutions && problem.solutions.length > 0) {
59
+ // Problem description
60
+ if (problem.description) {
61
+ html += `
62
+ <div class="card">
63
+ <h2>Problem Statement</h2>
64
+ <pre class="code-block" style="white-space: pre-wrap;">${escapeHtml(problem.description)}</pre>
65
+ </div>
66
+ `;
67
+ }
68
+
69
+ // Each solution: code + complexity/language badges
70
+ problem.solutions.forEach((sol, i) => {
71
+ html += `
72
+ <div class="card">
73
+ <h2>Solution ${i + 1} <span style="font-size:0.8rem;color:#7f8c8d;font-weight:400;">${escapeHtml(sol.solution_id)}</span></h2>
74
+ <div class="complexity-badges" style="margin-bottom: 15px;">
75
+ `;
76
+ if (sol.language) {
77
+ html += `
78
+ <div class="complexity-item">
79
+ <span class="complexity-label">Language</span>
80
+ <span class="badge badge-info">${escapeHtml(sol.language)}</span>
81
+ </div>`;
82
+ }
83
+ if (sol.time_complexity) {
84
+ html += `
85
+ <div class="complexity-item">
86
+ <span class="complexity-label">Time</span>
87
+ <span class="complexity-value">${escapeHtml(sol.time_complexity)}</span>
88
+ </div>`;
89
+ }
90
+ if (sol.space_complexity) {
91
+ html += `
92
+ <div class="complexity-item">
93
+ <span class="complexity-label">Space</span>
94
+ <span class="complexity-value">${escapeHtml(sol.space_complexity)}</span>
95
+ </div>`;
96
+ }
97
+ html += `
98
+ </div>
99
+ <div class="code-with-tasks">
100
+ ${sol.highlighted_code}
101
+ </div>
102
+ </div>
103
+ `;
104
+ });
105
+
106
+ // Navigation hint
107
+ html += `
108
+ <div class="navigation-hint">
109
+ <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
110
+ or return to the list view to filter by dataset source or search by name.
111
+ </div>
112
+ `;
113
+
114
+ container.innerHTML = html;
115
+ window.currentProblem = problem;
116
+ return;
117
+ }
118
+
119
+ // --- DebugBench before/after view (buggy → fixed) ---
120
+ if (problem.buggy_code !== undefined && problem.fixed_code !== undefined) {
121
+ // Problem description
122
+ if (problem.description) {
123
+ html += `
124
+ <div class="card">
125
+ <h2>Problem Statement</h2>
126
+ <pre class="code-block" style="white-space: pre-wrap;">${escapeHtml(problem.description)}</pre>
127
+ </div>
128
+ `;
129
+ }
130
+
131
+ // Bug info
132
+ html += `
133
+ <div class="card">
134
+ <h2>Bug Information</h2>
135
+ <div class="bug-info">
136
+ <span class="bug-category">${escapeHtml(problem.bug_category || '')}</span>
137
+ <span class="bug-subtype">${escapeHtml(problem.bug_subtype || '')}</span>
138
+ </div>
139
+ <p style="margin-top: 10px;">${escapeHtml(problem.bug_explanation || '')}</p>
140
+ `;
141
+ if (problem.difficulty) {
142
+ html += `<p style="margin-top: 8px; color: #7f8c8d;">Difficulty: <strong>${escapeHtml(problem.difficulty)}</strong></p>`;
143
+ }
144
+ html += `</div>`;
145
+
146
+ // Unified diff view of buggy → fixed
147
+ const unifiedDiff = computeUnifiedDiff(problem.buggy_code, problem.fixed_code);
148
+ html += `
149
+ <div class="card">
150
+ <h2>Changes</h2>
151
+ <div class="diff-view">${renderComputedDiff(unifiedDiff)}</div>
152
+ </div>
153
+ `;
154
+
155
+ // Side-by-side buggy/fixed code
156
+ html += `
157
+ <div class="card">
158
+ <h2>Full Code Comparison</h2>
159
+ <div class="diff-container">
160
+ <div class="diff-panel">
161
+ <h3><span class="diff-label-buggy">Before</span></h3>
162
+ <div class="code-with-tasks">${problem.buggy_highlighted_code}</div>
163
+ </div>
164
+ <div class="diff-panel">
165
+ <h3><span class="diff-label-fixed">After</span></h3>
166
+ <div class="code-with-tasks">${problem.fixed_highlighted_code}</div>
167
+ </div>
168
+ </div>
169
+ </div>
170
+ `;
171
+
172
+ // Examples
173
+ if (problem.examples && problem.examples.length > 0) {
174
+ html += `<div class="card"><h2>Examples</h2>`;
175
+ problem.examples.forEach((ex, i) => {
176
+ html += `<pre class="code-block" style="margin-bottom: 10px; white-space: pre-wrap;">${escapeHtml(ex)}</pre>`;
177
+ });
178
+ html += `</div>`;
179
+ }
180
+
181
+ html += `
182
+ <div class="navigation-hint">
183
+ <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
184
+ or return to the list view to filter by dataset source or search by name.
185
+ </div>
186
+ `;
187
+
188
+ container.innerHTML = html;
189
+ window.currentProblem = problem;
190
+ return;
191
+ }
192
+
193
+ // --- HumanEval-X / HumanEvalPack multi-language view ---
194
+ if (problem.lang_solutions && problem.lang_solutions.length > 0) {
195
+ // Check if this is HumanEvalPack (has buggy_code in solutions)
196
+ const hasBuggy = problem.lang_solutions.some(sol => sol.buggy_code);
197
+
198
+ // Bug info (HumanEvalPack only)
199
+ if (hasBuggy && (problem.bug_type || problem.failure_symptoms)) {
200
+ html += `
201
+ <div class="card">
202
+ <h2>Bug Information</h2>
203
+ <div class="bug-info">
204
+ ${problem.bug_type ? `<span class="bug-category">${escapeHtml(problem.bug_type)}</span>` : ''}
205
+ ${problem.failure_symptoms ? `<span class="bug-subtype">${escapeHtml(problem.failure_symptoms)}</span>` : ''}
206
+ </div>
207
+ </div>
208
+ `;
209
+ }
210
+
211
+ // Language tabs with code panels
212
+ html += `
213
+ <div class="card">
214
+ <h2>Source Code</h2>
215
+ `;
216
+
217
+ // Buggy/Canonical toggle for HumanEvalPack
218
+ if (hasBuggy) {
219
+ html += `
220
+ <div class="lang-tabs" id="code-mode-tabs" style="margin-bottom: 10px;">
221
+ <button class="lang-tab active" onclick="toggleCodeMode('canonical')">Canonical</button>
222
+ <button class="lang-tab" onclick="toggleCodeMode('buggy')">Buggy</button>
223
+ </div>
224
+ `;
225
+ }
226
+
227
+ html += `<div class="lang-tabs" id="lang-tabs">`;
228
+ problem.lang_solutions.forEach((sol, i) => {
229
+ const label = sol.language_label || sol.language;
230
+ html += `<button class="lang-tab ${i === 0 ? 'active' : ''}" onclick="showLangTab(${i})">${escapeHtml(label)}</button>`;
231
+ });
232
+ html += `</div>`;
233
+
234
+ problem.lang_solutions.forEach((sol, i) => {
235
+ html += `
236
+ <div class="lang-code-panel ${i === 0 ? 'active' : ''}" id="lang-panel-${i}">
237
+ <div class="code-with-tasks" id="lang-code-canonical-${i}">${sol.highlighted_code}</div>
238
+ ${sol.buggy_code ? `<div class="code-with-tasks" id="lang-code-buggy-${i}" style="display:none;">${sol.buggy_highlighted_code}</div>` : ''}
239
+ </div>
240
+ `;
241
+ });
242
+ html += `</div>`;
243
+
244
+ // Test suite for current language
245
+ html += `<div class="card" id="lang-test-container">`;
246
+ if (problem.lang_solutions[0].test) {
247
+ html += `<h2>Test Suite</h2><pre class="code-block">${escapeHtml(problem.lang_solutions[0].test)}</pre>`;
248
+ }
249
+ html += `</div>`;
250
+
251
+ // Description
252
+ if (problem.description) {
253
+ html += `
254
+ <div class="card">
255
+ <h2>Problem Description</h2>
256
+ <div style="padding: 10px; line-height: 1.6; white-space: pre-wrap;">${escapeHtml(problem.description)}</div>
257
+ </div>
258
+ `;
259
+ }
260
+
261
+ html += `
262
+ <div class="navigation-hint">
263
+ <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
264
+ or return to the list view to filter by dataset source or search by name.
265
+ </div>
266
+ `;
267
+
268
+ container.innerHTML = html;
269
+ window.currentProblem = problem;
270
+ window._currentCodeMode = 'canonical';
271
+ return;
272
+ }
273
+
274
+ // --- SWE-bench / CommitBench diff view (unified diff patch) ---
275
+ if (problem.patch !== undefined) {
276
+ // GitHub links bar (SWE-bench variants)
277
+ const ghLinks = [];
278
+ if (problem.repo_url) ghLinks.push(`<a href="${escapeHtml(problem.repo_url)}" target="_blank" class="gh-link">Repository</a>`);
279
+ if (problem.issue_url) ghLinks.push(`<a href="${escapeHtml(problem.issue_url)}" target="_blank" class="gh-link">Issue</a>`);
280
+ if (problem.commit_url) ghLinks.push(`<a href="${escapeHtml(problem.commit_url)}" target="_blank" class="gh-link">Base Commit</a>`);
281
+ if (ghLinks.length > 0) {
282
+ html += `<div class="card gh-links-bar">${ghLinks.join('')}</div>`;
283
+ }
284
+
285
+ // Metadata badges (version, date)
286
+ const metaBadges = [];
287
+ if (problem.version) metaBadges.push(`<span class="badge badge-info">v${escapeHtml(problem.version)}</span>`);
288
+ if (problem.created_at) metaBadges.push(`<span class="badge badge-info">${escapeHtml(problem.created_at.split('T')[0])}</span>`);
289
+ if (problem.commit_hash) metaBadges.push(`<span class="badge badge-info">${escapeHtml(problem.commit_hash.substring(0, 12))}</span>`);
290
+ if (problem.diff_languages) metaBadges.push(`<span class="badge badge-info">${escapeHtml(problem.diff_languages)}</span>`);
291
+ if (metaBadges.length > 0) {
292
+ html += `<div style="margin-bottom: 15px;">${metaBadges.join(' ')}</div>`;
293
+ }
294
+
295
+ // Problem statement (issue text / commit message)
296
+ if (problem.description) {
297
+ html += `
298
+ <div class="card">
299
+ <h2>${problem.issue_url ? 'Issue Description' : 'Description'}</h2>
300
+ <div class="issue-statement">${escapeHtml(problem.description)}</div>
301
+ </div>
302
+ `;
303
+ }
304
+
305
+ // Render unified diff with per-file sections
306
+ html += renderDiffFiles(problem.patch, 'Solution Patch');
307
+
308
+ // Test patch if available
309
+ if (problem.test_patch) {
310
+ html += renderDiffFiles(problem.test_patch, 'Test Patch');
311
+ }
312
+
313
+ // Failing tests
314
+ if (problem.fail_to_pass && problem.fail_to_pass.length > 0) {
315
+ html += `<div class="card"><h2>Tests: Fail → Pass</h2><ul class="test-id-list">`;
316
+ problem.fail_to_pass.forEach(t => {
317
+ html += `<li>${escapeHtml(t)}</li>`;
318
+ });
319
+ html += `</ul></div>`;
320
+ }
321
+
322
+ // Hints
323
+ if (problem.hints) {
324
+ html += `
325
+ <div class="card">
326
+ <h2>Hints</h2>
327
+ <div class="issue-statement">${escapeHtml(problem.hints)}</div>
328
+ </div>
329
+ `;
330
+ }
331
+
332
+ html += `
333
+ <div class="navigation-hint">
334
+ <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
335
+ or return to the list view to filter by dataset source or search by name.
336
+ </div>
337
+ `;
338
+
339
+ container.innerHTML = html;
340
+ window.currentProblem = problem;
341
+ return;
342
+ }
343
+
344
+ // --- SAFIM Fill-in-the-Middle view ---
345
+ if (problem.fim_ground_truth !== undefined) {
346
+ // Tab bar: "With Gap" | "Completed" | "Completion Only"
347
+ html += `
348
+ <div class="card">
349
+ <h2>Fill-in-the-Middle</h2>
350
+ <div class="lang-tabs" id="fim-tabs">
351
+ <button class="lang-tab" onclick="showFimTab(0)">With Gap</button>
352
+ <button class="lang-tab active" onclick="showFimTab(1)">Completed</button>
353
+ <button class="lang-tab" onclick="showFimTab(2)">Completion Only</button>
354
+ </div>
355
+ <div class="lang-code-panel" id="fim-panel-0">
356
+ <div class="code-with-tasks">${problem.highlighted_code}</div>
357
+ </div>
358
+ <div class="lang-code-panel active" id="fim-panel-1">
359
+ <div class="fim-merged-legend">
360
+ <span class="coverage-swatch" style="background:#ffffcc; border:1px solid #ccc;"></span>
361
+ Inserted completion (lines ${problem.fim_gt_start_line}&ndash;${problem.fim_gt_end_line})
362
+ </div>
363
+ <div class="code-with-tasks">${problem.fim_merged_highlighted}</div>
364
+ </div>
365
+ <div class="lang-code-panel" id="fim-panel-2">
366
+ <div class="fim-answer">${problem.fim_ground_truth_highlighted || escapeHtml(problem.fim_ground_truth)}</div>
367
+ </div>
368
+ </div>
369
+ `;
370
+
371
+ html += `
372
+ <div class="navigation-hint">
373
+ <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
374
+ or return to the list view to filter by dataset source or search by name.
375
+ </div>
376
+ `;
377
+ container.innerHTML = html;
378
+ window.currentProblem = problem;
379
+ return;
380
+ }
381
+
382
+ // --- Vulnerability view (BigVul, DiverseVul, PrimeVul) ---
383
+ if (problem.vulnerable_code !== undefined || problem.is_vulnerable !== undefined) {
384
+ // Vulnerability status and CWE info
385
+ const isVuln = problem.is_vulnerable;
386
+ html += `
387
+ <div class="card">
388
+ <h2>Vulnerability Information</h2>
389
+ <div style="margin-bottom: 10px;">
390
+ <span class="vuln-status ${isVuln ? 'vuln-status-vulnerable' : 'vuln-status-patched'}">
391
+ ${isVuln ? 'Vulnerable' : 'Patched'}
392
+ </span>
393
+ ${problem.cwe_id ? `<span class="cwe-badge">${escapeHtml(problem.cwe_id)}</span>` : ''}
394
+ ${problem.cve_id ? `<span class="badge badge-info">${escapeHtml(problem.cve_id)}</span>` : ''}
395
+ ${problem.project ? `<span class="badge badge-info">${escapeHtml(problem.project)}</span>` : ''}
396
+ </div>
397
+ ${problem.description ? `<p style="margin-top: 10px; color: #555;">${escapeHtml(problem.description).substring(0, 500)}</p>` : ''}
398
+ </div>
399
+ `;
400
+
401
+ // Show code with vuln/patched side-by-side if both available
402
+ if (problem.vulnerable_code && problem.patched_code) {
403
+ const vulnDiff = computeUnifiedDiff(problem.vulnerable_code, problem.patched_code);
404
+ html += `
405
+ <div class="card">
406
+ <h2>Changes</h2>
407
+ <div class="diff-view">${renderComputedDiff(vulnDiff)}</div>
408
+ </div>
409
+ `;
410
+ html += `
411
+ <div class="card">
412
+ <h2>Full Code Comparison</h2>
413
+ <div class="diff-container">
414
+ <div class="diff-panel">
415
+ <h3><span class="diff-label-buggy">Vulnerable</span></h3>
416
+ <div class="code-with-tasks">${problem.vulnerable_highlighted_code}</div>
417
+ </div>
418
+ <div class="diff-panel">
419
+ <h3><span class="diff-label-fixed">Patched</span></h3>
420
+ <div class="code-with-tasks">${problem.patched_highlighted_code}</div>
421
+ </div>
422
+ </div>
423
+ </div>
424
+ `;
425
+ } else {
426
+ // Single code view
427
+ html += `
428
+ <div class="card">
429
+ <h2>Source Code</h2>
430
+ <div class="code-with-tasks">${problem.highlighted_code}</div>
431
+ </div>
432
+ `;
433
+ }
434
+
435
+ html += `
436
+ <div class="navigation-hint">
437
+ <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
438
+ or return to the list view to filter by dataset source or search by name.
439
+ </div>
440
+ `;
441
+ container.innerHTML = html;
442
+ window.currentProblem = problem;
443
+ return;
444
+ }
445
+
446
+ // Source Code card
447
+ html += `
448
+ <div class="card">
449
+ <h2>Source Code</h2>
450
+ <div class="code-with-tasks" id="code-container">
451
+ ${problem.highlighted_code}
452
+ </div>
453
+ </div>
454
+ `;
455
+
456
+ // --- Non-DREval (simple) view ---
457
+ if (!hasTasks) {
458
+ // Show description if available (e.g. LiveCodeBench, MBPP+, ClassEval)
459
+ if (problem.description) {
460
+ html += `
461
+ <div class="card">
462
+ <h2>Problem Description</h2>
463
+ <div style="padding: 10px; line-height: 1.6; white-space: pre-wrap;">${escapeHtml(problem.description)}</div>
464
+ </div>
465
+ `;
466
+ }
467
+
468
+ // Show difficulty, contest date, tags, rating if available
469
+ if (problem.difficulty || problem.contest_date || problem.tags || problem.cf_rating) {
470
+ let metaHtml = '';
471
+ if (problem.difficulty) {
472
+ metaHtml += `<span class="badge badge-info">Difficulty: ${escapeHtml(problem.difficulty)}</span>`;
473
+ }
474
+ if (problem.cf_rating) {
475
+ metaHtml += `<span class="badge badge-info">Rating: ${problem.cf_rating}</span>`;
476
+ }
477
+ if (problem.contest_date) {
478
+ metaHtml += `<span class="badge badge-info">Date: ${escapeHtml(problem.contest_date.split('T')[0])}</span>`;
479
+ }
480
+ if (problem.tags && problem.tags.length > 0) {
481
+ problem.tags.forEach(tag => {
482
+ metaHtml += `<span class="badge badge-info">${escapeHtml(tag)}</span>`;
483
+ });
484
+ }
485
+ html += `<div style="margin-bottom: 15px;">${metaHtml}</div>`;
486
+ }
487
+
488
+ // Show inputs/outputs if available
489
+ if (problem.inputs && problem.inputs.length > 0) {
490
+ html += `<div class="card"><h2>Inputs &amp; Outputs</h2>`;
491
+ problem.inputs.forEach((inp, i) => {
492
+ const out = (problem.outputs && problem.outputs[i]) || '';
493
+ html += `
494
+ <div class="io-section" style="margin-bottom: 15px;">
495
+ <div class="task-section">
496
+ <h3>Input ${i + 1}</h3>
497
+ <pre class="code-block">${escapeHtml(inp)}</pre>
498
+ </div>
499
+ <div class="task-section">
500
+ <h3>Output</h3>
501
+ <pre class="code-block">${escapeHtml(out)}</pre>
502
+ </div>
503
+ </div>
504
+ `;
505
+ });
506
+ html += `</div>`;
507
+ }
508
+
509
+ // Show test suite if available
510
+ if (problem.test) {
511
+ html += `
512
+ <div class="card">
513
+ <h2>Test Suite</h2>
514
+ <pre class="code-block">${escapeHtml(problem.test)}</pre>
515
+ </div>
516
+ `;
517
+ }
518
+
519
+ // Navigation hint
520
+ html += `
521
+ <div class="navigation-hint">
522
+ <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
523
+ or return to the list view to filter by dataset source or search by name.
524
+ </div>
525
+ `;
526
+
527
+ container.innerHTML = html;
528
+ window.currentProblem = problem;
529
+ return;
530
+ }
531
+
532
+ // --- CRUXEval task view (tasks have given/predict fields, no task_items) ---
533
+ if (problem.tasks.length > 0 && problem.tasks[0].given !== undefined) {
534
+ // Task selector
535
+ html += `
536
+ <div class="card">
537
+ <h2>Tasks</h2>
538
+ <div class="task-selector" id="task-selector">
539
+ `;
540
+ problem.tasks.forEach((task, idx) => {
541
+ html += `
542
+ <button class="task-btn ${idx === 0 ? 'active' : ''}"
543
+ onclick="showCruxTask(${idx})">
544
+ ${escapeHtml(task.name)}
545
+ </button>
546
+ `;
547
+ });
548
+ html += `
549
+ </div>
550
+ <div id="task-content"></div>
551
+ </div>
552
+ `;
553
+
554
+ // Navigation hint
555
+ html += `
556
+ <div class="navigation-hint">
557
+ <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
558
+ or return to the list view to filter by dataset source or search by name.
559
+ </div>
560
+ `;
561
+
562
+ container.innerHTML = html;
563
+ window.currentProblem = problem;
564
+ showCruxTask(0);
565
+ return;
566
+ }
567
+
568
+ // --- DREval (full) view with tasks, coverage, arrows ---
569
+ // Rebuild html cleanly with coverage legend and SVG overlay
570
+ html = `
571
+ <div class="card">
572
+ <div class="problem-header">
573
+ <h2>${escapeHtml(problem.entry_point)}</h2>
574
+ <span class="badge ${badgeClass(problem.source)}">${escapeHtml(problem.source)}</span>
575
+ </div>
576
+ <div class="problem-meta">
577
+ <div class="meta-item">
578
+ <span class="meta-label">Task ID:</span>
579
+ <span class="meta-value">${escapeHtml(problem.task_id)}</span>
580
+ </div>
581
+ <div class="meta-item">
582
+ <span class="meta-label">Index:</span>
583
+ <span class="meta-value">${problem.idx}</span>
584
+ </div>
585
+ <div class="meta-item">
586
+ <span class="meta-label">Dataset:</span>
587
+ <span class="meta-value">${escapeHtml(datasetName)}</span>
588
+ </div>
589
+ <div class="meta-item">
590
+ <span class="meta-label">Test Inputs:</span>
591
+ <span class="meta-value">${problem.inputs.length}</span>
592
+ </div>
593
+ </div>
594
+ </div>
595
+
596
+ <div class="card">
597
+ <h2>Source Code</h2>
598
+ <div class="coverage-legend" id="coverage-legend">
599
+ <strong>Coverage:</strong>
600
+ <span class="coverage-legend-item">
601
+ <span class="coverage-swatch" style="background:#d4edda; border:1px solid #28a745;"></span>
602
+ Executed
603
+ </span>
604
+ <span class="coverage-legend-item">
605
+ <span class="coverage-swatch" style="background:#f8d7da; border:1px solid #dc3545;"></span>
606
+ Not executed
607
+ </span>
608
+ </div>
609
+ <div class="code-with-tasks" id="code-container">
610
+ ${problem.highlighted_code}
611
+ <svg id="arrow-overlay" xmlns="http://www.w3.org/2000/svg">
612
+ <defs>
613
+ <marker id="arrowhead" markerWidth="8" markerHeight="6"
614
+ refX="8" refY="3" orient="auto">
615
+ <polygon points="0 0, 8 3, 0 6" class="exec-arrow-head"/>
616
+ </marker>
617
+ </defs>
618
+ </svg>
619
+ </div>
620
+ </div>
621
+ `;
622
+
623
+ // Task selector
624
+ html += `
625
+ <div class="card">
626
+ <h2>Test Cases & Tasks</h2>
627
+ <p>Select a test input to view associated reasoning tasks:</p>
628
+ <div class="task-selector" id="task-selector">
629
+ `;
630
+
631
+ problem.tasks.forEach((task, idx) => {
632
+ html += `
633
+ <button class="task-btn ${idx === 0 ? 'active' : ''}"
634
+ onclick="showTask(${idx})">
635
+ Input ${task.input_idx + 1}
636
+ </button>
637
+ `;
638
+ });
639
+
640
+ html += `
641
+ </div>
642
+ <div id="task-content"></div>
643
+ </div>
644
+ `;
645
+
646
+ // Navigation hint
647
+ html += `
648
+ <div class="navigation-hint">
649
+ <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
650
+ or return to the list view to filter by dataset source or search by name.
651
+ </div>
652
+ `;
653
+
654
+ container.innerHTML = html;
655
+
656
+ // Store problem data globally
657
+ window.currentProblem = problem;
658
+
659
+ // Show first task by default
660
+ showTask(0);
661
+ }
662
+
663
+ function injectTaskMarkers(taskItems) {
664
+ const codePre = document.querySelector('.source .code pre');
665
+
666
+ // Save the pristine original innerHTML once, before any modification.
667
+ if (codePre && !window._codePreOriginalHtml) {
668
+ window._codePreOriginalHtml = codePre.innerHTML;
669
+ }
670
+
671
+ // Invalidate span cache (rebuilt lazily on next arrow draw)
672
+ window._linenoSpanCache = null;
673
+
674
+ // Store current task items so applyCoverage can re-add markers after wrapping.
675
+ window._currentTaskItems = taskItems || [];
676
+
677
+ // Reset code pre to original, then add markers from scratch.
678
+ if (codePre && window._codePreOriginalHtml) {
679
+ codePre.innerHTML = window._codePreOriginalHtml;
680
+ }
681
+
682
+ if (!taskItems || taskItems.length === 0) {
683
+ return;
684
+ }
685
+
686
+ // Group tasks by line number
687
+ const tasksByLine = {};
688
+ taskItems.forEach(item => {
689
+ if (!tasksByLine[item.lineno]) tasksByLine[item.lineno] = [];
690
+ tasksByLine[item.lineno].push(item.var);
691
+ });
692
+
693
+ // Inject task marker badges into the code pre
694
+ if (!codePre) return;
695
+ const codeLines = codePre.innerHTML.split('\n');
696
+ codePre.innerHTML = codeLines.map((line, idx) => {
697
+ const lineNum = idx + 1;
698
+ if (tasksByLine[lineNum] && line.trim() !== '') {
699
+ const vars = tasksByLine[lineNum];
700
+ return line + `<span class="task-marker" data-lineno="${lineNum}" data-vars="${escapeHtml(vars.join(', '))}">${escapeHtml(vars.join(', '))}</span>`;
701
+ }
702
+ return line;
703
+ }).join('\n');
704
+
705
+ }
706
+
707
+ function applyCoverage(coverageSet, totalLines) {
708
+ // Remove previous coverage classes from lineno spans.
709
+ document.querySelectorAll('td.linenos .normal').forEach(el => {
710
+ el.classList.remove('line-executed', 'line-not-executed');
711
+ });
712
+
713
+ if (!coverageSet) {
714
+ const legend = document.getElementById('coverage-legend');
715
+ if (legend) legend.style.display = 'none';
716
+ return;
717
+ }
718
+
719
+ const legend = document.getElementById('coverage-legend');
720
+ if (legend) legend.style.display = 'block';
721
+
722
+ // Color lineno spans only.
723
+ document.querySelectorAll('td.linenos .normal').forEach(span => {
724
+ const lineNum = parseInt(span.textContent.trim());
725
+ if (!isNaN(lineNum) && lineNum <= totalLines) {
726
+ span.classList.add(coverageSet.has(lineNum) ? 'line-executed' : 'line-not-executed');
727
+ }
728
+ });
729
+ }
730
+
731
+ // Global map: lineno -> list of next line numbers (1-indexed; -1 = end of trace)
732
+ window._nextLinesMap = {};
733
+
734
+ async function loadAndApplyGroundTruth(problemIdx, inputIdx, taskItems) {
735
+ // Show "loading" placeholders on all task items
736
+ taskItems.forEach(item => {
737
+ const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
738
+ if (el) { el.textContent = '…'; el.className = 'gt-answer loading'; }
739
+ });
740
+
741
+ // Clear next-lines data from previous input
742
+ window._nextLinesMap = {};
743
+
744
+ try {
745
+ const resp = await fetch(`/api/${datasetSlug}/problem/${problemIdx}/ground_truth/${inputIdx}`);
746
+ const gt = await resp.json();
747
+
748
+ if (gt.status !== 'ok') {
749
+ taskItems.forEach(item => {
750
+ const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
751
+ if (el) { el.textContent = gt.status === 'error' ? '(exec error)' : '(unavailable)'; el.className = 'gt-answer'; }
752
+ });
753
+ applyCoverage(null, 0);
754
+ return;
755
+ }
756
+
757
+ // Apply coverage highlighting
758
+ const coverageSet = new Set(gt.coverage);
759
+ applyCoverage(coverageSet, gt.total_lines);
760
+
761
+ // Fill in variable answers
762
+ const answerMap = {};
763
+ gt.variable_answers.forEach(a => {
764
+ answerMap[`${a.lineno}-${a.var}`] = a.answer_str;
765
+ });
766
+ taskItems.forEach(item => {
767
+ const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
768
+ if (el) {
769
+ const answer = answerMap[`${item.lineno}-${item.var}`] || '(not available)';
770
+ el.textContent = answer;
771
+ el.className = 'gt-answer';
772
+ }
773
+ });
774
+
775
+ // Store next-lines data for arrow visualization
776
+ if (gt.next_lines_answers) {
777
+ gt.next_lines_answers.forEach(a => {
778
+ window._nextLinesMap[a.lineno] = a.next_lines;
779
+ });
780
+ }
781
+
782
+ // Attach hover handlers to task-marker spans now that we have next-lines data
783
+ attachArrowHoverHandlers();
784
+
785
+ } catch (e) {
786
+ taskItems.forEach(item => {
787
+ const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
788
+ if (el) { el.textContent = '(error)'; el.className = 'gt-answer'; }
789
+ });
790
+ }
791
+ }
792
+
793
+ // Cache of lineNum → DOM span, rebuilt whenever injectTaskMarkers runs.
794
+ window._linenoSpanCache = null;
795
+
796
+ function buildLinenoSpanCache(container) {
797
+ const cache = {};
798
+ container.querySelectorAll('td.linenos .normal').forEach(span => {
799
+ const n = parseInt(span.textContent.trim());
800
+ if (!isNaN(n)) cache[n] = span;
801
+ });
802
+ window._linenoSpanCache = cache;
803
+ }
804
+
805
+ /**
806
+ * Get the bounding rect of the lineno span for a given 1-indexed line number,
807
+ * relative to the code container element. Uses a cached span map.
808
+ */
809
+ function getLinenoSpanRect(lineNum, container) {
810
+ if (!window._linenoSpanCache) buildLinenoSpanCache(container);
811
+ const span = window._linenoSpanCache[lineNum];
812
+ if (!span) return null;
813
+ const spanRect = span.getBoundingClientRect();
814
+ const containerRect = container.getBoundingClientRect();
815
+ return {
816
+ top: spanRect.top - containerRect.top + container.scrollTop,
817
+ bottom: spanRect.bottom - containerRect.top + container.scrollTop,
818
+ left: spanRect.left - containerRect.left,
819
+ right: spanRect.right - containerRect.left,
820
+ width: spanRect.width,
821
+ height: spanRect.height,
822
+ midY: (spanRect.top + spanRect.bottom) / 2 - containerRect.top + container.scrollTop,
823
+ };
824
+ }
825
+
826
+ /**
827
+ * Draw arrows from sourceLine to each of the targetLines in the SVG overlay.
828
+ * Lines are 1-indexed. -1 means "end of execution" (no arrow drawn).
829
+ */
830
+ function drawArrows(sourceLineNum, targetLineNums) {
831
+ const container = document.getElementById('code-container');
832
+ const svg = document.getElementById('arrow-overlay');
833
+ if (!container || !svg) return;
834
+
835
+ // Remove previous arrows (but keep defs)
836
+ svg.querySelectorAll('.arrow-path').forEach(el => el.remove());
837
+
838
+ const srcRect = getLinenoSpanRect(sourceLineNum, container);
839
+ if (!srcRect) return;
840
+
841
+ // Update SVG height to match container
842
+ svg.setAttribute('height', container.scrollHeight);
843
+
844
+ targetLineNums.forEach(targetLineNum => {
845
+ if (targetLineNum === -1) return; // end of trace — no arrow
846
+
847
+ const dstRect = getLinenoSpanRect(targetLineNum, container);
848
+ if (!dstRect) return;
849
+
850
+ // Start point: right edge of source lineno span, vertically centered
851
+ const x1 = srcRect.right + 2;
852
+ const y1 = srcRect.midY;
853
+
854
+ // End point: right edge of target lineno span, vertically centered
855
+ const x2 = dstRect.right + 2;
856
+ const y2 = dstRect.midY;
857
+
858
+ // Horizontal offset for the bezier control points — curves to the right
859
+ const curveOffset = Math.max(30, Math.abs(y2 - y1) * 0.4);
860
+
861
+ // Cubic bezier: both control points extend to the right of the lineno column
862
+ const cx1 = x1 + curveOffset;
863
+ const cy1 = y1;
864
+ const cx2 = x2 + curveOffset;
865
+ const cy2 = y2;
866
+
867
+ const path = document.createElementNS('http://www.w3.org/2000/svg', 'path');
868
+ path.setAttribute('d', `M ${x1} ${y1} C ${cx1} ${cy1}, ${cx2} ${cy2}, ${x2} ${y2}`);
869
+ path.setAttribute('class', 'exec-arrow arrow-path');
870
+ path.setAttribute('marker-end', 'url(#arrowhead)');
871
+ svg.appendChild(path);
872
+ });
873
+ }
874
+
875
+ /**
876
+ * Clear all arrows from the SVG overlay.
877
+ */
878
+ function clearArrows() {
879
+ const svg = document.getElementById('arrow-overlay');
880
+ if (svg) {
881
+ svg.querySelectorAll('.arrow-path').forEach(el => el.remove());
882
+ }
883
+ }
884
+
885
+ // AbortController for the current set of marker hover listeners.
886
+ let _markerListenersAbort = null;
887
+
888
+ /**
889
+ * Attach mouseenter/mouseleave handlers to all .task-marker spans so that
890
+ * hovering shows execution-flow arrows to next lines.
891
+ */
892
+ function attachArrowHoverHandlers() {
893
+ // Cancel any previously attached listeners without touching the DOM.
894
+ if (_markerListenersAbort) _markerListenersAbort.abort();
895
+ _markerListenersAbort = new AbortController();
896
+ const { signal } = _markerListenersAbort;
897
+
898
+ document.querySelectorAll('.task-marker').forEach(marker => {
899
+ marker.addEventListener('mouseenter', () => {
900
+ const lineNum = parseInt(marker.dataset.lineno);
901
+ if (!lineNum) return;
902
+ const nextLines = window._nextLinesMap[lineNum];
903
+ if (nextLines && nextLines.length > 0) {
904
+ drawArrows(lineNum, nextLines);
905
+ }
906
+ }, { signal });
907
+
908
+ marker.addEventListener('mouseleave', () => {
909
+ clearArrows();
910
+ }, { signal });
911
+ });
912
+ }
913
+
914
+ function showCruxTask(taskIdx) {
915
+ const problem = window.currentProblem;
916
+ const task = problem.tasks[taskIdx];
917
+
918
+ // Update active button
919
+ document.querySelectorAll('.task-btn').forEach((btn, idx) => {
920
+ btn.classList.toggle('active', idx === taskIdx);
921
+ });
922
+
923
+ const givenLabel = task.given === 'input' ? 'Input (given)' : 'Output (given)';
924
+ const predictLabel = task.predict === 'output' ? 'Output (predict)' : 'Input (predict)';
925
+ const givenValue = task.given === 'input' ? task.input : task.output;
926
+ const predictValue = task.predict === 'output' ? task.output : task.input;
927
+
928
+ const html = `
929
+ <div class="task-details">
930
+ <div class="task-section">
931
+ <p style="margin-bottom: 12px; color: #7f8c8d;">${escapeHtml(task.description)}</p>
932
+ </div>
933
+ <div class="io-section">
934
+ <div class="task-section">
935
+ <h3>${escapeHtml(givenLabel)}</h3>
936
+ <pre class="code-block">${escapeHtml(givenValue)}</pre>
937
+ </div>
938
+ <div class="task-section">
939
+ <h3>${escapeHtml(predictLabel)}</h3>
940
+ <pre class="code-block crux-answer">${escapeHtml(predictValue)}</pre>
941
+ </div>
942
+ </div>
943
+ </div>
944
+ `;
945
+
946
+ document.getElementById('task-content').innerHTML = html;
947
+ }
948
+
949
+ function showTask(taskIdx) {
950
+ const problem = window.currentProblem;
951
+ const task = problem.tasks[taskIdx];
952
+
953
+ // Update active button
954
+ const buttons = document.querySelectorAll('.task-btn');
955
+ buttons.forEach((btn, idx) => {
956
+ if (idx === taskIdx) {
957
+ btn.classList.add('active');
958
+ } else {
959
+ btn.classList.remove('active');
960
+ }
961
+ });
962
+
963
+ // Inject task markers into the code
964
+ injectTaskMarkers(task.task_items);
965
+
966
+ // Clear previous coverage while new one loads
967
+ applyCoverage(null, 0);
968
+
969
+ // Render task content
970
+ const ioSection = task.test_class_code
971
+ ? `<div class="io-section">
972
+ <div class="task-section">
973
+ <h3>Input</h3>
974
+ <pre class="code-block">${escapeHtml(task.input)}</pre>
975
+ </div>
976
+ </div>
977
+ <div class="task-section">
978
+ <h3>Test Class &mdash; <code>${escapeHtml(task.test_class_name)}</code></h3>
979
+ <pre class="code-block">${escapeHtml(task.test_class_code)}</pre>
980
+ </div>`
981
+ : `<div class="io-section">
982
+ <div class="task-section">
983
+ <h3>Input</h3>
984
+ <pre class="code-block">${escapeHtml(task.input)}</pre>
985
+ </div>
986
+ <div class="task-section">
987
+ <h3>Expected Output</h3>
988
+ <pre class="code-block">${escapeHtml(task.output)}</pre>
989
+ </div>
990
+ </div>`;
991
+
992
+ let html = `
993
+ <div class="task-details">
994
+ ${ioSection}
995
+ `;
996
+
997
+ // Show task items with ground truth answer placeholders
998
+ if (task.task_items && task.task_items.length > 0) {
999
+ html += `
1000
+ <div class="task-section">
1001
+ <h3>Reasoning Tasks</h3>
1002
+ <p style="margin-bottom: 10px; color: #7f8c8d;">
1003
+ Variable state at each execution point (correct answer shown in
1004
+ <span style="background:#17a2b8;color:white;padding:1px 6px;border-radius:3px;font-size:0.82rem;">teal</span>):
1005
+ </p>
1006
+ <ul class="task-items-list">
1007
+ `;
1008
+
1009
+ task.task_items.forEach(item => {
1010
+ html += `
1011
+ <li>
1012
+ <span class="line-ref">Line ${item.lineno}</span>
1013
+ <span class="var-name">${escapeHtml(item.var)}</span>
1014
+ <span class="gt-answer loading" id="gt-${item.lineno}-${item.var}">…</span>
1015
+ </li>
1016
+ `;
1017
+ });
1018
+
1019
+ html += `
1020
+ </ul>
1021
+ </div>
1022
+ `;
1023
+ }
1024
+
1025
+ // Show output prediction task if exists
1026
+ if (task.output_pred) {
1027
+ html += `
1028
+ <div class="task-section">
1029
+ <h3>Output Completion Task</h3>
1030
+ <p style="margin-bottom: 10px; color: #7f8c8d;">
1031
+ The model needs to complete this test assertion:
1032
+ </p>
1033
+ <pre class="code-block">${escapeHtml(task.output_pred)}</pre>
1034
+ </div>
1035
+ `;
1036
+ }
1037
+
1038
+ html += `</div>`;
1039
+
1040
+ document.getElementById('task-content').innerHTML = html;
1041
+
1042
+ // Fetch and apply ground truth (coverage + variable answers)
1043
+ if (hasGroundTruth && task.task_items) {
1044
+ loadAndApplyGroundTruth(problem.idx, task.input_idx, task.task_items);
1045
+ }
1046
+ }
1047
+
1048
+ function showLangTab(idx) {
1049
+ document.querySelectorAll('.lang-tab').forEach((tab, i) => {
1050
+ tab.classList.toggle('active', i === idx);
1051
+ });
1052
+ document.querySelectorAll('.lang-code-panel').forEach((panel, i) => {
1053
+ panel.classList.toggle('active', i === idx);
1054
+ });
1055
+ // Update test section
1056
+ const problem = window.currentProblem;
1057
+ if (problem && problem.lang_solutions) {
1058
+ const sol = problem.lang_solutions[idx];
1059
+ const testContainer = document.getElementById('lang-test-container');
1060
+ if (testContainer && sol.test) {
1061
+ testContainer.innerHTML = `<h2>Test Suite</h2><pre class="code-block">${escapeHtml(sol.test)}</pre>`;
1062
+ } else if (testContainer) {
1063
+ testContainer.innerHTML = '';
1064
+ }
1065
+ }
1066
+ }
1067
+
1068
+ function toggleCodeMode(mode) {
1069
+ window._currentCodeMode = mode;
1070
+ const problem = window.currentProblem;
1071
+ if (!problem || !problem.lang_solutions) return;
1072
+
1073
+ // Update mode tabs
1074
+ const modeTabs = document.querySelectorAll('#code-mode-tabs .lang-tab');
1075
+ modeTabs.forEach(tab => {
1076
+ tab.classList.toggle('active', tab.textContent.trim().toLowerCase() === mode);
1077
+ });
1078
+
1079
+ // Toggle visibility of canonical/buggy code in all panels
1080
+ problem.lang_solutions.forEach((sol, i) => {
1081
+ const canonical = document.getElementById('lang-code-canonical-' + i);
1082
+ const buggy = document.getElementById('lang-code-buggy-' + i);
1083
+ if (canonical) canonical.style.display = mode === 'canonical' ? '' : 'none';
1084
+ if (buggy) buggy.style.display = mode === 'buggy' ? '' : 'none';
1085
+ });
1086
+ }
1087
+
1088
+ function showFimTab(idx) {
1089
+ const tabs = document.querySelectorAll('#fim-tabs .lang-tab');
1090
+ tabs.forEach((tab, i) => tab.classList.toggle('active', i === idx));
1091
+ for (let i = 0; i < 3; i++) {
1092
+ const panel = document.getElementById('fim-panel-' + i);
1093
+ if (panel) panel.classList.toggle('active', i === idx);
1094
+ }
1095
+ }
1096
+
1097
+ /**
1098
+ * Split a unified diff into per-file sections and render each with a GitHub-style
1099
+ * file header bar. Returns an HTML string with one card per file.
1100
+ */
1101
+ function renderDiffFiles(diffText, title) {
1102
+ if (!diffText) return '';
1103
+ // Split into per-file chunks by "diff --git" boundaries
1104
+ const files = [];
1105
+ let current = null;
1106
+ diffText.split('\n').forEach(line => {
1107
+ if (line.startsWith('diff --git')) {
1108
+ if (current) files.push(current);
1109
+ // Extract file path from "diff --git a/path b/path"
1110
+ const m = line.match(/^diff --git a\/(.+?) b\/(.+)/);
1111
+ const filePath = m ? m[2] : line;
1112
+ current = { path: filePath, lines: [line] };
1113
+ } else if (current) {
1114
+ current.lines.push(line);
1115
+ } else {
1116
+ // Lines before any diff header — create a default section
1117
+ if (!current) current = { path: '', lines: [] };
1118
+ current.lines.push(line);
1119
+ }
1120
+ });
1121
+ if (current) files.push(current);
1122
+
1123
+ if (files.length === 0) return '';
1124
+
1125
+ let html = '';
1126
+ if (files.length === 1 && !files[0].path) {
1127
+ // Single unnamed diff — render as before
1128
+ html += `<div class="card"><h2>${escapeHtml(title)}</h2><div class="diff-view">${renderDiff(diffText)}</div></div>`;
1129
+ } else {
1130
+ html += `<div class="card"><h2>${escapeHtml(title)}</h2>`;
1131
+ files.forEach(file => {
1132
+ const diffChunk = file.lines.join('\n');
1133
+ // Count additions/deletions
1134
+ let adds = 0, dels = 0;
1135
+ file.lines.forEach(l => {
1136
+ if (l.startsWith('+') && !l.startsWith('+++')) adds++;
1137
+ if (l.startsWith('-') && !l.startsWith('---')) dels++;
1138
+ });
1139
+ const statsHtml = `<span class="diff-file-stats"><span class="diff-stat-add">+${adds}</span> <span class="diff-stat-del">-${dels}</span></span>`;
1140
+ html += `
1141
+ <div class="diff-file-section">
1142
+ <div class="diff-file-header">
1143
+ <span class="diff-file-path">${escapeHtml(file.path)}</span>
1144
+ ${statsHtml}
1145
+ </div>
1146
+ <div class="diff-view">${renderDiff(diffChunk)}</div>
1147
+ </div>
1148
+ `;
1149
+ });
1150
+ html += `</div>`;
1151
+ }
1152
+ return html;
1153
+ }
1154
+
1155
+ /**
1156
+ * Render a unified diff with line numbers and file headers (GitHub-style).
1157
+ */
1158
+ function renderDiff(diffText) {
1159
+ if (!diffText) return '';
1160
+ const lines = diffText.split('\n');
1161
+ let oldLine = 0, newLine = 0;
1162
+ const rows = [];
1163
+
1164
+ lines.forEach(line => {
1165
+ if (line.startsWith('diff ')) {
1166
+ rows.push(`<tr class="diff-tr-header"><td class="diff-ln"></td><td class="diff-ln"></td><td class="diff-td-header">${escapeHtml(line)}</td></tr>`);
1167
+ return;
1168
+ }
1169
+ if (line.startsWith('---') || line.startsWith('+++')) {
1170
+ rows.push(`<tr class="diff-tr-header"><td class="diff-ln"></td><td class="diff-ln"></td><td class="diff-td-header">${escapeHtml(line)}</td></tr>`);
1171
+ return;
1172
+ }
1173
+ if (line.startsWith('@@')) {
1174
+ // Parse hunk header: @@ -oldStart,oldCount +newStart,newCount @@
1175
+ const m = line.match(/@@ -(\d+)(?:,\d+)? \+(\d+)(?:,\d+)? @@/);
1176
+ if (m) {
1177
+ oldLine = parseInt(m[1]);
1178
+ newLine = parseInt(m[2]);
1179
+ }
1180
+ rows.push(`<tr class="diff-tr-hunk"><td class="diff-ln"></td><td class="diff-ln"></td><td class="diff-td-hunk">${escapeHtml(line)}</td></tr>`);
1181
+ return;
1182
+ }
1183
+ if (line.startsWith('+')) {
1184
+ rows.push(`<tr class="diff-tr-add"><td class="diff-ln"></td><td class="diff-ln">${newLine}</td><td class="diff-td-add">${escapeHtml(line.substring(1))}</td></tr>`);
1185
+ newLine++;
1186
+ } else if (line.startsWith('-')) {
1187
+ rows.push(`<tr class="diff-tr-del"><td class="diff-ln">${oldLine}</td><td class="diff-ln"></td><td class="diff-td-del">${escapeHtml(line.substring(1))}</td></tr>`);
1188
+ oldLine++;
1189
+ } else if (line.startsWith(' ')) {
1190
+ rows.push(`<tr class="diff-tr-ctx"><td class="diff-ln">${oldLine}</td><td class="diff-ln">${newLine}</td><td class="diff-td-ctx">${escapeHtml(line.substring(1))}</td></tr>`);
1191
+ oldLine++;
1192
+ newLine++;
1193
+ } else if (line.trim() === '') {
1194
+ // Empty trailing line
1195
+ } else {
1196
+ rows.push(`<tr class="diff-tr-ctx"><td class="diff-ln">${oldLine}</td><td class="diff-ln">${newLine}</td><td class="diff-td-ctx">${escapeHtml(line)}</td></tr>`);
1197
+ oldLine++;
1198
+ newLine++;
1199
+ }
1200
+ });
1201
+
1202
+ return `<table class="diff-table">${rows.join('')}</table>`;
1203
+ }
1204
+
1205
+ /**
1206
+ * Simple line-by-line diff (LCS-based) between two code strings.
1207
+ * Returns an array of {type: 'context'|'add'|'del', line: string}.
1208
+ */
1209
+ function computeUnifiedDiff(oldText, newText) {
1210
+ const oldLines = (oldText || '').split('\n');
1211
+ const newLines = (newText || '').split('\n');
1212
+
1213
+ // LCS for line sequences
1214
+ const m = oldLines.length, n = newLines.length;
1215
+ // For very large files, just show both in full instead of computing LCS
1216
+ if (m * n > 500000) {
1217
+ const result = [];
1218
+ oldLines.forEach(l => result.push({type: 'del', line: l}));
1219
+ newLines.forEach(l => result.push({type: 'add', line: l}));
1220
+ return result;
1221
+ }
1222
+
1223
+ const dp = Array.from({length: m + 1}, () => new Uint16Array(n + 1));
1224
+ for (let i = 1; i <= m; i++) {
1225
+ for (let j = 1; j <= n; j++) {
1226
+ if (oldLines[i - 1] === newLines[j - 1]) {
1227
+ dp[i][j] = dp[i - 1][j - 1] + 1;
1228
+ } else {
1229
+ dp[i][j] = Math.max(dp[i - 1][j], dp[i][j - 1]);
1230
+ }
1231
+ }
1232
+ }
1233
+
1234
+ // Backtrack to build diff
1235
+ const result = [];
1236
+ let i = m, j = n;
1237
+ while (i > 0 || j > 0) {
1238
+ if (i > 0 && j > 0 && oldLines[i - 1] === newLines[j - 1]) {
1239
+ result.push({type: 'context', line: oldLines[i - 1]});
1240
+ i--; j--;
1241
+ } else if (j > 0 && (i === 0 || dp[i][j - 1] >= dp[i - 1][j])) {
1242
+ result.push({type: 'add', line: newLines[j - 1]});
1243
+ j--;
1244
+ } else {
1245
+ result.push({type: 'del', line: oldLines[i - 1]});
1246
+ i--;
1247
+ }
1248
+ }
1249
+ result.reverse();
1250
+
1251
+ // Compact: only show hunks with context (3 lines around changes)
1252
+ const contextSize = 3;
1253
+ const hasChange = result.map(r => r.type !== 'context');
1254
+ const show = new Uint8Array(result.length);
1255
+ for (let k = 0; k < result.length; k++) {
1256
+ if (hasChange[k]) {
1257
+ for (let c = Math.max(0, k - contextSize); c <= Math.min(result.length - 1, k + contextSize); c++) {
1258
+ show[c] = 1;
1259
+ }
1260
+ }
1261
+ }
1262
+
1263
+ const compacted = [];
1264
+ let lastShown = -1;
1265
+ for (let k = 0; k < result.length; k++) {
1266
+ if (show[k]) {
1267
+ if (lastShown >= 0 && k - lastShown > 1) {
1268
+ compacted.push({type: 'separator', line: '...'});
1269
+ }
1270
+ compacted.push(result[k]);
1271
+ lastShown = k;
1272
+ }
1273
+ }
1274
+
1275
+ return compacted.length > 0 ? compacted : result;
1276
+ }
1277
+
1278
+ /**
1279
+ * Render the output of computeUnifiedDiff into diff HTML with line numbers.
1280
+ */
1281
+ function renderComputedDiff(diffEntries) {
1282
+ let oldLine = 1, newLine = 1;
1283
+ const rows = diffEntries.map(entry => {
1284
+ if (entry.type === 'separator') {
1285
+ return `<tr class="diff-tr-hunk"><td class="diff-ln"></td><td class="diff-ln"></td><td class="diff-td-hunk">${escapeHtml(entry.line)}</td></tr>`;
1286
+ }
1287
+ if (entry.type === 'del') {
1288
+ const row = `<tr class="diff-tr-del"><td class="diff-ln">${oldLine}</td><td class="diff-ln"></td><td class="diff-td-del">${escapeHtml(entry.line)}</td></tr>`;
1289
+ oldLine++;
1290
+ return row;
1291
+ }
1292
+ if (entry.type === 'add') {
1293
+ const row = `<tr class="diff-tr-add"><td class="diff-ln"></td><td class="diff-ln">${newLine}</td><td class="diff-td-add">${escapeHtml(entry.line)}</td></tr>`;
1294
+ newLine++;
1295
+ return row;
1296
+ }
1297
+ // context
1298
+ const row = `<tr class="diff-tr-ctx"><td class="diff-ln">${oldLine}</td><td class="diff-ln">${newLine}</td><td class="diff-td-ctx">${escapeHtml(entry.line)}</td></tr>`;
1299
+ oldLine++;
1300
+ newLine++;
1301
+ return row;
1302
+ });
1303
+ return `<table class="diff-table">${rows.join('')}</table>`;
1304
+ }
1305
+
1306
+ function escapeHtml(text) {
1307
+ if (text === null || text === undefined) return '';
1308
+ const div = document.createElement('div');
1309
+ div.textContent = text;
1310
+ return div.innerHTML;
1311
+ }
1312
+
1313
+ loadProblem();
templates/base.html CHANGED
@@ -92,6 +92,159 @@
92
  color: white;
93
  }
94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
  .badge-info {
96
  background: #ecf0f1;
97
  color: #2c3e50;
@@ -192,6 +345,7 @@
192
 
193
  {% block extra_css %}{% endblock %}
194
  </style>
 
195
  </head>
196
  <body>
197
  <header>
 
92
  color: white;
93
  }
94
 
95
+ .badge-mbpp {
96
+ background: #16a085;
97
+ color: white;
98
+ }
99
+
100
+ .badge-codeforces {
101
+ background: #e74c3c;
102
+ color: white;
103
+ }
104
+
105
+ .badge-leetcode {
106
+ background: #f39c12;
107
+ color: white;
108
+ }
109
+
110
+ .badge-atcoder {
111
+ background: #2ecc71;
112
+ color: white;
113
+ }
114
+
115
+ .badge-cppsyntaxerror, .badge-cppreferenceerror, .badge-cpplogicerror, .badge-cppmultipleerror {
116
+ background: #3498db;
117
+ color: white;
118
+ }
119
+
120
+ .badge-javasyntaxerror, .badge-javareferenceerror, .badge-javalogicerror, .badge-javamultipleerror {
121
+ background: #e67e22;
122
+ color: white;
123
+ }
124
+
125
+ .badge-pythonsyntaxerror, .badge-pythonreferenceerror, .badge-pythonlogicerror, .badge-pythonmultipleerror {
126
+ background: #2ecc71;
127
+ color: white;
128
+ }
129
+
130
+ .badge-humanevalx {
131
+ background: #1abc9c;
132
+ color: white;
133
+ }
134
+
135
+ /* SWE-bench repo badges */
136
+ .badge-djangodjango, .badge-astropyastropy, .badge-matabormataborlib, .badge-scikitimagescikitimage {
137
+ background: #0d6efd;
138
+ color: white;
139
+ }
140
+
141
+ .badge-sympy, .badge-sympysympy, .badge-pylintdevpylint, .badge-sphinxdocsphinx,
142
+ .badge-palletstflask, .badge-palletsjinja, .badge-pyaborpyabor, .badge-pytestdevpytest {
143
+ background: #6610f2;
144
+ color: white;
145
+ }
146
+
147
+ /* APPS difficulty badges */
148
+ .badge-introductory {
149
+ background: #27ae60;
150
+ color: white;
151
+ }
152
+
153
+ .badge-interview {
154
+ background: #f39c12;
155
+ color: white;
156
+ }
157
+
158
+ .badge-competition {
159
+ background: #e74c3c;
160
+ color: white;
161
+ }
162
+
163
+ /* CanItEdit change kind badges */
164
+ .badge-adaptive {
165
+ background: #3498db;
166
+ color: white;
167
+ }
168
+
169
+ .badge-perfective {
170
+ background: #2ecc71;
171
+ color: white;
172
+ }
173
+
174
+ .badge-corrective {
175
+ background: #e67e22;
176
+ color: white;
177
+ }
178
+
179
+ .badge-canitedit {
180
+ background: #9b59b6;
181
+ color: white;
182
+ }
183
+
184
+ /* CodeContests source badges (extend existing) */
185
+ .badge-codechef {
186
+ background: #5b4638;
187
+ color: white;
188
+ }
189
+
190
+ .badge-codejam {
191
+ background: #4285f4;
192
+ color: white;
193
+ }
194
+
195
+ .badge-hackerearth {
196
+ background: #2c3454;
197
+ color: white;
198
+ }
199
+
200
+ .badge-aizu {
201
+ background: #0089d0;
202
+ color: white;
203
+ }
204
+
205
+ .badge-unknown {
206
+ background: #95a5a6;
207
+ color: white;
208
+ }
209
+
210
+ /* SAFIM language badges */
211
+ .badge-python, .badge-java, .badge-c {
212
+ background: #3498db;
213
+ color: white;
214
+ }
215
+
216
+ /* Vulnerability badges */
217
+ .badge-vulnerable {
218
+ background: #e74c3c;
219
+ color: white;
220
+ }
221
+
222
+ .badge-patched {
223
+ background: #27ae60;
224
+ color: white;
225
+ }
226
+
227
+ /* CodeEditorBench type badges */
228
+ .badge-codedebug {
229
+ background: #e74c3c;
230
+ color: white;
231
+ }
232
+
233
+ .badge-codetranslate {
234
+ background: #3498db;
235
+ color: white;
236
+ }
237
+
238
+ .badge-codepolish {
239
+ background: #2ecc71;
240
+ color: white;
241
+ }
242
+
243
+ .badge-coderequirementswitch {
244
+ background: #9b59b6;
245
+ color: white;
246
+ }
247
+
248
  .badge-info {
249
  background: #ecf0f1;
250
  color: #2c3e50;
 
345
 
346
  {% block extra_css %}{% endblock %}
347
  </style>
348
+ {% block extra_head %}{% endblock %}
349
  </head>
350
  <body>
351
  <header>
templates/index.html CHANGED
@@ -84,55 +84,45 @@
84
  color: #7f8c8d;
85
  }
86
 
87
- .stats {
88
  display: flex;
89
- gap: 20px;
90
- margin-bottom: 20px;
 
 
 
91
  flex-wrap: wrap;
92
  }
93
 
94
- .stat-card {
95
- background: white;
96
- border-radius: 8px;
97
- padding: 20px;
98
- box-shadow: 0 2px 4px rgba(0,0,0,0.1);
99
- flex: 1;
100
- min-width: 200px;
101
  }
102
 
103
- .stat-number {
104
- font-size: 2.5rem;
105
- font-weight: 700;
106
- color: #3498db;
107
  }
108
 
109
- .stat-label {
110
- font-size: 0.9rem;
 
 
 
 
 
111
  color: #7f8c8d;
112
- margin-top: 5px;
 
 
 
 
113
  }
114
  </style>
115
  {% endblock %}
116
 
117
  {% block content %}
118
- <div class="stats" id="stats">
119
- <div class="stat-card">
120
- <div class="stat-number" id="total-problems">-</div>
121
- <div class="stat-label">Total Problems</div>
122
- </div>
123
- <div class="stat-card" id="stat-source-a">
124
- <div class="stat-number" id="source-a-count">-</div>
125
- <div class="stat-label" id="source-a-label">Source A</div>
126
- </div>
127
- <div class="stat-card" id="stat-source-b">
128
- <div class="stat-number" id="source-b-count">-</div>
129
- <div class="stat-label" id="source-b-label">Source B</div>
130
- </div>
131
- <div class="stat-card">
132
- <div class="stat-number" id="filtered-count">-</div>
133
- <div class="stat-label">Displayed</div>
134
- </div>
135
- </div>
136
 
137
  <div class="card">
138
  <h2>Filter Problems</h2>
@@ -179,7 +169,10 @@ async function loadDatasets() {
179
  datasets.forEach(ds => {
180
  const opt = document.createElement('option');
181
  opt.value = ds.slug;
182
- opt.textContent = `${ds.display_name} (${ds.problem_count})`;
 
 
 
183
  if (ds.slug === currentDataset) opt.selected = true;
184
  select.appendChild(opt);
185
  });
@@ -236,27 +229,22 @@ function updateStats() {
236
  sources[p.source] = (sources[p.source] || 0) + 1;
237
  });
238
 
239
- document.getElementById('total-problems').textContent = allProblems.length;
 
 
 
240
 
241
- const sourceNames = Object.keys(sources);
242
- const statA = document.getElementById('stat-source-a');
243
- const statB = document.getElementById('stat-source-b');
244
-
245
- if (sourceNames.length >= 1) {
246
- statA.style.display = '';
247
- document.getElementById('source-a-count').textContent = sources[sourceNames[0]];
248
- document.getElementById('source-a-label').textContent = sourceNames[0];
249
- } else {
250
- statA.style.display = 'none';
251
- }
252
-
253
- if (sourceNames.length >= 2) {
254
- statB.style.display = '';
255
- document.getElementById('source-b-count').textContent = sources[sourceNames[1]];
256
- document.getElementById('source-b-label').textContent = sourceNames[1];
257
- } else {
258
- statB.style.display = 'none';
259
  }
 
260
  }
261
 
262
  function badgeClass(source) {
@@ -268,12 +256,9 @@ function renderProblems(problems) {
268
 
269
  if (problems.length === 0) {
270
  container.innerHTML = '<div class="card"><p>No problems match your filters.</p></div>';
271
- document.getElementById('filtered-count').textContent = '0';
272
  return;
273
  }
274
 
275
- document.getElementById('filtered-count').textContent = problems.length;
276
-
277
  const grid = document.createElement('div');
278
  grid.className = 'problems-grid';
279
 
 
84
  color: #7f8c8d;
85
  }
86
 
87
+ .stats-bar {
88
  display: flex;
89
+ align-items: center;
90
+ gap: 16px;
91
+ margin-bottom: 16px;
92
+ font-size: 0.9rem;
93
+ color: #555;
94
  flex-wrap: wrap;
95
  }
96
 
97
+ .stats-total {
98
+ font-weight: 700;
99
+ color: #2c3e50;
100
+ font-size: 0.95rem;
 
 
 
101
  }
102
 
103
+ .stats-sep {
104
+ color: #ccc;
 
 
105
  }
106
 
107
+ .stats-tag {
108
+ display: inline-flex;
109
+ align-items: center;
110
+ gap: 4px;
111
+ }
112
+
113
+ .stats-tag-name {
114
  color: #7f8c8d;
115
+ }
116
+
117
+ .stats-tag-count {
118
+ font-weight: 600;
119
+ color: #2c3e50;
120
  }
121
  </style>
122
  {% endblock %}
123
 
124
  {% block content %}
125
+ <div class="stats-bar" id="stats-bar"></div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
 
127
  <div class="card">
128
  <h2>Filter Problems</h2>
 
169
  datasets.forEach(ds => {
170
  const opt = document.createElement('option');
171
  opt.value = ds.slug;
172
+ const countLabel = ds.total_count
173
+ ? `${ds.problem_count} of ${ds.total_count}`
174
+ : `${ds.problem_count}`;
175
+ opt.textContent = `${ds.display_name} (${countLabel})`;
176
  if (ds.slug === currentDataset) opt.selected = true;
177
  select.appendChild(opt);
178
  });
 
229
  sources[p.source] = (sources[p.source] || 0) + 1;
230
  });
231
 
232
+ const sorted = Object.entries(sources)
233
+ .sort((a, b) => b[1] - a[1]);
234
+ const top5 = sorted.slice(0, 5);
235
+ const otherCount = sorted.slice(5).reduce((sum, [, c]) => sum + c, 0);
236
 
237
+ const bar = document.getElementById('stats-bar');
238
+ let html = `<span class="stats-total">Total: ${allProblems.length}</span>`;
239
+ top5.forEach(([name, count]) => {
240
+ html += `<span class="stats-sep">|</span>`;
241
+ html += `<span class="stats-tag"><span class="stats-tag-name">${name}:</span> <span class="stats-tag-count">${count}</span></span>`;
242
+ });
243
+ if (otherCount > 0) {
244
+ html += `<span class="stats-sep">|</span>`;
245
+ html += `<span class="stats-tag"><span class="stats-tag-name">Other:</span> <span class="stats-tag-count">${otherCount}</span></span>`;
 
 
 
 
 
 
 
 
 
246
  }
247
+ bar.innerHTML = html;
248
  }
249
 
250
  function badgeClass(source) {
 
256
 
257
  if (problems.length === 0) {
258
  container.innerHTML = '<div class="card"><p>No problems match your filters.</p></div>';
 
259
  return;
260
  }
261
 
 
 
262
  const grid = document.createElement('div');
263
  grid.className = 'problems-grid';
264
 
templates/problem.html CHANGED
@@ -20,283 +20,11 @@
20
  {% endblock %}
21
 
22
  {% block extra_css %}
23
- <style>
24
- {{ css|safe }}
25
-
26
- .problem-header {
27
- display: flex;
28
- justify-content: space-between;
29
- align-items: center;
30
- margin-bottom: 15px;
31
- }
32
-
33
- .problem-meta {
34
- margin-bottom: 20px;
35
- }
36
-
37
- .meta-item {
38
- display: inline-block;
39
- margin-right: 15px;
40
- margin-bottom: 10px;
41
- }
42
-
43
- .meta-label {
44
- font-weight: 600;
45
- color: #7f8c8d;
46
- margin-right: 5px;
47
- }
48
-
49
- .meta-value {
50
- color: #2c3e50;
51
- }
52
-
53
- .task-selector {
54
- margin: 20px 0;
55
- display: flex;
56
- gap: 10px;
57
- flex-wrap: wrap;
58
- }
59
-
60
- .task-btn {
61
- padding: 10px 20px;
62
- background: #ecf0f1;
63
- border: 2px solid transparent;
64
- border-radius: 4px;
65
- cursor: pointer;
66
- transition: all 0.3s;
67
- font-size: 0.95rem;
68
- }
69
-
70
- .task-btn:hover {
71
- background: #bdc3c7;
72
- }
73
-
74
- .task-btn.active {
75
- background: #3498db;
76
- color: white;
77
- border-color: #2980b9;
78
- }
79
-
80
- .task-details {
81
- margin-top: 20px;
82
- }
83
-
84
- .task-section {
85
- margin-bottom: 25px;
86
- padding: 15px;
87
- background: #f8f9fa;
88
- border-left: 4px solid #3498db;
89
- border-radius: 4px;
90
- }
91
-
92
- .task-section h3 {
93
- margin-bottom: 10px;
94
- color: #2c3e50;
95
- font-size: 1.1rem;
96
- }
97
-
98
- .code-block {
99
- background: #f8f9fa;
100
- padding: 15px;
101
- border-radius: 4px;
102
- overflow-x: auto;
103
- font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;
104
- font-size: 0.9rem;
105
- border: 1px solid #e1e4e8;
106
- }
107
-
108
- .task-items-list {
109
- list-style: none;
110
- }
111
-
112
- .task-items-list li {
113
- padding: 10px;
114
- margin-bottom: 8px;
115
- background: white;
116
- border-radius: 4px;
117
- border: 1px solid #e1e4e8;
118
- }
119
-
120
- .line-ref {
121
- display: inline-block;
122
- padding: 2px 8px;
123
- background: #3498db;
124
- color: white;
125
- border-radius: 3px;
126
- font-family: monospace;
127
- font-size: 0.85rem;
128
- margin-right: 8px;
129
- }
130
-
131
- .var-name {
132
- display: inline-block;
133
- padding: 2px 8px;
134
- background: #9b59b6;
135
- color: white;
136
- border-radius: 3px;
137
- font-family: monospace;
138
- font-size: 0.85rem;
139
- }
140
-
141
- .io-section {
142
- display: grid;
143
- grid-template-columns: 1fr 1fr;
144
- gap: 15px;
145
- }
146
-
147
- @media (max-width: 768px) {
148
- .io-section {
149
- grid-template-columns: 1fr;
150
- }
151
- }
152
-
153
- .navigation-hint {
154
- margin-top: 20px;
155
- padding: 15px;
156
- background: #e8f4f8;
157
- border-radius: 4px;
158
- color: #2c3e50;
159
- font-size: 0.9rem;
160
- }
161
-
162
- .test-code-section {
163
- margin-top: 20px;
164
- }
165
-
166
- /* Inline task visualization */
167
- .code-with-tasks {
168
- position: relative;
169
- }
170
-
171
- .task-marker {
172
- display: inline-block;
173
- margin-left: 10px;
174
- padding: 2px 8px;
175
- background: #9b59b6;
176
- color: white;
177
- border-radius: 3px;
178
- font-size: 0.75rem;
179
- font-weight: 600;
180
- cursor: crosshair;
181
- }
182
-
183
- /* Coverage coloring on lineno spans.
184
- Pygments emits: td.linenos > div.linenodiv > pre > span.normal
185
- We must match that chain; .source .linenos doesn't work because
186
- the td has class "linenos", not an element named "linenos". */
187
- td.linenos .normal.line-executed {
188
- background-color: #d4edda !important;
189
- color: #155724 !important;
190
- }
191
-
192
- td.linenos .normal.line-not-executed {
193
- background-color: #f8d7da !important;
194
- color: #721c24 !important;
195
- }
196
-
197
- /* Coverage legend */
198
- .coverage-legend {
199
- margin: 10px 0;
200
- padding: 10px 15px;
201
- background: #f8f9fa;
202
- border-left: 4px solid #28a745;
203
- border-radius: 4px;
204
- font-size: 0.85rem;
205
- display: none;
206
- }
207
-
208
- .coverage-legend-item {
209
- display: inline-block;
210
- margin-right: 18px;
211
- }
212
-
213
- .coverage-swatch {
214
- display: inline-block;
215
- width: 12px;
216
- height: 12px;
217
- border-radius: 2px;
218
- margin-right: 4px;
219
- vertical-align: middle;
220
- }
221
-
222
- /* Ground truth answer badge shown next to task items */
223
- .gt-answer {
224
- display: inline-block;
225
- margin-left: 10px;
226
- padding: 2px 8px;
227
- background: #17a2b8;
228
- color: white;
229
- border-radius: 3px;
230
- font-family: monospace;
231
- font-size: 0.82rem;
232
- font-weight: 600;
233
- }
234
-
235
- .gt-answer.loading {
236
- background: #6c757d;
237
- }
238
-
239
- /* SVG arrow overlay positioned over the code container */
240
- #arrow-overlay {
241
- position: absolute;
242
- top: 0;
243
- left: 0;
244
- width: 100%;
245
- height: 100%;
246
- pointer-events: none;
247
- overflow: visible;
248
- z-index: 10;
249
- }
250
-
251
- .exec-arrow {
252
- fill: none;
253
- stroke: #e67e22;
254
- stroke-width: 2.5;
255
- stroke-dasharray: none;
256
- opacity: 0.9;
257
- }
258
-
259
- .exec-arrow-head {
260
- fill: #e67e22;
261
- opacity: 0.9;
262
- }
263
-
264
- /* CRUXEval answer highlight */
265
- .crux-answer {
266
- border-left: 4px solid #17a2b8 !important;
267
- background: #e8f6f8 !important;
268
- }
269
-
270
- /* BigOBench complexity display */
271
- .complexity-badges {
272
- display: flex;
273
- gap: 20px;
274
- flex-wrap: wrap;
275
- }
276
-
277
- .complexity-item {
278
- display: flex;
279
- align-items: center;
280
- gap: 10px;
281
- }
282
-
283
- .complexity-label {
284
- font-weight: 600;
285
- color: #7f8c8d;
286
- font-size: 0.95rem;
287
- }
288
 
289
- .complexity-value {
290
- display: inline-block;
291
- padding: 6px 16px;
292
- background: #2c3e50;
293
- color: #f1c40f;
294
- border-radius: 4px;
295
- font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;
296
- font-size: 1.1rem;
297
- font-weight: 600;
298
- }
299
- </style>
300
  {% endblock %}
301
 
302
  {% block content %}
@@ -315,701 +43,6 @@ const datasetSlug = {{ dataset_slug|tojson }};
315
  const datasetName = {{ dataset_name|tojson }};
316
  const hasGroundTruth = {{ has_ground_truth|tojson }};
317
  const hasTasks = {{ has_tasks|tojson }};
318
-
319
- function badgeClass(source) {
320
- return 'badge-' + source.toLowerCase().replace(/[^a-z0-9]/g, '');
321
- }
322
-
323
- async function loadProblem() {
324
- try {
325
- const response = await fetch(`/api/${datasetSlug}/problem/${problemIdx}`);
326
- const problem = await response.json();
327
-
328
- if (problem.error) {
329
- document.getElementById('problem-content').innerHTML =
330
- '<div class="card"><p style="color: red;">Error: ' + problem.error + '</p></div>';
331
- return;
332
- }
333
-
334
- renderProblem(problem);
335
- } catch (error) {
336
- document.getElementById('problem-content').innerHTML =
337
- '<div class="card"><p style="color: red;">Error loading problem: ' + error.message + '</p></div>';
338
- }
339
- }
340
-
341
- function renderProblem(problem) {
342
- const container = document.getElementById('problem-content');
343
-
344
- // Main problem info card (shared by all datasets)
345
- let html = `
346
- <div class="card">
347
- <div class="problem-header">
348
- <h2>${escapeHtml(problem.entry_point)}</h2>
349
- <span class="badge ${badgeClass(problem.source)}">${escapeHtml(problem.source)}</span>
350
- </div>
351
- <div class="problem-meta">
352
- <div class="meta-item">
353
- <span class="meta-label">Task ID:</span>
354
- <span class="meta-value">${escapeHtml(problem.task_id)}</span>
355
- </div>
356
- <div class="meta-item">
357
- <span class="meta-label">Index:</span>
358
- <span class="meta-value">${problem.idx}</span>
359
- </div>
360
- <div class="meta-item">
361
- <span class="meta-label">Dataset:</span>
362
- <span class="meta-value">${escapeHtml(datasetName)}</span>
363
- </div>
364
- ${problem.inputs.length > 0 ? `
365
- <div class="meta-item">
366
- <span class="meta-label">Test Inputs:</span>
367
- <span class="meta-value">${problem.inputs.length}</span>
368
- </div>` : ''}
369
- </div>
370
- </div>
371
- `;
372
-
373
- // --- BigOBench view (problem description + per-solution code & complexity) ---
374
- if (problem.solutions && problem.solutions.length > 0) {
375
- // Problem description
376
- if (problem.description) {
377
- html += `
378
- <div class="card">
379
- <h2>Problem Statement</h2>
380
- <pre class="code-block" style="white-space: pre-wrap;">${escapeHtml(problem.description)}</pre>
381
- </div>
382
- `;
383
- }
384
-
385
- // Each solution: code + complexity
386
- problem.solutions.forEach((sol, i) => {
387
- html += `
388
- <div class="card">
389
- <h2>Solution ${i + 1} <span style="font-size:0.8rem;color:#7f8c8d;font-weight:400;">${escapeHtml(sol.solution_id)}</span></h2>
390
- <div class="complexity-badges" style="margin-bottom: 15px;">
391
- `;
392
- if (sol.time_complexity) {
393
- html += `
394
- <div class="complexity-item">
395
- <span class="complexity-label">Time</span>
396
- <span class="complexity-value">${escapeHtml(sol.time_complexity)}</span>
397
- </div>`;
398
- }
399
- if (sol.space_complexity) {
400
- html += `
401
- <div class="complexity-item">
402
- <span class="complexity-label">Space</span>
403
- <span class="complexity-value">${escapeHtml(sol.space_complexity)}</span>
404
- </div>`;
405
- }
406
- html += `
407
- </div>
408
- <div class="code-with-tasks">
409
- ${sol.highlighted_code}
410
- </div>
411
- </div>
412
- `;
413
- });
414
-
415
- // Navigation hint
416
- html += `
417
- <div class="navigation-hint">
418
- <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
419
- or return to the list view to filter by dataset source or search by name.
420
- </div>
421
- `;
422
-
423
- container.innerHTML = html;
424
- window.currentProblem = problem;
425
- return;
426
- }
427
-
428
- // Source Code card
429
- html += `
430
- <div class="card">
431
- <h2>Source Code</h2>
432
- <div class="code-with-tasks" id="code-container">
433
- ${problem.highlighted_code}
434
- </div>
435
- </div>
436
- `;
437
-
438
- // --- Non-DREval (simple) view ---
439
- if (!hasTasks) {
440
- // Show inputs/outputs if available
441
- if (problem.inputs && problem.inputs.length > 0) {
442
- html += `<div class="card"><h2>Inputs &amp; Outputs</h2>`;
443
- problem.inputs.forEach((inp, i) => {
444
- const out = (problem.outputs && problem.outputs[i]) || '';
445
- html += `
446
- <div class="io-section" style="margin-bottom: 15px;">
447
- <div class="task-section">
448
- <h3>Input ${i + 1}</h3>
449
- <pre class="code-block">${escapeHtml(inp)}</pre>
450
- </div>
451
- <div class="task-section">
452
- <h3>Output</h3>
453
- <pre class="code-block">${escapeHtml(out)}</pre>
454
- </div>
455
- </div>
456
- `;
457
- });
458
- html += `</div>`;
459
- }
460
-
461
- // Show test suite if available
462
- if (problem.test) {
463
- html += `
464
- <div class="card">
465
- <h2>Test Suite</h2>
466
- <pre class="code-block">${escapeHtml(problem.test)}</pre>
467
- </div>
468
- `;
469
- }
470
-
471
- // Navigation hint
472
- html += `
473
- <div class="navigation-hint">
474
- <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
475
- or return to the list view to filter by dataset source or search by name.
476
- </div>
477
- `;
478
-
479
- container.innerHTML = html;
480
- window.currentProblem = problem;
481
- return;
482
- }
483
-
484
- // --- CRUXEval task view (tasks have given/predict fields, no task_items) ---
485
- if (problem.tasks.length > 0 && problem.tasks[0].given !== undefined) {
486
- // Task selector
487
- html += `
488
- <div class="card">
489
- <h2>Tasks</h2>
490
- <div class="task-selector" id="task-selector">
491
- `;
492
- problem.tasks.forEach((task, idx) => {
493
- html += `
494
- <button class="task-btn ${idx === 0 ? 'active' : ''}"
495
- onclick="showCruxTask(${idx})">
496
- ${escapeHtml(task.name)}
497
- </button>
498
- `;
499
- });
500
- html += `
501
- </div>
502
- <div id="task-content"></div>
503
- </div>
504
- `;
505
-
506
- // Navigation hint
507
- html += `
508
- <div class="navigation-hint">
509
- <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
510
- or return to the list view to filter by dataset source or search by name.
511
- </div>
512
- `;
513
-
514
- container.innerHTML = html;
515
- window.currentProblem = problem;
516
- showCruxTask(0);
517
- return;
518
- }
519
-
520
- // --- DREval (full) view with tasks, coverage, arrows ---
521
- // Rebuild html cleanly with coverage legend and SVG overlay
522
- html = `
523
- <div class="card">
524
- <div class="problem-header">
525
- <h2>${escapeHtml(problem.entry_point)}</h2>
526
- <span class="badge ${badgeClass(problem.source)}">${escapeHtml(problem.source)}</span>
527
- </div>
528
- <div class="problem-meta">
529
- <div class="meta-item">
530
- <span class="meta-label">Task ID:</span>
531
- <span class="meta-value">${escapeHtml(problem.task_id)}</span>
532
- </div>
533
- <div class="meta-item">
534
- <span class="meta-label">Index:</span>
535
- <span class="meta-value">${problem.idx}</span>
536
- </div>
537
- <div class="meta-item">
538
- <span class="meta-label">Dataset:</span>
539
- <span class="meta-value">${escapeHtml(datasetName)}</span>
540
- </div>
541
- <div class="meta-item">
542
- <span class="meta-label">Test Inputs:</span>
543
- <span class="meta-value">${problem.inputs.length}</span>
544
- </div>
545
- </div>
546
- </div>
547
-
548
- <div class="card">
549
- <h2>Source Code</h2>
550
- <div class="coverage-legend" id="coverage-legend">
551
- <strong>Coverage:</strong>
552
- <span class="coverage-legend-item">
553
- <span class="coverage-swatch" style="background:#d4edda; border:1px solid #28a745;"></span>
554
- Executed
555
- </span>
556
- <span class="coverage-legend-item">
557
- <span class="coverage-swatch" style="background:#f8d7da; border:1px solid #dc3545;"></span>
558
- Not executed
559
- </span>
560
- </div>
561
- <div class="code-with-tasks" id="code-container">
562
- ${problem.highlighted_code}
563
- <svg id="arrow-overlay" xmlns="http://www.w3.org/2000/svg">
564
- <defs>
565
- <marker id="arrowhead" markerWidth="8" markerHeight="6"
566
- refX="8" refY="3" orient="auto">
567
- <polygon points="0 0, 8 3, 0 6" class="exec-arrow-head"/>
568
- </marker>
569
- </defs>
570
- </svg>
571
- </div>
572
- </div>
573
- `;
574
-
575
- // Task selector
576
- html += `
577
- <div class="card">
578
- <h2>Test Cases & Tasks</h2>
579
- <p>Select a test input to view associated reasoning tasks:</p>
580
- <div class="task-selector" id="task-selector">
581
- `;
582
-
583
- problem.tasks.forEach((task, idx) => {
584
- html += `
585
- <button class="task-btn ${idx === 0 ? 'active' : ''}"
586
- onclick="showTask(${idx})">
587
- Input ${task.input_idx + 1}
588
- </button>
589
- `;
590
- });
591
-
592
- html += `
593
- </div>
594
- <div id="task-content"></div>
595
- </div>
596
- `;
597
-
598
- // Navigation hint
599
- html += `
600
- <div class="navigation-hint">
601
- <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
602
- or return to the list view to filter by dataset source or search by name.
603
- </div>
604
- `;
605
-
606
- container.innerHTML = html;
607
-
608
- // Store problem data globally
609
- window.currentProblem = problem;
610
-
611
- // Show first task by default
612
- showTask(0);
613
- }
614
-
615
- function injectTaskMarkers(taskItems) {
616
- const codePre = document.querySelector('.source .code pre');
617
-
618
- // Save the pristine original innerHTML once, before any modification.
619
- if (codePre && !window._codePreOriginalHtml) {
620
- window._codePreOriginalHtml = codePre.innerHTML;
621
- }
622
-
623
- // Invalidate span cache (rebuilt lazily on next arrow draw)
624
- window._linenoSpanCache = null;
625
-
626
- // Store current task items so applyCoverage can re-add markers after wrapping.
627
- window._currentTaskItems = taskItems || [];
628
-
629
- // Reset code pre to original, then add markers from scratch.
630
- if (codePre && window._codePreOriginalHtml) {
631
- codePre.innerHTML = window._codePreOriginalHtml;
632
- }
633
-
634
- if (!taskItems || taskItems.length === 0) {
635
- return;
636
- }
637
-
638
- // Group tasks by line number
639
- const tasksByLine = {};
640
- taskItems.forEach(item => {
641
- if (!tasksByLine[item.lineno]) tasksByLine[item.lineno] = [];
642
- tasksByLine[item.lineno].push(item.var);
643
- });
644
-
645
- // Inject task marker badges into the code pre
646
- if (!codePre) return;
647
- const codeLines = codePre.innerHTML.split('\n');
648
- codePre.innerHTML = codeLines.map((line, idx) => {
649
- const lineNum = idx + 1;
650
- if (tasksByLine[lineNum] && line.trim() !== '') {
651
- const vars = tasksByLine[lineNum];
652
- return line + `<span class="task-marker" data-lineno="${lineNum}" data-vars="${escapeHtml(vars.join(', '))}">${escapeHtml(vars.join(', '))}</span>`;
653
- }
654
- return line;
655
- }).join('\n');
656
-
657
- }
658
-
659
- function applyCoverage(coverageSet, totalLines) {
660
- // Remove previous coverage classes from lineno spans.
661
- // Pygments structure: td.linenos > div.linenodiv > pre > span.normal
662
- // These are individual elements — adding/removing classes has no layout impact.
663
- document.querySelectorAll('td.linenos .normal').forEach(el => {
664
- el.classList.remove('line-executed', 'line-not-executed');
665
- });
666
-
667
- if (!coverageSet) {
668
- const legend = document.getElementById('coverage-legend');
669
- if (legend) legend.style.display = 'none';
670
- return;
671
- }
672
-
673
- const legend = document.getElementById('coverage-legend');
674
- if (legend) legend.style.display = 'block';
675
-
676
- // Color lineno spans only. We never touch codePre.innerHTML here so:
677
- // 1. The table layout is never disturbed (no alignment issue).
678
- // 2. Task markers injected by injectTaskMarkers are left untouched.
679
- document.querySelectorAll('td.linenos .normal').forEach(span => {
680
- const lineNum = parseInt(span.textContent.trim());
681
- if (!isNaN(lineNum) && lineNum <= totalLines) {
682
- span.classList.add(coverageSet.has(lineNum) ? 'line-executed' : 'line-not-executed');
683
- }
684
- });
685
- }
686
-
687
- // Global map: lineno -> list of next line numbers (1-indexed; -1 = end of trace)
688
- window._nextLinesMap = {};
689
-
690
- async function loadAndApplyGroundTruth(problemIdx, inputIdx, taskItems) {
691
- // Show "loading" placeholders on all task items
692
- taskItems.forEach(item => {
693
- const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
694
- if (el) { el.textContent = '…'; el.className = 'gt-answer loading'; }
695
- });
696
-
697
- // Clear next-lines data from previous input
698
- window._nextLinesMap = {};
699
-
700
- try {
701
- const resp = await fetch(`/api/${datasetSlug}/problem/${problemIdx}/ground_truth/${inputIdx}`);
702
- const gt = await resp.json();
703
-
704
- if (gt.status !== 'ok') {
705
- taskItems.forEach(item => {
706
- const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
707
- if (el) { el.textContent = gt.status === 'error' ? '(exec error)' : '(unavailable)'; el.className = 'gt-answer'; }
708
- });
709
- applyCoverage(null, 0);
710
- return;
711
- }
712
-
713
- // Apply coverage highlighting
714
- const coverageSet = new Set(gt.coverage);
715
- applyCoverage(coverageSet, gt.total_lines);
716
-
717
- // Fill in variable answers
718
- const answerMap = {};
719
- gt.variable_answers.forEach(a => {
720
- answerMap[`${a.lineno}-${a.var}`] = a.answer_str;
721
- });
722
- taskItems.forEach(item => {
723
- const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
724
- if (el) {
725
- const answer = answerMap[`${item.lineno}-${item.var}`] || '(not available)';
726
- el.textContent = answer;
727
- el.className = 'gt-answer';
728
- }
729
- });
730
-
731
- // Store next-lines data for arrow visualization
732
- if (gt.next_lines_answers) {
733
- gt.next_lines_answers.forEach(a => {
734
- window._nextLinesMap[a.lineno] = a.next_lines;
735
- });
736
- }
737
-
738
- // Attach hover handlers to task-marker spans now that we have next-lines data
739
- attachArrowHoverHandlers();
740
-
741
- } catch (e) {
742
- taskItems.forEach(item => {
743
- const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
744
- if (el) { el.textContent = '(error)'; el.className = 'gt-answer'; }
745
- });
746
- }
747
- }
748
-
749
- // Cache of lineNum → DOM span, rebuilt whenever injectTaskMarkers runs.
750
- window._linenoSpanCache = null;
751
-
752
- function buildLinenoSpanCache(container) {
753
- const cache = {};
754
- container.querySelectorAll('td.linenos .normal').forEach(span => {
755
- const n = parseInt(span.textContent.trim());
756
- if (!isNaN(n)) cache[n] = span;
757
- });
758
- window._linenoSpanCache = cache;
759
- }
760
-
761
- /**
762
- * Get the bounding rect of the lineno span for a given 1-indexed line number,
763
- * relative to the code container element. Uses a cached span map.
764
- */
765
- function getLinenoSpanRect(lineNum, container) {
766
- if (!window._linenoSpanCache) buildLinenoSpanCache(container);
767
- const span = window._linenoSpanCache[lineNum];
768
- if (!span) return null;
769
- const spanRect = span.getBoundingClientRect();
770
- const containerRect = container.getBoundingClientRect();
771
- return {
772
- top: spanRect.top - containerRect.top + container.scrollTop,
773
- bottom: spanRect.bottom - containerRect.top + container.scrollTop,
774
- left: spanRect.left - containerRect.left,
775
- right: spanRect.right - containerRect.left,
776
- width: spanRect.width,
777
- height: spanRect.height,
778
- midY: (spanRect.top + spanRect.bottom) / 2 - containerRect.top + container.scrollTop,
779
- };
780
- }
781
-
782
- /**
783
- * Draw arrows from sourceLine to each of the targetLines in the SVG overlay.
784
- * Lines are 1-indexed. -1 means "end of execution" (no arrow drawn).
785
- */
786
- function drawArrows(sourceLineNum, targetLineNums) {
787
- const container = document.getElementById('code-container');
788
- const svg = document.getElementById('arrow-overlay');
789
- if (!container || !svg) return;
790
-
791
- // Remove previous arrows (but keep defs)
792
- svg.querySelectorAll('.arrow-path').forEach(el => el.remove());
793
-
794
- const srcRect = getLinenoSpanRect(sourceLineNum, container);
795
- if (!srcRect) return;
796
-
797
- // Update SVG height to match container
798
- svg.setAttribute('height', container.scrollHeight);
799
-
800
- targetLineNums.forEach(targetLineNum => {
801
- if (targetLineNum === -1) return; // end of trace — no arrow
802
-
803
- const dstRect = getLinenoSpanRect(targetLineNum, container);
804
- if (!dstRect) return;
805
-
806
- // Start point: right edge of source lineno span, vertically centered
807
- const x1 = srcRect.right + 2;
808
- const y1 = srcRect.midY;
809
-
810
- // End point: right edge of target lineno span, vertically centered
811
- const x2 = dstRect.right + 2;
812
- const y2 = dstRect.midY;
813
-
814
- // Horizontal offset for the bezier control points — curves to the right
815
- const curveOffset = Math.max(30, Math.abs(y2 - y1) * 0.4);
816
-
817
- // Cubic bezier: both control points extend to the right of the lineno column
818
- const cx1 = x1 + curveOffset;
819
- const cy1 = y1;
820
- const cx2 = x2 + curveOffset;
821
- const cy2 = y2;
822
-
823
- const path = document.createElementNS('http://www.w3.org/2000/svg', 'path');
824
- path.setAttribute('d', `M ${x1} ${y1} C ${cx1} ${cy1}, ${cx2} ${cy2}, ${x2} ${y2}`);
825
- path.setAttribute('class', 'exec-arrow arrow-path');
826
- path.setAttribute('marker-end', 'url(#arrowhead)');
827
- svg.appendChild(path);
828
- });
829
- }
830
-
831
- /**
832
- * Clear all arrows from the SVG overlay.
833
- */
834
- function clearArrows() {
835
- const svg = document.getElementById('arrow-overlay');
836
- if (svg) {
837
- svg.querySelectorAll('.arrow-path').forEach(el => el.remove());
838
- }
839
- }
840
-
841
- // AbortController for the current set of marker hover listeners.
842
- let _markerListenersAbort = null;
843
-
844
- /**
845
- * Attach mouseenter/mouseleave handlers to all .task-marker spans so that
846
- * hovering shows execution-flow arrows to next lines.
847
- */
848
- function attachArrowHoverHandlers() {
849
- // Cancel any previously attached listeners without touching the DOM.
850
- if (_markerListenersAbort) _markerListenersAbort.abort();
851
- _markerListenersAbort = new AbortController();
852
- const { signal } = _markerListenersAbort;
853
-
854
- document.querySelectorAll('.task-marker').forEach(marker => {
855
- marker.addEventListener('mouseenter', () => {
856
- const lineNum = parseInt(marker.dataset.lineno);
857
- if (!lineNum) return;
858
- const nextLines = window._nextLinesMap[lineNum];
859
- if (nextLines && nextLines.length > 0) {
860
- drawArrows(lineNum, nextLines);
861
- }
862
- }, { signal });
863
-
864
- marker.addEventListener('mouseleave', () => {
865
- clearArrows();
866
- }, { signal });
867
- });
868
- }
869
-
870
- function showCruxTask(taskIdx) {
871
- const problem = window.currentProblem;
872
- const task = problem.tasks[taskIdx];
873
-
874
- // Update active button
875
- document.querySelectorAll('.task-btn').forEach((btn, idx) => {
876
- btn.classList.toggle('active', idx === taskIdx);
877
- });
878
-
879
- const givenLabel = task.given === 'input' ? 'Input (given)' : 'Output (given)';
880
- const predictLabel = task.predict === 'output' ? 'Output (predict)' : 'Input (predict)';
881
- const givenValue = task.given === 'input' ? task.input : task.output;
882
- const predictValue = task.predict === 'output' ? task.output : task.input;
883
-
884
- const html = `
885
- <div class="task-details">
886
- <div class="task-section">
887
- <p style="margin-bottom: 12px; color: #7f8c8d;">${escapeHtml(task.description)}</p>
888
- </div>
889
- <div class="io-section">
890
- <div class="task-section">
891
- <h3>${escapeHtml(givenLabel)}</h3>
892
- <pre class="code-block">${escapeHtml(givenValue)}</pre>
893
- </div>
894
- <div class="task-section">
895
- <h3>${escapeHtml(predictLabel)}</h3>
896
- <pre class="code-block crux-answer">${escapeHtml(predictValue)}</pre>
897
- </div>
898
- </div>
899
- </div>
900
- `;
901
-
902
- document.getElementById('task-content').innerHTML = html;
903
- }
904
-
905
- function showTask(taskIdx) {
906
- const problem = window.currentProblem;
907
- const task = problem.tasks[taskIdx];
908
-
909
- // Update active button
910
- const buttons = document.querySelectorAll('.task-btn');
911
- buttons.forEach((btn, idx) => {
912
- if (idx === taskIdx) {
913
- btn.classList.add('active');
914
- } else {
915
- btn.classList.remove('active');
916
- }
917
- });
918
-
919
- // Inject task markers into the code
920
- injectTaskMarkers(task.task_items);
921
-
922
- // Clear previous coverage while new one loads
923
- applyCoverage(null, 0);
924
-
925
- // Render task content
926
- // For HumanEval: Input + Expected Output side by side.
927
- // For ClassEval: Input alone (side by side layout), then Test Class below full-width.
928
- const ioSection = task.test_class_code
929
- ? `<div class="io-section">
930
- <div class="task-section">
931
- <h3>Input</h3>
932
- <pre class="code-block">${escapeHtml(task.input)}</pre>
933
- </div>
934
- </div>
935
- <div class="task-section">
936
- <h3>Test Class &mdash; <code>${escapeHtml(task.test_class_name)}</code></h3>
937
- <pre class="code-block">${escapeHtml(task.test_class_code)}</pre>
938
- </div>`
939
- : `<div class="io-section">
940
- <div class="task-section">
941
- <h3>Input</h3>
942
- <pre class="code-block">${escapeHtml(task.input)}</pre>
943
- </div>
944
- <div class="task-section">
945
- <h3>Expected Output</h3>
946
- <pre class="code-block">${escapeHtml(task.output)}</pre>
947
- </div>
948
- </div>`;
949
-
950
- let html = `
951
- <div class="task-details">
952
- ${ioSection}
953
- `;
954
-
955
- // Show task items with ground truth answer placeholders
956
- if (task.task_items && task.task_items.length > 0) {
957
- html += `
958
- <div class="task-section">
959
- <h3>Reasoning Tasks</h3>
960
- <p style="margin-bottom: 10px; color: #7f8c8d;">
961
- Variable state at each execution point (correct answer shown in
962
- <span style="background:#17a2b8;color:white;padding:1px 6px;border-radius:3px;font-size:0.82rem;">teal</span>):
963
- </p>
964
- <ul class="task-items-list">
965
- `;
966
-
967
- task.task_items.forEach(item => {
968
- html += `
969
- <li>
970
- <span class="line-ref">Line ${item.lineno}</span>
971
- <span class="var-name">${escapeHtml(item.var)}</span>
972
- <span class="gt-answer loading" id="gt-${item.lineno}-${item.var}">…</span>
973
- </li>
974
- `;
975
- });
976
-
977
- html += `
978
- </ul>
979
- </div>
980
- `;
981
- }
982
-
983
- // Show output prediction task if exists
984
- if (task.output_pred) {
985
- html += `
986
- <div class="task-section">
987
- <h3>Output Completion Task</h3>
988
- <p style="margin-bottom: 10px; color: #7f8c8d;">
989
- The model needs to complete this test assertion:
990
- </p>
991
- <pre class="code-block">${escapeHtml(task.output_pred)}</pre>
992
- </div>
993
- `;
994
- }
995
-
996
- html += `</div>`;
997
-
998
- document.getElementById('task-content').innerHTML = html;
999
-
1000
- // Fetch and apply ground truth (coverage + variable answers)
1001
- if (hasGroundTruth && task.task_items) {
1002
- loadAndApplyGroundTruth(problem.idx, task.input_idx, task.task_items);
1003
- }
1004
- }
1005
-
1006
- function escapeHtml(text) {
1007
- if (text === null || text === undefined) return '';
1008
- const div = document.createElement('div');
1009
- div.textContent = text;
1010
- return div.innerHTML;
1011
- }
1012
-
1013
- loadProblem();
1014
  </script>
 
1015
  {% endblock %}
 
20
  {% endblock %}
21
 
22
  {% block extra_css %}
23
+ {{ css|safe }}
24
+ {% endblock %}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
+ {% block extra_head %}
27
+ <link rel="stylesheet" href="{{ url_for('static', filename='problem.css') }}">
 
 
 
 
 
 
 
 
 
28
  {% endblock %}
29
 
30
  {% block content %}
 
43
  const datasetName = {{ dataset_name|tojson }};
44
  const hasGroundTruth = {{ has_ground_truth|tojson }};
45
  const hasTasks = {{ has_tasks|tojson }};
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  </script>
47
+ <script src="{{ url_for('static', filename='problem.js') }}"></script>
48
  {% endblock %}