egor-bogomolov commited on
Commit
a096bac
·
1 Parent(s): b9b3992

Rebrand to ML4SE Benchmark Viewer, make datasets a core dependency

Browse files
Files changed (9) hide show
  1. .gitignore +71 -0
  2. CLAUDE.md +351 -0
  3. README.md +47 -7
  4. app.py +359 -0
  5. dataset_adapters.py +413 -0
  6. pyproject.toml +38 -0
  7. templates/base.html +211 -0
  8. templates/index.html +330 -0
  9. templates/problem.html +1015 -0
.gitignore ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ build/
8
+ develop-eggs/
9
+ dist/
10
+ downloads/
11
+ eggs/
12
+ .eggs/
13
+ lib/
14
+ lib64/
15
+ parts/
16
+ sdist/
17
+ var/
18
+ wheels/
19
+ pip-wheel-metadata/
20
+ share/python-wheels/
21
+ *.egg-info/
22
+ .installed.cfg
23
+ *.egg
24
+ MANIFEST
25
+
26
+ # Virtual environments
27
+ venv/
28
+ ENV/
29
+ env/
30
+ .venv
31
+
32
+ # IDE
33
+ .vscode/
34
+ .idea/
35
+ *.swp
36
+ *.swo
37
+ *~
38
+ .DS_Store
39
+
40
+ # Flask
41
+ instance/
42
+ .webassets-cache
43
+ *.log
44
+
45
+ # Environment variables
46
+ .env
47
+ .env.local
48
+
49
+ # uv
50
+ uv.lock
51
+
52
+ # AI Assistant context files
53
+ CLAUDE.md
54
+
55
+ # Testing
56
+ .pytest_cache/
57
+ .coverage
58
+ htmlcov/
59
+ .tox/
60
+ .hypothesis/
61
+
62
+ # Jupyter Notebook
63
+ .ipynb_checkpoints
64
+
65
+ # mypy
66
+ .mypy_cache/
67
+ .dmypy.json
68
+ dmypy.json
69
+
70
+ # Ruff
71
+ .ruff_cache/
CLAUDE.md ADDED
@@ -0,0 +1,351 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CLAUDE.md - Project Context for AI Assistants
2
+
3
+ ## Project Overview
4
+
5
+ **ml4se-evals-visualization** is a web-based visualization tool for browsing and analyzing benchmark datasets for machine learning code evaluation. The primary dataset is DREval (Dynamic Reasoning Evaluation), but the tool supports multiple datasets through an adapter pattern.
6
+
7
+ **Tech Stack:**
8
+ - **Backend**: Flask (Python web framework)
9
+ - **Frontend**: Jinja2 templates with vanilla JavaScript
10
+ - **Syntax Highlighting**: Pygments
11
+ - **Package Management**: uv (modern Python package manager)
12
+ - **Linting/Formatting**: ruff
13
+
14
+ ## Architecture
15
+
16
+ ### Core Components
17
+
18
+ 1. **app.py** - Main Flask application
19
+ - API endpoints for dataset listing, problem browsing, and code highlighting
20
+ - Ground truth execution data endpoints (DREval only)
21
+ - Template rendering for index and problem detail pages
22
+ - Port: 7860 (default), configurable via PORT env var
23
+ - Debug mode: controlled by FLASK_DEBUG env var
24
+
25
+ 2. **dataset_adapters.py** - Dataset adapter system
26
+ - `DatasetAdapter` base class with common interface
27
+ - Concrete adapters for: DREval, CRUXEval, HumanEval+, BigOBench
28
+ - Registry pattern (`REGISTRY` dict) for dataset management
29
+ - Each adapter normalizes dataset-specific formats to common API
30
+
31
+ 3. **templates/** - Jinja2 HTML templates
32
+ - `base.html` - Base layout (IMPORTANT: `{% block scripts %}` does NOT include `<script>` tags)
33
+ - `index.html` - Problem list view with filtering
34
+ - `problem.html` - Problem detail view with syntax highlighting
35
+
36
+ 4. **requirements.txt** / **pyproject.toml** - Dependencies
37
+ - Core: flask, pygments
38
+ - Optional HF: datasets (for CRUXEval, HumanEval+, BigOBench)
39
+ - Dev: ruff
40
+
41
+ ### Data Flow
42
+
43
+ ```
44
+ User Request → Flask Route → Dataset Adapter → API Response → Template/JSON
45
+
46
+ Helper Functions
47
+ (highlight_code, etc.)
48
+ ```
49
+
50
+ ### Key Design Patterns
51
+
52
+ - **Adapter Pattern**: Each dataset type has an adapter implementing `DatasetAdapter` interface
53
+ - **Dependency Injection**: Helper functions injected into adapters module via `_set_helpers()`
54
+ - **URL Routing**: Supports both default (DREval) and explicit dataset slug routes
55
+ - `/problem/123` → DREval problem 123
56
+ - `/cruxeval/problem/45` → CRUXEval problem 45
57
+
58
+ ## Important Files & Locations
59
+
60
+ ### Python Files
61
+ - **app.py**: Main entry point, Flask routes, ground truth logic
62
+ - **dataset_adapters.py**: Adapter implementations for all datasets
63
+ - **ground_truth_loader.py**: (parent dir) Loads execution traces for DREval
64
+ - **dynamics.py**: (parent dir) Contains `Nil` singleton for missing values
65
+
66
+ ### Data Files
67
+ - **data/DREval_data.jsonl**: Main DREval problem data (328 problems)
68
+ - **data/DREval_tasks.jsonl**: Task definitions for each problem
69
+ - **data/ground_truth/**: Execution traces for ground truth visualization
70
+
71
+ ### Template Files
72
+ - **templates/base.html**: Layout with navigation, CSS includes
73
+ - **templates/index.html**: Problem list with dataset selector, search, filtering
74
+ - **templates/problem.html**: Problem detail with code highlighting, test inputs, ground truth overlay
75
+
76
+ ## Key Functionalities
77
+
78
+ ### 1. Dataset Support
79
+
80
+ **DREval** (primary dataset):
81
+ - 328 problems (164 HumanEval + 164 ClassEval)
82
+ - Ground truth execution traces available
83
+ - Tasks: Coverage, Path, State, Output predictions
84
+ - Test inputs with expected outputs
85
+
86
+ **CRUXEval** (HuggingFace):
87
+ - Input/Output prediction tasks
88
+ - Single function execution reasoning
89
+
90
+ **HumanEval+** (HuggingFace):
91
+ - Extended HumanEval with additional tests
92
+ - No execution traces
93
+
94
+ **BigOBench** (HuggingFace):
95
+ - Algorithm complexity analysis
96
+ - Multiple solutions per problem with time/space complexity labels
97
+
98
+ ### 2. Problem Browsing
99
+
100
+ - **List View** (`/`): Grid of problem cards with filtering
101
+ - Dataset selector (dropdown)
102
+ - Search by function name or task ID
103
+ - Source filter (HumanEval/ClassEval for DREval)
104
+ - Problem cards show: task_id, entry_point, source badge, input count
105
+
106
+ - **Detail View** (`/problem/<idx>`): Full problem display
107
+ - Syntax-highlighted code (Pygments)
108
+ - Test inputs selector (buttons at top)
109
+ - Task item visualization (yellow line highlights, purple variable badges)
110
+ - Previous/Next navigation
111
+ - Ground truth overlay (DREval only, when available)
112
+
113
+ ### 3. Ground Truth Visualization (DREval only)
114
+
115
+ When ground truth data is available:
116
+ - **Coverage overlay**: Executed lines highlighted in green
117
+ - **Variable values**: Hovering over purple badges shows actual values
118
+ - **Next line arrows**: Hovering over line numbers shows execution flow
119
+ - Fetched via `/api/<dataset>/problem/<idx>/ground_truth/<input_idx>`
120
+
121
+ ### 4. Code Highlighting
122
+
123
+ - **Pygments-based** syntax highlighting with table line numbers
124
+ - **Line offset handling**: Leading newlines stripped for display
125
+ - **Dynamic highlighting**: Updates when switching test inputs
126
+ - **Task line markers**: Yellow highlights for lines with queries
127
+ - **Variable badges**: Purple badges showing which variables are queried
128
+
129
+ ## API Endpoints
130
+
131
+ ### Dataset Operations
132
+ - `GET /api/datasets` - List all available datasets
133
+ - `GET /api/<dataset_slug>/problems` - Get problem list for dataset
134
+ - `GET /api/<dataset_slug>/problem/<idx>` - Get problem detail
135
+ - `GET /api/<dataset_slug>/problem/<idx>/ground_truth/<input_idx>` - Get execution data (DREval only)
136
+
137
+ ### Utility
138
+ - `GET /api/css` - Pygments CSS for syntax highlighting
139
+ - `GET /api/highlight_code?code=...&lines=...` - Highlight arbitrary code
140
+
141
+ ## Common Development Tasks
142
+
143
+ ### Running the Application
144
+
145
+ ```bash
146
+ # Development mode (with debug)
147
+ FLASK_DEBUG=true uv run python app.py
148
+
149
+ # Production mode
150
+ uv run python app.py
151
+
152
+ # Custom port
153
+ PORT=8080 uv run python app.py
154
+ ```
155
+
156
+ ### Installing Dependencies
157
+
158
+ ```bash
159
+ # Core dependencies only
160
+ uv sync
161
+
162
+ # With HuggingFace datasets support
163
+ uv sync --extra hf
164
+
165
+ # Dev dependencies (includes ruff)
166
+ uv sync --extra dev
167
+ ```
168
+
169
+ ### Code Quality
170
+
171
+ ```bash
172
+ # Format code
173
+ uv run ruff format .
174
+
175
+ # Lint code
176
+ uv run ruff check .
177
+
178
+ # Auto-fix issues
179
+ uv run ruff check --fix .
180
+ ```
181
+
182
+ ### Adding a New Dataset
183
+
184
+ 1. Create adapter class in `dataset_adapters.py` inheriting from `DatasetAdapter`
185
+ 2. Implement required methods: `problem_count()`, `get_problem_summary()`, `get_problem_detail()`
186
+ 3. Set class attributes: `slug`, `display_name`, `has_ground_truth`, `has_tasks`
187
+ 4. Register adapter in `REGISTRY` (usually in `register_hf_datasets()` or similar)
188
+ 5. Test endpoints: `/api/<slug>/problems` and `/api/<slug>/problem/<idx>`
189
+
190
+ ## Important Implementation Details
191
+
192
+ ### Line Number Indexing
193
+
194
+ **Critical**: Multiple indexing conventions are used:
195
+
196
+ 1. **Original code** (in data files): 1-indexed, includes leading newlines
197
+ 2. **Stripped code** (displayed): 1-indexed, leading newlines removed
198
+ 3. **Coverage data**: 0-indexed relative to original code
199
+ 4. **Pygments output**: 1-indexed, starts from linenostart parameter
200
+
201
+ **Conversion formula** for task line numbers:
202
+ ```python
203
+ stripped_lineno = original_lineno - offset
204
+ where offset = number of leading newlines
205
+ ```
206
+
207
+ ### Template JavaScript Rules
208
+
209
+ **IMPORTANT**: The `{% block scripts %}` in `base.html` does NOT include `<script>` tags.
210
+
211
+ ❌ **Wrong** (creates nested script tags):
212
+ ```html
213
+ {% block scripts %}
214
+ <script>
215
+ // Your code
216
+ </script>
217
+ {% endblock %}
218
+ ```
219
+
220
+ ✅ **Correct**:
221
+ ```html
222
+ {% block scripts %}
223
+ // Your code directly here
224
+ {% endblock %}
225
+ ```
226
+
227
+ ### ClassEval Test Parsing
228
+
229
+ ClassEval problems have test classes defined in the `test` field. The tool:
230
+ 1. Parses test code with `ast.parse()` (safe, no execution)
231
+ 2. Extracts classes matching pattern `{entry_point}Test*`
232
+ 3. Associates each test class with an input_idx
233
+ 4. Displays test class code alongside problem code
234
+
235
+ ### Ground Truth Data Format
236
+
237
+ Ground truth execution records contain:
238
+ - **coverage**: List of 0-indexed line numbers executed
239
+ - **variable_values**: Dict mapping (lineno, var) → list of values
240
+ - **next_lines**: Dict mapping lineno → list of next line numbers
241
+ - **status**: "ok" | "error" indicating execution success
242
+
243
+ ## Troubleshooting
244
+
245
+ ### "No such file or directory" error
246
+ - Ensure you're in the project root directory
247
+ - Check that `data/DREval_data.jsonl` exists
248
+ - Verify relative path assumptions in `app.py` (DATA_DIR calculation)
249
+
250
+ ### "Ground truth unavailable"
251
+ - Check that `ground_truth_loader.py` exists in parent directory
252
+ - Verify `data/ground_truth/` directory contains execution traces
253
+ - Ensure GT_LOADER is not None (check startup logs)
254
+
255
+ ### HuggingFace datasets not loading
256
+ - Install optional dependencies: `uv sync --extra hf`
257
+ - Check network connectivity (datasets download from HF hub)
258
+ - Review startup logs for specific error messages
259
+
260
+ ### JavaScript not executing
261
+ - Check browser console for errors (F12)
262
+ - Verify `{% block scripts %}` is not creating nested `<script>` tags
263
+ - Hard refresh (Ctrl+F5 or Cmd+Shift+R) to clear cache
264
+ - Ensure API endpoints return valid JSON
265
+
266
+ ### Line numbers misaligned
267
+ - Check offset calculation in `_code_offset()`
268
+ - Verify line number adjustments in adapter `get_problem_detail()`
269
+ - Ensure Pygments linenostart matches expected starting line
270
+
271
+ ## File Paths and Navigation
272
+
273
+ All file references should use the fleet-file:// scheme with absolute paths:
274
+
275
+ ```markdown
276
+ [filename](fleet-file://4idaunrhu7c1r7voo46l/Users/Egor.Bogomolov_1/work/ml4se-evals-visualization/app.py?type=file&root=%252F)
277
+ ```
278
+
279
+ For line references (always 0-indexed):
280
+ ```markdown
281
+ [filename:12-16](fleet-file://4idaunrhu7c1r7voo46l/Users/Egor.Bogomolov_1/work/ml4se-evals-visualization/app.py?type=file&linesData=%7B%22range%22%3A%7B%22first%22%3A527%2C%22second%22%3A732%7D%2C%22lines%22%3A%7B%22first%22%3A12%2C%22second%22%3A16%7D%7D&root=%252F)
282
+ ```
283
+
284
+ ## Git Information
285
+
286
+ - **Current branch**: main
287
+ - **Main branch**: main (use for PRs)
288
+ - **Recent commit**: b9b3992 "initial commit"
289
+ - **Untracked files**: Dockerfile, __pycache__/, app.py, dataset_adapters.py, pyproject.toml, requirements.txt, templates/, uv.lock
290
+ - **Modified files**: README.md
291
+
292
+ ## Environment
293
+
294
+ - **Platform**: macOS (darwin 24.6.0)
295
+ - **Python**: 3.8+ required
296
+ - **Working directory**: /Users/Egor.Bogomolov_1/work/ml4se-evals-visualization
297
+ - **Is git repo**: Yes
298
+
299
+ ## Testing Checklist
300
+
301
+ When making changes, verify:
302
+
303
+ - [ ] Flask server starts without errors
304
+ - [ ] All datasets appear in dropdown (if HF dependencies installed)
305
+ - [ ] Problem list loads and displays correctly
306
+ - [ ] Search/filter functionality works
307
+ - [ ] Problem detail page renders with syntax highlighting
308
+ - [ ] Test input selector updates line highlights
309
+ - [ ] Ground truth overlay displays (DREval only)
310
+ - [ ] Previous/Next navigation works
311
+ - [ ] API endpoints return valid JSON
312
+ - [ ] No console errors in browser (F12)
313
+ - [ ] Code passes ruff checks
314
+
315
+ ## External Dependencies
316
+
317
+ ### Python Packages
318
+ - **flask**: Web framework (>=3.0.0)
319
+ - **pygments**: Syntax highlighting (>=2.17.2)
320
+ - **datasets**: HuggingFace datasets (>=2.14.0, optional)
321
+ - **ruff**: Linting and formatting (>=0.8.0, dev)
322
+
323
+ ### Data Sources
324
+ - **DREval**: Local JSONL files in data/ directory
325
+ - **CRUXEval**: cruxeval-org/cruxeval (HuggingFace Hub)
326
+ - **HumanEval+**: evalplus/humanevalplus (HuggingFace Hub)
327
+ - **BigOBench**: facebook/BigOBench (HuggingFace Hub)
328
+
329
+ ## Future Enhancements (Not Implemented)
330
+
331
+ Potential areas for improvement:
332
+ - User authentication and saved preferences
333
+ - Export functionality (PDF, CSV)
334
+ - Comparison view for multiple solutions
335
+ - Interactive debugging/stepping through execution
336
+ - Code editing and re-evaluation
337
+ - Dataset upload functionality
338
+ - Performance metrics visualization
339
+
340
+ ## Related Documentation
341
+
342
+ - **README.md**: User-facing documentation, installation instructions
343
+ - **pyproject.toml**: Package metadata, dependencies, ruff configuration
344
+ - **Dockerfile**: Container deployment configuration (if present)
345
+ - **requirements.txt**: Pip-format dependency list
346
+
347
+ ---
348
+
349
+ **Last Updated**: 2026-03-02
350
+ **Project Status**: Active Development
351
+ **Primary Maintainer**: Egor Bogomolov
README.md CHANGED
@@ -1,12 +1,52 @@
1
  ---
2
- title: Ml4se Evals Visualization
3
- emoji: 🐠
4
- colorFrom: red
5
- colorTo: indigo
6
  sdk: docker
7
  pinned: false
8
- license: apache-2.0
9
- short_description: Space for inspecting popular ML4SE datasets
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: ML4SE Benchmark Viewer
3
+ emoji: 📊
4
+ colorFrom: blue
5
+ colorTo: green
6
  sdk: docker
7
  pinned: false
 
 
8
  ---
9
 
10
+ # ML4SE Benchmark Viewer
11
+
12
+ A web-based interface for browsing and manually inspecting individual datapoints from popular ML4SE (Machine Learning for Software Engineering) benchmark datasets.
13
+
14
+ ## Supported Datasets
15
+
16
+ | Dataset | Description |
17
+ |---------|-------------|
18
+ | **CRUXEval** | Input/output prediction tasks for single-function execution reasoning |
19
+ | **HumanEval+** | Extended HumanEval with additional tests |
20
+ | **BigOBench** | Algorithm complexity analysis with time/space complexity labels |
21
+
22
+ **Coming soon:** DREval (Dynamic Reasoning Evaluation)
23
+
24
+ ## Installation & Usage
25
+
26
+ ```bash
27
+ # Install dependencies
28
+ uv sync
29
+
30
+ # Run the server (default port: 7860)
31
+ uv run python app.py
32
+
33
+ # Development mode with auto-reload
34
+ FLASK_DEBUG=true uv run python app.py
35
+ ```
36
+
37
+ Then open http://localhost:7860.
38
+
39
+ ## Development
40
+
41
+ ```bash
42
+ # Lint and format
43
+ uv run ruff check .
44
+ uv run ruff format .
45
+ ```
46
+
47
+ ### Adding a New Dataset
48
+
49
+ 1. Create an adapter class in `dataset_adapters.py` inheriting from `DatasetAdapter`
50
+ 2. Implement required methods: `problem_count()`, `get_problem_summary()`, `get_problem_detail()`
51
+ 3. Register the adapter in the `REGISTRY`
52
+ 4. Test: `/api/<slug>/problems` and `/api/<slug>/problem/<idx>`
app.py ADDED
@@ -0,0 +1,359 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ML4SE Benchmark Viewer
3
+
4
+ A web-based interface for browsing and inspecting individual datapoints
5
+ from popular ML4SE benchmark datasets (DREval, CRUXEval, HumanEval+,
6
+ BigOBench, and others).
7
+ """
8
+
9
+ import ast as _ast
10
+ import json
11
+ import os
12
+ import sys
13
+ from pathlib import Path
14
+
15
+ from flask import Flask, jsonify, render_template, request
16
+ from pygments import highlight
17
+ from pygments.formatters import HtmlFormatter
18
+ from pygments.lexers import PythonLexer
19
+
20
+ app = Flask(__name__)
21
+
22
+ # Path to data files
23
+ DATA_DIR = Path(__file__).parent.parent / "data"
24
+ DATA_FILE = DATA_DIR / "DREval_data.jsonl"
25
+ TASKS_FILE = DATA_DIR / "DREval_tasks.jsonl"
26
+
27
+ # Add parent dir to path so we can import ground_truth_loader
28
+ sys.path.insert(0, str(Path(__file__).parent.parent))
29
+ try:
30
+ from ground_truth_loader import GroundTruthLoader
31
+ GT_LOADER = GroundTruthLoader(data_dir=DATA_DIR / "ground_truth")
32
+ except Exception as e:
33
+ print(f"Warning: ground truth loader unavailable: {e}")
34
+ GT_LOADER = None
35
+
36
+
37
+ # Load data at startup
38
+ def load_data():
39
+ """Load the benchmark data and tasks."""
40
+ if not DATA_FILE.exists() or not TASKS_FILE.exists():
41
+ raise FileNotFoundError(
42
+ f"Data files not found. Expected files:\n"
43
+ f" - {DATA_FILE}\n"
44
+ f" - {TASKS_FILE}\n"
45
+ f"Please ensure you're running from the visualization directory "
46
+ f"and the data files exist in ../data/"
47
+ )
48
+
49
+ data_records = []
50
+ with open(DATA_FILE) as f:
51
+ for line in f:
52
+ data_records.append(json.loads(line))
53
+
54
+ task_records = []
55
+ with open(TASKS_FILE) as f:
56
+ for line in f:
57
+ task_records.append(json.loads(line))
58
+
59
+ return data_records, task_records
60
+
61
+
62
+ try:
63
+ DATA, TASKS = load_data()
64
+ except FileNotFoundError as e:
65
+ print(f"Warning: DREval data files not found, DREval dataset will be unavailable: {e}")
66
+ DATA, TASKS = [], []
67
+
68
+
69
+ def _extract_test_classes(test_code: str, cls_name: str) -> list:
70
+ """
71
+ Parse a ClassEval unittest module and return one dict per test class
72
+ in definition order: {"name": ..., "code": ...}.
73
+
74
+ Matches top-level classes whose names start with f"{cls_name}Test",
75
+ which is the same pattern used by ClassFactory.create_test_classes().
76
+ Uses ast.parse only — no code execution, safe to call from the web server.
77
+ """
78
+ try:
79
+ tree = _ast.parse(test_code)
80
+ except SyntaxError as e:
81
+ print(f"Warning: SyntaxError parsing test code for {cls_name}: {e}")
82
+ return []
83
+ lines = test_code.splitlines(keepends=True)
84
+ prefix = f"{cls_name}Test"
85
+ result = []
86
+ for node in tree.body: # top-level definitions, preserves source order
87
+ if isinstance(node, _ast.ClassDef) and node.name.startswith(prefix):
88
+ start = node.lineno - 1 # ast lineno is 1-indexed
89
+ end = node.end_lineno # end_lineno is inclusive; slice is exclusive
90
+ result.append({
91
+ "name": node.name,
92
+ "code": "".join(lines[start:end]),
93
+ })
94
+ return result
95
+
96
+
97
+ def _code_offset(code: str) -> int:
98
+ """Number of leading newlines that Pygments will strip."""
99
+ offset = 0
100
+ for ch in code:
101
+ if ch == '\n':
102
+ offset += 1
103
+ else:
104
+ break
105
+ return offset
106
+
107
+
108
+ def highlight_code(code, highlight_lines=None):
109
+ """
110
+ Syntax highlight Python code with optional line highlighting.
111
+
112
+ Args:
113
+ code: The Python code to highlight
114
+ highlight_lines: List of line numbers (1-indexed) to highlight
115
+
116
+ Returns:
117
+ HTML string with syntax highlighted code
118
+ """
119
+ formatter = HtmlFormatter(
120
+ linenos="table", cssclass="source", hl_lines=highlight_lines or [], linenostart=1
121
+ )
122
+ return highlight(code, PythonLexer(), formatter)
123
+
124
+
125
+ def get_css():
126
+ """Get CSS for syntax highlighting."""
127
+ return HtmlFormatter().get_style_defs(".source")
128
+
129
+
130
+ # ---------------------------------------------------------------------------
131
+ # Dataset adapter registration
132
+ # ---------------------------------------------------------------------------
133
+
134
+ from dataset_adapters import REGISTRY, _set_helpers, register_dreval, register_hf_datasets
135
+
136
+ # Inject helper functions into the adapters module (avoids circular imports)
137
+ _set_helpers(highlight_code, _code_offset, _extract_test_classes)
138
+
139
+ # Register DREval only if data is available
140
+ if DATA:
141
+ register_dreval(DATA, TASKS, GT_LOADER)
142
+
143
+ # Optionally register HuggingFace datasets
144
+ register_hf_datasets()
145
+
146
+
147
+ def _get_adapter(dataset_slug: str):
148
+ """Return the adapter for the given slug, or None."""
149
+ return REGISTRY.get(dataset_slug)
150
+
151
+
152
+ # ---------------------------------------------------------------------------
153
+ # Routes
154
+ # ---------------------------------------------------------------------------
155
+
156
+ @app.route("/")
157
+ def index():
158
+ """Main page showing list of all benchmark problems."""
159
+ return render_template("index.html", total_problems=len(DATA))
160
+
161
+
162
+ @app.route("/api/datasets")
163
+ def get_datasets():
164
+ """Return list of available datasets for the UI dataset selector."""
165
+ return jsonify([
166
+ {
167
+ "slug": slug,
168
+ "display_name": adapter.display_name,
169
+ "problem_count": adapter.problem_count(),
170
+ "has_ground_truth": adapter.has_ground_truth,
171
+ }
172
+ for slug, adapter in REGISTRY.items()
173
+ ])
174
+
175
+
176
+ @app.route("/api/problems")
177
+ @app.route("/api/<dataset_slug>/problems")
178
+ def get_problems(dataset_slug="dreval"):
179
+ """API endpoint to get list of all problems for a dataset."""
180
+ adapter = _get_adapter(dataset_slug)
181
+ if adapter is None:
182
+ return jsonify({"error": f"Unknown dataset: {dataset_slug}"}), 404
183
+
184
+ problems = [adapter.get_problem_summary(i) for i in range(adapter.problem_count())]
185
+ return jsonify(problems)
186
+
187
+
188
+ @app.route("/api/problem/<int:idx>")
189
+ @app.route("/api/<dataset_slug>/problem/<int:idx>")
190
+ def get_problem(idx, dataset_slug="dreval"):
191
+ """API endpoint to get detailed information about a specific problem."""
192
+ adapter = _get_adapter(dataset_slug)
193
+ if adapter is None:
194
+ return jsonify({"error": f"Unknown dataset: {dataset_slug}"}), 404
195
+
196
+ if not (0 <= idx < adapter.problem_count()):
197
+ return jsonify({"error": "Invalid problem index"}), 404
198
+
199
+ try:
200
+ return jsonify(adapter.get_problem_detail(idx))
201
+ except (KeyError, IndexError, ValueError) as exc:
202
+ return jsonify({"error": f"Internal error: {exc}"}), 500
203
+
204
+
205
+ @app.route("/api/highlight_code")
206
+ def highlight_code_api():
207
+ """API endpoint to highlight code with specific lines."""
208
+ code = request.args.get("code", "")
209
+ lines_str = request.args.get("lines", "")
210
+
211
+ if lines_str:
212
+ try:
213
+ lines = [int(x) for x in lines_str.split(",") if x.strip()]
214
+ except ValueError:
215
+ return jsonify({"error": "Invalid line numbers"}), 400
216
+ else:
217
+ lines = None
218
+
219
+ highlighted = highlight_code(code, lines)
220
+ return jsonify({"highlighted_code": highlighted})
221
+
222
+
223
+ @app.route("/problem/<int:idx>")
224
+ @app.route("/<dataset_slug>/problem/<int:idx>")
225
+ def problem_detail(idx, dataset_slug="dreval"):
226
+ """Page showing detailed view of a specific problem."""
227
+ adapter = _get_adapter(dataset_slug)
228
+ if adapter is None:
229
+ return jsonify({"error": "Unknown dataset"}), 404
230
+
231
+ if not (0 <= idx < adapter.problem_count()):
232
+ return jsonify({"error": "Problem not found"}), 404
233
+
234
+ return render_template(
235
+ "problem.html",
236
+ idx=idx,
237
+ css=get_css(),
238
+ total_problems=adapter.problem_count(),
239
+ dataset_slug=dataset_slug,
240
+ dataset_name=adapter.display_name,
241
+ has_ground_truth=adapter.has_ground_truth,
242
+ has_tasks=adapter.has_tasks,
243
+ )
244
+
245
+
246
+ @app.route("/api/css")
247
+ def get_css_api():
248
+ """API endpoint to get CSS for syntax highlighting."""
249
+ return get_css(), 200, {"Content-Type": "text/css"}
250
+
251
+
252
+ @app.route("/api/problem/<int:idx>/ground_truth/<int:input_idx>")
253
+ @app.route("/api/<dataset_slug>/problem/<int:idx>/ground_truth/<int:input_idx>")
254
+ def get_ground_truth(idx, input_idx, dataset_slug="dreval"):
255
+ """
256
+ Return ground truth execution data for one (problem, input) pair.
257
+
258
+ Ground truth is only available for DREval.
259
+
260
+ Response fields:
261
+ - coverage: sorted list of 1-indexed line numbers that were executed
262
+ - variable_answers: [{lineno, var, values, answer_str}] matching task items
263
+ - status: "ok" | "error" | "unavailable"
264
+ """
265
+ if dataset_slug != "dreval":
266
+ return jsonify({"status": "unavailable", "message": "Ground truth only available for DREval"}), 503
267
+
268
+ if GT_LOADER is None:
269
+ return jsonify({"status": "unavailable"}), 503
270
+
271
+ if not (0 <= idx < len(DATA)):
272
+ return jsonify({"error": "Invalid problem index"}), 404
273
+
274
+ problem = DATA[idx]
275
+ task_id = problem["task_id"]
276
+
277
+ try:
278
+ exec_rec = GT_LOADER.get_execution(task_id, input_idx)
279
+ except KeyError as e:
280
+ return jsonify({"status": "error", "message": str(e)}), 404
281
+
282
+ if exec_rec.get("status") == "error":
283
+ return jsonify({"status": "error", "message": "Execution failed for this input"}), 200
284
+
285
+ # Coverage: convert 0-indexed (relative to original code) → 1-indexed
286
+ # (relative to stripped code shown by Pygments).
287
+ # original_0indexed + 1 = original_1indexed;
288
+ # stripped_1indexed = original_1indexed - offset
289
+ code = problem["code"]
290
+ offset = _code_offset(code)
291
+ coverage_1indexed = [ln + 1 - offset for ln in exec_rec["coverage"]]
292
+
293
+ # Compute total line count from stripped code
294
+ total_lines = len(code[offset:].splitlines())
295
+
296
+ # Determine which task items belong to this input_idx
297
+ task = TASKS[idx]
298
+ task_items = []
299
+ for t in task["tasks"]:
300
+ if t["input_idx"] == input_idx:
301
+ task_items = t.get("task", [])
302
+ break
303
+
304
+ # Resolve variable answers for each task item
305
+ variable_answers = []
306
+ for item in task_items:
307
+ lineno = item["lineno"] # 1-indexed relative to original code
308
+ var = item["var"]
309
+ try:
310
+ # Loader expects original 1-indexed lineno (before stripping)
311
+ values = GT_LOADER.get_variable_values(task_id, input_idx, lineno, var)
312
+ except (KeyError, ValueError):
313
+ values = None
314
+
315
+ from dynamics import Nil
316
+ if values is Nil or values is None:
317
+ answer_str = "(not available)"
318
+ elif len(values) == 1:
319
+ answer_str = repr(values[0])
320
+ else:
321
+ answer_str = "[" + ", ".join(repr(v) for v in values) + "]"
322
+
323
+ variable_answers.append({
324
+ "lineno": lineno - offset, # adjusted for stripped code display
325
+ "var": var,
326
+ "answer_str": answer_str,
327
+ })
328
+
329
+ # Resolve next lines for each task item (for arrow visualization on hover)
330
+ next_lines_answers = []
331
+ processed_linenos = set()
332
+ for item in task_items:
333
+ lineno = item["lineno"] # already 1-indexed
334
+ if lineno in processed_linenos:
335
+ continue
336
+ processed_linenos.add(lineno)
337
+ try:
338
+ next_lines = GT_LOADER.get_next_lines(task_id, input_idx, lineno)
339
+ except (KeyError, ValueError) as exc:
340
+ print(f"Warning: get_next_lines({task_id}, {input_idx}, {lineno}) failed: {exc}")
341
+ next_lines = [-1]
342
+ next_lines_answers.append({
343
+ "lineno": lineno,
344
+ "next_lines": next_lines,
345
+ })
346
+
347
+ return jsonify({
348
+ "status": "ok",
349
+ "coverage": coverage_1indexed,
350
+ "total_lines": total_lines,
351
+ "variable_answers": variable_answers,
352
+ "next_lines_answers": next_lines_answers,
353
+ })
354
+
355
+
356
+ if __name__ == "__main__":
357
+ debug_mode = os.getenv("FLASK_DEBUG", "false").lower() == "true"
358
+ port = int(os.getenv("PORT", 7860))
359
+ app.run(debug=debug_mode, host="0.0.0.0", port=port)
dataset_adapters.py ADDED
@@ -0,0 +1,413 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Dataset adapters for the ML4SE Benchmark Viewer.
3
+
4
+ Each adapter normalises a different benchmark dataset into a common API shape
5
+ so the Flask routes and templates can handle them uniformly.
6
+
7
+ The REGISTRY dict maps slug strings (used in URLs) to adapter instances.
8
+ """
9
+
10
+ from __future__ import annotations
11
+
12
+ from pathlib import Path
13
+ from typing import Any
14
+
15
+ # These are imported from app.py at registration time to avoid circular imports.
16
+ _highlight_code = None
17
+ _code_offset = None
18
+ _extract_test_classes = None
19
+
20
+
21
+ def _set_helpers(highlight_code_fn, code_offset_fn, extract_test_classes_fn):
22
+ """Called once by app.py to inject helper functions."""
23
+ global _highlight_code, _code_offset, _extract_test_classes
24
+ _highlight_code = highlight_code_fn
25
+ _code_offset = code_offset_fn
26
+ _extract_test_classes = extract_test_classes_fn
27
+
28
+
29
+ # ---------------------------------------------------------------------------
30
+ # Registry
31
+ # ---------------------------------------------------------------------------
32
+
33
+ REGISTRY: dict[str, "DatasetAdapter"] = {}
34
+
35
+
36
+ # ---------------------------------------------------------------------------
37
+ # Base class
38
+ # ---------------------------------------------------------------------------
39
+
40
+ class DatasetAdapter:
41
+ slug: str = ""
42
+ display_name: str = ""
43
+ has_ground_truth: bool = False
44
+ has_tasks: bool = False
45
+
46
+ def problem_count(self) -> int:
47
+ raise NotImplementedError
48
+
49
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
50
+ raise NotImplementedError
51
+
52
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
53
+ raise NotImplementedError
54
+
55
+
56
+ # ---------------------------------------------------------------------------
57
+ # DREval adapter (wraps existing DATA / TASKS / GT_LOADER globals)
58
+ # ---------------------------------------------------------------------------
59
+
60
+ class DREvalAdapter(DatasetAdapter):
61
+ slug = "dreval"
62
+ display_name = "DREval"
63
+ has_ground_truth = True
64
+ has_tasks = True
65
+
66
+ def __init__(self, data: list, tasks: list, gt_loader):
67
+ self._data = data
68
+ self._tasks = tasks
69
+ self._gt_loader = gt_loader
70
+
71
+ def problem_count(self) -> int:
72
+ return len(self._data)
73
+
74
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
75
+ record = self._data[idx]
76
+ return {
77
+ "idx": idx,
78
+ "task_id": record["task_id"],
79
+ "entry_point": record["entry_point"],
80
+ "num_inputs": len(record["inputs"]),
81
+ "source": "ClassEval" if record.get("test") is not None else "HumanEval",
82
+ }
83
+
84
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
85
+ problem = self._data[idx]
86
+ task = self._tasks[idx]
87
+
88
+ code = problem["code"]
89
+ offset = _code_offset(code)
90
+ code = code[offset:]
91
+ highlighted_code = _highlight_code(code)
92
+
93
+ tasks_info = []
94
+ for task_item in task["tasks"]:
95
+ adjusted_items = []
96
+ for item in task_item.get("task", []):
97
+ adj = dict(item)
98
+ if "lineno" in adj:
99
+ adj["lineno"] -= offset
100
+ adjusted_items.append(adj)
101
+
102
+ input_idx = task_item["input_idx"]
103
+ inp = problem["inputs"][input_idx] if input_idx < len(problem["inputs"]) else ""
104
+ out = problem["outputs"][input_idx] if input_idx < len(problem["outputs"]) else ""
105
+
106
+ task_info = {
107
+ "input_idx": input_idx,
108
+ "input": inp,
109
+ "output": out,
110
+ "task_items": adjusted_items,
111
+ }
112
+
113
+ if "output_pred" in task_item:
114
+ task_info["output_pred"] = task_item["output_pred"]
115
+
116
+ task_lines = set()
117
+ for item in adjusted_items:
118
+ if "lineno" in item:
119
+ task_lines.add(item["lineno"])
120
+ task_info["task_lines"] = sorted(list(task_lines))
121
+
122
+ tasks_info.append(task_info)
123
+
124
+ if problem.get("test") is not None:
125
+ tc_list = _extract_test_classes(problem["test"], problem["entry_point"])
126
+ for task_info in tasks_info:
127
+ idx_in_tc = task_info["input_idx"]
128
+ if idx_in_tc < len(tc_list):
129
+ task_info["test_class_name"] = tc_list[idx_in_tc]["name"]
130
+ task_info["test_class_code"] = tc_list[idx_in_tc]["code"]
131
+
132
+ return {
133
+ "idx": idx,
134
+ "task_id": problem["task_id"],
135
+ "entry_point": problem["entry_point"],
136
+ "code": code,
137
+ "highlighted_code": highlighted_code,
138
+ "inputs": problem["inputs"],
139
+ "outputs": problem["outputs"],
140
+ "test": problem.get("test"),
141
+ "tasks": tasks_info,
142
+ "source": "ClassEval" if problem.get("test") is not None else "HumanEval",
143
+ "has_ground_truth": True,
144
+ "has_tasks": True,
145
+ }
146
+
147
+
148
+ # ---------------------------------------------------------------------------
149
+ # CRUXEval adapter (HuggingFace: cruxeval-org/cruxeval)
150
+ # ---------------------------------------------------------------------------
151
+
152
+ class CRUXEvalAdapter(DatasetAdapter):
153
+ slug = "cruxeval"
154
+ display_name = "CRUXEval"
155
+ has_ground_truth = False
156
+ has_tasks = True
157
+
158
+ def __init__(self, hf_dataset):
159
+ self._ds = hf_dataset
160
+
161
+ def problem_count(self) -> int:
162
+ return len(self._ds)
163
+
164
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
165
+ row = self._ds[idx]
166
+ return {
167
+ "idx": idx,
168
+ "task_id": row["id"],
169
+ "entry_point": "f",
170
+ "num_inputs": 1,
171
+ "source": "CRUXEval",
172
+ }
173
+
174
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
175
+ row = self._ds[idx]
176
+ code = row["code"]
177
+ return {
178
+ "idx": idx,
179
+ "task_id": row["id"],
180
+ "entry_point": "f",
181
+ "code": code,
182
+ "highlighted_code": _highlight_code(code),
183
+ "inputs": [row["input"]],
184
+ "outputs": [row["output"]],
185
+ "test": None,
186
+ "tasks": [
187
+ {
188
+ "name": "Output Prediction",
189
+ "description": "Given the code and input, predict the output.",
190
+ "given": "input",
191
+ "predict": "output",
192
+ "input": row["input"],
193
+ "output": row["output"],
194
+ },
195
+ {
196
+ "name": "Input Prediction",
197
+ "description": "Given the code and output, predict the input.",
198
+ "given": "output",
199
+ "predict": "input",
200
+ "input": row["input"],
201
+ "output": row["output"],
202
+ },
203
+ ],
204
+ "source": "CRUXEval",
205
+ "has_ground_truth": False,
206
+ "has_tasks": True,
207
+ }
208
+
209
+
210
+ # ---------------------------------------------------------------------------
211
+ # HumanEval+ adapter (HuggingFace: evalplus/humanevalplus)
212
+ # ---------------------------------------------------------------------------
213
+
214
+ class HumanEvalPlusAdapter(DatasetAdapter):
215
+ slug = "humanevalplus"
216
+ display_name = "HumanEval+"
217
+ has_ground_truth = False
218
+ has_tasks = False
219
+
220
+ def __init__(self, hf_dataset):
221
+ self._ds = hf_dataset
222
+
223
+ def problem_count(self) -> int:
224
+ return len(self._ds)
225
+
226
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
227
+ row = self._ds[idx]
228
+ return {
229
+ "idx": idx,
230
+ "task_id": row["task_id"],
231
+ "entry_point": row["entry_point"],
232
+ "num_inputs": 0,
233
+ "source": "HumanEval+",
234
+ }
235
+
236
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
237
+ row = self._ds[idx]
238
+ code = row["prompt"] + row["canonical_solution"]
239
+ return {
240
+ "idx": idx,
241
+ "task_id": row["task_id"],
242
+ "entry_point": row["entry_point"],
243
+ "code": code,
244
+ "highlighted_code": _highlight_code(code),
245
+ "inputs": [],
246
+ "outputs": [],
247
+ "test": row["test"],
248
+ "tasks": [],
249
+ "source": "HumanEval+",
250
+ "has_ground_truth": False,
251
+ "has_tasks": False,
252
+ }
253
+
254
+
255
+ # ---------------------------------------------------------------------------
256
+ # BigOBench adapter (HuggingFace: facebook/BigOBench)
257
+ # ---------------------------------------------------------------------------
258
+
259
+ class BigOBenchAdapter(DatasetAdapter):
260
+ slug = "bigobench"
261
+ display_name = "BigOBench"
262
+ has_ground_truth = False
263
+ has_tasks = False
264
+
265
+ def __init__(self, problems: list[dict[str, Any]]):
266
+ self._problems = problems
267
+
268
+ def problem_count(self) -> int:
269
+ return len(self._problems)
270
+
271
+ def get_problem_summary(self, idx: int) -> dict[str, Any]:
272
+ prob = self._problems[idx]
273
+ return {
274
+ "idx": idx,
275
+ "task_id": prob["problem_id"],
276
+ "entry_point": prob["problem_name"],
277
+ "num_inputs": len(prob["solutions"]),
278
+ "source": "BigOBench",
279
+ }
280
+
281
+ def get_problem_detail(self, idx: int) -> dict[str, Any]:
282
+ prob = self._problems[idx]
283
+ solutions = []
284
+ for sol in prob["solutions"]:
285
+ solutions.append({
286
+ "solution_id": sol["solution_id"],
287
+ "code": sol["solution_code"],
288
+ "highlighted_code": _highlight_code(sol["solution_code"]),
289
+ "time_complexity": sol.get("time_complexity"),
290
+ "space_complexity": sol.get("space_complexity"),
291
+ })
292
+ return {
293
+ "idx": idx,
294
+ "task_id": prob["problem_id"],
295
+ "entry_point": prob["problem_name"],
296
+ "code": solutions[0]["code"] if solutions else "",
297
+ "highlighted_code": solutions[0]["highlighted_code"] if solutions else "",
298
+ "inputs": [],
299
+ "outputs": [],
300
+ "test": None,
301
+ "tasks": [],
302
+ "source": "BigOBench",
303
+ "has_ground_truth": False,
304
+ "has_tasks": False,
305
+ "description": prob["description"],
306
+ "solutions": solutions,
307
+ }
308
+
309
+
310
+ def _merge_bigobench(ds_time, ds_space) -> list[dict[str, Any]]:
311
+ """Merge time and space complexity test sets by problem_id.
312
+
313
+ Groups all solutions under their parent problem. Solutions that appear
314
+ in both test sets get both complexity labels; otherwise the missing one
315
+ is None. Returns a list of problem dicts sorted by problem_id.
316
+ """
317
+ # First, collect solutions keyed by (problem_id, solution_id)
318
+ solutions: dict[tuple[str, str], dict[str, Any]] = {}
319
+ # Track problem-level metadata
320
+ problem_meta: dict[str, dict[str, str]] = {}
321
+
322
+ for row in ds_time:
323
+ pid, sid = row["problem_id"], row["solution_id"]
324
+ problem_meta[pid] = {
325
+ "problem_name": row["problem_name"],
326
+ "description": row["description"],
327
+ }
328
+ solutions[(pid, sid)] = {
329
+ "solution_id": sid,
330
+ "solution_code": row["solution_code"],
331
+ "time_complexity": row["time_complexity_inferred"],
332
+ "space_complexity": None,
333
+ }
334
+
335
+ for row in ds_space:
336
+ pid, sid = row["problem_id"], row["solution_id"]
337
+ problem_meta.setdefault(pid, {
338
+ "problem_name": row["problem_name"],
339
+ "description": row["description"],
340
+ })
341
+ key = (pid, sid)
342
+ if key in solutions:
343
+ solutions[key]["space_complexity"] = row["space_complexity_inferred"]
344
+ else:
345
+ solutions[key] = {
346
+ "solution_id": sid,
347
+ "solution_code": row["solution_code"],
348
+ "time_complexity": None,
349
+ "space_complexity": row["space_complexity_inferred"],
350
+ }
351
+
352
+ # Group solutions by problem_id
353
+ from collections import defaultdict
354
+ by_problem: dict[str, list[dict[str, Any]]] = defaultdict(list)
355
+ for (pid, _sid), sol in solutions.items():
356
+ by_problem[pid].append(sol)
357
+
358
+ problems = []
359
+ for pid in sorted(by_problem.keys()):
360
+ meta = problem_meta[pid]
361
+ problems.append({
362
+ "problem_id": pid,
363
+ "problem_name": meta["problem_name"],
364
+ "description": meta["description"],
365
+ "solutions": by_problem[pid],
366
+ })
367
+
368
+ return problems
369
+
370
+
371
+ # ---------------------------------------------------------------------------
372
+ # Registration helpers
373
+ # ---------------------------------------------------------------------------
374
+
375
+ def register_dreval(data: list, tasks: list, gt_loader) -> None:
376
+ """Register the DREval dataset (always available)."""
377
+ REGISTRY["dreval"] = DREvalAdapter(data, tasks, gt_loader)
378
+
379
+
380
+ def register_hf_datasets() -> None:
381
+ """Try to load HuggingFace datasets. Silently skips if `datasets` is not installed."""
382
+ try:
383
+ from datasets import load_dataset
384
+ except ImportError:
385
+ return
386
+
387
+ try:
388
+ crux = load_dataset("cruxeval-org/cruxeval", split="test")
389
+ REGISTRY["cruxeval"] = CRUXEvalAdapter(crux)
390
+ print(f"Loaded CRUXEval: {len(crux)} problems")
391
+ except Exception as e:
392
+ print(f"Warning: could not load CRUXEval: {e}")
393
+
394
+ try:
395
+ heplus = load_dataset("evalplus/humanevalplus", split="test")
396
+ REGISTRY["humanevalplus"] = HumanEvalPlusAdapter(heplus)
397
+ print(f"Loaded HumanEval+: {len(heplus)} problems")
398
+ except Exception as e:
399
+ print(f"Warning: could not load HumanEval+: {e}")
400
+
401
+ try:
402
+ ds_time = load_dataset(
403
+ "facebook/BigOBench", "time_complexity_test_set.jsonl", split="train"
404
+ )
405
+ ds_space = load_dataset(
406
+ "facebook/BigOBench", "space_complexity_test_set.jsonl", split="train"
407
+ )
408
+ merged = _merge_bigobench(ds_time, ds_space)
409
+ REGISTRY["bigobench"] = BigOBenchAdapter(merged)
410
+ print(f"Loaded BigOBench: {len(merged)} problems "
411
+ f"({len(ds_time)} time + {len(ds_space)} space)")
412
+ except Exception as e:
413
+ print(f"Warning: could not load BigOBench: {e}")
pyproject.toml ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "ml4se-bench-viewer"
3
+ version = "0.1.0"
4
+ description = "Web-based visualization tool for browsing and inspecting popular ML4SE benchmark datasets"
5
+ readme = "README.md"
6
+ requires-python = ">=3.8"
7
+ dependencies = [
8
+ "flask>=3.0.0",
9
+ "pygments>=2.17.2",
10
+ "datasets>=2.14.0",
11
+ ]
12
+
13
+ [project.optional-dependencies]
14
+ dev = [
15
+ "ruff>=0.8.0",
16
+ ]
17
+
18
+ [tool.ruff]
19
+ line-length = 100
20
+ target-version = "py38"
21
+
22
+ [tool.ruff.lint]
23
+ select = [
24
+ "E", # pycodestyle errors
25
+ "W", # pycodestyle warnings
26
+ "F", # pyflakes
27
+ "I", # isort
28
+ "B", # flake8-bugbear
29
+ "C4", # flake8-comprehensions
30
+ "UP", # pyupgrade
31
+ ]
32
+ ignore = [
33
+ "E501", # line too long (handled by formatter)
34
+ ]
35
+
36
+ [tool.ruff.format]
37
+ quote-style = "double"
38
+ indent-style = "space"
templates/base.html ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>{% block title %}ML4SE Benchmark Viewer{% endblock %}</title>
7
+ <style>
8
+ * {
9
+ margin: 0;
10
+ padding: 0;
11
+ box-sizing: border-box;
12
+ }
13
+
14
+ body {
15
+ font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
16
+ line-height: 1.6;
17
+ color: #333;
18
+ background: #f5f5f5;
19
+ }
20
+
21
+ .container {
22
+ max-width: 1400px;
23
+ margin: 0 auto;
24
+ padding: 20px;
25
+ }
26
+
27
+ header {
28
+ background: #2c3e50;
29
+ color: white;
30
+ padding: 20px 0;
31
+ margin-bottom: 30px;
32
+ box-shadow: 0 2px 4px rgba(0,0,0,0.1);
33
+ }
34
+
35
+ header h1 {
36
+ font-size: 2rem;
37
+ font-weight: 600;
38
+ }
39
+
40
+ header p {
41
+ margin-top: 10px;
42
+ opacity: 0.9;
43
+ }
44
+
45
+ .card {
46
+ background: white;
47
+ border-radius: 8px;
48
+ padding: 20px;
49
+ margin-bottom: 20px;
50
+ box-shadow: 0 2px 4px rgba(0,0,0,0.1);
51
+ }
52
+
53
+ .card h2 {
54
+ font-size: 1.5rem;
55
+ margin-bottom: 15px;
56
+ color: #2c3e50;
57
+ border-bottom: 2px solid #3498db;
58
+ padding-bottom: 10px;
59
+ }
60
+
61
+ .badge {
62
+ display: inline-block;
63
+ padding: 4px 12px;
64
+ border-radius: 4px;
65
+ font-size: 0.85rem;
66
+ font-weight: 600;
67
+ margin-right: 8px;
68
+ }
69
+
70
+ .badge-humaneval {
71
+ background: #3498db;
72
+ color: white;
73
+ }
74
+
75
+ .badge-classeval {
76
+ background: #9b59b6;
77
+ color: white;
78
+ }
79
+
80
+ .badge-cruxeval {
81
+ background: #e67e22;
82
+ color: white;
83
+ }
84
+
85
+ .badge-humanevalplus {
86
+ background: #27ae60;
87
+ color: white;
88
+ }
89
+
90
+ .badge-bigobench {
91
+ background: #8e44ad;
92
+ color: white;
93
+ }
94
+
95
+ .badge-info {
96
+ background: #ecf0f1;
97
+ color: #2c3e50;
98
+ }
99
+
100
+ /* Code styling */
101
+ .source {
102
+ font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;
103
+ font-size: 0.9rem;
104
+ line-height: 1.5;
105
+ overflow-x: auto;
106
+ }
107
+
108
+ .source table {
109
+ border-collapse: collapse;
110
+ width: 100%;
111
+ }
112
+
113
+ .source td.linenos {
114
+ background: #f8f8f8;
115
+ color: #999;
116
+ padding: 0 10px;
117
+ text-align: right;
118
+ user-select: none;
119
+ border-right: 1px solid #ddd;
120
+ }
121
+
122
+ .source td.code {
123
+ padding-left: 15px;
124
+ }
125
+
126
+ .source .hll {
127
+ background-color: #ffffcc;
128
+ }
129
+
130
+ pre {
131
+ margin: 0;
132
+ white-space: pre;
133
+ word-wrap: normal;
134
+ }
135
+
136
+ /* Loading spinner */
137
+ .loading {
138
+ text-align: center;
139
+ padding: 40px;
140
+ color: #7f8c8d;
141
+ }
142
+
143
+ .spinner {
144
+ border: 4px solid #f3f3f3;
145
+ border-top: 4px solid #3498db;
146
+ border-radius: 50%;
147
+ width: 40px;
148
+ height: 40px;
149
+ animation: spin 1s linear infinite;
150
+ margin: 20px auto;
151
+ }
152
+
153
+ @keyframes spin {
154
+ 0% { transform: rotate(0deg); }
155
+ 100% { transform: rotate(360deg); }
156
+ }
157
+
158
+ /* Button styles */
159
+ .btn {
160
+ display: inline-block;
161
+ padding: 10px 20px;
162
+ background: #3498db;
163
+ color: white;
164
+ text-decoration: none;
165
+ border-radius: 4px;
166
+ border: none;
167
+ cursor: pointer;
168
+ font-size: 1rem;
169
+ transition: background 0.3s;
170
+ }
171
+
172
+ .btn:hover {
173
+ background: #2980b9;
174
+ }
175
+
176
+ .btn-secondary {
177
+ background: #95a5a6;
178
+ }
179
+
180
+ .btn-secondary:hover {
181
+ background: #7f8c8d;
182
+ }
183
+
184
+ /* Navigation */
185
+ .nav-links {
186
+ margin-top: 20px;
187
+ }
188
+
189
+ .nav-links a {
190
+ margin-right: 15px;
191
+ }
192
+
193
+ {% block extra_css %}{% endblock %}
194
+ </style>
195
+ </head>
196
+ <body>
197
+ <header>
198
+ <div class="container">
199
+ <h1>ML4SE Benchmark Viewer</h1>
200
+ <p>Browse and inspect popular ML4SE benchmark datasets</p>
201
+ {% block header_extra %}{% endblock %}
202
+ </div>
203
+ </header>
204
+
205
+ <div class="container">
206
+ {% block content %}{% endblock %}
207
+ </div>
208
+
209
+ {% block scripts %}{% endblock %}
210
+ </body>
211
+ </html>
templates/index.html ADDED
@@ -0,0 +1,330 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {% extends "base.html" %}
2
+
3
+ {% block title %}ML4SE Benchmark Viewer - Problem List{% endblock %}
4
+
5
+ {% block extra_css %}
6
+ <style>
7
+ .filters {
8
+ margin-bottom: 20px;
9
+ display: flex;
10
+ gap: 15px;
11
+ align-items: center;
12
+ flex-wrap: wrap;
13
+ }
14
+
15
+ .filter-group {
16
+ display: flex;
17
+ align-items: center;
18
+ gap: 8px;
19
+ }
20
+
21
+ .filter-group label {
22
+ font-weight: 600;
23
+ color: #2c3e50;
24
+ }
25
+
26
+ .filter-group select,
27
+ .filter-group input {
28
+ padding: 8px 12px;
29
+ border: 1px solid #ddd;
30
+ border-radius: 4px;
31
+ font-size: 0.95rem;
32
+ }
33
+
34
+ .problems-grid {
35
+ display: grid;
36
+ grid-template-columns: repeat(auto-fill, minmax(300px, 1fr));
37
+ gap: 20px;
38
+ }
39
+
40
+ .problem-card {
41
+ background: white;
42
+ border-radius: 8px;
43
+ padding: 20px;
44
+ box-shadow: 0 2px 4px rgba(0,0,0,0.1);
45
+ transition: transform 0.2s, box-shadow 0.2s;
46
+ cursor: pointer;
47
+ text-decoration: none;
48
+ color: inherit;
49
+ display: block;
50
+ }
51
+
52
+ .problem-card:hover {
53
+ transform: translateY(-2px);
54
+ box-shadow: 0 4px 8px rgba(0,0,0,0.15);
55
+ }
56
+
57
+ .problem-card-header {
58
+ margin-bottom: 12px;
59
+ padding-bottom: 12px;
60
+ border-bottom: 1px solid #ecf0f1;
61
+ }
62
+
63
+ .problem-card-title {
64
+ font-size: 1.1rem;
65
+ font-weight: 600;
66
+ color: #2c3e50;
67
+ margin-bottom: 5px;
68
+ }
69
+
70
+ .problem-card-id {
71
+ font-size: 0.85rem;
72
+ color: #7f8c8d;
73
+ font-family: monospace;
74
+ }
75
+
76
+ .problem-card-body {
77
+ margin: 12px 0;
78
+ }
79
+
80
+ .problem-card-info {
81
+ display: flex;
82
+ justify-content: space-between;
83
+ font-size: 0.9rem;
84
+ color: #7f8c8d;
85
+ }
86
+
87
+ .stats {
88
+ display: flex;
89
+ gap: 20px;
90
+ margin-bottom: 20px;
91
+ flex-wrap: wrap;
92
+ }
93
+
94
+ .stat-card {
95
+ background: white;
96
+ border-radius: 8px;
97
+ padding: 20px;
98
+ box-shadow: 0 2px 4px rgba(0,0,0,0.1);
99
+ flex: 1;
100
+ min-width: 200px;
101
+ }
102
+
103
+ .stat-number {
104
+ font-size: 2.5rem;
105
+ font-weight: 700;
106
+ color: #3498db;
107
+ }
108
+
109
+ .stat-label {
110
+ font-size: 0.9rem;
111
+ color: #7f8c8d;
112
+ margin-top: 5px;
113
+ }
114
+ </style>
115
+ {% endblock %}
116
+
117
+ {% block content %}
118
+ <div class="stats" id="stats">
119
+ <div class="stat-card">
120
+ <div class="stat-number" id="total-problems">-</div>
121
+ <div class="stat-label">Total Problems</div>
122
+ </div>
123
+ <div class="stat-card" id="stat-source-a">
124
+ <div class="stat-number" id="source-a-count">-</div>
125
+ <div class="stat-label" id="source-a-label">Source A</div>
126
+ </div>
127
+ <div class="stat-card" id="stat-source-b">
128
+ <div class="stat-number" id="source-b-count">-</div>
129
+ <div class="stat-label" id="source-b-label">Source B</div>
130
+ </div>
131
+ <div class="stat-card">
132
+ <div class="stat-number" id="filtered-count">-</div>
133
+ <div class="stat-label">Displayed</div>
134
+ </div>
135
+ </div>
136
+
137
+ <div class="card">
138
+ <h2>Filter Problems</h2>
139
+ <div class="filters">
140
+ <div class="filter-group">
141
+ <label for="dataset-filter">Dataset:</label>
142
+ <select id="dataset-filter">
143
+ <option value="dreval">DREval</option>
144
+ </select>
145
+ </div>
146
+ <div class="filter-group">
147
+ <label for="source-filter">Source:</label>
148
+ <select id="source-filter">
149
+ <option value="all">All</option>
150
+ </select>
151
+ </div>
152
+ <div class="filter-group">
153
+ <label for="search-filter">Search:</label>
154
+ <input type="text" id="search-filter" placeholder="Function name or ID...">
155
+ </div>
156
+ </div>
157
+ </div>
158
+
159
+ <div id="problems-container">
160
+ <div class="loading">
161
+ <div class="spinner"></div>
162
+ <p>Loading problems...</p>
163
+ </div>
164
+ </div>
165
+ {% endblock %}
166
+
167
+ {% block scripts %}
168
+ <script>
169
+ let allProblems = [];
170
+ // Read dataset from URL query param (e.g. /?dataset=cruxeval), default to dreval
171
+ let currentDataset = new URLSearchParams(window.location.search).get('dataset') || 'dreval';
172
+
173
+ async function loadDatasets() {
174
+ try {
175
+ const response = await fetch('/api/datasets');
176
+ const datasets = await response.json();
177
+ const select = document.getElementById('dataset-filter');
178
+ select.innerHTML = '';
179
+ datasets.forEach(ds => {
180
+ const opt = document.createElement('option');
181
+ opt.value = ds.slug;
182
+ opt.textContent = `${ds.display_name} (${ds.problem_count})`;
183
+ if (ds.slug === currentDataset) opt.selected = true;
184
+ select.appendChild(opt);
185
+ });
186
+ } catch (error) {
187
+ console.error('Failed to load datasets:', error);
188
+ }
189
+ }
190
+
191
+ async function loadProblems() {
192
+ try {
193
+ document.getElementById('problems-container').innerHTML =
194
+ '<div class="loading"><div class="spinner"></div><p>Loading problems...</p></div>';
195
+ const response = await fetch(`/api/${currentDataset}/problems`);
196
+ allProblems = await response.json();
197
+ updateSourceFilter();
198
+ updateStats();
199
+ renderProblems(allProblems);
200
+ } catch (error) {
201
+ document.getElementById('problems-container').innerHTML =
202
+ '<div class="card"><p style="color: red;">Error loading problems: ' + error.message + '</p></div>';
203
+ }
204
+ }
205
+
206
+ function updateSourceFilter() {
207
+ const sources = [...new Set(allProblems.map(p => p.source))];
208
+ const select = document.getElementById('source-filter');
209
+ const current = select.value;
210
+ select.innerHTML = '<option value="all">All</option>';
211
+ sources.forEach(src => {
212
+ const opt = document.createElement('option');
213
+ opt.value = src;
214
+ opt.textContent = src;
215
+ select.appendChild(opt);
216
+ });
217
+ // Restore selection if still valid
218
+ if (sources.includes(current)) {
219
+ select.value = current;
220
+ }
221
+ }
222
+
223
+ function updateStats() {
224
+ const sources = {};
225
+ allProblems.forEach(p => {
226
+ sources[p.source] = (sources[p.source] || 0) + 1;
227
+ });
228
+
229
+ document.getElementById('total-problems').textContent = allProblems.length;
230
+
231
+ const sourceNames = Object.keys(sources);
232
+ const statA = document.getElementById('stat-source-a');
233
+ const statB = document.getElementById('stat-source-b');
234
+
235
+ if (sourceNames.length >= 1) {
236
+ statA.style.display = '';
237
+ document.getElementById('source-a-count').textContent = sources[sourceNames[0]];
238
+ document.getElementById('source-a-label').textContent = sourceNames[0];
239
+ } else {
240
+ statA.style.display = 'none';
241
+ }
242
+
243
+ if (sourceNames.length >= 2) {
244
+ statB.style.display = '';
245
+ document.getElementById('source-b-count').textContent = sources[sourceNames[1]];
246
+ document.getElementById('source-b-label').textContent = sourceNames[1];
247
+ } else {
248
+ statB.style.display = 'none';
249
+ }
250
+ }
251
+
252
+ function badgeClass(source) {
253
+ return 'badge-' + source.toLowerCase().replace(/[^a-z0-9]/g, '');
254
+ }
255
+
256
+ function renderProblems(problems) {
257
+ const container = document.getElementById('problems-container');
258
+
259
+ if (problems.length === 0) {
260
+ container.innerHTML = '<div class="card"><p>No problems match your filters.</p></div>';
261
+ document.getElementById('filtered-count').textContent = '0';
262
+ return;
263
+ }
264
+
265
+ document.getElementById('filtered-count').textContent = problems.length;
266
+
267
+ const grid = document.createElement('div');
268
+ grid.className = 'problems-grid';
269
+
270
+ const basePath = currentDataset === 'dreval' ? '' : `/${currentDataset}`;
271
+
272
+ problems.forEach(problem => {
273
+ const card = document.createElement('a');
274
+ card.className = 'problem-card';
275
+ card.href = `${basePath}/problem/${problem.idx}`;
276
+
277
+ card.innerHTML = `
278
+ <div class="problem-card-header">
279
+ <div class="problem-card-title">${problem.entry_point}</div>
280
+ <div class="problem-card-id">${problem.task_id}</div>
281
+ </div>
282
+ <div class="problem-card-body">
283
+ <span class="badge ${badgeClass(problem.source)}">${problem.source}</span>
284
+ <span class="badge badge-info">${problem.num_inputs} inputs</span>
285
+ </div>
286
+ <div class="problem-card-info">
287
+ <span>Index: ${problem.idx}</span>
288
+ </div>
289
+ `;
290
+
291
+ grid.appendChild(card);
292
+ });
293
+
294
+ container.innerHTML = '';
295
+ container.appendChild(grid);
296
+ }
297
+
298
+ function filterProblems() {
299
+ const sourceFilter = document.getElementById('source-filter').value;
300
+ const searchFilter = document.getElementById('search-filter').value.toLowerCase();
301
+
302
+ let filtered = allProblems;
303
+
304
+ if (sourceFilter !== 'all') {
305
+ filtered = filtered.filter(p => p.source === sourceFilter);
306
+ }
307
+
308
+ if (searchFilter) {
309
+ filtered = filtered.filter(p =>
310
+ p.entry_point.toLowerCase().includes(searchFilter) ||
311
+ p.task_id.toLowerCase().includes(searchFilter)
312
+ );
313
+ }
314
+
315
+ renderProblems(filtered);
316
+ }
317
+
318
+ document.getElementById('dataset-filter').addEventListener('change', (e) => {
319
+ currentDataset = e.target.value;
320
+ document.getElementById('source-filter').value = 'all';
321
+ document.getElementById('search-filter').value = '';
322
+ loadProblems();
323
+ });
324
+ document.getElementById('source-filter').addEventListener('change', filterProblems);
325
+ document.getElementById('search-filter').addEventListener('input', filterProblems);
326
+
327
+ loadDatasets();
328
+ loadProblems();
329
+ </script>
330
+ {% endblock %}
templates/problem.html ADDED
@@ -0,0 +1,1015 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {% extends "base.html" %}
2
+
3
+ {% block title %}Problem {{ idx }} - {{ dataset_name }} - ML4SE Benchmark Viewer{% endblock %}
4
+
5
+ {% block header_extra %}
6
+ <div class="nav-links">
7
+ <a href="/?dataset={{ dataset_slug }}" class="btn btn-secondary">← Back to List</a>
8
+ {% set base_path = '/' + dataset_slug if dataset_slug != 'dreval' else '' %}
9
+ {% if idx > 0 %}
10
+ <a href="{{ base_path }}/problem/{{ idx - 1 }}" class="btn btn-secondary">← Previous</a>
11
+ {% else %}
12
+ <button class="btn btn-secondary" disabled style="opacity: 0.5; cursor: not-allowed;">← Previous</button>
13
+ {% endif %}
14
+ {% if idx < total_problems - 1 %}
15
+ <a href="{{ base_path }}/problem/{{ idx + 1 }}" class="btn btn-secondary">Next →</a>
16
+ {% else %}
17
+ <button class="btn btn-secondary" disabled style="opacity: 0.5; cursor: not-allowed;">Next →</button>
18
+ {% endif %}
19
+ </div>
20
+ {% endblock %}
21
+
22
+ {% block extra_css %}
23
+ <style>
24
+ {{ css|safe }}
25
+
26
+ .problem-header {
27
+ display: flex;
28
+ justify-content: space-between;
29
+ align-items: center;
30
+ margin-bottom: 15px;
31
+ }
32
+
33
+ .problem-meta {
34
+ margin-bottom: 20px;
35
+ }
36
+
37
+ .meta-item {
38
+ display: inline-block;
39
+ margin-right: 15px;
40
+ margin-bottom: 10px;
41
+ }
42
+
43
+ .meta-label {
44
+ font-weight: 600;
45
+ color: #7f8c8d;
46
+ margin-right: 5px;
47
+ }
48
+
49
+ .meta-value {
50
+ color: #2c3e50;
51
+ }
52
+
53
+ .task-selector {
54
+ margin: 20px 0;
55
+ display: flex;
56
+ gap: 10px;
57
+ flex-wrap: wrap;
58
+ }
59
+
60
+ .task-btn {
61
+ padding: 10px 20px;
62
+ background: #ecf0f1;
63
+ border: 2px solid transparent;
64
+ border-radius: 4px;
65
+ cursor: pointer;
66
+ transition: all 0.3s;
67
+ font-size: 0.95rem;
68
+ }
69
+
70
+ .task-btn:hover {
71
+ background: #bdc3c7;
72
+ }
73
+
74
+ .task-btn.active {
75
+ background: #3498db;
76
+ color: white;
77
+ border-color: #2980b9;
78
+ }
79
+
80
+ .task-details {
81
+ margin-top: 20px;
82
+ }
83
+
84
+ .task-section {
85
+ margin-bottom: 25px;
86
+ padding: 15px;
87
+ background: #f8f9fa;
88
+ border-left: 4px solid #3498db;
89
+ border-radius: 4px;
90
+ }
91
+
92
+ .task-section h3 {
93
+ margin-bottom: 10px;
94
+ color: #2c3e50;
95
+ font-size: 1.1rem;
96
+ }
97
+
98
+ .code-block {
99
+ background: #f8f9fa;
100
+ padding: 15px;
101
+ border-radius: 4px;
102
+ overflow-x: auto;
103
+ font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;
104
+ font-size: 0.9rem;
105
+ border: 1px solid #e1e4e8;
106
+ }
107
+
108
+ .task-items-list {
109
+ list-style: none;
110
+ }
111
+
112
+ .task-items-list li {
113
+ padding: 10px;
114
+ margin-bottom: 8px;
115
+ background: white;
116
+ border-radius: 4px;
117
+ border: 1px solid #e1e4e8;
118
+ }
119
+
120
+ .line-ref {
121
+ display: inline-block;
122
+ padding: 2px 8px;
123
+ background: #3498db;
124
+ color: white;
125
+ border-radius: 3px;
126
+ font-family: monospace;
127
+ font-size: 0.85rem;
128
+ margin-right: 8px;
129
+ }
130
+
131
+ .var-name {
132
+ display: inline-block;
133
+ padding: 2px 8px;
134
+ background: #9b59b6;
135
+ color: white;
136
+ border-radius: 3px;
137
+ font-family: monospace;
138
+ font-size: 0.85rem;
139
+ }
140
+
141
+ .io-section {
142
+ display: grid;
143
+ grid-template-columns: 1fr 1fr;
144
+ gap: 15px;
145
+ }
146
+
147
+ @media (max-width: 768px) {
148
+ .io-section {
149
+ grid-template-columns: 1fr;
150
+ }
151
+ }
152
+
153
+ .navigation-hint {
154
+ margin-top: 20px;
155
+ padding: 15px;
156
+ background: #e8f4f8;
157
+ border-radius: 4px;
158
+ color: #2c3e50;
159
+ font-size: 0.9rem;
160
+ }
161
+
162
+ .test-code-section {
163
+ margin-top: 20px;
164
+ }
165
+
166
+ /* Inline task visualization */
167
+ .code-with-tasks {
168
+ position: relative;
169
+ }
170
+
171
+ .task-marker {
172
+ display: inline-block;
173
+ margin-left: 10px;
174
+ padding: 2px 8px;
175
+ background: #9b59b6;
176
+ color: white;
177
+ border-radius: 3px;
178
+ font-size: 0.75rem;
179
+ font-weight: 600;
180
+ cursor: crosshair;
181
+ }
182
+
183
+ /* Coverage coloring on lineno spans.
184
+ Pygments emits: td.linenos > div.linenodiv > pre > span.normal
185
+ We must match that chain; .source .linenos doesn't work because
186
+ the td has class "linenos", not an element named "linenos". */
187
+ td.linenos .normal.line-executed {
188
+ background-color: #d4edda !important;
189
+ color: #155724 !important;
190
+ }
191
+
192
+ td.linenos .normal.line-not-executed {
193
+ background-color: #f8d7da !important;
194
+ color: #721c24 !important;
195
+ }
196
+
197
+ /* Coverage legend */
198
+ .coverage-legend {
199
+ margin: 10px 0;
200
+ padding: 10px 15px;
201
+ background: #f8f9fa;
202
+ border-left: 4px solid #28a745;
203
+ border-radius: 4px;
204
+ font-size: 0.85rem;
205
+ display: none;
206
+ }
207
+
208
+ .coverage-legend-item {
209
+ display: inline-block;
210
+ margin-right: 18px;
211
+ }
212
+
213
+ .coverage-swatch {
214
+ display: inline-block;
215
+ width: 12px;
216
+ height: 12px;
217
+ border-radius: 2px;
218
+ margin-right: 4px;
219
+ vertical-align: middle;
220
+ }
221
+
222
+ /* Ground truth answer badge shown next to task items */
223
+ .gt-answer {
224
+ display: inline-block;
225
+ margin-left: 10px;
226
+ padding: 2px 8px;
227
+ background: #17a2b8;
228
+ color: white;
229
+ border-radius: 3px;
230
+ font-family: monospace;
231
+ font-size: 0.82rem;
232
+ font-weight: 600;
233
+ }
234
+
235
+ .gt-answer.loading {
236
+ background: #6c757d;
237
+ }
238
+
239
+ /* SVG arrow overlay positioned over the code container */
240
+ #arrow-overlay {
241
+ position: absolute;
242
+ top: 0;
243
+ left: 0;
244
+ width: 100%;
245
+ height: 100%;
246
+ pointer-events: none;
247
+ overflow: visible;
248
+ z-index: 10;
249
+ }
250
+
251
+ .exec-arrow {
252
+ fill: none;
253
+ stroke: #e67e22;
254
+ stroke-width: 2.5;
255
+ stroke-dasharray: none;
256
+ opacity: 0.9;
257
+ }
258
+
259
+ .exec-arrow-head {
260
+ fill: #e67e22;
261
+ opacity: 0.9;
262
+ }
263
+
264
+ /* CRUXEval answer highlight */
265
+ .crux-answer {
266
+ border-left: 4px solid #17a2b8 !important;
267
+ background: #e8f6f8 !important;
268
+ }
269
+
270
+ /* BigOBench complexity display */
271
+ .complexity-badges {
272
+ display: flex;
273
+ gap: 20px;
274
+ flex-wrap: wrap;
275
+ }
276
+
277
+ .complexity-item {
278
+ display: flex;
279
+ align-items: center;
280
+ gap: 10px;
281
+ }
282
+
283
+ .complexity-label {
284
+ font-weight: 600;
285
+ color: #7f8c8d;
286
+ font-size: 0.95rem;
287
+ }
288
+
289
+ .complexity-value {
290
+ display: inline-block;
291
+ padding: 6px 16px;
292
+ background: #2c3e50;
293
+ color: #f1c40f;
294
+ border-radius: 4px;
295
+ font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;
296
+ font-size: 1.1rem;
297
+ font-weight: 600;
298
+ }
299
+ </style>
300
+ {% endblock %}
301
+
302
+ {% block content %}
303
+ <div id="problem-content">
304
+ <div class="loading">
305
+ <div class="spinner"></div>
306
+ <p>Loading problem details...</p>
307
+ </div>
308
+ </div>
309
+ {% endblock %}
310
+
311
+ {% block scripts %}
312
+ <script>
313
+ const problemIdx = {{ idx }};
314
+ const datasetSlug = {{ dataset_slug|tojson }};
315
+ const datasetName = {{ dataset_name|tojson }};
316
+ const hasGroundTruth = {{ has_ground_truth|tojson }};
317
+ const hasTasks = {{ has_tasks|tojson }};
318
+
319
+ function badgeClass(source) {
320
+ return 'badge-' + source.toLowerCase().replace(/[^a-z0-9]/g, '');
321
+ }
322
+
323
+ async function loadProblem() {
324
+ try {
325
+ const response = await fetch(`/api/${datasetSlug}/problem/${problemIdx}`);
326
+ const problem = await response.json();
327
+
328
+ if (problem.error) {
329
+ document.getElementById('problem-content').innerHTML =
330
+ '<div class="card"><p style="color: red;">Error: ' + problem.error + '</p></div>';
331
+ return;
332
+ }
333
+
334
+ renderProblem(problem);
335
+ } catch (error) {
336
+ document.getElementById('problem-content').innerHTML =
337
+ '<div class="card"><p style="color: red;">Error loading problem: ' + error.message + '</p></div>';
338
+ }
339
+ }
340
+
341
+ function renderProblem(problem) {
342
+ const container = document.getElementById('problem-content');
343
+
344
+ // Main problem info card (shared by all datasets)
345
+ let html = `
346
+ <div class="card">
347
+ <div class="problem-header">
348
+ <h2>${escapeHtml(problem.entry_point)}</h2>
349
+ <span class="badge ${badgeClass(problem.source)}">${escapeHtml(problem.source)}</span>
350
+ </div>
351
+ <div class="problem-meta">
352
+ <div class="meta-item">
353
+ <span class="meta-label">Task ID:</span>
354
+ <span class="meta-value">${escapeHtml(problem.task_id)}</span>
355
+ </div>
356
+ <div class="meta-item">
357
+ <span class="meta-label">Index:</span>
358
+ <span class="meta-value">${problem.idx}</span>
359
+ </div>
360
+ <div class="meta-item">
361
+ <span class="meta-label">Dataset:</span>
362
+ <span class="meta-value">${escapeHtml(datasetName)}</span>
363
+ </div>
364
+ ${problem.inputs.length > 0 ? `
365
+ <div class="meta-item">
366
+ <span class="meta-label">Test Inputs:</span>
367
+ <span class="meta-value">${problem.inputs.length}</span>
368
+ </div>` : ''}
369
+ </div>
370
+ </div>
371
+ `;
372
+
373
+ // --- BigOBench view (problem description + per-solution code & complexity) ---
374
+ if (problem.solutions && problem.solutions.length > 0) {
375
+ // Problem description
376
+ if (problem.description) {
377
+ html += `
378
+ <div class="card">
379
+ <h2>Problem Statement</h2>
380
+ <pre class="code-block" style="white-space: pre-wrap;">${escapeHtml(problem.description)}</pre>
381
+ </div>
382
+ `;
383
+ }
384
+
385
+ // Each solution: code + complexity
386
+ problem.solutions.forEach((sol, i) => {
387
+ html += `
388
+ <div class="card">
389
+ <h2>Solution ${i + 1} <span style="font-size:0.8rem;color:#7f8c8d;font-weight:400;">${escapeHtml(sol.solution_id)}</span></h2>
390
+ <div class="complexity-badges" style="margin-bottom: 15px;">
391
+ `;
392
+ if (sol.time_complexity) {
393
+ html += `
394
+ <div class="complexity-item">
395
+ <span class="complexity-label">Time</span>
396
+ <span class="complexity-value">${escapeHtml(sol.time_complexity)}</span>
397
+ </div>`;
398
+ }
399
+ if (sol.space_complexity) {
400
+ html += `
401
+ <div class="complexity-item">
402
+ <span class="complexity-label">Space</span>
403
+ <span class="complexity-value">${escapeHtml(sol.space_complexity)}</span>
404
+ </div>`;
405
+ }
406
+ html += `
407
+ </div>
408
+ <div class="code-with-tasks">
409
+ ${sol.highlighted_code}
410
+ </div>
411
+ </div>
412
+ `;
413
+ });
414
+
415
+ // Navigation hint
416
+ html += `
417
+ <div class="navigation-hint">
418
+ <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
419
+ or return to the list view to filter by dataset source or search by name.
420
+ </div>
421
+ `;
422
+
423
+ container.innerHTML = html;
424
+ window.currentProblem = problem;
425
+ return;
426
+ }
427
+
428
+ // Source Code card
429
+ html += `
430
+ <div class="card">
431
+ <h2>Source Code</h2>
432
+ <div class="code-with-tasks" id="code-container">
433
+ ${problem.highlighted_code}
434
+ </div>
435
+ </div>
436
+ `;
437
+
438
+ // --- Non-DREval (simple) view ---
439
+ if (!hasTasks) {
440
+ // Show inputs/outputs if available
441
+ if (problem.inputs && problem.inputs.length > 0) {
442
+ html += `<div class="card"><h2>Inputs &amp; Outputs</h2>`;
443
+ problem.inputs.forEach((inp, i) => {
444
+ const out = (problem.outputs && problem.outputs[i]) || '';
445
+ html += `
446
+ <div class="io-section" style="margin-bottom: 15px;">
447
+ <div class="task-section">
448
+ <h3>Input ${i + 1}</h3>
449
+ <pre class="code-block">${escapeHtml(inp)}</pre>
450
+ </div>
451
+ <div class="task-section">
452
+ <h3>Output</h3>
453
+ <pre class="code-block">${escapeHtml(out)}</pre>
454
+ </div>
455
+ </div>
456
+ `;
457
+ });
458
+ html += `</div>`;
459
+ }
460
+
461
+ // Show test suite if available
462
+ if (problem.test) {
463
+ html += `
464
+ <div class="card">
465
+ <h2>Test Suite</h2>
466
+ <pre class="code-block">${escapeHtml(problem.test)}</pre>
467
+ </div>
468
+ `;
469
+ }
470
+
471
+ // Navigation hint
472
+ html += `
473
+ <div class="navigation-hint">
474
+ <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
475
+ or return to the list view to filter by dataset source or search by name.
476
+ </div>
477
+ `;
478
+
479
+ container.innerHTML = html;
480
+ window.currentProblem = problem;
481
+ return;
482
+ }
483
+
484
+ // --- CRUXEval task view (tasks have given/predict fields, no task_items) ---
485
+ if (problem.tasks.length > 0 && problem.tasks[0].given !== undefined) {
486
+ // Task selector
487
+ html += `
488
+ <div class="card">
489
+ <h2>Tasks</h2>
490
+ <div class="task-selector" id="task-selector">
491
+ `;
492
+ problem.tasks.forEach((task, idx) => {
493
+ html += `
494
+ <button class="task-btn ${idx === 0 ? 'active' : ''}"
495
+ onclick="showCruxTask(${idx})">
496
+ ${escapeHtml(task.name)}
497
+ </button>
498
+ `;
499
+ });
500
+ html += `
501
+ </div>
502
+ <div id="task-content"></div>
503
+ </div>
504
+ `;
505
+
506
+ // Navigation hint
507
+ html += `
508
+ <div class="navigation-hint">
509
+ <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
510
+ or return to the list view to filter by dataset source or search by name.
511
+ </div>
512
+ `;
513
+
514
+ container.innerHTML = html;
515
+ window.currentProblem = problem;
516
+ showCruxTask(0);
517
+ return;
518
+ }
519
+
520
+ // --- DREval (full) view with tasks, coverage, arrows ---
521
+ // Rebuild html cleanly with coverage legend and SVG overlay
522
+ html = `
523
+ <div class="card">
524
+ <div class="problem-header">
525
+ <h2>${escapeHtml(problem.entry_point)}</h2>
526
+ <span class="badge ${badgeClass(problem.source)}">${escapeHtml(problem.source)}</span>
527
+ </div>
528
+ <div class="problem-meta">
529
+ <div class="meta-item">
530
+ <span class="meta-label">Task ID:</span>
531
+ <span class="meta-value">${escapeHtml(problem.task_id)}</span>
532
+ </div>
533
+ <div class="meta-item">
534
+ <span class="meta-label">Index:</span>
535
+ <span class="meta-value">${problem.idx}</span>
536
+ </div>
537
+ <div class="meta-item">
538
+ <span class="meta-label">Dataset:</span>
539
+ <span class="meta-value">${escapeHtml(datasetName)}</span>
540
+ </div>
541
+ <div class="meta-item">
542
+ <span class="meta-label">Test Inputs:</span>
543
+ <span class="meta-value">${problem.inputs.length}</span>
544
+ </div>
545
+ </div>
546
+ </div>
547
+
548
+ <div class="card">
549
+ <h2>Source Code</h2>
550
+ <div class="coverage-legend" id="coverage-legend">
551
+ <strong>Coverage:</strong>
552
+ <span class="coverage-legend-item">
553
+ <span class="coverage-swatch" style="background:#d4edda; border:1px solid #28a745;"></span>
554
+ Executed
555
+ </span>
556
+ <span class="coverage-legend-item">
557
+ <span class="coverage-swatch" style="background:#f8d7da; border:1px solid #dc3545;"></span>
558
+ Not executed
559
+ </span>
560
+ </div>
561
+ <div class="code-with-tasks" id="code-container">
562
+ ${problem.highlighted_code}
563
+ <svg id="arrow-overlay" xmlns="http://www.w3.org/2000/svg">
564
+ <defs>
565
+ <marker id="arrowhead" markerWidth="8" markerHeight="6"
566
+ refX="8" refY="3" orient="auto">
567
+ <polygon points="0 0, 8 3, 0 6" class="exec-arrow-head"/>
568
+ </marker>
569
+ </defs>
570
+ </svg>
571
+ </div>
572
+ </div>
573
+ `;
574
+
575
+ // Task selector
576
+ html += `
577
+ <div class="card">
578
+ <h2>Test Cases & Tasks</h2>
579
+ <p>Select a test input to view associated reasoning tasks:</p>
580
+ <div class="task-selector" id="task-selector">
581
+ `;
582
+
583
+ problem.tasks.forEach((task, idx) => {
584
+ html += `
585
+ <button class="task-btn ${idx === 0 ? 'active' : ''}"
586
+ onclick="showTask(${idx})">
587
+ Input ${task.input_idx + 1}
588
+ </button>
589
+ `;
590
+ });
591
+
592
+ html += `
593
+ </div>
594
+ <div id="task-content"></div>
595
+ </div>
596
+ `;
597
+
598
+ // Navigation hint
599
+ html += `
600
+ <div class="navigation-hint">
601
+ <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
602
+ or return to the list view to filter by dataset source or search by name.
603
+ </div>
604
+ `;
605
+
606
+ container.innerHTML = html;
607
+
608
+ // Store problem data globally
609
+ window.currentProblem = problem;
610
+
611
+ // Show first task by default
612
+ showTask(0);
613
+ }
614
+
615
+ function injectTaskMarkers(taskItems) {
616
+ const codePre = document.querySelector('.source .code pre');
617
+
618
+ // Save the pristine original innerHTML once, before any modification.
619
+ if (codePre && !window._codePreOriginalHtml) {
620
+ window._codePreOriginalHtml = codePre.innerHTML;
621
+ }
622
+
623
+ // Invalidate span cache (rebuilt lazily on next arrow draw)
624
+ window._linenoSpanCache = null;
625
+
626
+ // Store current task items so applyCoverage can re-add markers after wrapping.
627
+ window._currentTaskItems = taskItems || [];
628
+
629
+ // Reset code pre to original, then add markers from scratch.
630
+ if (codePre && window._codePreOriginalHtml) {
631
+ codePre.innerHTML = window._codePreOriginalHtml;
632
+ }
633
+
634
+ if (!taskItems || taskItems.length === 0) {
635
+ return;
636
+ }
637
+
638
+ // Group tasks by line number
639
+ const tasksByLine = {};
640
+ taskItems.forEach(item => {
641
+ if (!tasksByLine[item.lineno]) tasksByLine[item.lineno] = [];
642
+ tasksByLine[item.lineno].push(item.var);
643
+ });
644
+
645
+ // Inject task marker badges into the code pre
646
+ if (!codePre) return;
647
+ const codeLines = codePre.innerHTML.split('\n');
648
+ codePre.innerHTML = codeLines.map((line, idx) => {
649
+ const lineNum = idx + 1;
650
+ if (tasksByLine[lineNum] && line.trim() !== '') {
651
+ const vars = tasksByLine[lineNum];
652
+ return line + `<span class="task-marker" data-lineno="${lineNum}" data-vars="${escapeHtml(vars.join(', '))}">${escapeHtml(vars.join(', '))}</span>`;
653
+ }
654
+ return line;
655
+ }).join('\n');
656
+
657
+ }
658
+
659
+ function applyCoverage(coverageSet, totalLines) {
660
+ // Remove previous coverage classes from lineno spans.
661
+ // Pygments structure: td.linenos > div.linenodiv > pre > span.normal
662
+ // These are individual elements — adding/removing classes has no layout impact.
663
+ document.querySelectorAll('td.linenos .normal').forEach(el => {
664
+ el.classList.remove('line-executed', 'line-not-executed');
665
+ });
666
+
667
+ if (!coverageSet) {
668
+ const legend = document.getElementById('coverage-legend');
669
+ if (legend) legend.style.display = 'none';
670
+ return;
671
+ }
672
+
673
+ const legend = document.getElementById('coverage-legend');
674
+ if (legend) legend.style.display = 'block';
675
+
676
+ // Color lineno spans only. We never touch codePre.innerHTML here so:
677
+ // 1. The table layout is never disturbed (no alignment issue).
678
+ // 2. Task markers injected by injectTaskMarkers are left untouched.
679
+ document.querySelectorAll('td.linenos .normal').forEach(span => {
680
+ const lineNum = parseInt(span.textContent.trim());
681
+ if (!isNaN(lineNum) && lineNum <= totalLines) {
682
+ span.classList.add(coverageSet.has(lineNum) ? 'line-executed' : 'line-not-executed');
683
+ }
684
+ });
685
+ }
686
+
687
+ // Global map: lineno -> list of next line numbers (1-indexed; -1 = end of trace)
688
+ window._nextLinesMap = {};
689
+
690
+ async function loadAndApplyGroundTruth(problemIdx, inputIdx, taskItems) {
691
+ // Show "loading" placeholders on all task items
692
+ taskItems.forEach(item => {
693
+ const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
694
+ if (el) { el.textContent = '…'; el.className = 'gt-answer loading'; }
695
+ });
696
+
697
+ // Clear next-lines data from previous input
698
+ window._nextLinesMap = {};
699
+
700
+ try {
701
+ const resp = await fetch(`/api/${datasetSlug}/problem/${problemIdx}/ground_truth/${inputIdx}`);
702
+ const gt = await resp.json();
703
+
704
+ if (gt.status !== 'ok') {
705
+ taskItems.forEach(item => {
706
+ const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
707
+ if (el) { el.textContent = gt.status === 'error' ? '(exec error)' : '(unavailable)'; el.className = 'gt-answer'; }
708
+ });
709
+ applyCoverage(null, 0);
710
+ return;
711
+ }
712
+
713
+ // Apply coverage highlighting
714
+ const coverageSet = new Set(gt.coverage);
715
+ applyCoverage(coverageSet, gt.total_lines);
716
+
717
+ // Fill in variable answers
718
+ const answerMap = {};
719
+ gt.variable_answers.forEach(a => {
720
+ answerMap[`${a.lineno}-${a.var}`] = a.answer_str;
721
+ });
722
+ taskItems.forEach(item => {
723
+ const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
724
+ if (el) {
725
+ const answer = answerMap[`${item.lineno}-${item.var}`] || '(not available)';
726
+ el.textContent = answer;
727
+ el.className = 'gt-answer';
728
+ }
729
+ });
730
+
731
+ // Store next-lines data for arrow visualization
732
+ if (gt.next_lines_answers) {
733
+ gt.next_lines_answers.forEach(a => {
734
+ window._nextLinesMap[a.lineno] = a.next_lines;
735
+ });
736
+ }
737
+
738
+ // Attach hover handlers to task-marker spans now that we have next-lines data
739
+ attachArrowHoverHandlers();
740
+
741
+ } catch (e) {
742
+ taskItems.forEach(item => {
743
+ const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
744
+ if (el) { el.textContent = '(error)'; el.className = 'gt-answer'; }
745
+ });
746
+ }
747
+ }
748
+
749
+ // Cache of lineNum → DOM span, rebuilt whenever injectTaskMarkers runs.
750
+ window._linenoSpanCache = null;
751
+
752
+ function buildLinenoSpanCache(container) {
753
+ const cache = {};
754
+ container.querySelectorAll('td.linenos .normal').forEach(span => {
755
+ const n = parseInt(span.textContent.trim());
756
+ if (!isNaN(n)) cache[n] = span;
757
+ });
758
+ window._linenoSpanCache = cache;
759
+ }
760
+
761
+ /**
762
+ * Get the bounding rect of the lineno span for a given 1-indexed line number,
763
+ * relative to the code container element. Uses a cached span map.
764
+ */
765
+ function getLinenoSpanRect(lineNum, container) {
766
+ if (!window._linenoSpanCache) buildLinenoSpanCache(container);
767
+ const span = window._linenoSpanCache[lineNum];
768
+ if (!span) return null;
769
+ const spanRect = span.getBoundingClientRect();
770
+ const containerRect = container.getBoundingClientRect();
771
+ return {
772
+ top: spanRect.top - containerRect.top + container.scrollTop,
773
+ bottom: spanRect.bottom - containerRect.top + container.scrollTop,
774
+ left: spanRect.left - containerRect.left,
775
+ right: spanRect.right - containerRect.left,
776
+ width: spanRect.width,
777
+ height: spanRect.height,
778
+ midY: (spanRect.top + spanRect.bottom) / 2 - containerRect.top + container.scrollTop,
779
+ };
780
+ }
781
+
782
+ /**
783
+ * Draw arrows from sourceLine to each of the targetLines in the SVG overlay.
784
+ * Lines are 1-indexed. -1 means "end of execution" (no arrow drawn).
785
+ */
786
+ function drawArrows(sourceLineNum, targetLineNums) {
787
+ const container = document.getElementById('code-container');
788
+ const svg = document.getElementById('arrow-overlay');
789
+ if (!container || !svg) return;
790
+
791
+ // Remove previous arrows (but keep defs)
792
+ svg.querySelectorAll('.arrow-path').forEach(el => el.remove());
793
+
794
+ const srcRect = getLinenoSpanRect(sourceLineNum, container);
795
+ if (!srcRect) return;
796
+
797
+ // Update SVG height to match container
798
+ svg.setAttribute('height', container.scrollHeight);
799
+
800
+ targetLineNums.forEach(targetLineNum => {
801
+ if (targetLineNum === -1) return; // end of trace — no arrow
802
+
803
+ const dstRect = getLinenoSpanRect(targetLineNum, container);
804
+ if (!dstRect) return;
805
+
806
+ // Start point: right edge of source lineno span, vertically centered
807
+ const x1 = srcRect.right + 2;
808
+ const y1 = srcRect.midY;
809
+
810
+ // End point: right edge of target lineno span, vertically centered
811
+ const x2 = dstRect.right + 2;
812
+ const y2 = dstRect.midY;
813
+
814
+ // Horizontal offset for the bezier control points — curves to the right
815
+ const curveOffset = Math.max(30, Math.abs(y2 - y1) * 0.4);
816
+
817
+ // Cubic bezier: both control points extend to the right of the lineno column
818
+ const cx1 = x1 + curveOffset;
819
+ const cy1 = y1;
820
+ const cx2 = x2 + curveOffset;
821
+ const cy2 = y2;
822
+
823
+ const path = document.createElementNS('http://www.w3.org/2000/svg', 'path');
824
+ path.setAttribute('d', `M ${x1} ${y1} C ${cx1} ${cy1}, ${cx2} ${cy2}, ${x2} ${y2}`);
825
+ path.setAttribute('class', 'exec-arrow arrow-path');
826
+ path.setAttribute('marker-end', 'url(#arrowhead)');
827
+ svg.appendChild(path);
828
+ });
829
+ }
830
+
831
+ /**
832
+ * Clear all arrows from the SVG overlay.
833
+ */
834
+ function clearArrows() {
835
+ const svg = document.getElementById('arrow-overlay');
836
+ if (svg) {
837
+ svg.querySelectorAll('.arrow-path').forEach(el => el.remove());
838
+ }
839
+ }
840
+
841
+ // AbortController for the current set of marker hover listeners.
842
+ let _markerListenersAbort = null;
843
+
844
+ /**
845
+ * Attach mouseenter/mouseleave handlers to all .task-marker spans so that
846
+ * hovering shows execution-flow arrows to next lines.
847
+ */
848
+ function attachArrowHoverHandlers() {
849
+ // Cancel any previously attached listeners without touching the DOM.
850
+ if (_markerListenersAbort) _markerListenersAbort.abort();
851
+ _markerListenersAbort = new AbortController();
852
+ const { signal } = _markerListenersAbort;
853
+
854
+ document.querySelectorAll('.task-marker').forEach(marker => {
855
+ marker.addEventListener('mouseenter', () => {
856
+ const lineNum = parseInt(marker.dataset.lineno);
857
+ if (!lineNum) return;
858
+ const nextLines = window._nextLinesMap[lineNum];
859
+ if (nextLines && nextLines.length > 0) {
860
+ drawArrows(lineNum, nextLines);
861
+ }
862
+ }, { signal });
863
+
864
+ marker.addEventListener('mouseleave', () => {
865
+ clearArrows();
866
+ }, { signal });
867
+ });
868
+ }
869
+
870
+ function showCruxTask(taskIdx) {
871
+ const problem = window.currentProblem;
872
+ const task = problem.tasks[taskIdx];
873
+
874
+ // Update active button
875
+ document.querySelectorAll('.task-btn').forEach((btn, idx) => {
876
+ btn.classList.toggle('active', idx === taskIdx);
877
+ });
878
+
879
+ const givenLabel = task.given === 'input' ? 'Input (given)' : 'Output (given)';
880
+ const predictLabel = task.predict === 'output' ? 'Output (predict)' : 'Input (predict)';
881
+ const givenValue = task.given === 'input' ? task.input : task.output;
882
+ const predictValue = task.predict === 'output' ? task.output : task.input;
883
+
884
+ const html = `
885
+ <div class="task-details">
886
+ <div class="task-section">
887
+ <p style="margin-bottom: 12px; color: #7f8c8d;">${escapeHtml(task.description)}</p>
888
+ </div>
889
+ <div class="io-section">
890
+ <div class="task-section">
891
+ <h3>${escapeHtml(givenLabel)}</h3>
892
+ <pre class="code-block">${escapeHtml(givenValue)}</pre>
893
+ </div>
894
+ <div class="task-section">
895
+ <h3>${escapeHtml(predictLabel)}</h3>
896
+ <pre class="code-block crux-answer">${escapeHtml(predictValue)}</pre>
897
+ </div>
898
+ </div>
899
+ </div>
900
+ `;
901
+
902
+ document.getElementById('task-content').innerHTML = html;
903
+ }
904
+
905
+ function showTask(taskIdx) {
906
+ const problem = window.currentProblem;
907
+ const task = problem.tasks[taskIdx];
908
+
909
+ // Update active button
910
+ const buttons = document.querySelectorAll('.task-btn');
911
+ buttons.forEach((btn, idx) => {
912
+ if (idx === taskIdx) {
913
+ btn.classList.add('active');
914
+ } else {
915
+ btn.classList.remove('active');
916
+ }
917
+ });
918
+
919
+ // Inject task markers into the code
920
+ injectTaskMarkers(task.task_items);
921
+
922
+ // Clear previous coverage while new one loads
923
+ applyCoverage(null, 0);
924
+
925
+ // Render task content
926
+ // For HumanEval: Input + Expected Output side by side.
927
+ // For ClassEval: Input alone (side by side layout), then Test Class below full-width.
928
+ const ioSection = task.test_class_code
929
+ ? `<div class="io-section">
930
+ <div class="task-section">
931
+ <h3>Input</h3>
932
+ <pre class="code-block">${escapeHtml(task.input)}</pre>
933
+ </div>
934
+ </div>
935
+ <div class="task-section">
936
+ <h3>Test Class &mdash; <code>${escapeHtml(task.test_class_name)}</code></h3>
937
+ <pre class="code-block">${escapeHtml(task.test_class_code)}</pre>
938
+ </div>`
939
+ : `<div class="io-section">
940
+ <div class="task-section">
941
+ <h3>Input</h3>
942
+ <pre class="code-block">${escapeHtml(task.input)}</pre>
943
+ </div>
944
+ <div class="task-section">
945
+ <h3>Expected Output</h3>
946
+ <pre class="code-block">${escapeHtml(task.output)}</pre>
947
+ </div>
948
+ </div>`;
949
+
950
+ let html = `
951
+ <div class="task-details">
952
+ ${ioSection}
953
+ `;
954
+
955
+ // Show task items with ground truth answer placeholders
956
+ if (task.task_items && task.task_items.length > 0) {
957
+ html += `
958
+ <div class="task-section">
959
+ <h3>Reasoning Tasks</h3>
960
+ <p style="margin-bottom: 10px; color: #7f8c8d;">
961
+ Variable state at each execution point (correct answer shown in
962
+ <span style="background:#17a2b8;color:white;padding:1px 6px;border-radius:3px;font-size:0.82rem;">teal</span>):
963
+ </p>
964
+ <ul class="task-items-list">
965
+ `;
966
+
967
+ task.task_items.forEach(item => {
968
+ html += `
969
+ <li>
970
+ <span class="line-ref">Line ${item.lineno}</span>
971
+ <span class="var-name">${escapeHtml(item.var)}</span>
972
+ <span class="gt-answer loading" id="gt-${item.lineno}-${item.var}">…</span>
973
+ </li>
974
+ `;
975
+ });
976
+
977
+ html += `
978
+ </ul>
979
+ </div>
980
+ `;
981
+ }
982
+
983
+ // Show output prediction task if exists
984
+ if (task.output_pred) {
985
+ html += `
986
+ <div class="task-section">
987
+ <h3>Output Completion Task</h3>
988
+ <p style="margin-bottom: 10px; color: #7f8c8d;">
989
+ The model needs to complete this test assertion:
990
+ </p>
991
+ <pre class="code-block">${escapeHtml(task.output_pred)}</pre>
992
+ </div>
993
+ `;
994
+ }
995
+
996
+ html += `</div>`;
997
+
998
+ document.getElementById('task-content').innerHTML = html;
999
+
1000
+ // Fetch and apply ground truth (coverage + variable answers)
1001
+ if (hasGroundTruth && task.task_items) {
1002
+ loadAndApplyGroundTruth(problem.idx, task.input_idx, task.task_items);
1003
+ }
1004
+ }
1005
+
1006
+ function escapeHtml(text) {
1007
+ if (text === null || text === undefined) return '';
1008
+ const div = document.createElement('div');
1009
+ div.textContent = text;
1010
+ return div.innerHTML;
1011
+ }
1012
+
1013
+ loadProblem();
1014
+ </script>
1015
+ {% endblock %}