Spaces:

JetBrains-Research
/

ml4se-evals-visualization

Paused

App Files Files Community

egor-bogomolov commited on Mar 2

Commit

a096bac

1 Parent(s): b9b3992

Rebrand to ML4SE Benchmark Viewer, make datasets a core dependency

Browse files

Files changed (9) hide show

.gitignore +71 -0
CLAUDE.md +351 -0
README.md +47 -7
app.py +359 -0
dataset_adapters.py +413 -0
pyproject.toml +38 -0
templates/base.html +211 -0
templates/index.html +330 -0
templates/problem.html +1015 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,71 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# Virtual environments
+venv/
+ENV/
+env/
+.venv
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+.DS_Store
+# Flask
+instance/
+.webassets-cache
+*.log
+# Environment variables
+.env
+.env.local
+# uv
+uv.lock
+# AI Assistant context files
+CLAUDE.md
+# Testing
+.pytest_cache/
+.coverage
+htmlcov/
+.tox/
+.hypothesis/
+# Jupyter Notebook
+.ipynb_checkpoints
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Ruff
+.ruff_cache/

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,351 @@

+# CLAUDE.md - Project Context for AI Assistants
+## Project Overview
+**ml4se-evals-visualization** is a web-based visualization tool for browsing and analyzing benchmark datasets for machine learning code evaluation. The primary dataset is DREval (Dynamic Reasoning Evaluation), but the tool supports multiple datasets through an adapter pattern.
+**Tech Stack:**
+- **Backend**: Flask (Python web framework)
+- **Frontend**: Jinja2 templates with vanilla JavaScript
+- **Syntax Highlighting**: Pygments
+- **Package Management**: uv (modern Python package manager)
+- **Linting/Formatting**: ruff
+## Architecture
+### Core Components
+1. **app.py** - Main Flask application
+   - API endpoints for dataset listing, problem browsing, and code highlighting
+   - Ground truth execution data endpoints (DREval only)
+   - Template rendering for index and problem detail pages
+   - Port: 7860 (default), configurable via PORT env var
+   - Debug mode: controlled by FLASK_DEBUG env var
+2. **dataset_adapters.py** - Dataset adapter system
+   - `DatasetAdapter` base class with common interface
+   - Concrete adapters for: DREval, CRUXEval, HumanEval+, BigOBench
+   - Registry pattern (`REGISTRY` dict) for dataset management
+   - Each adapter normalizes dataset-specific formats to common API
+3. **templates/** - Jinja2 HTML templates
+   - `base.html` - Base layout (IMPORTANT: `{% block scripts %}` does NOT include `<script>` tags)
+   - `index.html` - Problem list view with filtering
+   - `problem.html` - Problem detail view with syntax highlighting
+4. **requirements.txt** / **pyproject.toml** - Dependencies
+   - Core: flask, pygments
+   - Optional HF: datasets (for CRUXEval, HumanEval+, BigOBench)
+   - Dev: ruff
+### Data Flow
+```
+User Request → Flask Route → Dataset Adapter → API Response → Template/JSON
+                                   ↓
+                            Helper Functions
+                            (highlight_code, etc.)
+```
+### Key Design Patterns
+- **Adapter Pattern**: Each dataset type has an adapter implementing `DatasetAdapter` interface
+- **Dependency Injection**: Helper functions injected into adapters module via `_set_helpers()`
+- **URL Routing**: Supports both default (DREval) and explicit dataset slug routes
+  - `/problem/123` → DREval problem 123
+  - `/cruxeval/problem/45` → CRUXEval problem 45
+## Important Files & Locations
+### Python Files
+- **app.py**: Main entry point, Flask routes, ground truth logic
+- **dataset_adapters.py**: Adapter implementations for all datasets
+- **ground_truth_loader.py**: (parent dir) Loads execution traces for DREval
+- **dynamics.py**: (parent dir) Contains `Nil` singleton for missing values
+### Data Files
+- **data/DREval_data.jsonl**: Main DREval problem data (328 problems)
+- **data/DREval_tasks.jsonl**: Task definitions for each problem
+- **data/ground_truth/**: Execution traces for ground truth visualization
+### Template Files
+- **templates/base.html**: Layout with navigation, CSS includes
+- **templates/index.html**: Problem list with dataset selector, search, filtering
+- **templates/problem.html**: Problem detail with code highlighting, test inputs, ground truth overlay
+## Key Functionalities
+### 1. Dataset Support
+**DREval** (primary dataset):
+- 328 problems (164 HumanEval + 164 ClassEval)
+- Ground truth execution traces available
+- Tasks: Coverage, Path, State, Output predictions
+- Test inputs with expected outputs
+**CRUXEval** (HuggingFace):
+- Input/Output prediction tasks
+- Single function execution reasoning
+**HumanEval+** (HuggingFace):
+- Extended HumanEval with additional tests
+- No execution traces
+**BigOBench** (HuggingFace):
+- Algorithm complexity analysis
+- Multiple solutions per problem with time/space complexity labels
+### 2. Problem Browsing
+- **List View** (`/`): Grid of problem cards with filtering
+  - Dataset selector (dropdown)
+  - Search by function name or task ID
+  - Source filter (HumanEval/ClassEval for DREval)
+  - Problem cards show: task_id, entry_point, source badge, input count
+- **Detail View** (`/problem/<idx>`): Full problem display
+  - Syntax-highlighted code (Pygments)
+  - Test inputs selector (buttons at top)
+  - Task item visualization (yellow line highlights, purple variable badges)
+  - Previous/Next navigation
+  - Ground truth overlay (DREval only, when available)
+### 3. Ground Truth Visualization (DREval only)
+When ground truth data is available:
+- **Coverage overlay**: Executed lines highlighted in green
+- **Variable values**: Hovering over purple badges shows actual values
+- **Next line arrows**: Hovering over line numbers shows execution flow
+- Fetched via `/api/<dataset>/problem/<idx>/ground_truth/<input_idx>`
+### 4. Code Highlighting
+- **Pygments-based** syntax highlighting with table line numbers
+- **Line offset handling**: Leading newlines stripped for display
+- **Dynamic highlighting**: Updates when switching test inputs
+- **Task line markers**: Yellow highlights for lines with queries
+- **Variable badges**: Purple badges showing which variables are queried
+## API Endpoints
+### Dataset Operations
+- `GET /api/datasets` - List all available datasets
+- `GET /api/<dataset_slug>/problems` - Get problem list for dataset
+- `GET /api/<dataset_slug>/problem/<idx>` - Get problem detail
+- `GET /api/<dataset_slug>/problem/<idx>/ground_truth/<input_idx>` - Get execution data (DREval only)
+### Utility
+- `GET /api/css` - Pygments CSS for syntax highlighting
+- `GET /api/highlight_code?code=...&lines=...` - Highlight arbitrary code
+## Common Development Tasks
+### Running the Application
+```bash
+# Development mode (with debug)
+FLASK_DEBUG=true uv run python app.py
+# Production mode
+uv run python app.py
+# Custom port
+PORT=8080 uv run python app.py
+```
+### Installing Dependencies
+```bash
+# Core dependencies only
+uv sync
+# With HuggingFace datasets support
+uv sync --extra hf
+# Dev dependencies (includes ruff)
+uv sync --extra dev
+```
+### Code Quality
+```bash
+# Format code
+uv run ruff format .
+# Lint code
+uv run ruff check .
+# Auto-fix issues
+uv run ruff check --fix .
+```
+### Adding a New Dataset
+1. Create adapter class in `dataset_adapters.py` inheriting from `DatasetAdapter`
+2. Implement required methods: `problem_count()`, `get_problem_summary()`, `get_problem_detail()`
+3. Set class attributes: `slug`, `display_name`, `has_ground_truth`, `has_tasks`
+4. Register adapter in `REGISTRY` (usually in `register_hf_datasets()` or similar)
+5. Test endpoints: `/api/<slug>/problems` and `/api/<slug>/problem/<idx>`
+## Important Implementation Details
+### Line Number Indexing
+**Critical**: Multiple indexing conventions are used:
+1. **Original code** (in data files): 1-indexed, includes leading newlines
+2. **Stripped code** (displayed): 1-indexed, leading newlines removed
+3. **Coverage data**: 0-indexed relative to original code
+4. **Pygments output**: 1-indexed, starts from linenostart parameter
+**Conversion formula** for task line numbers:
+```python
+stripped_lineno = original_lineno - offset
+where offset = number of leading newlines
+```
+### Template JavaScript Rules
+**IMPORTANT**: The `{% block scripts %}` in `base.html` does NOT include `<script>` tags.
+❌ **Wrong** (creates nested script tags):
+```html
+{% block scripts %}
+<script>
+  // Your code
+</script>
+{% endblock %}
+```
+✅ **Correct**:
+```html
+{% block scripts %}
+// Your code directly here
+{% endblock %}
+```
+### ClassEval Test Parsing
+ClassEval problems have test classes defined in the `test` field. The tool:
+1. Parses test code with `ast.parse()` (safe, no execution)
+2. Extracts classes matching pattern `{entry_point}Test*`
+3. Associates each test class with an input_idx
+4. Displays test class code alongside problem code
+### Ground Truth Data Format
+Ground truth execution records contain:
+- **coverage**: List of 0-indexed line numbers executed
+- **variable_values**: Dict mapping (lineno, var) → list of values
+- **next_lines**: Dict mapping lineno → list of next line numbers
+- **status**: "ok" | "error" indicating execution success
+## Troubleshooting
+### "No such file or directory" error
+- Ensure you're in the project root directory
+- Check that `data/DREval_data.jsonl` exists
+- Verify relative path assumptions in `app.py` (DATA_DIR calculation)
+### "Ground truth unavailable"
+- Check that `ground_truth_loader.py` exists in parent directory
+- Verify `data/ground_truth/` directory contains execution traces
+- Ensure GT_LOADER is not None (check startup logs)
+### HuggingFace datasets not loading
+- Install optional dependencies: `uv sync --extra hf`
+- Check network connectivity (datasets download from HF hub)
+- Review startup logs for specific error messages
+### JavaScript not executing
+- Check browser console for errors (F12)
+- Verify `{% block scripts %}` is not creating nested `<script>` tags
+- Hard refresh (Ctrl+F5 or Cmd+Shift+R) to clear cache
+- Ensure API endpoints return valid JSON
+### Line numbers misaligned
+- Check offset calculation in `_code_offset()`
+- Verify line number adjustments in adapter `get_problem_detail()`
+- Ensure Pygments linenostart matches expected starting line
+## File Paths and Navigation
+All file references should use the fleet-file:// scheme with absolute paths:
+```markdown
+[filename](fleet-file://4idaunrhu7c1r7voo46l/Users/Egor.Bogomolov_1/work/ml4se-evals-visualization/app.py?type=file&root=%252F)
+```
+For line references (always 0-indexed):
+```markdown
+[filename:12-16](fleet-file://4idaunrhu7c1r7voo46l/Users/Egor.Bogomolov_1/work/ml4se-evals-visualization/app.py?type=file&linesData=%7B%22range%22%3A%7B%22first%22%3A527%2C%22second%22%3A732%7D%2C%22lines%22%3A%7B%22first%22%3A12%2C%22second%22%3A16%7D%7D&root=%252F)
+```
+## Git Information
+- **Current branch**: main
+- **Main branch**: main (use for PRs)
+- **Recent commit**: b9b3992 "initial commit"
+- **Untracked files**: Dockerfile, __pycache__/, app.py, dataset_adapters.py, pyproject.toml, requirements.txt, templates/, uv.lock
+- **Modified files**: README.md
+## Environment
+- **Platform**: macOS (darwin 24.6.0)
+- **Python**: 3.8+ required
+- **Working directory**: /Users/Egor.Bogomolov_1/work/ml4se-evals-visualization
+- **Is git repo**: Yes
+## Testing Checklist
+When making changes, verify:
+- [ ] Flask server starts without errors
+- [ ] All datasets appear in dropdown (if HF dependencies installed)
+- [ ] Problem list loads and displays correctly
+- [ ] Search/filter functionality works
+- [ ] Problem detail page renders with syntax highlighting
+- [ ] Test input selector updates line highlights
+- [ ] Ground truth overlay displays (DREval only)
+- [ ] Previous/Next navigation works
+- [ ] API endpoints return valid JSON
+- [ ] No console errors in browser (F12)
+- [ ] Code passes ruff checks
+## External Dependencies
+### Python Packages
+- **flask**: Web framework (>=3.0.0)
+- **pygments**: Syntax highlighting (>=2.17.2)
+- **datasets**: HuggingFace datasets (>=2.14.0, optional)
+- **ruff**: Linting and formatting (>=0.8.0, dev)
+### Data Sources
+- **DREval**: Local JSONL files in data/ directory
+- **CRUXEval**: cruxeval-org/cruxeval (HuggingFace Hub)
+- **HumanEval+**: evalplus/humanevalplus (HuggingFace Hub)
+- **BigOBench**: facebook/BigOBench (HuggingFace Hub)
+## Future Enhancements (Not Implemented)
+Potential areas for improvement:
+- User authentication and saved preferences
+- Export functionality (PDF, CSV)
+- Comparison view for multiple solutions
+- Interactive debugging/stepping through execution
+- Code editing and re-evaluation
+- Dataset upload functionality
+- Performance metrics visualization
+## Related Documentation
+- **README.md**: User-facing documentation, installation instructions
+- **pyproject.toml**: Package metadata, dependencies, ruff configuration
+- **Dockerfile**: Container deployment configuration (if present)
+- **requirements.txt**: Pip-format dependency list
+---
+**Last Updated**: 2026-03-02
+**Project Status**: Active Development
+**Primary Maintainer**: Egor Bogomolov

README.md CHANGED Viewed

@@ -1,12 +1,52 @@
 ---
-title: Ml4se Evals Visualization
-emoji: 🐠
-colorFrom: red
-colorTo: indigo
 sdk: docker
 pinned: false
-license: apache-2.0
-short_description: Space for inspecting popular ML4SE datasets
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: ML4SE Benchmark Viewer
+emoji: 📊
+colorFrom: blue
+colorTo: green
 sdk: docker
 pinned: false
 ---
+# ML4SE Benchmark Viewer
+A web-based interface for browsing and manually inspecting individual datapoints from popular ML4SE (Machine Learning for Software Engineering) benchmark datasets.
+## Supported Datasets
+| Dataset | Description |
+|---------|-------------|
+| **CRUXEval** | Input/output prediction tasks for single-function execution reasoning |
+| **HumanEval+** | Extended HumanEval with additional tests |
+| **BigOBench** | Algorithm complexity analysis with time/space complexity labels |
+**Coming soon:** DREval (Dynamic Reasoning Evaluation)
+## Installation & Usage
+```bash
+# Install dependencies
+uv sync
+# Run the server (default port: 7860)
+uv run python app.py
+# Development mode with auto-reload
+FLASK_DEBUG=true uv run python app.py
+```
+Then open http://localhost:7860.
+## Development
+```bash
+# Lint and format
+uv run ruff check .
+uv run ruff format .
+```
+### Adding a New Dataset
+1. Create an adapter class in `dataset_adapters.py` inheriting from `DatasetAdapter`
+2. Implement required methods: `problem_count()`, `get_problem_summary()`, `get_problem_detail()`
+3. Register the adapter in the `REGISTRY`
+4. Test: `/api/<slug>/problems` and `/api/<slug>/problem/<idx>`

app.py ADDED Viewed

	@@ -0,0 +1,359 @@

+"""
+ML4SE Benchmark Viewer
+A web-based interface for browsing and inspecting individual datapoints
+from popular ML4SE benchmark datasets (DREval, CRUXEval, HumanEval+,
+BigOBench, and others).
+"""
+import ast as _ast
+import json
+import os
+import sys
+from pathlib import Path
+from flask import Flask, jsonify, render_template, request
+from pygments import highlight
+from pygments.formatters import HtmlFormatter
+from pygments.lexers import PythonLexer
+app = Flask(__name__)
+# Path to data files
+DATA_DIR = Path(__file__).parent.parent / "data"
+DATA_FILE = DATA_DIR / "DREval_data.jsonl"
+TASKS_FILE = DATA_DIR / "DREval_tasks.jsonl"
+# Add parent dir to path so we can import ground_truth_loader
+sys.path.insert(0, str(Path(__file__).parent.parent))
+try:
+    from ground_truth_loader import GroundTruthLoader
+    GT_LOADER = GroundTruthLoader(data_dir=DATA_DIR / "ground_truth")
+except Exception as e:
+    print(f"Warning: ground truth loader unavailable: {e}")
+    GT_LOADER = None
+# Load data at startup
+def load_data():
+    """Load the benchmark data and tasks."""
+    if not DATA_FILE.exists() or not TASKS_FILE.exists():
+        raise FileNotFoundError(
+            f"Data files not found. Expected files:\n"
+            f"  - {DATA_FILE}\n"
+            f"  - {TASKS_FILE}\n"
+            f"Please ensure you're running from the visualization directory "
+            f"and the data files exist in ../data/"
+        )
+    data_records = []
+    with open(DATA_FILE) as f:
+        for line in f:
+            data_records.append(json.loads(line))
+    task_records = []
+    with open(TASKS_FILE) as f:
+        for line in f:
+            task_records.append(json.loads(line))
+    return data_records, task_records
+try:
+    DATA, TASKS = load_data()
+except FileNotFoundError as e:
+    print(f"Warning: DREval data files not found, DREval dataset will be unavailable: {e}")
+    DATA, TASKS = [], []
+def _extract_test_classes(test_code: str, cls_name: str) -> list:
+    """
+    Parse a ClassEval unittest module and return one dict per test class
+    in definition order: {"name": ..., "code": ...}.
+    Matches top-level classes whose names start with f"{cls_name}Test",
+    which is the same pattern used by ClassFactory.create_test_classes().
+    Uses ast.parse only — no code execution, safe to call from the web server.
+    """
+    try:
+        tree = _ast.parse(test_code)
+    except SyntaxError as e:
+        print(f"Warning: SyntaxError parsing test code for {cls_name}: {e}")
+        return []
+    lines = test_code.splitlines(keepends=True)
+    prefix = f"{cls_name}Test"
+    result = []
+    for node in tree.body:          # top-level definitions, preserves source order
+        if isinstance(node, _ast.ClassDef) and node.name.startswith(prefix):
+            start = node.lineno - 1     # ast lineno is 1-indexed
+            end = node.end_lineno       # end_lineno is inclusive; slice is exclusive
+            result.append({
+                "name": node.name,
+                "code": "".join(lines[start:end]),
+            })
+    return result
+def _code_offset(code: str) -> int:
+    """Number of leading newlines that Pygments will strip."""
+    offset = 0
+    for ch in code:
+        if ch == '\n':
+            offset += 1
+        else:
+            break
+    return offset
+def highlight_code(code, highlight_lines=None):
+    """
+    Syntax highlight Python code with optional line highlighting.
+    Args:
+        code: The Python code to highlight
+        highlight_lines: List of line numbers (1-indexed) to highlight
+    Returns:
+        HTML string with syntax highlighted code
+    """
+    formatter = HtmlFormatter(
+        linenos="table", cssclass="source", hl_lines=highlight_lines or [], linenostart=1
+    )
+    return highlight(code, PythonLexer(), formatter)
+def get_css():
+    """Get CSS for syntax highlighting."""
+    return HtmlFormatter().get_style_defs(".source")
+# ---------------------------------------------------------------------------
+# Dataset adapter registration
+# ---------------------------------------------------------------------------
+from dataset_adapters import REGISTRY, _set_helpers, register_dreval, register_hf_datasets
+# Inject helper functions into the adapters module (avoids circular imports)
+_set_helpers(highlight_code, _code_offset, _extract_test_classes)
+# Register DREval only if data is available
+if DATA:
+    register_dreval(DATA, TASKS, GT_LOADER)
+# Optionally register HuggingFace datasets
+register_hf_datasets()
+def _get_adapter(dataset_slug: str):
+    """Return the adapter for the given slug, or None."""
+    return REGISTRY.get(dataset_slug)
+# ---------------------------------------------------------------------------
+# Routes
+# ---------------------------------------------------------------------------
+@app.route("/")
+def index():
+    """Main page showing list of all benchmark problems."""
+    return render_template("index.html", total_problems=len(DATA))
+@app.route("/api/datasets")
+def get_datasets():
+    """Return list of available datasets for the UI dataset selector."""
+    return jsonify([
+        {
+            "slug": slug,
+            "display_name": adapter.display_name,
+            "problem_count": adapter.problem_count(),
+            "has_ground_truth": adapter.has_ground_truth,
+        }
+        for slug, adapter in REGISTRY.items()
+    ])
+@app.route("/api/problems")
+@app.route("/api/<dataset_slug>/problems")
+def get_problems(dataset_slug="dreval"):
+    """API endpoint to get list of all problems for a dataset."""
+    adapter = _get_adapter(dataset_slug)
+    if adapter is None:
+        return jsonify({"error": f"Unknown dataset: {dataset_slug}"}), 404
+    problems = [adapter.get_problem_summary(i) for i in range(adapter.problem_count())]
+    return jsonify(problems)
+@app.route("/api/problem/<int:idx>")
+@app.route("/api/<dataset_slug>/problem/<int:idx>")
+def get_problem(idx, dataset_slug="dreval"):
+    """API endpoint to get detailed information about a specific problem."""
+    adapter = _get_adapter(dataset_slug)
+    if adapter is None:
+        return jsonify({"error": f"Unknown dataset: {dataset_slug}"}), 404
+    if not (0 <= idx < adapter.problem_count()):
+        return jsonify({"error": "Invalid problem index"}), 404
+    try:
+        return jsonify(adapter.get_problem_detail(idx))
+    except (KeyError, IndexError, ValueError) as exc:
+        return jsonify({"error": f"Internal error: {exc}"}), 500
+@app.route("/api/highlight_code")
+def highlight_code_api():
+    """API endpoint to highlight code with specific lines."""
+    code = request.args.get("code", "")
+    lines_str = request.args.get("lines", "")
+    if lines_str:
+        try:
+            lines = [int(x) for x in lines_str.split(",") if x.strip()]
+        except ValueError:
+            return jsonify({"error": "Invalid line numbers"}), 400
+    else:
+        lines = None
+    highlighted = highlight_code(code, lines)
+    return jsonify({"highlighted_code": highlighted})
+@app.route("/problem/<int:idx>")
+@app.route("/<dataset_slug>/problem/<int:idx>")
+def problem_detail(idx, dataset_slug="dreval"):
+    """Page showing detailed view of a specific problem."""
+    adapter = _get_adapter(dataset_slug)
+    if adapter is None:
+        return jsonify({"error": "Unknown dataset"}), 404
+    if not (0 <= idx < adapter.problem_count()):
+        return jsonify({"error": "Problem not found"}), 404
+    return render_template(
+        "problem.html",
+        idx=idx,
+        css=get_css(),
+        total_problems=adapter.problem_count(),
+        dataset_slug=dataset_slug,
+        dataset_name=adapter.display_name,
+        has_ground_truth=adapter.has_ground_truth,
+        has_tasks=adapter.has_tasks,
+    )
+@app.route("/api/css")
+def get_css_api():
+    """API endpoint to get CSS for syntax highlighting."""
+    return get_css(), 200, {"Content-Type": "text/css"}
+@app.route("/api/problem/<int:idx>/ground_truth/<int:input_idx>")
+@app.route("/api/<dataset_slug>/problem/<int:idx>/ground_truth/<int:input_idx>")
+def get_ground_truth(idx, input_idx, dataset_slug="dreval"):
+    """
+    Return ground truth execution data for one (problem, input) pair.
+    Ground truth is only available for DREval.
+    Response fields:
+      - coverage: sorted list of 1-indexed line numbers that were executed
+      - variable_answers: [{lineno, var, values, answer_str}] matching task items
+      - status: "ok" | "error" | "unavailable"
+    """
+    if dataset_slug != "dreval":
+        return jsonify({"status": "unavailable", "message": "Ground truth only available for DREval"}), 503
+    if GT_LOADER is None:
+        return jsonify({"status": "unavailable"}), 503
+    if not (0 <= idx < len(DATA)):
+        return jsonify({"error": "Invalid problem index"}), 404
+    problem = DATA[idx]
+    task_id = problem["task_id"]
+    try:
+        exec_rec = GT_LOADER.get_execution(task_id, input_idx)
+    except KeyError as e:
+        return jsonify({"status": "error", "message": str(e)}), 404
+    if exec_rec.get("status") == "error":
+        return jsonify({"status": "error", "message": "Execution failed for this input"}), 200
+    # Coverage: convert 0-indexed (relative to original code) → 1-indexed
+    # (relative to stripped code shown by Pygments).
+    # original_0indexed + 1 = original_1indexed;
+    # stripped_1indexed = original_1indexed - offset
+    code = problem["code"]
+    offset = _code_offset(code)
+    coverage_1indexed = [ln + 1 - offset for ln in exec_rec["coverage"]]
+    # Compute total line count from stripped code
+    total_lines = len(code[offset:].splitlines())
+    # Determine which task items belong to this input_idx
+    task = TASKS[idx]
+    task_items = []
+    for t in task["tasks"]:
+        if t["input_idx"] == input_idx:
+            task_items = t.get("task", [])
+            break
+    # Resolve variable answers for each task item
+    variable_answers = []
+    for item in task_items:
+        lineno = item["lineno"]   # 1-indexed relative to original code
+        var = item["var"]
+        try:
+            # Loader expects original 1-indexed lineno (before stripping)
+            values = GT_LOADER.get_variable_values(task_id, input_idx, lineno, var)
+        except (KeyError, ValueError):
+            values = None
+        from dynamics import Nil
+        if values is Nil or values is None:
+            answer_str = "(not available)"
+        elif len(values) == 1:
+            answer_str = repr(values[0])
+        else:
+            answer_str = "[" + ", ".join(repr(v) for v in values) + "]"
+        variable_answers.append({
+            "lineno": lineno - offset,  # adjusted for stripped code display
+            "var": var,
+            "answer_str": answer_str,
+        })
+    # Resolve next lines for each task item (for arrow visualization on hover)
+    next_lines_answers = []
+    processed_linenos = set()
+    for item in task_items:
+        lineno = item["lineno"]  # already 1-indexed
+        if lineno in processed_linenos:
+            continue
+        processed_linenos.add(lineno)
+        try:
+            next_lines = GT_LOADER.get_next_lines(task_id, input_idx, lineno)
+        except (KeyError, ValueError) as exc:
+            print(f"Warning: get_next_lines({task_id}, {input_idx}, {lineno}) failed: {exc}")
+            next_lines = [-1]
+        next_lines_answers.append({
+            "lineno": lineno,
+            "next_lines": next_lines,
+        })
+    return jsonify({
+        "status": "ok",
+        "coverage": coverage_1indexed,
+        "total_lines": total_lines,
+        "variable_answers": variable_answers,
+        "next_lines_answers": next_lines_answers,
+    })
+if __name__ == "__main__":
+    debug_mode = os.getenv("FLASK_DEBUG", "false").lower() == "true"
+    port = int(os.getenv("PORT", 7860))
+    app.run(debug=debug_mode, host="0.0.0.0", port=port)

dataset_adapters.py ADDED Viewed

	@@ -0,0 +1,413 @@

+"""
+Dataset adapters for the ML4SE Benchmark Viewer.
+Each adapter normalises a different benchmark dataset into a common API shape
+so the Flask routes and templates can handle them uniformly.
+The REGISTRY dict maps slug strings (used in URLs) to adapter instances.
+"""
+from __future__ import annotations
+from pathlib import Path
+from typing import Any
+# These are imported from app.py at registration time to avoid circular imports.
+_highlight_code = None
+_code_offset = None
+_extract_test_classes = None
+def _set_helpers(highlight_code_fn, code_offset_fn, extract_test_classes_fn):
+    """Called once by app.py to inject helper functions."""
+    global _highlight_code, _code_offset, _extract_test_classes
+    _highlight_code = highlight_code_fn
+    _code_offset = code_offset_fn
+    _extract_test_classes = extract_test_classes_fn
+# ---------------------------------------------------------------------------
+# Registry
+# ---------------------------------------------------------------------------
+REGISTRY: dict[str, "DatasetAdapter"] = {}
+# ---------------------------------------------------------------------------
+# Base class
+# ---------------------------------------------------------------------------
+class DatasetAdapter:
+    slug: str = ""
+    display_name: str = ""
+    has_ground_truth: bool = False
+    has_tasks: bool = False
+    def problem_count(self) -> int:
+        raise NotImplementedError
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        raise NotImplementedError
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        raise NotImplementedError
+# ---------------------------------------------------------------------------
+# DREval adapter  (wraps existing DATA / TASKS / GT_LOADER globals)
+# ---------------------------------------------------------------------------
+class DREvalAdapter(DatasetAdapter):
+    slug = "dreval"
+    display_name = "DREval"
+    has_ground_truth = True
+    has_tasks = True
+    def __init__(self, data: list, tasks: list, gt_loader):
+        self._data = data
+        self._tasks = tasks
+        self._gt_loader = gt_loader
+    def problem_count(self) -> int:
+        return len(self._data)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        record = self._data[idx]
+        return {
+            "idx": idx,
+            "task_id": record["task_id"],
+            "entry_point": record["entry_point"],
+            "num_inputs": len(record["inputs"]),
+            "source": "ClassEval" if record.get("test") is not None else "HumanEval",
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        problem = self._data[idx]
+        task = self._tasks[idx]
+        code = problem["code"]
+        offset = _code_offset(code)
+        code = code[offset:]
+        highlighted_code = _highlight_code(code)
+        tasks_info = []
+        for task_item in task["tasks"]:
+            adjusted_items = []
+            for item in task_item.get("task", []):
+                adj = dict(item)
+                if "lineno" in adj:
+                    adj["lineno"] -= offset
+                adjusted_items.append(adj)
+            input_idx = task_item["input_idx"]
+            inp = problem["inputs"][input_idx] if input_idx < len(problem["inputs"]) else ""
+            out = problem["outputs"][input_idx] if input_idx < len(problem["outputs"]) else ""
+            task_info = {
+                "input_idx": input_idx,
+                "input": inp,
+                "output": out,
+                "task_items": adjusted_items,
+            }
+            if "output_pred" in task_item:
+                task_info["output_pred"] = task_item["output_pred"]
+            task_lines = set()
+            for item in adjusted_items:
+                if "lineno" in item:
+                    task_lines.add(item["lineno"])
+            task_info["task_lines"] = sorted(list(task_lines))
+            tasks_info.append(task_info)
+        if problem.get("test") is not None:
+            tc_list = _extract_test_classes(problem["test"], problem["entry_point"])
+            for task_info in tasks_info:
+                idx_in_tc = task_info["input_idx"]
+                if idx_in_tc < len(tc_list):
+                    task_info["test_class_name"] = tc_list[idx_in_tc]["name"]
+                    task_info["test_class_code"] = tc_list[idx_in_tc]["code"]
+        return {
+            "idx": idx,
+            "task_id": problem["task_id"],
+            "entry_point": problem["entry_point"],
+            "code": code,
+            "highlighted_code": highlighted_code,
+            "inputs": problem["inputs"],
+            "outputs": problem["outputs"],
+            "test": problem.get("test"),
+            "tasks": tasks_info,
+            "source": "ClassEval" if problem.get("test") is not None else "HumanEval",
+            "has_ground_truth": True,
+            "has_tasks": True,
+        }
+# ---------------------------------------------------------------------------
+# CRUXEval adapter  (HuggingFace: cruxeval-org/cruxeval)
+# ---------------------------------------------------------------------------
+class CRUXEvalAdapter(DatasetAdapter):
+    slug = "cruxeval"
+    display_name = "CRUXEval"
+    has_ground_truth = False
+    has_tasks = True
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        return {
+            "idx": idx,
+            "task_id": row["id"],
+            "entry_point": "f",
+            "num_inputs": 1,
+            "source": "CRUXEval",
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        code = row["code"]
+        return {
+            "idx": idx,
+            "task_id": row["id"],
+            "entry_point": "f",
+            "code": code,
+            "highlighted_code": _highlight_code(code),
+            "inputs": [row["input"]],
+            "outputs": [row["output"]],
+            "test": None,
+            "tasks": [
+                {
+                    "name": "Output Prediction",
+                    "description": "Given the code and input, predict the output.",
+                    "given": "input",
+                    "predict": "output",
+                    "input": row["input"],
+                    "output": row["output"],
+                },
+                {
+                    "name": "Input Prediction",
+                    "description": "Given the code and output, predict the input.",
+                    "given": "output",
+                    "predict": "input",
+                    "input": row["input"],
+                    "output": row["output"],
+                },
+            ],
+            "source": "CRUXEval",
+            "has_ground_truth": False,
+            "has_tasks": True,
+        }
+# ---------------------------------------------------------------------------
+# HumanEval+ adapter  (HuggingFace: evalplus/humanevalplus)
+# ---------------------------------------------------------------------------
+class HumanEvalPlusAdapter(DatasetAdapter):
+    slug = "humanevalplus"
+    display_name = "HumanEval+"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, hf_dataset):
+        self._ds = hf_dataset
+    def problem_count(self) -> int:
+        return len(self._ds)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        return {
+            "idx": idx,
+            "task_id": row["task_id"],
+            "entry_point": row["entry_point"],
+            "num_inputs": 0,
+            "source": "HumanEval+",
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        row = self._ds[idx]
+        code = row["prompt"] + row["canonical_solution"]
+        return {
+            "idx": idx,
+            "task_id": row["task_id"],
+            "entry_point": row["entry_point"],
+            "code": code,
+            "highlighted_code": _highlight_code(code),
+            "inputs": [],
+            "outputs": [],
+            "test": row["test"],
+            "tasks": [],
+            "source": "HumanEval+",
+            "has_ground_truth": False,
+            "has_tasks": False,
+        }
+# ---------------------------------------------------------------------------
+# BigOBench adapter  (HuggingFace: facebook/BigOBench)
+# ---------------------------------------------------------------------------
+class BigOBenchAdapter(DatasetAdapter):
+    slug = "bigobench"
+    display_name = "BigOBench"
+    has_ground_truth = False
+    has_tasks = False
+    def __init__(self, problems: list[dict[str, Any]]):
+        self._problems = problems
+    def problem_count(self) -> int:
+        return len(self._problems)
+    def get_problem_summary(self, idx: int) -> dict[str, Any]:
+        prob = self._problems[idx]
+        return {
+            "idx": idx,
+            "task_id": prob["problem_id"],
+            "entry_point": prob["problem_name"],
+            "num_inputs": len(prob["solutions"]),
+            "source": "BigOBench",
+        }
+    def get_problem_detail(self, idx: int) -> dict[str, Any]:
+        prob = self._problems[idx]
+        solutions = []
+        for sol in prob["solutions"]:
+            solutions.append({
+                "solution_id": sol["solution_id"],
+                "code": sol["solution_code"],
+                "highlighted_code": _highlight_code(sol["solution_code"]),
+                "time_complexity": sol.get("time_complexity"),
+                "space_complexity": sol.get("space_complexity"),
+            })
+        return {
+            "idx": idx,
+            "task_id": prob["problem_id"],
+            "entry_point": prob["problem_name"],
+            "code": solutions[0]["code"] if solutions else "",
+            "highlighted_code": solutions[0]["highlighted_code"] if solutions else "",
+            "inputs": [],
+            "outputs": [],
+            "test": None,
+            "tasks": [],
+            "source": "BigOBench",
+            "has_ground_truth": False,
+            "has_tasks": False,
+            "description": prob["description"],
+            "solutions": solutions,
+        }
+def _merge_bigobench(ds_time, ds_space) -> list[dict[str, Any]]:
+    """Merge time and space complexity test sets by problem_id.
+    Groups all solutions under their parent problem.  Solutions that appear
+    in both test sets get both complexity labels; otherwise the missing one
+    is None.  Returns a list of problem dicts sorted by problem_id.
+    """
+    # First, collect solutions keyed by (problem_id, solution_id)
+    solutions: dict[tuple[str, str], dict[str, Any]] = {}
+    # Track problem-level metadata
+    problem_meta: dict[str, dict[str, str]] = {}
+    for row in ds_time:
+        pid, sid = row["problem_id"], row["solution_id"]
+        problem_meta[pid] = {
+            "problem_name": row["problem_name"],
+            "description": row["description"],
+        }
+        solutions[(pid, sid)] = {
+            "solution_id": sid,
+            "solution_code": row["solution_code"],
+            "time_complexity": row["time_complexity_inferred"],
+            "space_complexity": None,
+        }
+    for row in ds_space:
+        pid, sid = row["problem_id"], row["solution_id"]
+        problem_meta.setdefault(pid, {
+            "problem_name": row["problem_name"],
+            "description": row["description"],
+        })
+        key = (pid, sid)
+        if key in solutions:
+            solutions[key]["space_complexity"] = row["space_complexity_inferred"]
+        else:
+            solutions[key] = {
+                "solution_id": sid,
+                "solution_code": row["solution_code"],
+                "time_complexity": None,
+                "space_complexity": row["space_complexity_inferred"],
+            }
+    # Group solutions by problem_id
+    from collections import defaultdict
+    by_problem: dict[str, list[dict[str, Any]]] = defaultdict(list)
+    for (pid, _sid), sol in solutions.items():
+        by_problem[pid].append(sol)
+    problems = []
+    for pid in sorted(by_problem.keys()):
+        meta = problem_meta[pid]
+        problems.append({
+            "problem_id": pid,
+            "problem_name": meta["problem_name"],
+            "description": meta["description"],
+            "solutions": by_problem[pid],
+        })
+    return problems
+# ---------------------------------------------------------------------------
+# Registration helpers
+# ---------------------------------------------------------------------------
+def register_dreval(data: list, tasks: list, gt_loader) -> None:
+    """Register the DREval dataset (always available)."""
+    REGISTRY["dreval"] = DREvalAdapter(data, tasks, gt_loader)
+def register_hf_datasets() -> None:
+    """Try to load HuggingFace datasets.  Silently skips if `datasets` is not installed."""
+    try:
+        from datasets import load_dataset
+    except ImportError:
+        return
+    try:
+        crux = load_dataset("cruxeval-org/cruxeval", split="test")
+        REGISTRY["cruxeval"] = CRUXEvalAdapter(crux)
+        print(f"Loaded CRUXEval: {len(crux)} problems")
+    except Exception as e:
+        print(f"Warning: could not load CRUXEval: {e}")
+    try:
+        heplus = load_dataset("evalplus/humanevalplus", split="test")
+        REGISTRY["humanevalplus"] = HumanEvalPlusAdapter(heplus)
+        print(f"Loaded HumanEval+: {len(heplus)} problems")
+    except Exception as e:
+        print(f"Warning: could not load HumanEval+: {e}")
+    try:
+        ds_time = load_dataset(
+            "facebook/BigOBench", "time_complexity_test_set.jsonl", split="train"
+        )
+        ds_space = load_dataset(
+            "facebook/BigOBench", "space_complexity_test_set.jsonl", split="train"
+        )
+        merged = _merge_bigobench(ds_time, ds_space)
+        REGISTRY["bigobench"] = BigOBenchAdapter(merged)
+        print(f"Loaded BigOBench: {len(merged)} problems "
+              f"({len(ds_time)} time + {len(ds_space)} space)")
+    except Exception as e:
+        print(f"Warning: could not load BigOBench: {e}")

pyproject.toml ADDED Viewed

	@@ -0,0 +1,38 @@

+[project]
+name = "ml4se-bench-viewer"
+version = "0.1.0"
+description = "Web-based visualization tool for browsing and inspecting popular ML4SE benchmark datasets"
+readme = "README.md"
+requires-python = ">=3.8"
+dependencies = [
+    "flask>=3.0.0",
+    "pygments>=2.17.2",
+    "datasets>=2.14.0",
+]
+[project.optional-dependencies]
+dev = [
+    "ruff>=0.8.0",
+]
+[tool.ruff]
+line-length = 100
+target-version = "py38"
+[tool.ruff.lint]
+select = [
+    "E",   # pycodestyle errors
+    "W",   # pycodestyle warnings
+    "F",   # pyflakes
+    "I",   # isort
+    "B",   # flake8-bugbear
+    "C4",  # flake8-comprehensions
+    "UP",  # pyupgrade
+]
+ignore = [
+    "E501",  # line too long (handled by formatter)
+]
+[tool.ruff.format]
+quote-style = "double"
+indent-style = "space"

templates/base.html ADDED Viewed

	@@ -0,0 +1,211 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>{% block title %}ML4SE Benchmark Viewer{% endblock %}</title>
+    <style>
+        * {
+            margin: 0;
+            padding: 0;
+            box-sizing: border-box;
+        }
+        body {
+            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
+            line-height: 1.6;
+            color: #333;
+            background: #f5f5f5;
+        }
+        .container {
+            max-width: 1400px;
+            margin: 0 auto;
+            padding: 20px;
+        }
+        header {
+            background: #2c3e50;
+            color: white;
+            padding: 20px 0;
+            margin-bottom: 30px;
+            box-shadow: 0 2px 4px rgba(0,0,0,0.1);
+        }
+        header h1 {
+            font-size: 2rem;
+            font-weight: 600;
+        }
+        header p {
+            margin-top: 10px;
+            opacity: 0.9;
+        }
+        .card {
+            background: white;
+            border-radius: 8px;
+            padding: 20px;
+            margin-bottom: 20px;
+            box-shadow: 0 2px 4px rgba(0,0,0,0.1);
+        }
+        .card h2 {
+            font-size: 1.5rem;
+            margin-bottom: 15px;
+            color: #2c3e50;
+            border-bottom: 2px solid #3498db;
+            padding-bottom: 10px;
+        }
+        .badge {
+            display: inline-block;
+            padding: 4px 12px;
+            border-radius: 4px;
+            font-size: 0.85rem;
+            font-weight: 600;
+            margin-right: 8px;
+        }
+        .badge-humaneval {
+            background: #3498db;
+            color: white;
+        }
+        .badge-classeval {
+            background: #9b59b6;
+            color: white;
+        }
+        .badge-cruxeval {
+            background: #e67e22;
+            color: white;
+        }
+        .badge-humanevalplus {
+            background: #27ae60;
+            color: white;
+        }
+        .badge-bigobench {
+            background: #8e44ad;
+            color: white;
+        }
+        .badge-info {
+            background: #ecf0f1;
+            color: #2c3e50;
+        }
+        /* Code styling */
+        .source {
+            font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;
+            font-size: 0.9rem;
+            line-height: 1.5;
+            overflow-x: auto;
+        }
+        .source table {
+            border-collapse: collapse;
+            width: 100%;
+        }
+        .source td.linenos {
+            background: #f8f8f8;
+            color: #999;
+            padding: 0 10px;
+            text-align: right;
+            user-select: none;
+            border-right: 1px solid #ddd;
+        }
+        .source td.code {
+            padding-left: 15px;
+        }
+        .source .hll {
+            background-color: #ffffcc;
+        }
+        pre {
+            margin: 0;
+            white-space: pre;
+            word-wrap: normal;
+        }
+        /* Loading spinner */
+        .loading {
+            text-align: center;
+            padding: 40px;
+            color: #7f8c8d;
+        }
+        .spinner {
+            border: 4px solid #f3f3f3;
+            border-top: 4px solid #3498db;
+            border-radius: 50%;
+            width: 40px;
+            height: 40px;
+            animation: spin 1s linear infinite;
+            margin: 20px auto;
+        }
+        @keyframes spin {
+            0% { transform: rotate(0deg); }
+            100% { transform: rotate(360deg); }
+        }
+        /* Button styles */
+        .btn {
+            display: inline-block;
+            padding: 10px 20px;
+            background: #3498db;
+            color: white;
+            text-decoration: none;
+            border-radius: 4px;
+            border: none;
+            cursor: pointer;
+            font-size: 1rem;
+            transition: background 0.3s;
+        }
+        .btn:hover {
+            background: #2980b9;
+        }
+        .btn-secondary {
+            background: #95a5a6;
+        }
+        .btn-secondary:hover {
+            background: #7f8c8d;
+        }
+        /* Navigation */
+        .nav-links {
+            margin-top: 20px;
+        }
+        .nav-links a {
+            margin-right: 15px;
+        }
+        {% block extra_css %}{% endblock %}
+    </style>
+</head>
+<body>
+    <header>
+        <div class="container">
+            <h1>ML4SE Benchmark Viewer</h1>
+            <p>Browse and inspect popular ML4SE benchmark datasets</p>
+            {% block header_extra %}{% endblock %}
+        </div>
+    </header>
+    <div class="container">
+        {% block content %}{% endblock %}
+    </div>
+    {% block scripts %}{% endblock %}
+</body>
+</html>

templates/index.html ADDED Viewed

	@@ -0,0 +1,330 @@

+{% extends "base.html" %}
+{% block title %}ML4SE Benchmark Viewer - Problem List{% endblock %}
+{% block extra_css %}
+<style>
+    .filters {
+        margin-bottom: 20px;
+        display: flex;
+        gap: 15px;
+        align-items: center;
+        flex-wrap: wrap;
+    }
+    .filter-group {
+        display: flex;
+        align-items: center;
+        gap: 8px;
+    }
+    .filter-group label {
+        font-weight: 600;
+        color: #2c3e50;
+    }
+    .filter-group select,
+    .filter-group input {
+        padding: 8px 12px;
+        border: 1px solid #ddd;
+        border-radius: 4px;
+        font-size: 0.95rem;
+    }
+    .problems-grid {
+        display: grid;
+        grid-template-columns: repeat(auto-fill, minmax(300px, 1fr));
+        gap: 20px;
+    }
+    .problem-card {
+        background: white;
+        border-radius: 8px;
+        padding: 20px;
+        box-shadow: 0 2px 4px rgba(0,0,0,0.1);
+        transition: transform 0.2s, box-shadow 0.2s;
+        cursor: pointer;
+        text-decoration: none;
+        color: inherit;
+        display: block;
+    }
+    .problem-card:hover {
+        transform: translateY(-2px);
+        box-shadow: 0 4px 8px rgba(0,0,0,0.15);
+    }
+    .problem-card-header {
+        margin-bottom: 12px;
+        padding-bottom: 12px;
+        border-bottom: 1px solid #ecf0f1;
+    }
+    .problem-card-title {
+        font-size: 1.1rem;
+        font-weight: 600;
+        color: #2c3e50;
+        margin-bottom: 5px;
+    }
+    .problem-card-id {
+        font-size: 0.85rem;
+        color: #7f8c8d;
+        font-family: monospace;
+    }
+    .problem-card-body {
+        margin: 12px 0;
+    }
+    .problem-card-info {
+        display: flex;
+        justify-content: space-between;
+        font-size: 0.9rem;
+        color: #7f8c8d;
+    }
+    .stats {
+        display: flex;
+        gap: 20px;
+        margin-bottom: 20px;
+        flex-wrap: wrap;
+    }
+    .stat-card {
+        background: white;
+        border-radius: 8px;
+        padding: 20px;
+        box-shadow: 0 2px 4px rgba(0,0,0,0.1);
+        flex: 1;
+        min-width: 200px;
+    }
+    .stat-number {
+        font-size: 2.5rem;
+        font-weight: 700;
+        color: #3498db;
+    }
+    .stat-label {
+        font-size: 0.9rem;
+        color: #7f8c8d;
+        margin-top: 5px;
+    }
+</style>
+{% endblock %}
+{% block content %}
+<div class="stats" id="stats">
+    <div class="stat-card">
+        <div class="stat-number" id="total-problems">-</div>
+        <div class="stat-label">Total Problems</div>
+    </div>
+    <div class="stat-card" id="stat-source-a">
+        <div class="stat-number" id="source-a-count">-</div>
+        <div class="stat-label" id="source-a-label">Source A</div>
+    </div>
+    <div class="stat-card" id="stat-source-b">
+        <div class="stat-number" id="source-b-count">-</div>
+        <div class="stat-label" id="source-b-label">Source B</div>
+    </div>
+    <div class="stat-card">
+        <div class="stat-number" id="filtered-count">-</div>
+        <div class="stat-label">Displayed</div>
+    </div>
+</div>
+<div class="card">
+    <h2>Filter Problems</h2>
+    <div class="filters">
+        <div class="filter-group">
+            <label for="dataset-filter">Dataset:</label>
+            <select id="dataset-filter">
+                <option value="dreval">DREval</option>
+            </select>
+        </div>
+        <div class="filter-group">
+            <label for="source-filter">Source:</label>
+            <select id="source-filter">
+                <option value="all">All</option>
+            </select>
+        </div>
+        <div class="filter-group">
+            <label for="search-filter">Search:</label>
+            <input type="text" id="search-filter" placeholder="Function name or ID...">
+        </div>
+    </div>
+</div>
+<div id="problems-container">
+    <div class="loading">
+        <div class="spinner"></div>
+        <p>Loading problems...</p>
+    </div>
+</div>
+{% endblock %}
+{% block scripts %}
+<script>
+let allProblems = [];
+// Read dataset from URL query param (e.g. /?dataset=cruxeval), default to dreval
+let currentDataset = new URLSearchParams(window.location.search).get('dataset') || 'dreval';
+async function loadDatasets() {
+    try {
+        const response = await fetch('/api/datasets');
+        const datasets = await response.json();
+        const select = document.getElementById('dataset-filter');
+        select.innerHTML = '';
+        datasets.forEach(ds => {
+            const opt = document.createElement('option');
+            opt.value = ds.slug;
+            opt.textContent = `${ds.display_name} (${ds.problem_count})`;
+            if (ds.slug === currentDataset) opt.selected = true;
+            select.appendChild(opt);
+        });
+    } catch (error) {
+        console.error('Failed to load datasets:', error);
+    }
+}
+async function loadProblems() {
+    try {
+        document.getElementById('problems-container').innerHTML =
+            '<div class="loading"><div class="spinner"></div><p>Loading problems...</p></div>';
+        const response = await fetch(`/api/${currentDataset}/problems`);
+        allProblems = await response.json();
+        updateSourceFilter();
+        updateStats();
+        renderProblems(allProblems);
+    } catch (error) {
+        document.getElementById('problems-container').innerHTML =
+            '<div class="card"><p style="color: red;">Error loading problems: ' + error.message + '</p></div>';
+    }
+}
+function updateSourceFilter() {
+    const sources = [...new Set(allProblems.map(p => p.source))];
+    const select = document.getElementById('source-filter');
+    const current = select.value;
+    select.innerHTML = '<option value="all">All</option>';
+    sources.forEach(src => {
+        const opt = document.createElement('option');
+        opt.value = src;
+        opt.textContent = src;
+        select.appendChild(opt);
+    });
+    // Restore selection if still valid
+    if (sources.includes(current)) {
+        select.value = current;
+    }
+}
+function updateStats() {
+    const sources = {};
+    allProblems.forEach(p => {
+        sources[p.source] = (sources[p.source] || 0) + 1;
+    });
+    document.getElementById('total-problems').textContent = allProblems.length;
+    const sourceNames = Object.keys(sources);
+    const statA = document.getElementById('stat-source-a');
+    const statB = document.getElementById('stat-source-b');
+    if (sourceNames.length >= 1) {
+        statA.style.display = '';
+        document.getElementById('source-a-count').textContent = sources[sourceNames[0]];
+        document.getElementById('source-a-label').textContent = sourceNames[0];
+    } else {
+        statA.style.display = 'none';
+    }
+    if (sourceNames.length >= 2) {
+        statB.style.display = '';
+        document.getElementById('source-b-count').textContent = sources[sourceNames[1]];
+        document.getElementById('source-b-label').textContent = sourceNames[1];
+    } else {
+        statB.style.display = 'none';
+    }
+}
+function badgeClass(source) {
+    return 'badge-' + source.toLowerCase().replace(/[^a-z0-9]/g, '');
+}
+function renderProblems(problems) {
+    const container = document.getElementById('problems-container');
+    if (problems.length === 0) {
+        container.innerHTML = '<div class="card"><p>No problems match your filters.</p></div>';
+        document.getElementById('filtered-count').textContent = '0';
+        return;
+    }
+    document.getElementById('filtered-count').textContent = problems.length;
+    const grid = document.createElement('div');
+    grid.className = 'problems-grid';
+    const basePath = currentDataset === 'dreval' ? '' : `/${currentDataset}`;
+    problems.forEach(problem => {
+        const card = document.createElement('a');
+        card.className = 'problem-card';
+        card.href = `${basePath}/problem/${problem.idx}`;
+        card.innerHTML = `
+            <div class="problem-card-header">
+                <div class="problem-card-title">${problem.entry_point}</div>
+                <div class="problem-card-id">${problem.task_id}</div>
+            </div>
+            <div class="problem-card-body">
+                <span class="badge ${badgeClass(problem.source)}">${problem.source}</span>
+                <span class="badge badge-info">${problem.num_inputs} inputs</span>
+            </div>
+            <div class="problem-card-info">
+                <span>Index: ${problem.idx}</span>
+            </div>
+        `;
+        grid.appendChild(card);
+    });
+    container.innerHTML = '';
+    container.appendChild(grid);
+}
+function filterProblems() {
+    const sourceFilter = document.getElementById('source-filter').value;
+    const searchFilter = document.getElementById('search-filter').value.toLowerCase();
+    let filtered = allProblems;
+    if (sourceFilter !== 'all') {
+        filtered = filtered.filter(p => p.source === sourceFilter);
+    }
+    if (searchFilter) {
+        filtered = filtered.filter(p =>
+            p.entry_point.toLowerCase().includes(searchFilter) ||
+            p.task_id.toLowerCase().includes(searchFilter)
+        );
+    }
+    renderProblems(filtered);
+}
+document.getElementById('dataset-filter').addEventListener('change', (e) => {
+    currentDataset = e.target.value;
+    document.getElementById('source-filter').value = 'all';
+    document.getElementById('search-filter').value = '';
+    loadProblems();
+});
+document.getElementById('source-filter').addEventListener('change', filterProblems);
+document.getElementById('search-filter').addEventListener('input', filterProblems);
+loadDatasets();
+loadProblems();
+</script>
+{% endblock %}

templates/problem.html ADDED Viewed

	@@ -0,0 +1,1015 @@

+{% extends "base.html" %}
+{% block title %}Problem {{ idx }} - {{ dataset_name }} - ML4SE Benchmark Viewer{% endblock %}
+{% block header_extra %}
+<div class="nav-links">
+    <a href="/?dataset={{ dataset_slug }}" class="btn btn-secondary">← Back to List</a>
+    {% set base_path = '/' + dataset_slug if dataset_slug != 'dreval' else '' %}
+    {% if idx > 0 %}
+        <a href="{{ base_path }}/problem/{{ idx - 1 }}" class="btn btn-secondary">← Previous</a>
+    {% else %}
+        <button class="btn btn-secondary" disabled style="opacity: 0.5; cursor: not-allowed;">← Previous</button>
+    {% endif %}
+    {% if idx < total_problems - 1 %}
+        <a href="{{ base_path }}/problem/{{ idx + 1 }}" class="btn btn-secondary">Next →</a>
+    {% else %}
+        <button class="btn btn-secondary" disabled style="opacity: 0.5; cursor: not-allowed;">Next →</button>
+    {% endif %}
+</div>
+{% endblock %}
+{% block extra_css %}
+<style>
+    {{ css|safe }}
+    .problem-header {
+        display: flex;
+        justify-content: space-between;
+        align-items: center;
+        margin-bottom: 15px;
+    }
+    .problem-meta {
+        margin-bottom: 20px;
+    }
+    .meta-item {
+        display: inline-block;
+        margin-right: 15px;
+        margin-bottom: 10px;
+    }
+    .meta-label {
+        font-weight: 600;
+        color: #7f8c8d;
+        margin-right: 5px;
+    }
+    .meta-value {
+        color: #2c3e50;
+    }
+    .task-selector {
+        margin: 20px 0;
+        display: flex;
+        gap: 10px;
+        flex-wrap: wrap;
+    }
+    .task-btn {
+        padding: 10px 20px;
+        background: #ecf0f1;
+        border: 2px solid transparent;
+        border-radius: 4px;
+        cursor: pointer;
+        transition: all 0.3s;
+        font-size: 0.95rem;
+    }
+    .task-btn:hover {
+        background: #bdc3c7;
+    }
+    .task-btn.active {
+        background: #3498db;
+        color: white;
+        border-color: #2980b9;
+    }
+    .task-details {
+        margin-top: 20px;
+    }
+    .task-section {
+        margin-bottom: 25px;
+        padding: 15px;
+        background: #f8f9fa;
+        border-left: 4px solid #3498db;
+        border-radius: 4px;
+    }
+    .task-section h3 {
+        margin-bottom: 10px;
+        color: #2c3e50;
+        font-size: 1.1rem;
+    }
+    .code-block {
+        background: #f8f9fa;
+        padding: 15px;
+        border-radius: 4px;
+        overflow-x: auto;
+        font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;
+        font-size: 0.9rem;
+        border: 1px solid #e1e4e8;
+    }
+    .task-items-list {
+        list-style: none;
+    }
+    .task-items-list li {
+        padding: 10px;
+        margin-bottom: 8px;
+        background: white;
+        border-radius: 4px;
+        border: 1px solid #e1e4e8;
+    }
+    .line-ref {
+        display: inline-block;
+        padding: 2px 8px;
+        background: #3498db;
+        color: white;
+        border-radius: 3px;
+        font-family: monospace;
+        font-size: 0.85rem;
+        margin-right: 8px;
+    }
+    .var-name {
+        display: inline-block;
+        padding: 2px 8px;
+        background: #9b59b6;
+        color: white;
+        border-radius: 3px;
+        font-family: monospace;
+        font-size: 0.85rem;
+    }
+    .io-section {
+        display: grid;
+        grid-template-columns: 1fr 1fr;
+        gap: 15px;
+    }
+    @media (max-width: 768px) {
+        .io-section {
+            grid-template-columns: 1fr;
+        }
+    }
+    .navigation-hint {
+        margin-top: 20px;
+        padding: 15px;
+        background: #e8f4f8;
+        border-radius: 4px;
+        color: #2c3e50;
+        font-size: 0.9rem;
+    }
+    .test-code-section {
+        margin-top: 20px;
+    }
+    /* Inline task visualization */
+    .code-with-tasks {
+        position: relative;
+    }
+    .task-marker {
+        display: inline-block;
+        margin-left: 10px;
+        padding: 2px 8px;
+        background: #9b59b6;
+        color: white;
+        border-radius: 3px;
+        font-size: 0.75rem;
+        font-weight: 600;
+        cursor: crosshair;
+    }
+    /* Coverage coloring on lineno spans.
+       Pygments emits: td.linenos > div.linenodiv > pre > span.normal
+       We must match that chain; .source .linenos doesn't work because
+       the td has class "linenos", not an element named "linenos". */
+    td.linenos .normal.line-executed {
+        background-color: #d4edda !important;
+        color: #155724 !important;
+    }
+    td.linenos .normal.line-not-executed {
+        background-color: #f8d7da !important;
+        color: #721c24 !important;
+    }
+    /* Coverage legend */
+    .coverage-legend {
+        margin: 10px 0;
+        padding: 10px 15px;
+        background: #f8f9fa;
+        border-left: 4px solid #28a745;
+        border-radius: 4px;
+        font-size: 0.85rem;
+        display: none;
+    }
+    .coverage-legend-item {
+        display: inline-block;
+        margin-right: 18px;
+    }
+    .coverage-swatch {
+        display: inline-block;
+        width: 12px;
+        height: 12px;
+        border-radius: 2px;
+        margin-right: 4px;
+        vertical-align: middle;
+    }
+    /* Ground truth answer badge shown next to task items */
+    .gt-answer {
+        display: inline-block;
+        margin-left: 10px;
+        padding: 2px 8px;
+        background: #17a2b8;
+        color: white;
+        border-radius: 3px;
+        font-family: monospace;
+        font-size: 0.82rem;
+        font-weight: 600;
+    }
+    .gt-answer.loading {
+        background: #6c757d;
+    }
+    /* SVG arrow overlay positioned over the code container */
+    #arrow-overlay {
+        position: absolute;
+        top: 0;
+        left: 0;
+        width: 100%;
+        height: 100%;
+        pointer-events: none;
+        overflow: visible;
+        z-index: 10;
+    }
+    .exec-arrow {
+        fill: none;
+        stroke: #e67e22;
+        stroke-width: 2.5;
+        stroke-dasharray: none;
+        opacity: 0.9;
+    }
+    .exec-arrow-head {
+        fill: #e67e22;
+        opacity: 0.9;
+    }
+    /* CRUXEval answer highlight */
+    .crux-answer {
+        border-left: 4px solid #17a2b8 !important;
+        background: #e8f6f8 !important;
+    }
+    /* BigOBench complexity display */
+    .complexity-badges {
+        display: flex;
+        gap: 20px;
+        flex-wrap: wrap;
+    }
+    .complexity-item {
+        display: flex;
+        align-items: center;
+        gap: 10px;
+    }
+    .complexity-label {
+        font-weight: 600;
+        color: #7f8c8d;
+        font-size: 0.95rem;
+    }
+    .complexity-value {
+        display: inline-block;
+        padding: 6px 16px;
+        background: #2c3e50;
+        color: #f1c40f;
+        border-radius: 4px;
+        font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;
+        font-size: 1.1rem;
+        font-weight: 600;
+    }
+</style>
+{% endblock %}
+{% block content %}
+<div id="problem-content">
+    <div class="loading">
+        <div class="spinner"></div>
+        <p>Loading problem details...</p>
+    </div>
+</div>
+{% endblock %}
+{% block scripts %}
+<script>
+const problemIdx = {{ idx }};
+const datasetSlug = {{ dataset_slug|tojson }};
+const datasetName = {{ dataset_name|tojson }};
+const hasGroundTruth = {{ has_ground_truth|tojson }};
+const hasTasks = {{ has_tasks|tojson }};
+function badgeClass(source) {
+    return 'badge-' + source.toLowerCase().replace(/[^a-z0-9]/g, '');
+}
+async function loadProblem() {
+    try {
+        const response = await fetch(`/api/${datasetSlug}/problem/${problemIdx}`);
+        const problem = await response.json();
+        if (problem.error) {
+            document.getElementById('problem-content').innerHTML =
+                '<div class="card"><p style="color: red;">Error: ' + problem.error + '</p></div>';
+            return;
+        }
+        renderProblem(problem);
+    } catch (error) {
+        document.getElementById('problem-content').innerHTML =
+            '<div class="card"><p style="color: red;">Error loading problem: ' + error.message + '</p></div>';
+    }
+}
+function renderProblem(problem) {
+    const container = document.getElementById('problem-content');
+    // Main problem info card (shared by all datasets)
+    let html = `
+        <div class="card">
+            <div class="problem-header">
+                <h2>${escapeHtml(problem.entry_point)}</h2>
+                <span class="badge ${badgeClass(problem.source)}">${escapeHtml(problem.source)}</span>
+            </div>
+            <div class="problem-meta">
+                <div class="meta-item">
+                    <span class="meta-label">Task ID:</span>
+                    <span class="meta-value">${escapeHtml(problem.task_id)}</span>
+                </div>
+                <div class="meta-item">
+                    <span class="meta-label">Index:</span>
+                    <span class="meta-value">${problem.idx}</span>
+                </div>
+                <div class="meta-item">
+                    <span class="meta-label">Dataset:</span>
+                    <span class="meta-value">${escapeHtml(datasetName)}</span>
+                </div>
+                ${problem.inputs.length > 0 ? `
+                <div class="meta-item">
+                    <span class="meta-label">Test Inputs:</span>
+                    <span class="meta-value">${problem.inputs.length}</span>
+                </div>` : ''}
+            </div>
+        </div>
+    `;
+    // --- BigOBench view (problem description + per-solution code & complexity) ---
+    if (problem.solutions && problem.solutions.length > 0) {
+        // Problem description
+        if (problem.description) {
+            html += `
+                <div class="card">
+                    <h2>Problem Statement</h2>
+                    <pre class="code-block" style="white-space: pre-wrap;">${escapeHtml(problem.description)}</pre>
+                </div>
+            `;
+        }
+        // Each solution: code + complexity
+        problem.solutions.forEach((sol, i) => {
+            html += `
+                <div class="card">
+                    <h2>Solution ${i + 1} <span style="font-size:0.8rem;color:#7f8c8d;font-weight:400;">${escapeHtml(sol.solution_id)}</span></h2>
+                    <div class="complexity-badges" style="margin-bottom: 15px;">
+            `;
+            if (sol.time_complexity) {
+                html += `
+                        <div class="complexity-item">
+                            <span class="complexity-label">Time</span>
+                            <span class="complexity-value">${escapeHtml(sol.time_complexity)}</span>
+                        </div>`;
+            }
+            if (sol.space_complexity) {
+                html += `
+                        <div class="complexity-item">
+                            <span class="complexity-label">Space</span>
+                            <span class="complexity-value">${escapeHtml(sol.space_complexity)}</span>
+                        </div>`;
+            }
+            html += `
+                    </div>
+                    <div class="code-with-tasks">
+                        ${sol.highlighted_code}
+                    </div>
+                </div>
+            `;
+        });
+        // Navigation hint
+        html += `
+            <div class="navigation-hint">
+                <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
+                or return to the list view to filter by dataset source or search by name.
+            </div>
+        `;
+        container.innerHTML = html;
+        window.currentProblem = problem;
+        return;
+    }
+    // Source Code card
+    html += `
+        <div class="card">
+            <h2>Source Code</h2>
+            <div class="code-with-tasks" id="code-container">
+                ${problem.highlighted_code}
+            </div>
+        </div>
+    `;
+    // --- Non-DREval (simple) view ---
+    if (!hasTasks) {
+        // Show inputs/outputs if available
+        if (problem.inputs && problem.inputs.length > 0) {
+            html += `<div class="card"><h2>Inputs &amp; Outputs</h2>`;
+            problem.inputs.forEach((inp, i) => {
+                const out = (problem.outputs && problem.outputs[i]) || '';
+                html += `
+                    <div class="io-section" style="margin-bottom: 15px;">
+                        <div class="task-section">
+                            <h3>Input ${i + 1}</h3>
+                            <pre class="code-block">${escapeHtml(inp)}</pre>
+                        </div>
+                        <div class="task-section">
+                            <h3>Output</h3>
+                            <pre class="code-block">${escapeHtml(out)}</pre>
+                        </div>
+                    </div>
+                `;
+            });
+            html += `</div>`;
+        }
+        // Show test suite if available
+        if (problem.test) {
+            html += `
+                <div class="card">
+                    <h2>Test Suite</h2>
+                    <pre class="code-block">${escapeHtml(problem.test)}</pre>
+                </div>
+            `;
+        }
+        // Navigation hint
+        html += `
+            <div class="navigation-hint">
+                <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
+                or return to the list view to filter by dataset source or search by name.
+            </div>
+        `;
+        container.innerHTML = html;
+        window.currentProblem = problem;
+        return;
+    }
+    // --- CRUXEval task view (tasks have given/predict fields, no task_items) ---
+    if (problem.tasks.length > 0 && problem.tasks[0].given !== undefined) {
+        // Task selector
+        html += `
+            <div class="card">
+                <h2>Tasks</h2>
+                <div class="task-selector" id="task-selector">
+        `;
+        problem.tasks.forEach((task, idx) => {
+            html += `
+                <button class="task-btn ${idx === 0 ? 'active' : ''}"
+                        onclick="showCruxTask(${idx})">
+                    ${escapeHtml(task.name)}
+                </button>
+            `;
+        });
+        html += `
+                </div>
+                <div id="task-content"></div>
+            </div>
+        `;
+        // Navigation hint
+        html += `
+            <div class="navigation-hint">
+                <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
+                or return to the list view to filter by dataset source or search by name.
+            </div>
+        `;
+        container.innerHTML = html;
+        window.currentProblem = problem;
+        showCruxTask(0);
+        return;
+    }
+    // --- DREval (full) view with tasks, coverage, arrows ---
+    // Rebuild html cleanly with coverage legend and SVG overlay
+    html = `
+        <div class="card">
+            <div class="problem-header">
+                <h2>${escapeHtml(problem.entry_point)}</h2>
+                <span class="badge ${badgeClass(problem.source)}">${escapeHtml(problem.source)}</span>
+            </div>
+            <div class="problem-meta">
+                <div class="meta-item">
+                    <span class="meta-label">Task ID:</span>
+                    <span class="meta-value">${escapeHtml(problem.task_id)}</span>
+                </div>
+                <div class="meta-item">
+                    <span class="meta-label">Index:</span>
+                    <span class="meta-value">${problem.idx}</span>
+                </div>
+                <div class="meta-item">
+                    <span class="meta-label">Dataset:</span>
+                    <span class="meta-value">${escapeHtml(datasetName)}</span>
+                </div>
+                <div class="meta-item">
+                    <span class="meta-label">Test Inputs:</span>
+                    <span class="meta-value">${problem.inputs.length}</span>
+                </div>
+            </div>
+        </div>
+        <div class="card">
+            <h2>Source Code</h2>
+            <div class="coverage-legend" id="coverage-legend">
+                <strong>Coverage:</strong>
+                <span class="coverage-legend-item">
+                    <span class="coverage-swatch" style="background:#d4edda; border:1px solid #28a745;"></span>
+                    Executed
+                </span>
+                <span class="coverage-legend-item">
+                    <span class="coverage-swatch" style="background:#f8d7da; border:1px solid #dc3545;"></span>
+                    Not executed
+                </span>
+            </div>
+            <div class="code-with-tasks" id="code-container">
+                ${problem.highlighted_code}
+                <svg id="arrow-overlay" xmlns="http://www.w3.org/2000/svg">
+                    <defs>
+                        <marker id="arrowhead" markerWidth="8" markerHeight="6"
+                                refX="8" refY="3" orient="auto">
+                            <polygon points="0 0, 8 3, 0 6" class="exec-arrow-head"/>
+                        </marker>
+                    </defs>
+                </svg>
+            </div>
+        </div>
+    `;
+    // Task selector
+    html += `
+        <div class="card">
+            <h2>Test Cases & Tasks</h2>
+            <p>Select a test input to view associated reasoning tasks:</p>
+            <div class="task-selector" id="task-selector">
+    `;
+    problem.tasks.forEach((task, idx) => {
+        html += `
+            <button class="task-btn ${idx === 0 ? 'active' : ''}"
+                    onclick="showTask(${idx})">
+                Input ${task.input_idx + 1}
+            </button>
+        `;
+    });
+    html += `
+            </div>
+            <div id="task-content"></div>
+        </div>
+    `;
+    // Navigation hint
+    html += `
+        <div class="navigation-hint">
+            <strong>Tip:</strong> Use the Previous/Next buttons at the top to browse through problems,
+            or return to the list view to filter by dataset source or search by name.
+        </div>
+    `;
+    container.innerHTML = html;
+    // Store problem data globally
+    window.currentProblem = problem;
+    // Show first task by default
+    showTask(0);
+}
+function injectTaskMarkers(taskItems) {
+    const codePre = document.querySelector('.source .code pre');
+    // Save the pristine original innerHTML once, before any modification.
+    if (codePre && !window._codePreOriginalHtml) {
+        window._codePreOriginalHtml = codePre.innerHTML;
+    }
+    // Invalidate span cache (rebuilt lazily on next arrow draw)
+    window._linenoSpanCache = null;
+    // Store current task items so applyCoverage can re-add markers after wrapping.
+    window._currentTaskItems = taskItems || [];
+    // Reset code pre to original, then add markers from scratch.
+    if (codePre && window._codePreOriginalHtml) {
+        codePre.innerHTML = window._codePreOriginalHtml;
+    }
+    if (!taskItems || taskItems.length === 0) {
+        return;
+    }
+    // Group tasks by line number
+    const tasksByLine = {};
+    taskItems.forEach(item => {
+        if (!tasksByLine[item.lineno]) tasksByLine[item.lineno] = [];
+        tasksByLine[item.lineno].push(item.var);
+    });
+    // Inject task marker badges into the code pre
+    if (!codePre) return;
+    const codeLines = codePre.innerHTML.split('\n');
+    codePre.innerHTML = codeLines.map((line, idx) => {
+        const lineNum = idx + 1;
+        if (tasksByLine[lineNum] && line.trim() !== '') {
+            const vars = tasksByLine[lineNum];
+            return line + `<span class="task-marker" data-lineno="${lineNum}" data-vars="${escapeHtml(vars.join(', '))}">${escapeHtml(vars.join(', '))}</span>`;
+        }
+        return line;
+    }).join('\n');
+}
+function applyCoverage(coverageSet, totalLines) {
+    // Remove previous coverage classes from lineno spans.
+    // Pygments structure: td.linenos > div.linenodiv > pre > span.normal
+    // These are individual elements — adding/removing classes has no layout impact.
+    document.querySelectorAll('td.linenos .normal').forEach(el => {
+        el.classList.remove('line-executed', 'line-not-executed');
+    });
+    if (!coverageSet) {
+        const legend = document.getElementById('coverage-legend');
+        if (legend) legend.style.display = 'none';
+        return;
+    }
+    const legend = document.getElementById('coverage-legend');
+    if (legend) legend.style.display = 'block';
+    // Color lineno spans only. We never touch codePre.innerHTML here so:
+    //   1. The table layout is never disturbed (no alignment issue).
+    //   2. Task markers injected by injectTaskMarkers are left untouched.
+    document.querySelectorAll('td.linenos .normal').forEach(span => {
+        const lineNum = parseInt(span.textContent.trim());
+        if (!isNaN(lineNum) && lineNum <= totalLines) {
+            span.classList.add(coverageSet.has(lineNum) ? 'line-executed' : 'line-not-executed');
+        }
+    });
+}
+// Global map: lineno -> list of next line numbers (1-indexed; -1 = end of trace)
+window._nextLinesMap = {};
+async function loadAndApplyGroundTruth(problemIdx, inputIdx, taskItems) {
+    // Show "loading" placeholders on all task items
+    taskItems.forEach(item => {
+        const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
+        if (el) { el.textContent = '…'; el.className = 'gt-answer loading'; }
+    });
+    // Clear next-lines data from previous input
+    window._nextLinesMap = {};
+    try {
+        const resp = await fetch(`/api/${datasetSlug}/problem/${problemIdx}/ground_truth/${inputIdx}`);
+        const gt = await resp.json();
+        if (gt.status !== 'ok') {
+            taskItems.forEach(item => {
+                const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
+                if (el) { el.textContent = gt.status === 'error' ? '(exec error)' : '(unavailable)'; el.className = 'gt-answer'; }
+            });
+            applyCoverage(null, 0);
+            return;
+        }
+        // Apply coverage highlighting
+        const coverageSet = new Set(gt.coverage);
+        applyCoverage(coverageSet, gt.total_lines);
+        // Fill in variable answers
+        const answerMap = {};
+        gt.variable_answers.forEach(a => {
+            answerMap[`${a.lineno}-${a.var}`] = a.answer_str;
+        });
+        taskItems.forEach(item => {
+            const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
+            if (el) {
+                const answer = answerMap[`${item.lineno}-${item.var}`] || '(not available)';
+                el.textContent = answer;
+                el.className = 'gt-answer';
+            }
+        });
+        // Store next-lines data for arrow visualization
+        if (gt.next_lines_answers) {
+            gt.next_lines_answers.forEach(a => {
+                window._nextLinesMap[a.lineno] = a.next_lines;
+            });
+        }
+        // Attach hover handlers to task-marker spans now that we have next-lines data
+        attachArrowHoverHandlers();
+    } catch (e) {
+        taskItems.forEach(item => {
+            const el = document.getElementById(`gt-${item.lineno}-${item.var}`);
+            if (el) { el.textContent = '(error)'; el.className = 'gt-answer'; }
+        });
+    }
+}
+// Cache of lineNum → DOM span, rebuilt whenever injectTaskMarkers runs.
+window._linenoSpanCache = null;
+function buildLinenoSpanCache(container) {
+    const cache = {};
+    container.querySelectorAll('td.linenos .normal').forEach(span => {
+        const n = parseInt(span.textContent.trim());
+        if (!isNaN(n)) cache[n] = span;
+    });
+    window._linenoSpanCache = cache;
+}
+/**
+ * Get the bounding rect of the lineno span for a given 1-indexed line number,
+ * relative to the code container element. Uses a cached span map.
+ */
+function getLinenoSpanRect(lineNum, container) {
+    if (!window._linenoSpanCache) buildLinenoSpanCache(container);
+    const span = window._linenoSpanCache[lineNum];
+    if (!span) return null;
+    const spanRect = span.getBoundingClientRect();
+    const containerRect = container.getBoundingClientRect();
+    return {
+        top: spanRect.top - containerRect.top + container.scrollTop,
+        bottom: spanRect.bottom - containerRect.top + container.scrollTop,
+        left: spanRect.left - containerRect.left,
+        right: spanRect.right - containerRect.left,
+        width: spanRect.width,
+        height: spanRect.height,
+        midY: (spanRect.top + spanRect.bottom) / 2 - containerRect.top + container.scrollTop,
+    };
+}
+/**
+ * Draw arrows from sourceLine to each of the targetLines in the SVG overlay.
+ * Lines are 1-indexed. -1 means "end of execution" (no arrow drawn).
+ */
+function drawArrows(sourceLineNum, targetLineNums) {
+    const container = document.getElementById('code-container');
+    const svg = document.getElementById('arrow-overlay');
+    if (!container || !svg) return;
+    // Remove previous arrows (but keep defs)
+    svg.querySelectorAll('.arrow-path').forEach(el => el.remove());
+    const srcRect = getLinenoSpanRect(sourceLineNum, container);
+    if (!srcRect) return;
+    // Update SVG height to match container
+    svg.setAttribute('height', container.scrollHeight);
+    targetLineNums.forEach(targetLineNum => {
+        if (targetLineNum === -1) return;  // end of trace — no arrow
+        const dstRect = getLinenoSpanRect(targetLineNum, container);
+        if (!dstRect) return;
+        // Start point: right edge of source lineno span, vertically centered
+        const x1 = srcRect.right + 2;
+        const y1 = srcRect.midY;
+        // End point: right edge of target lineno span, vertically centered
+        const x2 = dstRect.right + 2;
+        const y2 = dstRect.midY;
+        // Horizontal offset for the bezier control points — curves to the right
+        const curveOffset = Math.max(30, Math.abs(y2 - y1) * 0.4);
+        // Cubic bezier: both control points extend to the right of the lineno column
+        const cx1 = x1 + curveOffset;
+        const cy1 = y1;
+        const cx2 = x2 + curveOffset;
+        const cy2 = y2;
+        const path = document.createElementNS('http://www.w3.org/2000/svg', 'path');
+        path.setAttribute('d', `M ${x1} ${y1} C ${cx1} ${cy1}, ${cx2} ${cy2}, ${x2} ${y2}`);
+        path.setAttribute('class', 'exec-arrow arrow-path');
+        path.setAttribute('marker-end', 'url(#arrowhead)');
+        svg.appendChild(path);
+    });
+}
+/**
+ * Clear all arrows from the SVG overlay.
+ */
+function clearArrows() {
+    const svg = document.getElementById('arrow-overlay');
+    if (svg) {
+        svg.querySelectorAll('.arrow-path').forEach(el => el.remove());
+    }
+}
+// AbortController for the current set of marker hover listeners.
+let _markerListenersAbort = null;
+/**
+ * Attach mouseenter/mouseleave handlers to all .task-marker spans so that
+ * hovering shows execution-flow arrows to next lines.
+ */
+function attachArrowHoverHandlers() {
+    // Cancel any previously attached listeners without touching the DOM.
+    if (_markerListenersAbort) _markerListenersAbort.abort();
+    _markerListenersAbort = new AbortController();
+    const { signal } = _markerListenersAbort;
+    document.querySelectorAll('.task-marker').forEach(marker => {
+        marker.addEventListener('mouseenter', () => {
+            const lineNum = parseInt(marker.dataset.lineno);
+            if (!lineNum) return;
+            const nextLines = window._nextLinesMap[lineNum];
+            if (nextLines && nextLines.length > 0) {
+                drawArrows(lineNum, nextLines);
+            }
+        }, { signal });
+        marker.addEventListener('mouseleave', () => {
+            clearArrows();
+        }, { signal });
+    });
+}
+function showCruxTask(taskIdx) {
+    const problem = window.currentProblem;
+    const task = problem.tasks[taskIdx];
+    // Update active button
+    document.querySelectorAll('.task-btn').forEach((btn, idx) => {
+        btn.classList.toggle('active', idx === taskIdx);
+    });
+    const givenLabel = task.given === 'input' ? 'Input (given)' : 'Output (given)';
+    const predictLabel = task.predict === 'output' ? 'Output (predict)' : 'Input (predict)';
+    const givenValue = task.given === 'input' ? task.input : task.output;
+    const predictValue = task.predict === 'output' ? task.output : task.input;
+    const html = `
+        <div class="task-details">
+            <div class="task-section">
+                <p style="margin-bottom: 12px; color: #7f8c8d;">${escapeHtml(task.description)}</p>
+            </div>
+            <div class="io-section">
+                <div class="task-section">
+                    <h3>${escapeHtml(givenLabel)}</h3>
+                    <pre class="code-block">${escapeHtml(givenValue)}</pre>
+                </div>
+                <div class="task-section">
+                    <h3>${escapeHtml(predictLabel)}</h3>
+                    <pre class="code-block crux-answer">${escapeHtml(predictValue)}</pre>
+                </div>
+            </div>
+        </div>
+    `;
+    document.getElementById('task-content').innerHTML = html;
+}
+function showTask(taskIdx) {
+    const problem = window.currentProblem;
+    const task = problem.tasks[taskIdx];
+    // Update active button
+    const buttons = document.querySelectorAll('.task-btn');
+    buttons.forEach((btn, idx) => {
+        if (idx === taskIdx) {
+            btn.classList.add('active');
+        } else {
+            btn.classList.remove('active');
+        }
+    });
+    // Inject task markers into the code
+    injectTaskMarkers(task.task_items);
+    // Clear previous coverage while new one loads
+    applyCoverage(null, 0);
+    // Render task content
+    // For HumanEval: Input + Expected Output side by side.
+    // For ClassEval: Input alone (side by side layout), then Test Class below full-width.
+    const ioSection = task.test_class_code
+        ? `<div class="io-section">
+               <div class="task-section">
+                   <h3>Input</h3>
+                   <pre class="code-block">${escapeHtml(task.input)}</pre>
+               </div>
+           </div>
+           <div class="task-section">
+               <h3>Test Class &mdash; <code>${escapeHtml(task.test_class_name)}</code></h3>
+               <pre class="code-block">${escapeHtml(task.test_class_code)}</pre>
+           </div>`
+        : `<div class="io-section">
+               <div class="task-section">
+                   <h3>Input</h3>
+                   <pre class="code-block">${escapeHtml(task.input)}</pre>
+               </div>
+               <div class="task-section">
+                   <h3>Expected Output</h3>
+                   <pre class="code-block">${escapeHtml(task.output)}</pre>
+               </div>
+           </div>`;
+    let html = `
+        <div class="task-details">
+            ${ioSection}
+    `;
+    // Show task items with ground truth answer placeholders
+    if (task.task_items && task.task_items.length > 0) {
+        html += `
+            <div class="task-section">
+                <h3>Reasoning Tasks</h3>
+                <p style="margin-bottom: 10px; color: #7f8c8d;">
+                    Variable state at each execution point (correct answer shown in
+                    <span style="background:#17a2b8;color:white;padding:1px 6px;border-radius:3px;font-size:0.82rem;">teal</span>):
+                </p>
+                <ul class="task-items-list">
+        `;
+        task.task_items.forEach(item => {
+            html += `
+                <li>
+                    <span class="line-ref">Line ${item.lineno}</span>
+                    <span class="var-name">${escapeHtml(item.var)}</span>
+                    <span class="gt-answer loading" id="gt-${item.lineno}-${item.var}">…</span>
+                </li>
+            `;
+        });
+        html += `
+                </ul>
+            </div>
+        `;
+    }
+    // Show output prediction task if exists
+    if (task.output_pred) {
+        html += `
+            <div class="task-section">
+                <h3>Output Completion Task</h3>
+                <p style="margin-bottom: 10px; color: #7f8c8d;">
+                    The model needs to complete this test assertion:
+                </p>
+                <pre class="code-block">${escapeHtml(task.output_pred)}</pre>
+            </div>
+        `;
+    }
+    html += `</div>`;
+    document.getElementById('task-content').innerHTML = html;
+    // Fetch and apply ground truth (coverage + variable answers)
+    if (hasGroundTruth && task.task_items) {
+        loadAndApplyGroundTruth(problem.idx, task.input_idx, task.task_items);
+    }
+}
+function escapeHtml(text) {
+    if (text === null || text === undefined) return '';
+    const div = document.createElement('div');
+    div.textContent = text;
+    return div.innerHTML;
+}
+loadProblem();
+</script>
+{% endblock %}