metadata
title: ML4SE Benchmark Viewer
emoji: 📊
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
ML4SE Benchmark Viewer
A web-based interface for browsing and manually inspecting individual datapoints from popular ML4SE (Machine Learning for Software Engineering) benchmark datasets.
Supported Datasets (28)
Code Generation
| Dataset | Source | View Type |
|---|---|---|
| HumanEval+ | evalplus/humanevalplus | Simple |
| MBPP+ | evalplus/mbppplus | Simple |
| MBPP | google-research-datasets/mbpp | Simple |
| ClassEval | FudanSELab/ClassEval | Simple |
| LiveCodeBench | livecodebench/code_generation_lite | Simple |
| APPS | codeparrot/apps | Multi-solution |
| CodeContests | deepmind/code_contests | Multi-solution |
| BigOBench | facebook/BigOBench | Complexity badges |
| BigCodeBench | bigcode/bigcodebench | Simple |
| EffiBench | DONG19/EffiBench | Simple |
Code Reasoning & Evaluation
| Dataset | Source | View Type |
|---|---|---|
| REval | JetBrains-Research/REval | Interactive (coverage, arrows, ground truth) |
| CRUXEval | cruxeval-org/cruxeval | Given/Predict task selector |
| HumanEvalPack | bigcode/humanevalpack | Multi-language + buggy/canonical |
Code Editing & Debugging
| Dataset | Source | View Type |
|---|---|---|
| SWE-bench Lite | princeton-nlp/SWE-bench_Lite | Unified diff |
| SWE-bench Verified | princeton-nlp/SWE-bench_Verified | Unified diff |
| SWE-bench | princeton-nlp/SWE-bench | Unified diff |
| DebugBench | Rtian/DebugBench | Before/After |
| CanItEdit | nuprl/CanItEdit | Before/After |
| CodeEditorBench | m-a-p/CodeEditorBench | Before/After |
| CodeXGLUE Refinement | google/code_x_glue_cc_code_refinement | Before/After |
| CommitBench | Maxscha/commitbench | Unified diff |
Code Completion & Translation
| Dataset | Source | View Type |
|---|---|---|
| SAFIM | gonglinyuan/safim | Fill-in-the-Middle |
| HumanEval-X | THUDM/humaneval-x | Multi-language tabs |
| CodeSearchNet | code-search-net/code_search_net | Simple |
Vulnerability Detection
| Dataset | Source | View Type |
|---|---|---|
| BigVul | bstee615/bigvul | Vulnerability (CWE badges) |
| DiverseVul | claudios/DiverseVul | Vulnerability |
| PrimeVul | starsofchance/PrimeVul | Vulnerability |
| Devign | google/code_x_glue_cc_defect_detection | Vulnerability |
Installation & Usage
# Install dependencies
uv sync
# Run the server (default port: 7860)
uv run python app.py
# Development mode with auto-reload
FLASK_DEBUG=true uv run python app.py
Then open http://localhost:7860.
Development
# Lint and format
uv run ruff check .
uv run ruff format .
Adding a New Dataset
- Create an adapter class in the appropriate
adapters/submodule inheriting fromDatasetAdapter - Implement required methods:
problem_count(),get_problem_summary(),get_problem_detail() - Register the adapter in
adapters/registration.py - Test:
/api/<slug>/problemsand/api/<slug>/problem/<idx>