egor-bogomolov's picture
Add 28 benchmark datasets with rich visualization views
9a8a9c5
metadata
title: ML4SE Benchmark Viewer
emoji: 📊
colorFrom: blue
colorTo: green
sdk: docker
pinned: false

ML4SE Benchmark Viewer

A web-based interface for browsing and manually inspecting individual datapoints from popular ML4SE (Machine Learning for Software Engineering) benchmark datasets.

Supported Datasets (28)

Code Generation

Dataset Source View Type
HumanEval+ evalplus/humanevalplus Simple
MBPP+ evalplus/mbppplus Simple
MBPP google-research-datasets/mbpp Simple
ClassEval FudanSELab/ClassEval Simple
LiveCodeBench livecodebench/code_generation_lite Simple
APPS codeparrot/apps Multi-solution
CodeContests deepmind/code_contests Multi-solution
BigOBench facebook/BigOBench Complexity badges
BigCodeBench bigcode/bigcodebench Simple
EffiBench DONG19/EffiBench Simple

Code Reasoning & Evaluation

Dataset Source View Type
REval JetBrains-Research/REval Interactive (coverage, arrows, ground truth)
CRUXEval cruxeval-org/cruxeval Given/Predict task selector
HumanEvalPack bigcode/humanevalpack Multi-language + buggy/canonical

Code Editing & Debugging

Dataset Source View Type
SWE-bench Lite princeton-nlp/SWE-bench_Lite Unified diff
SWE-bench Verified princeton-nlp/SWE-bench_Verified Unified diff
SWE-bench princeton-nlp/SWE-bench Unified diff
DebugBench Rtian/DebugBench Before/After
CanItEdit nuprl/CanItEdit Before/After
CodeEditorBench m-a-p/CodeEditorBench Before/After
CodeXGLUE Refinement google/code_x_glue_cc_code_refinement Before/After
CommitBench Maxscha/commitbench Unified diff

Code Completion & Translation

Dataset Source View Type
SAFIM gonglinyuan/safim Fill-in-the-Middle
HumanEval-X THUDM/humaneval-x Multi-language tabs
CodeSearchNet code-search-net/code_search_net Simple

Vulnerability Detection

Dataset Source View Type
BigVul bstee615/bigvul Vulnerability (CWE badges)
DiverseVul claudios/DiverseVul Vulnerability
PrimeVul starsofchance/PrimeVul Vulnerability
Devign google/code_x_glue_cc_defect_detection Vulnerability

Installation & Usage

# Install dependencies
uv sync

# Run the server (default port: 7860)
uv run python app.py

# Development mode with auto-reload
FLASK_DEBUG=true uv run python app.py

Then open http://localhost:7860.

Development

# Lint and format
uv run ruff check .
uv run ruff format .

Adding a New Dataset

  1. Create an adapter class in the appropriate adapters/ submodule inheriting from DatasetAdapter
  2. Implement required methods: problem_count(), get_problem_summary(), get_problem_detail()
  3. Register the adapter in adapters/registration.py
  4. Test: /api/<slug>/problems and /api/<slug>/problem/<idx>