Spaces:

JetBrains-Research
/

ml4se-evals-visualization

Running

App Files Files Community

ml4se-evals-visualization / README.md

egor-bogomolov

Add 28 benchmark datasets with rich visualization views

9a8a9c5 1 day ago

preview code

raw

history blame contribute delete

3.2 kB

metadata

title: ML4SE Benchmark Viewer
emoji: 📊
colorFrom: blue
colorTo: green
sdk: docker
pinned: false

ML4SE Benchmark Viewer

A web-based interface for browsing and manually inspecting individual datapoints from popular ML4SE (Machine Learning for Software Engineering) benchmark datasets.

Supported Datasets (28)

Code Generation

Dataset	Source	View Type
HumanEval+	evalplus/humanevalplus	Simple
MBPP+	evalplus/mbppplus	Simple
MBPP	google-research-datasets/mbpp	Simple
ClassEval	FudanSELab/ClassEval	Simple
LiveCodeBench	livecodebench/code_generation_lite	Simple
APPS	codeparrot/apps	Multi-solution
CodeContests	deepmind/code_contests	Multi-solution
BigOBench	facebook/BigOBench	Complexity badges
BigCodeBench	bigcode/bigcodebench	Simple
EffiBench	DONG19/EffiBench	Simple

Code Reasoning & Evaluation

Dataset	Source	View Type
REval	JetBrains-Research/REval	Interactive (coverage, arrows, ground truth)
CRUXEval	cruxeval-org/cruxeval	Given/Predict task selector
HumanEvalPack	bigcode/humanevalpack	Multi-language + buggy/canonical

Code Editing & Debugging

Dataset	Source	View Type
SWE-bench Lite	princeton-nlp/SWE-bench_Lite	Unified diff
SWE-bench Verified	princeton-nlp/SWE-bench_Verified	Unified diff
SWE-bench	princeton-nlp/SWE-bench	Unified diff
DebugBench	Rtian/DebugBench	Before/After
CanItEdit	nuprl/CanItEdit	Before/After
CodeEditorBench	m-a-p/CodeEditorBench	Before/After
CodeXGLUE Refinement	google/code_x_glue_cc_code_refinement	Before/After
CommitBench	Maxscha/commitbench	Unified diff

Code Completion & Translation

Dataset	Source	View Type
SAFIM	gonglinyuan/safim	Fill-in-the-Middle
HumanEval-X	THUDM/humaneval-x	Multi-language tabs
CodeSearchNet	code-search-net/code_search_net	Simple

Vulnerability Detection

Dataset	Source	View Type
BigVul	bstee615/bigvul	Vulnerability (CWE badges)
DiverseVul	claudios/DiverseVul	Vulnerability
PrimeVul	starsofchance/PrimeVul	Vulnerability
Devign	google/code_x_glue_cc_defect_detection	Vulnerability

Installation & Usage

# Install dependencies
uv sync

# Run the server (default port: 7860)
uv run python app.py

# Development mode with auto-reload
FLASK_DEBUG=true uv run python app.py

Then open http://localhost:7860.

Development

# Lint and format
uv run ruff check .
uv run ruff format .

Adding a New Dataset

Create an adapter class in the appropriate adapters/ submodule inheriting from DatasetAdapter
Implement required methods: problem_count(), get_problem_summary(), get_problem_detail()
Register the adapter in adapters/registration.py
Test: /api/<slug>/problems and /api/<slug>/problem/<idx>