egor-bogomolov's picture
Add 28 benchmark datasets with rich visualization views
9a8a9c5
---
title: ML4SE Benchmark Viewer
emoji: ๐Ÿ“Š
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
---
# ML4SE Benchmark Viewer
A web-based interface for browsing and manually inspecting individual datapoints from popular ML4SE (Machine Learning for Software Engineering) benchmark datasets.
## Supported Datasets (28)
### Code Generation
| Dataset | Source | View Type |
|---------|--------|-----------|
| **HumanEval+** | evalplus/humanevalplus | Simple |
| **MBPP+** | evalplus/mbppplus | Simple |
| **MBPP** | google-research-datasets/mbpp | Simple |
| **ClassEval** | FudanSELab/ClassEval | Simple |
| **LiveCodeBench** | livecodebench/code_generation_lite | Simple |
| **APPS** | codeparrot/apps | Multi-solution |
| **CodeContests** | deepmind/code_contests | Multi-solution |
| **BigOBench** | facebook/BigOBench | Complexity badges |
| **BigCodeBench** | bigcode/bigcodebench | Simple |
| **EffiBench** | DONG19/EffiBench | Simple |
### Code Reasoning & Evaluation
| Dataset | Source | View Type |
|---------|--------|-----------|
| **REval** | JetBrains-Research/REval | Interactive (coverage, arrows, ground truth) |
| **CRUXEval** | cruxeval-org/cruxeval | Given/Predict task selector |
| **HumanEvalPack** | bigcode/humanevalpack | Multi-language + buggy/canonical |
### Code Editing & Debugging
| Dataset | Source | View Type |
|---------|--------|-----------|
| **SWE-bench Lite** | princeton-nlp/SWE-bench_Lite | Unified diff |
| **SWE-bench Verified** | princeton-nlp/SWE-bench_Verified | Unified diff |
| **SWE-bench** | princeton-nlp/SWE-bench | Unified diff |
| **DebugBench** | Rtian/DebugBench | Before/After |
| **CanItEdit** | nuprl/CanItEdit | Before/After |
| **CodeEditorBench** | m-a-p/CodeEditorBench | Before/After |
| **CodeXGLUE Refinement** | google/code_x_glue_cc_code_refinement | Before/After |
| **CommitBench** | Maxscha/commitbench | Unified diff |
### Code Completion & Translation
| Dataset | Source | View Type |
|---------|--------|-----------|
| **SAFIM** | gonglinyuan/safim | Fill-in-the-Middle |
| **HumanEval-X** | THUDM/humaneval-x | Multi-language tabs |
| **CodeSearchNet** | code-search-net/code_search_net | Simple |
### Vulnerability Detection
| Dataset | Source | View Type |
|---------|--------|-----------|
| **BigVul** | bstee615/bigvul | Vulnerability (CWE badges) |
| **DiverseVul** | claudios/DiverseVul | Vulnerability |
| **PrimeVul** | starsofchance/PrimeVul | Vulnerability |
| **Devign** | google/code_x_glue_cc_defect_detection | Vulnerability |
## Installation & Usage
```bash
# Install dependencies
uv sync
# Run the server (default port: 7860)
uv run python app.py
# Development mode with auto-reload
FLASK_DEBUG=true uv run python app.py
```
Then open http://localhost:7860.
## Development
```bash
# Lint and format
uv run ruff check .
uv run ruff format .
```
### Adding a New Dataset
1. Create an adapter class in the appropriate `adapters/` submodule inheriting from `DatasetAdapter`
2. Implement required methods: `problem_count()`, `get_problem_summary()`, `get_problem_detail()`
3. Register the adapter in `adapters/registration.py`
4. Test: `/api/<slug>/problems` and `/api/<slug>/problem/<idx>`