| | --- |
| | title: ML4SE Benchmark Viewer |
| | emoji: ๐ |
| | colorFrom: blue |
| | colorTo: green |
| | sdk: docker |
| | pinned: false |
| | --- |
| | |
| | # ML4SE Benchmark Viewer |
| |
|
| | A web-based interface for browsing and manually inspecting individual datapoints from popular ML4SE (Machine Learning for Software Engineering) benchmark datasets. |
| |
|
| | ## Supported Datasets (28) |
| |
|
| | ### Code Generation |
| | | Dataset | Source | View Type | |
| | |---------|--------|-----------| |
| | | **HumanEval+** | evalplus/humanevalplus | Simple | |
| | | **MBPP+** | evalplus/mbppplus | Simple | |
| | | **MBPP** | google-research-datasets/mbpp | Simple | |
| | | **ClassEval** | FudanSELab/ClassEval | Simple | |
| | | **LiveCodeBench** | livecodebench/code_generation_lite | Simple | |
| | | **APPS** | codeparrot/apps | Multi-solution | |
| | | **CodeContests** | deepmind/code_contests | Multi-solution | |
| | | **BigOBench** | facebook/BigOBench | Complexity badges | |
| | | **BigCodeBench** | bigcode/bigcodebench | Simple | |
| | | **EffiBench** | DONG19/EffiBench | Simple | |
| | |
| | ### Code Reasoning & Evaluation |
| | | Dataset | Source | View Type | |
| | |---------|--------|-----------| |
| | | **REval** | JetBrains-Research/REval | Interactive (coverage, arrows, ground truth) | |
| | | **CRUXEval** | cruxeval-org/cruxeval | Given/Predict task selector | |
| | | **HumanEvalPack** | bigcode/humanevalpack | Multi-language + buggy/canonical | |
| | |
| | ### Code Editing & Debugging |
| | | Dataset | Source | View Type | |
| | |---------|--------|-----------| |
| | | **SWE-bench Lite** | princeton-nlp/SWE-bench_Lite | Unified diff | |
| | | **SWE-bench Verified** | princeton-nlp/SWE-bench_Verified | Unified diff | |
| | | **SWE-bench** | princeton-nlp/SWE-bench | Unified diff | |
| | | **DebugBench** | Rtian/DebugBench | Before/After | |
| | | **CanItEdit** | nuprl/CanItEdit | Before/After | |
| | | **CodeEditorBench** | m-a-p/CodeEditorBench | Before/After | |
| | | **CodeXGLUE Refinement** | google/code_x_glue_cc_code_refinement | Before/After | |
| | | **CommitBench** | Maxscha/commitbench | Unified diff | |
| |
|
| | ### Code Completion & Translation |
| | | Dataset | Source | View Type | |
| | |---------|--------|-----------| |
| | | **SAFIM** | gonglinyuan/safim | Fill-in-the-Middle | |
| | | **HumanEval-X** | THUDM/humaneval-x | Multi-language tabs | |
| | | **CodeSearchNet** | code-search-net/code_search_net | Simple | |
| |
|
| | ### Vulnerability Detection |
| | | Dataset | Source | View Type | |
| | |---------|--------|-----------| |
| | | **BigVul** | bstee615/bigvul | Vulnerability (CWE badges) | |
| | | **DiverseVul** | claudios/DiverseVul | Vulnerability | |
| | | **PrimeVul** | starsofchance/PrimeVul | Vulnerability | |
| | | **Devign** | google/code_x_glue_cc_defect_detection | Vulnerability | |
| | |
| | ## Installation & Usage |
| | |
| | ```bash |
| | # Install dependencies |
| | uv sync |
| | |
| | # Run the server (default port: 7860) |
| | uv run python app.py |
| | |
| | # Development mode with auto-reload |
| | FLASK_DEBUG=true uv run python app.py |
| | ``` |
| | |
| | Then open http://localhost:7860. |
| | |
| | ## Development |
| | |
| | ```bash |
| | # Lint and format |
| | uv run ruff check . |
| | uv run ruff format . |
| | ``` |
| | |
| | ### Adding a New Dataset |
| | |
| | 1. Create an adapter class in the appropriate `adapters/` submodule inheriting from `DatasetAdapter` |
| | 2. Implement required methods: `problem_count()`, `get_problem_summary()`, `get_problem_detail()` |
| | 3. Register the adapter in `adapters/registration.py` |
| | 4. Test: `/api/<slug>/problems` and `/api/<slug>/problem/<idx>` |
| | |