--- title: ML4SE Benchmark Viewer emoji: 📊 colorFrom: blue colorTo: green sdk: docker pinned: false --- # ML4SE Benchmark Viewer A web-based interface for browsing and manually inspecting individual datapoints from popular ML4SE (Machine Learning for Software Engineering) benchmark datasets. ## Supported Datasets (28) ### Code Generation | Dataset | Source | View Type | |---------|--------|-----------| | **HumanEval+** | evalplus/humanevalplus | Simple | | **MBPP+** | evalplus/mbppplus | Simple | | **MBPP** | google-research-datasets/mbpp | Simple | | **ClassEval** | FudanSELab/ClassEval | Simple | | **LiveCodeBench** | livecodebench/code_generation_lite | Simple | | **APPS** | codeparrot/apps | Multi-solution | | **CodeContests** | deepmind/code_contests | Multi-solution | | **BigOBench** | facebook/BigOBench | Complexity badges | | **BigCodeBench** | bigcode/bigcodebench | Simple | | **EffiBench** | DONG19/EffiBench | Simple | ### Code Reasoning & Evaluation | Dataset | Source | View Type | |---------|--------|-----------| | **REval** | JetBrains-Research/REval | Interactive (coverage, arrows, ground truth) | | **CRUXEval** | cruxeval-org/cruxeval | Given/Predict task selector | | **HumanEvalPack** | bigcode/humanevalpack | Multi-language + buggy/canonical | ### Code Editing & Debugging | Dataset | Source | View Type | |---------|--------|-----------| | **SWE-bench Lite** | princeton-nlp/SWE-bench_Lite | Unified diff | | **SWE-bench Verified** | princeton-nlp/SWE-bench_Verified | Unified diff | | **SWE-bench** | princeton-nlp/SWE-bench | Unified diff | | **DebugBench** | Rtian/DebugBench | Before/After | | **CanItEdit** | nuprl/CanItEdit | Before/After | | **CodeEditorBench** | m-a-p/CodeEditorBench | Before/After | | **CodeXGLUE Refinement** | google/code_x_glue_cc_code_refinement | Before/After | | **CommitBench** | Maxscha/commitbench | Unified diff | ### Code Completion & Translation | Dataset | Source | View Type | |---------|--------|-----------| | **SAFIM** | gonglinyuan/safim | Fill-in-the-Middle | | **HumanEval-X** | THUDM/humaneval-x | Multi-language tabs | | **CodeSearchNet** | code-search-net/code_search_net | Simple | ### Vulnerability Detection | Dataset | Source | View Type | |---------|--------|-----------| | **BigVul** | bstee615/bigvul | Vulnerability (CWE badges) | | **DiverseVul** | claudios/DiverseVul | Vulnerability | | **PrimeVul** | starsofchance/PrimeVul | Vulnerability | | **Devign** | google/code_x_glue_cc_defect_detection | Vulnerability | ## Installation & Usage ```bash # Install dependencies uv sync # Run the server (default port: 7860) uv run python app.py # Development mode with auto-reload FLASK_DEBUG=true uv run python app.py ``` Then open http://localhost:7860. ## Development ```bash # Lint and format uv run ruff check . uv run ruff format . ``` ### Adding a New Dataset 1. Create an adapter class in the appropriate `adapters/` submodule inheriting from `DatasetAdapter` 2. Implement required methods: `problem_count()`, `get_problem_summary()`, `get_problem_detail()` 3. Register the adapter in `adapters/registration.py` 4. Test: `/api//problems` and `/api//problem/`