benchmark-builder / README.md
sammoftah's picture
Deploy Benchmark Builder
c812306 verified
---
title: Benchmark Builder
emoji: ๐Ÿ“Š
colorFrom: yellow
colorTo: blue
sdk: gradio
app_file: app.py
pinned: false
license: mit
---
# Benchmark Builder
## Question
How do we create small evaluation datasets without filling them with weak distractors?
## System Boundary
This Space is an evaluation-data workbench for multiple-choice questions. It helps author questions, generate distractors, audit answer quality, and export the result.
## Method
The app accepts a question, correct answer, subject, difficulty, and rationale. It uses Hugging Face inference when `HF_TOKEN` is available and a deterministic fallback otherwise. It then audits duplicate choices, answer leakage, length balance, and question-stem quality.
## Technique
This is evaluation-set construction. The system treats each question as a data object with a correct answer, distractors, rationale, and quality checks.
The distractor audit is important because weak distractors inflate model scores and make a benchmark look easier than it is.
## Output
The app returns a benchmark item preview, quality-check table, JSON, JSONL, or a Hugging Face Dataset push script.
## Why It Matters
Evaluation quality is a bottleneck in LLM work. Small, inspectable benchmarks are often more useful than large opaque ones.
## What To Notice
Good distractors should be plausible but wrong. If one option is obviously different in length, style, or vocabulary, the benchmark is leaking hints.
## Effect In Practice
This workflow can help teams build focused evals for retrieval, domain knowledge, safety, or product-specific behavior before running model comparisons.
## Hugging Face Extension
The generated examples can become a Hub Dataset with splits, dataset card, baseline model scores, and a Space leaderboard.
## Limitations
Generated distractors still need human review. Real benchmarks should include calibration, held-out validation, model baselines, and documentation of dataset scope.
## Run Locally
```bash
pip install -r requirements.txt
python app.py
```