Spaces:
Sleeping
Sleeping
| title: Benchmark Builder | |
| emoji: ๐ | |
| colorFrom: yellow | |
| colorTo: blue | |
| sdk: gradio | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| # Benchmark Builder | |
| ## Question | |
| How do we create small evaluation datasets without filling them with weak distractors? | |
| ## System Boundary | |
| This Space is an evaluation-data workbench for multiple-choice questions. It helps author questions, generate distractors, audit answer quality, and export the result. | |
| ## Method | |
| The app accepts a question, correct answer, subject, difficulty, and rationale. It uses Hugging Face inference when `HF_TOKEN` is available and a deterministic fallback otherwise. It then audits duplicate choices, answer leakage, length balance, and question-stem quality. | |
| ## Technique | |
| This is evaluation-set construction. The system treats each question as a data object with a correct answer, distractors, rationale, and quality checks. | |
| The distractor audit is important because weak distractors inflate model scores and make a benchmark look easier than it is. | |
| ## Output | |
| The app returns a benchmark item preview, quality-check table, JSON, JSONL, or a Hugging Face Dataset push script. | |
| ## Why It Matters | |
| Evaluation quality is a bottleneck in LLM work. Small, inspectable benchmarks are often more useful than large opaque ones. | |
| ## What To Notice | |
| Good distractors should be plausible but wrong. If one option is obviously different in length, style, or vocabulary, the benchmark is leaking hints. | |
| ## Effect In Practice | |
| This workflow can help teams build focused evals for retrieval, domain knowledge, safety, or product-specific behavior before running model comparisons. | |
| ## Hugging Face Extension | |
| The generated examples can become a Hub Dataset with splits, dataset card, baseline model scores, and a Space leaderboard. | |
| ## Limitations | |
| Generated distractors still need human review. Real benchmarks should include calibration, held-out validation, model baselines, and documentation of dataset scope. | |
| ## Run Locally | |
| ```bash | |
| pip install -r requirements.txt | |
| python app.py | |
| ``` | |