Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.14.0
title: Benchmark Builder
emoji: 📊
colorFrom: yellow
colorTo: blue
sdk: gradio
app_file: app.py
pinned: false
license: mit
Benchmark Builder
Question
How do we create small evaluation datasets without filling them with weak distractors?
System Boundary
This Space is an evaluation-data workbench for multiple-choice questions. It helps author questions, generate distractors, audit answer quality, and export the result.
Method
The app accepts a question, correct answer, subject, difficulty, and rationale. It uses Hugging Face inference when HF_TOKEN is available and a deterministic fallback otherwise. It then audits duplicate choices, answer leakage, length balance, and question-stem quality.
Technique
This is evaluation-set construction. The system treats each question as a data object with a correct answer, distractors, rationale, and quality checks.
The distractor audit is important because weak distractors inflate model scores and make a benchmark look easier than it is.
Output
The app returns a benchmark item preview, quality-check table, JSON, JSONL, or a Hugging Face Dataset push script.
Why It Matters
Evaluation quality is a bottleneck in LLM work. Small, inspectable benchmarks are often more useful than large opaque ones.
What To Notice
Good distractors should be plausible but wrong. If one option is obviously different in length, style, or vocabulary, the benchmark is leaking hints.
Effect In Practice
This workflow can help teams build focused evals for retrieval, domain knowledge, safety, or product-specific behavior before running model comparisons.
Hugging Face Extension
The generated examples can become a Hub Dataset with splits, dataset card, baseline model scores, and a Space leaderboard.
Limitations
Generated distractors still need human review. Real benchmarks should include calibration, held-out validation, model baselines, and documentation of dataset scope.
Run Locally
pip install -r requirements.txt
python app.py