--- title: Benchmark Builder emoji: 📊 colorFrom: yellow colorTo: blue sdk: gradio app_file: app.py pinned: false license: mit --- # Benchmark Builder ## Question How do we create small evaluation datasets without filling them with weak distractors? ## System Boundary This Space is an evaluation-data workbench for multiple-choice questions. It helps author questions, generate distractors, audit answer quality, and export the result. ## Method The app accepts a question, correct answer, subject, difficulty, and rationale. It uses Hugging Face inference when `HF_TOKEN` is available and a deterministic fallback otherwise. It then audits duplicate choices, answer leakage, length balance, and question-stem quality. ## Technique This is evaluation-set construction. The system treats each question as a data object with a correct answer, distractors, rationale, and quality checks. The distractor audit is important because weak distractors inflate model scores and make a benchmark look easier than it is. ## Output The app returns a benchmark item preview, quality-check table, JSON, JSONL, or a Hugging Face Dataset push script. ## Why It Matters Evaluation quality is a bottleneck in LLM work. Small, inspectable benchmarks are often more useful than large opaque ones. ## What To Notice Good distractors should be plausible but wrong. If one option is obviously different in length, style, or vocabulary, the benchmark is leaking hints. ## Effect In Practice This workflow can help teams build focused evals for retrieval, domain knowledge, safety, or product-specific behavior before running model comparisons. ## Hugging Face Extension The generated examples can become a Hub Dataset with splits, dataset card, baseline model scores, and a Space leaderboard. ## Limitations Generated distractors still need human review. Real benchmarks should include calibration, held-out validation, model baselines, and documentation of dataset scope. ## Run Locally ```bash pip install -r requirements.txt python app.py ```