Spaces:
Sleeping
Sleeping
File size: 2,049 Bytes
eec140e c812306 eec140e c812306 eec140e c812306 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | ---
title: Benchmark Builder
emoji: 📊
colorFrom: yellow
colorTo: blue
sdk: gradio
app_file: app.py
pinned: false
license: mit
---
# Benchmark Builder
## Question
How do we create small evaluation datasets without filling them with weak distractors?
## System Boundary
This Space is an evaluation-data workbench for multiple-choice questions. It helps author questions, generate distractors, audit answer quality, and export the result.
## Method
The app accepts a question, correct answer, subject, difficulty, and rationale. It uses Hugging Face inference when `HF_TOKEN` is available and a deterministic fallback otherwise. It then audits duplicate choices, answer leakage, length balance, and question-stem quality.
## Technique
This is evaluation-set construction. The system treats each question as a data object with a correct answer, distractors, rationale, and quality checks.
The distractor audit is important because weak distractors inflate model scores and make a benchmark look easier than it is.
## Output
The app returns a benchmark item preview, quality-check table, JSON, JSONL, or a Hugging Face Dataset push script.
## Why It Matters
Evaluation quality is a bottleneck in LLM work. Small, inspectable benchmarks are often more useful than large opaque ones.
## What To Notice
Good distractors should be plausible but wrong. If one option is obviously different in length, style, or vocabulary, the benchmark is leaking hints.
## Effect In Practice
This workflow can help teams build focused evals for retrieval, domain knowledge, safety, or product-specific behavior before running model comparisons.
## Hugging Face Extension
The generated examples can become a Hub Dataset with splits, dataset card, baseline model scores, and a Space leaderboard.
## Limitations
Generated distractors still need human review. Real benchmarks should include calibration, held-out validation, model baselines, and documentation of dataset scope.
## Run Locally
```bash
pip install -r requirements.txt
python app.py
```
|