File size: 2,049 Bytes
eec140e
 
c812306
 
 
eec140e
 
 
c812306
eec140e
 
c812306
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
title: Benchmark Builder
emoji: 📊
colorFrom: yellow
colorTo: blue
sdk: gradio
app_file: app.py
pinned: false
license: mit
---

# Benchmark Builder

## Question

How do we create small evaluation datasets without filling them with weak distractors?

## System Boundary

This Space is an evaluation-data workbench for multiple-choice questions. It helps author questions, generate distractors, audit answer quality, and export the result.

## Method

The app accepts a question, correct answer, subject, difficulty, and rationale. It uses Hugging Face inference when `HF_TOKEN` is available and a deterministic fallback otherwise. It then audits duplicate choices, answer leakage, length balance, and question-stem quality.

## Technique

This is evaluation-set construction. The system treats each question as a data object with a correct answer, distractors, rationale, and quality checks.

The distractor audit is important because weak distractors inflate model scores and make a benchmark look easier than it is.

## Output

The app returns a benchmark item preview, quality-check table, JSON, JSONL, or a Hugging Face Dataset push script.

## Why It Matters

Evaluation quality is a bottleneck in LLM work. Small, inspectable benchmarks are often more useful than large opaque ones.

## What To Notice

Good distractors should be plausible but wrong. If one option is obviously different in length, style, or vocabulary, the benchmark is leaking hints.

## Effect In Practice

This workflow can help teams build focused evals for retrieval, domain knowledge, safety, or product-specific behavior before running model comparisons.

## Hugging Face Extension

The generated examples can become a Hub Dataset with splits, dataset card, baseline model scores, and a Space leaderboard.

## Limitations

Generated distractors still need human review. Real benchmarks should include calibration, held-out validation, model baselines, and documentation of dataset scope.

## Run Locally

```bash
pip install -r requirements.txt
python app.py
```