Spaces:

sammoftah
/

benchmark-builder

Sleeping

App Files Files Community

benchmark-builder / README.md

sammoftah

Deploy Benchmark Builder

c812306 verified 17 days ago

preview code

raw

history blame contribute delete

2.05 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

metadata

title: Benchmark Builder
emoji: 📊
colorFrom: yellow
colorTo: blue
sdk: gradio
app_file: app.py
pinned: false
license: mit

Benchmark Builder

Question

How do we create small evaluation datasets without filling them with weak distractors?

System Boundary

This Space is an evaluation-data workbench for multiple-choice questions. It helps author questions, generate distractors, audit answer quality, and export the result.

Method

The app accepts a question, correct answer, subject, difficulty, and rationale. It uses Hugging Face inference when HF_TOKEN is available and a deterministic fallback otherwise. It then audits duplicate choices, answer leakage, length balance, and question-stem quality.

Technique

This is evaluation-set construction. The system treats each question as a data object with a correct answer, distractors, rationale, and quality checks.

The distractor audit is important because weak distractors inflate model scores and make a benchmark look easier than it is.

Output

The app returns a benchmark item preview, quality-check table, JSON, JSONL, or a Hugging Face Dataset push script.

Why It Matters

Evaluation quality is a bottleneck in LLM work. Small, inspectable benchmarks are often more useful than large opaque ones.

What To Notice

Good distractors should be plausible but wrong. If one option is obviously different in length, style, or vocabulary, the benchmark is leaking hints.

Effect In Practice

This workflow can help teams build focused evals for retrieval, domain knowledge, safety, or product-specific behavior before running model comparisons.

Hugging Face Extension

The generated examples can become a Hub Dataset with splits, dataset card, baseline model scores, and a Space leaderboard.

Limitations

Generated distractors still need human review. Real benchmarks should include calibration, held-out validation, model baselines, and documentation of dataset scope.

Run Locally

pip install -r requirements.txt
python app.py