Spaces:

sammoftah
/

benchmark-builder

Sleeping

App Files Files Community

benchmark-builder / README.md

sammoftah

Deploy Benchmark Builder

c812306 verified 17 days ago

preview code

raw

history blame contribute delete

2.05 kB

	---
	title: Benchmark Builder
	emoji: 📊
	colorFrom: yellow
	colorTo: blue
	sdk: gradio
	app_file: app.py
	pinned: false
	license: mit
	---

	# Benchmark Builder

	## Question

	How do we create small evaluation datasets without filling them with weak distractors?

	## System Boundary

	This Space is an evaluation-data workbench for multiple-choice questions. It helps author questions, generate distractors, audit answer quality, and export the result.

	## Method

	The app accepts a question, correct answer, subject, difficulty, and rationale. It uses Hugging Face inference when `HF_TOKEN` is available and a deterministic fallback otherwise. It then audits duplicate choices, answer leakage, length balance, and question-stem quality.

	## Technique

	This is evaluation-set construction. The system treats each question as a data object with a correct answer, distractors, rationale, and quality checks.

	The distractor audit is important because weak distractors inflate model scores and make a benchmark look easier than it is.

	## Output

	The app returns a benchmark item preview, quality-check table, JSON, JSONL, or a Hugging Face Dataset push script.

	## Why It Matters

	Evaluation quality is a bottleneck in LLM work. Small, inspectable benchmarks are often more useful than large opaque ones.

	## What To Notice

	Good distractors should be plausible but wrong. If one option is obviously different in length, style, or vocabulary, the benchmark is leaking hints.

	## Effect In Practice

	This workflow can help teams build focused evals for retrieval, domain knowledge, safety, or product-specific behavior before running model comparisons.

	## Hugging Face Extension

	The generated examples can become a Hub Dataset with splits, dataset card, baseline model scores, and a Space leaderboard.

	## Limitations

	Generated distractors still need human review. Real benchmarks should include calibration, held-out validation, model baselines, and documentation of dataset scope.

	## Run Locally

	```bash
	pip install -r requirements.txt
	python app.py
	```