Spaces:

VEX19
/

README

Running

App Files Files Community

README / README.md

Sokean

Update README.md

be942e9 verified 27 days ago

preview code

raw

history blame contribute delete

5.23 kB

	---
	title: VEX
	emoji: 📊
	colorFrom: blue
	colorTo: indigo
	sdk: static
	pinned: false
	short_description: Virtual Exam Benchmark for Automated Short Answer Grading
	---

	# VEX — Virtual Exam Benchmark

	VEX is a longitudinal benchmark for Automated Short Answer Grading (ASAG).

	It is designed to evaluate grading systems under realistic educational conditions by moving beyond isolated item-level prediction and towards virtual exam-based assessment, student-level aggregation, and feedback quality evaluation.

	VEX is built from real student responses collected in a live university database systems course. The benchmark supports research on joint grading and feedback generation, with a focus on deployment-relevant evaluation: ranking consistency, pass/fail decisions, grade-boundary agreement, and pedagogical feedback usefulness.

	---

	## What VEX Provides

	- Item-level grading: evaluation of individual student answers using standard ASAG metrics.
	- Exam-level evaluation: construction of virtual exams from multiple held-out questions answered by overlapping students.
	- Feedback quality evaluation: assessment of generated feedback for diagnostic accuracy, groundedness, actionability, specificity, score alignment, and pedagogical tone.
	- Question-disjoint splits: strict question-level splitting to reduce leakage between training and evaluation.
	- Longitudinal student structure: repeated responses from the same student cohort enable aggregation across questions and exam-style analysis.

	---

	## Dataset Overview

	\| Property \| Value \|
	\|---\|---:\|
	\| Total student responses \| ~31k \|
	\| Unique questions \| 239 \|
	\| Students \| 173 \|
	\| Gold-labeled responses \| 3,222 \|
	\| Language \| German \|
	\| Domain \| University database systems course \|
	\| Score scale \| 0, 0.25, 0.5, 0.75, 1 \|
	\| Split strategy \| Question-disjoint \|
	\| Evaluation setting \| Item-level and virtual-exam level \|

	The gold subset contains expert-annotated ordinal grades and is used as the benchmark evaluation standard. The remaining data can be used for optional training, representation learning, or silver-label experiments.

	---

	## Why Virtual Exams?

	Most ASAG benchmarks evaluate models as isolated item-level predictors: one answer in, one score out.

	However, real educational decisions are rarely based on a single response. They depend on cumulative performance across multiple questions, such as:

	- total exam scores,
	- student rankings,
	- pass/fail thresholds,
	- final grade categories,
	- grade-boundary decisions.

	VEX introduces virtual exams to model this setting. A virtual exam is created by sampling multiple held-out questions and aggregating the answers of students who responded to all selected questions. This makes it possible to evaluate whether a grading system preserves student-level outcomes, not just individual answer scores.

	---

	## Evaluation Dimensions

	### 1. Item-Level Grading

	Standard ASAG metrics are reported for comparability with prior work, including:

	- Mean Squared Error,
	- Mean Absolute Error,
	- Quadratic Weighted Kappa,
	- ordinal agreement metrics.

	These metrics measure local scoring quality on individual responses.

	### 2. Exam-Level Assessment

	Virtual exams evaluate cumulative grading behaviour using metrics such as:

	- EL-τb: exam-level Kendall rank correlation,
	- EL-Acc: exact final-grade agreement,
	- EL-QWK: ordinal agreement on final grade categories,
	- pass/fail consistency under different grading schemes.

	VEX supports both absolute and distribution-based grading schemes.

	### 3. Feedback Utility

	Generated feedback is evaluated along pedagogical dimensions:

	- diagnostic accuracy,
	- groundedness,
	- score alignment,
	- actionability,
	- specificity,
	- pedagogical tone.

	This allows VEX to evaluate systems that produce both grades and natural-language feedback.

	---

	## Intended Use

	VEX is intended for research on:

	- automated short answer grading,
	- educational NLP,
	- LLM-based grading,
	- feedback generation,
	- exam-level evaluation,
	- benchmark design,
	- model calibration in educational assessment,
	- teacher-student and silver-label training pipelines.

	The benchmark is especially useful for studying cases where item-level metrics alone are insufficient to judge whether a system is safe and useful for educational deployment.

	---

	## Organization Contents

	This Hugging Face organization hosts the VEX benchmark resources, including dataset releases, evaluation code, documentation, and supporting artifacts.

	Typical resources include:

	- curated dataset files,
	- gold-labeled benchmark subsets,
	- optional silver-label training data,
	- virtual exam construction utilities,
	- metric computation scripts,
	- evaluation reports,
	- documentation for reproducing reported results.

	Please refer to the individual repository README files for exact file descriptions and usage instructions.

	---

	## Citation

	If you use VEX, please cite the corresponding paper once available.

	```bibtex
	@misc{vex2026,
	title = {VEX: A Virtual Exam Benchmark for Automated Short Answer Grading},
	author = {TBD},
	year = {2026},
	note = {Dataset and benchmark for longitudinal ASAG evaluation}
	}