--- title: VEX emoji: 📊 colorFrom: blue colorTo: indigo sdk: static pinned: false short_description: Virtual Exam Benchmark for Automated Short Answer Grading --- # VEX — Virtual Exam Benchmark **VEX** is a longitudinal benchmark for **Automated Short Answer Grading (ASAG)**. It is designed to evaluate grading systems under realistic educational conditions by moving beyond isolated item-level prediction and towards **virtual exam-based assessment**, **student-level aggregation**, and **feedback quality evaluation**. VEX is built from real student responses collected in a live university database systems course. The benchmark supports research on joint **grading** and **feedback generation**, with a focus on deployment-relevant evaluation: ranking consistency, pass/fail decisions, grade-boundary agreement, and pedagogical feedback usefulness. --- ## What VEX Provides - **Item-level grading:** evaluation of individual student answers using standard ASAG metrics. - **Exam-level evaluation:** construction of virtual exams from multiple held-out questions answered by overlapping students. - **Feedback quality evaluation:** assessment of generated feedback for diagnostic accuracy, groundedness, actionability, specificity, score alignment, and pedagogical tone. - **Question-disjoint splits:** strict question-level splitting to reduce leakage between training and evaluation. - **Longitudinal student structure:** repeated responses from the same student cohort enable aggregation across questions and exam-style analysis. --- ## Dataset Overview | Property | Value | |---|---:| | Total student responses | ~31k | | Unique questions | 239 | | Students | 173 | | Gold-labeled responses | 3,222 | | Language | German | | Domain | University database systems course | | Score scale | 0, 0.25, 0.5, 0.75, 1 | | Split strategy | Question-disjoint | | Evaluation setting | Item-level and virtual-exam level | The gold subset contains expert-annotated ordinal grades and is used as the benchmark evaluation standard. The remaining data can be used for optional training, representation learning, or silver-label experiments. --- ## Why Virtual Exams? Most ASAG benchmarks evaluate models as isolated item-level predictors: one answer in, one score out. However, real educational decisions are rarely based on a single response. They depend on cumulative performance across multiple questions, such as: - total exam scores, - student rankings, - pass/fail thresholds, - final grade categories, - grade-boundary decisions. VEX introduces **virtual exams** to model this setting. A virtual exam is created by sampling multiple held-out questions and aggregating the answers of students who responded to all selected questions. This makes it possible to evaluate whether a grading system preserves student-level outcomes, not just individual answer scores. --- ## Evaluation Dimensions ### 1. Item-Level Grading Standard ASAG metrics are reported for comparability with prior work, including: - Mean Squared Error, - Mean Absolute Error, - Quadratic Weighted Kappa, - ordinal agreement metrics. These metrics measure local scoring quality on individual responses. ### 2. Exam-Level Assessment Virtual exams evaluate cumulative grading behaviour using metrics such as: - **EL-τb:** exam-level Kendall rank correlation, - **EL-Acc:** exact final-grade agreement, - **EL-QWK:** ordinal agreement on final grade categories, - pass/fail consistency under different grading schemes. VEX supports both absolute and distribution-based grading schemes. ### 3. Feedback Utility Generated feedback is evaluated along pedagogical dimensions: - diagnostic accuracy, - groundedness, - score alignment, - actionability, - specificity, - pedagogical tone. This allows VEX to evaluate systems that produce both grades and natural-language feedback. --- ## Intended Use VEX is intended for research on: - automated short answer grading, - educational NLP, - LLM-based grading, - feedback generation, - exam-level evaluation, - benchmark design, - model calibration in educational assessment, - teacher-student and silver-label training pipelines. The benchmark is especially useful for studying cases where item-level metrics alone are insufficient to judge whether a system is safe and useful for educational deployment. --- ## Organization Contents This Hugging Face organization hosts the VEX benchmark resources, including dataset releases, evaluation code, documentation, and supporting artifacts. Typical resources include: - curated dataset files, - gold-labeled benchmark subsets, - optional silver-label training data, - virtual exam construction utilities, - metric computation scripts, - evaluation reports, - documentation for reproducing reported results. Please refer to the individual repository README files for exact file descriptions and usage instructions. --- ## Citation If you use VEX, please cite the corresponding paper once available. ```bibtex @misc{vex2026, title = {VEX: A Virtual Exam Benchmark for Automated Short Answer Grading}, author = {TBD}, year = {2026}, note = {Dataset and benchmark for longitudinal ASAG evaluation} }