| --- |
| title: VEX |
| emoji: 📊 |
| colorFrom: blue |
| colorTo: indigo |
| sdk: static |
| pinned: false |
| short_description: Virtual Exam Benchmark for Automated Short Answer Grading |
| --- |
| |
| # VEX — Virtual Exam Benchmark |
|
|
| **VEX** is a longitudinal benchmark for **Automated Short Answer Grading (ASAG)**. |
|
|
| It is designed to evaluate grading systems under realistic educational conditions by moving beyond isolated item-level prediction and towards **virtual exam-based assessment**, **student-level aggregation**, and **feedback quality evaluation**. |
|
|
| VEX is built from real student responses collected in a live university database systems course. The benchmark supports research on joint **grading** and **feedback generation**, with a focus on deployment-relevant evaluation: ranking consistency, pass/fail decisions, grade-boundary agreement, and pedagogical feedback usefulness. |
|
|
| --- |
|
|
| ## What VEX Provides |
|
|
| - **Item-level grading:** evaluation of individual student answers using standard ASAG metrics. |
| - **Exam-level evaluation:** construction of virtual exams from multiple held-out questions answered by overlapping students. |
| - **Feedback quality evaluation:** assessment of generated feedback for diagnostic accuracy, groundedness, actionability, specificity, score alignment, and pedagogical tone. |
| - **Question-disjoint splits:** strict question-level splitting to reduce leakage between training and evaluation. |
| - **Longitudinal student structure:** repeated responses from the same student cohort enable aggregation across questions and exam-style analysis. |
|
|
| --- |
|
|
| ## Dataset Overview |
|
|
| | Property | Value | |
| |---|---:| |
| | Total student responses | ~31k | |
| | Unique questions | 239 | |
| | Students | 173 | |
| | Gold-labeled responses | 3,222 | |
| | Language | German | |
| | Domain | University database systems course | |
| | Score scale | 0, 0.25, 0.5, 0.75, 1 | |
| | Split strategy | Question-disjoint | |
| | Evaluation setting | Item-level and virtual-exam level | |
|
|
| The gold subset contains expert-annotated ordinal grades and is used as the benchmark evaluation standard. The remaining data can be used for optional training, representation learning, or silver-label experiments. |
|
|
| --- |
|
|
| ## Why Virtual Exams? |
|
|
| Most ASAG benchmarks evaluate models as isolated item-level predictors: one answer in, one score out. |
|
|
| However, real educational decisions are rarely based on a single response. They depend on cumulative performance across multiple questions, such as: |
|
|
| - total exam scores, |
| - student rankings, |
| - pass/fail thresholds, |
| - final grade categories, |
| - grade-boundary decisions. |
|
|
| VEX introduces **virtual exams** to model this setting. A virtual exam is created by sampling multiple held-out questions and aggregating the answers of students who responded to all selected questions. This makes it possible to evaluate whether a grading system preserves student-level outcomes, not just individual answer scores. |
|
|
| --- |
|
|
| ## Evaluation Dimensions |
|
|
| ### 1. Item-Level Grading |
|
|
| Standard ASAG metrics are reported for comparability with prior work, including: |
|
|
| - Mean Squared Error, |
| - Mean Absolute Error, |
| - Quadratic Weighted Kappa, |
| - ordinal agreement metrics. |
|
|
| These metrics measure local scoring quality on individual responses. |
|
|
| ### 2. Exam-Level Assessment |
|
|
| Virtual exams evaluate cumulative grading behaviour using metrics such as: |
|
|
| - **EL-τb:** exam-level Kendall rank correlation, |
| - **EL-Acc:** exact final-grade agreement, |
| - **EL-QWK:** ordinal agreement on final grade categories, |
| - pass/fail consistency under different grading schemes. |
|
|
| VEX supports both absolute and distribution-based grading schemes. |
|
|
| ### 3. Feedback Utility |
|
|
| Generated feedback is evaluated along pedagogical dimensions: |
|
|
| - diagnostic accuracy, |
| - groundedness, |
| - score alignment, |
| - actionability, |
| - specificity, |
| - pedagogical tone. |
|
|
| This allows VEX to evaluate systems that produce both grades and natural-language feedback. |
|
|
| --- |
|
|
| ## Intended Use |
|
|
| VEX is intended for research on: |
|
|
| - automated short answer grading, |
| - educational NLP, |
| - LLM-based grading, |
| - feedback generation, |
| - exam-level evaluation, |
| - benchmark design, |
| - model calibration in educational assessment, |
| - teacher-student and silver-label training pipelines. |
|
|
| The benchmark is especially useful for studying cases where item-level metrics alone are insufficient to judge whether a system is safe and useful for educational deployment. |
|
|
| --- |
|
|
| ## Organization Contents |
|
|
| This Hugging Face organization hosts the VEX benchmark resources, including dataset releases, evaluation code, documentation, and supporting artifacts. |
|
|
| Typical resources include: |
|
|
| - curated dataset files, |
| - gold-labeled benchmark subsets, |
| - optional silver-label training data, |
| - virtual exam construction utilities, |
| - metric computation scripts, |
| - evaluation reports, |
| - documentation for reproducing reported results. |
|
|
| Please refer to the individual repository README files for exact file descriptions and usage instructions. |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use VEX, please cite the corresponding paper once available. |
|
|
| ```bibtex |
| @misc{vex2026, |
| title = {VEX: A Virtual Exam Benchmark for Automated Short Answer Grading}, |
| author = {TBD}, |
| year = {2026}, |
| note = {Dataset and benchmark for longitudinal ASAG evaluation} |
| } |