README / README.md
Sokean's picture
Update README.md
be942e9 verified
---
title: VEX
emoji: 📊
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
short_description: Virtual Exam Benchmark for Automated Short Answer Grading
---
# VEX — Virtual Exam Benchmark
**VEX** is a longitudinal benchmark for **Automated Short Answer Grading (ASAG)**.
It is designed to evaluate grading systems under realistic educational conditions by moving beyond isolated item-level prediction and towards **virtual exam-based assessment**, **student-level aggregation**, and **feedback quality evaluation**.
VEX is built from real student responses collected in a live university database systems course. The benchmark supports research on joint **grading** and **feedback generation**, with a focus on deployment-relevant evaluation: ranking consistency, pass/fail decisions, grade-boundary agreement, and pedagogical feedback usefulness.
---
## What VEX Provides
- **Item-level grading:** evaluation of individual student answers using standard ASAG metrics.
- **Exam-level evaluation:** construction of virtual exams from multiple held-out questions answered by overlapping students.
- **Feedback quality evaluation:** assessment of generated feedback for diagnostic accuracy, groundedness, actionability, specificity, score alignment, and pedagogical tone.
- **Question-disjoint splits:** strict question-level splitting to reduce leakage between training and evaluation.
- **Longitudinal student structure:** repeated responses from the same student cohort enable aggregation across questions and exam-style analysis.
---
## Dataset Overview
| Property | Value |
|---|---:|
| Total student responses | ~31k |
| Unique questions | 239 |
| Students | 173 |
| Gold-labeled responses | 3,222 |
| Language | German |
| Domain | University database systems course |
| Score scale | 0, 0.25, 0.5, 0.75, 1 |
| Split strategy | Question-disjoint |
| Evaluation setting | Item-level and virtual-exam level |
The gold subset contains expert-annotated ordinal grades and is used as the benchmark evaluation standard. The remaining data can be used for optional training, representation learning, or silver-label experiments.
---
## Why Virtual Exams?
Most ASAG benchmarks evaluate models as isolated item-level predictors: one answer in, one score out.
However, real educational decisions are rarely based on a single response. They depend on cumulative performance across multiple questions, such as:
- total exam scores,
- student rankings,
- pass/fail thresholds,
- final grade categories,
- grade-boundary decisions.
VEX introduces **virtual exams** to model this setting. A virtual exam is created by sampling multiple held-out questions and aggregating the answers of students who responded to all selected questions. This makes it possible to evaluate whether a grading system preserves student-level outcomes, not just individual answer scores.
---
## Evaluation Dimensions
### 1. Item-Level Grading
Standard ASAG metrics are reported for comparability with prior work, including:
- Mean Squared Error,
- Mean Absolute Error,
- Quadratic Weighted Kappa,
- ordinal agreement metrics.
These metrics measure local scoring quality on individual responses.
### 2. Exam-Level Assessment
Virtual exams evaluate cumulative grading behaviour using metrics such as:
- **EL-τb:** exam-level Kendall rank correlation,
- **EL-Acc:** exact final-grade agreement,
- **EL-QWK:** ordinal agreement on final grade categories,
- pass/fail consistency under different grading schemes.
VEX supports both absolute and distribution-based grading schemes.
### 3. Feedback Utility
Generated feedback is evaluated along pedagogical dimensions:
- diagnostic accuracy,
- groundedness,
- score alignment,
- actionability,
- specificity,
- pedagogical tone.
This allows VEX to evaluate systems that produce both grades and natural-language feedback.
---
## Intended Use
VEX is intended for research on:
- automated short answer grading,
- educational NLP,
- LLM-based grading,
- feedback generation,
- exam-level evaluation,
- benchmark design,
- model calibration in educational assessment,
- teacher-student and silver-label training pipelines.
The benchmark is especially useful for studying cases where item-level metrics alone are insufficient to judge whether a system is safe and useful for educational deployment.
---
## Organization Contents
This Hugging Face organization hosts the VEX benchmark resources, including dataset releases, evaluation code, documentation, and supporting artifacts.
Typical resources include:
- curated dataset files,
- gold-labeled benchmark subsets,
- optional silver-label training data,
- virtual exam construction utilities,
- metric computation scripts,
- evaluation reports,
- documentation for reproducing reported results.
Please refer to the individual repository README files for exact file descriptions and usage instructions.
---
## Citation
If you use VEX, please cite the corresponding paper once available.
```bibtex
@misc{vex2026,
title = {VEX: A Virtual Exam Benchmark for Automated Short Answer Grading},
author = {TBD},
year = {2026},
note = {Dataset and benchmark for longitudinal ASAG evaluation}
}