---
title: VEX
emoji: 📊
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
short_description: Virtual Exam Benchmark for Automated Short Answer Grading
---

# VEX — Virtual Exam Benchmark

**VEX** is a longitudinal benchmark for **Automated Short Answer Grading (ASAG)**.

It is designed to evaluate grading systems under realistic educational conditions by moving beyond isolated item-level prediction and towards **virtual exam-based assessment**, **student-level aggregation**, and **feedback quality evaluation**.

VEX is built from real student responses collected in a live university database systems course. The benchmark supports research on joint **grading** and **feedback generation**, with a focus on deployment-relevant evaluation: ranking consistency, pass/fail decisions, grade-boundary agreement, and pedagogical feedback usefulness.

---

## What VEX Provides

- **Item-level grading:** evaluation of individual student answers using standard ASAG metrics.
- **Exam-level evaluation:** construction of virtual exams from multiple held-out questions answered by overlapping students.
- **Feedback quality evaluation:** assessment of generated feedback for diagnostic accuracy, groundedness, actionability, specificity, score alignment, and pedagogical tone.
- **Question-disjoint splits:** strict question-level splitting to reduce leakage between training and evaluation.
- **Longitudinal student structure:** repeated responses from the same student cohort enable aggregation across questions and exam-style analysis.

---

## Dataset Overview

| Property | Value |
|---|---:|
| Total student responses | ~31k |
| Unique questions | 239 |
| Students | 173 |
| Gold-labeled responses | 3,222 |
| Language | German |
| Domain | University database systems course |
| Score scale | 0, 0.25, 0.5, 0.75, 1 |
| Split strategy | Question-disjoint |
| Evaluation setting | Item-level and virtual-exam level |

The gold subset contains expert-annotated ordinal grades and is used as the benchmark evaluation standard. The remaining data can be used for optional training, representation learning, or silver-label experiments.

---

## Why Virtual Exams?

Most ASAG benchmarks evaluate models as isolated item-level predictors: one answer in, one score out.

However, real educational decisions are rarely based on a single response. They depend on cumulative performance across multiple questions, such as:

- total exam scores,
- student rankings,
- pass/fail thresholds,
- final grade categories,
- grade-boundary decisions.

VEX introduces **virtual exams** to model this setting. A virtual exam is created by sampling multiple held-out questions and aggregating the answers of students who responded to all selected questions. This makes it possible to evaluate whether a grading system preserves student-level outcomes, not just individual answer scores.

---

## Evaluation Dimensions

### 1. Item-Level Grading

Standard ASAG metrics are reported for comparability with prior work, including:

- Mean Squared Error,
- Mean Absolute Error,
- Quadratic Weighted Kappa,
- ordinal agreement metrics.

These metrics measure local scoring quality on individual responses.

### 2. Exam-Level Assessment

Virtual exams evaluate cumulative grading behaviour using metrics such as:

- **EL-τb:** exam-level Kendall rank correlation,
- **EL-Acc:** exact final-grade agreement,
- **EL-QWK:** ordinal agreement on final grade categories,
- pass/fail consistency under different grading schemes.

VEX supports both absolute and distribution-based grading schemes.

### 3. Feedback Utility

Generated feedback is evaluated along pedagogical dimensions:

- diagnostic accuracy,
- groundedness,
- score alignment,
- actionability,
- specificity,
- pedagogical tone.

This allows VEX to evaluate systems that produce both grades and natural-language feedback.

---

## Intended Use

VEX is intended for research on:

- automated short answer grading,
- educational NLP,
- LLM-based grading,
- feedback generation,
- exam-level evaluation,
- benchmark design,
- model calibration in educational assessment,
- teacher-student and silver-label training pipelines.

The benchmark is especially useful for studying cases where item-level metrics alone are insufficient to judge whether a system is safe and useful for educational deployment.

---

## Organization Contents

This Hugging Face organization hosts the VEX benchmark resources, including dataset releases, evaluation code, documentation, and supporting artifacts.

Typical resources include:

- curated dataset files,
- gold-labeled benchmark subsets,
- optional silver-label training data,
- virtual exam construction utilities,
- metric computation scripts,
- evaluation reports,
- documentation for reproducing reported results.

Please refer to the individual repository README files for exact file descriptions and usage instructions.

---

## Citation

If you use VEX, please cite the corresponding paper once available.

```bibtex
@misc{vex2026,
  title        = {VEX: A Virtual Exam Benchmark for Automated Short Answer Grading},
  author       = {TBD},
  year         = {2026},
  note         = {Dataset and benchmark for longitudinal ASAG evaluation}
}