Papers
arxiv:2601.16134

LLM Prompt Evaluation for Educational Applications

Published on Jan 22
· Submitted by
Rajkumar rawal
on Jan 23
Authors:
,
,
,
,

Abstract

A systematic evaluation framework using tournament-style testing and Glicko2 rating system demonstrates effective prompt engineering for educational LLM applications, prioritizing metacognitive learning strategies.

AI-generated summary

As large language models (LLMs) become increasingly common in educational applications, there is a growing need for evidence-based methods to design and evaluate LLM prompts that produce personalized and pedagogically aligned out-puts. This study presents a generalizable, systematic approach for evaluating prompts, demonstrated through an analysis of LLM-generated follow-up questions in a structured dialogue activity. Six prompt templates were designed and tested. The templates incorporated established prompt engineering patterns, with each prompt emphasizing distinct pedagogical strategies. The prompt templates were compared through a tournament-style evaluation framework that can be adapted for other educational applications. The tournament employed the Glicko2 rating system with eight judges evaluating question pairs across three dimensions: format, dialogue support, and appropriateness for learners. Data was sourced from 120 authentic user interactions across three distinct educational deployments. Results showed that a single prompt related to strategic reading out-performed other templates with win probabilities ranging from 81% to 100% in pairwise comparisons. This prompt combined persona and context manager pat-terns and was designed to support metacognitive learning strategies such as self-directed learning. The methodology showcases how educational technology re- searchers can systematically evaluate and improve prompt designs, moving beyond ad-hoc prompt engineering toward evidence-based prompt development for educational applications.

Community

Some of the observations founded are :-

-- Prompt design matters as much as the model :
The study shows that different prompt templates using the same LLM produce significantly different educational outcomes, proving prompt engineering is a critical lever in AI supported learning.

-- Persona + Context Manager is the strongest combination :
The Strategic Reading Coach prompt combining Persona and Context Manager patterns outperformed all others with 81–100% win probability, making it the most effective for follow up educational questions.

-- Systematic prompt evaluation beats ad-hoc refinement :
The tournament style evaluation using comparative judgment + Glicko2 ranking provides a reproducible, evidence based alternative to informal trial and error prompt tuning.

-- Learning theory grounded prompts perform better :
Prompts explicitly grounded in adult learning theory, self directed learning, and metacognition consistently generated higher quality educational dialogue than theory light designs

-- Theoretical alignment alone is not enough :
Some prompts rooted in strong learning theories (e.g. constructivism) still underperformed, highlighting that empirical evaluation is essential good theory must be paired with effective prompt patterns.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.16134 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.16134 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.16134 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.