EVT Item Classifier β Expectancy-Value Theory Construct Tagger
A fine-tuned text classifier that assigns psychological test items (e.g., survey questions, scale items) to one of six categories derived from Expectancy-Value Theory (Eccles et al., 1983; Eccles & Wigfield, 2002):
| Label | EVT Construct | Example item |
|---|---|---|
ATTAINMENT_VALUE |
Personal importance of doing well | "Doing well in math is important to who I am." |
COST |
Perceived negative consequences of engagement | "I have to give up too much to succeed in this class." |
EXPECTANCY |
Beliefs about future success | "I am confident I can master the skills taught in this course." |
INTRINSIC_VALUE |
Enjoyment and interest | "I find the content of this course very interesting." |
UTILITY_VALUE |
Usefulness for future goals | "What I learn in this class will be useful for my career." |
OTHER |
Not classifiable as an EVT construct | "I usually sit in the front row." |
Intended Use
This model is intended for academic research in educational psychology, motivation science, and psychometrics. Typical use cases include:
- Automated content analysis of existing item pools and questionnaire banks for EVT construct coverage
- Scale development assistance β screening candidate items during the item-writing phase
- Systematic reviews β coding large corpora of test items from published instruments
- Construct validation β checking whether items align with their intended EVT construct
Out-of-Scope Uses
- Clinical or diagnostic decision-making β This model classifies test items, not respondents. It should not be used to assess individuals.
- Replacement for human coding β The model achieves substantial but not perfect agreement with human coders (ΞΊ = .71). It is intended as a tool to assist researchers, not to replace expert judgment.
- Non-English items β The model was trained and evaluated on English-language items only.
- Non-EVT constructs β The model will force any input into one of the six categories. Items from unrelated theoretical frameworks (e.g., Big Five personality, cognitive load) will be classified as
OTHERat best, or spuriously assigned to an EVT category.
Model Description
Architecture
The model uses a custom classification head on top of sentence-transformers/all-mpnet-base-v2 (110M parameters):
MPNet encoder
β
ConcatPooling (mean + max over token embeddings β 2 Γ hidden_size)
β
Dropout(0.2) β Dense(2h β h) β GELU β Dropout(0.2)
β
NormLinear head (cosine similarity classifier, Ο = 20)
β
5 independent sigmoid outputs (one per EVT construct)
The OTHER class is not explicitly learned. Instead, an item is classified as OTHER when no construct's sigmoid probability exceeds the decision threshold (default: 0.50). This One-vs-Rest (OvR) formulation avoids forcing the model to learn a coherent representation for the heterogeneous OTHER category.
Key Design Choices
| Component | Rationale |
|---|---|
| ConcatPooling | Concatenating mean and max pooling captures both distributional and salient features from short text (Howard & Ruder, 2018) |
| NormLinear (cosine) head | L2-normalised features and weights make classification direction-based, improving robustness to domain shift (Wang et al., 2018) |
| Asymmetric Loss (Ξ³_neg=4, Ξ³_pos=1) | Down-weights easy negatives in the multi-label sigmoid setup, preventing the dominant negative class from overwhelming the loss (Ridnik et al., 2021) |
| FGM adversarial training | Embedding-space perturbations regularise the model, improving generalisation from synthetic to real items (Miyato et al., 2017) |
| LLRD (decay = 0.9) | Lower layers receive smaller learning rates, preserving pre-trained linguistic knowledge (Sun et al., 2019) |
| Gradual unfreezing | Head-only training for epoch 1, then full end-to-end fine-tuning, to prevent catastrophic forgetting (Howard & Ruder, 2018) |
| SWA | Averaging top-3 checkpoints produces a flatter minimum and marginally improves ΞΊ (Izmailov et al., 2018) |
Training
Training Data
The model was fine-tuned on the expectancy_value_pool_v2 dataset β approximately 15,000 synthetic psychological test items generated with construct-level attribution, filtered to remove participial-phrase fragments. The data was split 85/7.5/7.5 into train/validation/test sets with stratified sampling.
Training Procedure
- Epochs: 12 (with early stopping, patience = 4, monitored on validation ΞΊ)
- Batch size: 24 per device Γ 2 gradient accumulation steps = 48 effective
- Optimizer: AdamW (lr = 3e-5, weight decay = 0.01)
- Scheduler: Linear with 10% warmup
- Precision: bf16 mixed-precision
- Hardware: Single NVIDIA GPU
- Post-training: Stochastic Weight Averaging of top-3 checkpoints
Training Results (Held-Out Synthetic Test Set)
| Metric | Value |
|---|---|
| Accuracy | 0.990 |
| Macro F1 | 0.990 |
| Cohen's ΞΊ | 0.990 |
Note: These near-perfect numbers reflect performance on the synthetic data distribution. Real-world performance on human-authored items is reported below.
Evaluation
Test Set
The model was evaluated on N = 1,295 human-coded test items drawn from published and unpublished psychological instruments. Items were independently coded by a human expert and by the model. Items marked as formatting artefacts or third-person phrasing were excluded prior to evaluation. All confidence intervals are bias-corrected and accelerated (BCa) bootstrap CIs with B = 10,000.
Agreement with Human Coder (Full 6-Class)
| Metric | Value | 95% BCa CI |
|---|---|---|
| Cohen's ΞΊ (unweighted) | 0.712 | [0.684, 0.740] |
| Cohen's ΞΊ (linear weighted) | 0.702 | [0.666, 0.734] |
| Krippendorff's Ξ± | 0.712 | [0.683, 0.740] |
| PABAK | 0.717 | β |
| Overall accuracy | 0.765 | [0.741, 0.786] |
| Macro F1 | 0.758 | [0.735, 0.782] |
| Weighted F1 | 0.764 | [0.740, 0.786] |
Cohen's ΞΊ = .71 indicates substantial agreement according to Landis & Koch (1977).
Per-Class Performance (6-Class)
| Class | Precision | Recall | F1 | ΞΊ (OvR) | n |
|---|---|---|---|---|---|
| ATTAINMENT_VALUE | .727 [.648, .804] | .756 [.676, .829] | .741 [.676, .798] | .713 | 123 |
| COST | .708 [.637, .775] | .778 [.712, .843] | .741 [.686, .793] | .705 | 153 |
| EXPECTANCY | .808 [.766, .849] | .855 [.817, .893] | .831 [.799, .861] | .772 | 325 |
| INTRINSIC_VALUE | .838 [.789, .884] | .852 [.805, .895] | .845 [.808, .879] | .810 | 236 |
| OTHER | .621 [.561, .681] | .606 [.546, .667] | .614 [.561, .662] | .523 | 249 |
| UTILITY_VALUE | .860 [.806, .911] | .708 [.645, .769] | .777 [.729, .821] | .739 | 209 |
Core Construct Discrimination (5-Class, Excluding OTHER)
The OTHER category is not a psychological construct β it is a heterogeneous residual class that the model was never explicitly trained to recognise. To separately assess the model's ability to discriminate among the five core EVT constructs, we report a restricted analysis on items where both the human coder and the model assigned a core construct (n = 954, 73.7% of all items).
| Metric | Value | 95% BCa CI |
|---|---|---|
| Cohen's ΞΊ (5-class) | 0.845 | [0.816, 0.869] |
| Krippendorff's Ξ± | 0.845 | [0.816, 0.869] |
| Overall accuracy | 0.880 | [0.856, 0.898] |
| Macro F1 | 0.867 | [0.843, 0.889] |
Cohen's ΞΊ = .84 indicates almost perfect agreement on construct discrimination.
| Class | Precision | Recall | F1 | n |
|---|---|---|---|---|
| ATTAINMENT_VALUE | .823 [.750, .891] | .816 [.742, .884] | .819 [.762, .870] | 114 |
| COST | .793 [.728, .856] | .856 [.794, .912] | .824 [.773, .869] | 139 |
| EXPECTANCY | .911 [.878, .942] | .917 [.885, .946] | .914 [.890, .936] | 303 |
| INTRINSIC_VALUE | .889 [.848, .928] | .918 [.880, .953] | .903 [.874, .931] | 219 |
| UTILITY_VALUE | .925 [.882, .964] | .827 [.769, .881] | .873 [.833, .910] | 179 |
Additionally, when evaluating all items the human coded as a core construct (n = 1,046) β allowing the model to predict OTHER β the model achieved ΞΊ = .75 [.721, .780] and accuracy = .80 [.775, .824]. Of these 1,046 items, 92 (8.8%) were incorrectly pushed to OTHER by the model.
Summary Across Evaluation Scenarios
| Scenario | ΞΊ | Macro F1 | N |
|---|---|---|---|
| Full 6-class (with OTHER) | .712 | .758 | 1,295 |
| 5-class: both raters assigned core construct | .845 | .867 | 954 |
| Human = core, model unrestricted | .752 | β | 1,046 |
The jump from ΞΊ = .71 to ΞΊ = .84 confirms that the model's primary weakness lies at the EVT/non-EVT detection boundary (the OTHER category), not in discriminating among the five core constructs.
Confusion Matrix (6-Class)
| Pred: AV | Pred: CO | Pred: EX | Pred: IV | Pred: OT | Pred: UV | |
|---|---|---|---|---|---|---|
| True: AV | 93 | 3 | 7 | 7 | 9 | 4 |
| True: CO | 3 | 119 | 3 | 10 | 14 | 4 |
| True: EX | 4 | 17 | 278 | 4 | 22 | 0 |
| True: IV | 4 | 6 | 4 | 201 | 17 | 4 |
| True: OT | 15 | 18 | 39 | 14 | 151 | 12 |
| True: UV | 9 | 5 | 13 | 4 | 30 | 148 |
CramΓ©r's V = .72 (6-class), .84 (5-class core only).
Known Limitations and Biases
Systematic over-assignment of EVT constructs. The Stuart-Maxwell test for marginal homogeneity is significant (ΟΒ²(5) = 20.70, p < .001). The model over-predicts EXPECTANCY (+19 items) and COST (+15) while under-predicting UTILITY_VALUE (β37) relative to the human coder. This is a direct consequence of the OvR + asymmetric loss design, which prioritises recall on core constructs at the expense of the OTHER category.
OTHER is the weakest category (ΞΊ = .52, F1 = .61). Because OTHER is defined as "no construct activated above threshold" rather than a learned class, the model tends to find some EVT signal in ambiguous items that a human would code as non-EVT. The most frequent error type is OTHER β EXPECTANCY (39 cases, 12.8% of all errors). When OTHER is excluded, the model reaches ΞΊ = .84 β almost perfect agreement β on the five core constructs.
UTILITY_VALUE under-prediction. The model is conservative on utility value (Ξ = β37 items vs. human coder), with high precision (.86) but lower recall (.71). Items referencing indirect or long-term benefits may be harder for the model to detect.
Trained on synthetic data. The training set consists of synthetically generated items, not real published instruments. While the model generalises well (ΞΊ = .71 overall, ΞΊ = .84 on core constructs), there may be stylistic patterns in human-authored items from specific subfields or cultures that the model has not encountered.
English only. The model has not been evaluated on items in other languages. Cross-lingual transfer should not be assumed.
How to Use
Quick Start
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
from transformers.modeling_outputs import SequenceClassifierOutput
# ββ Architecture (must match training) ββββββββββββββββββββββββββ
class ConcatPooling(nn.Module):
def forward(self, hidden_states, attention_mask):
mask = attention_mask.unsqueeze(-1).expand(hidden_states.size()).float()
sum_emb = torch.sum(hidden_states * mask, 1)
sum_mask = torch.clamp(mask.sum(1), min=1e-9)
mean_pool = sum_emb / sum_mask
masked = hidden_states.masked_fill(mask == 0, -1e9)
max_pool = torch.max(masked, dim=1)[0]
return torch.cat([mean_pool, max_pool], dim=-1)
class NormLinear(nn.Module):
def __init__(self, in_features, out_features, temperature=20.0):
super().__init__()
self.weight = nn.Parameter(torch.FloatTensor(out_features, in_features))
nn.init.xavier_uniform_(self.weight)
self.temperature = temperature
def forward(self, x):
return F.linear(F.normalize(x, p=2, dim=1),
F.normalize(self.weight, p=2, dim=1)) * self.temperature
class AsymmetricLoss(nn.Module):
def __init__(self, gamma_neg=4.0, gamma_pos=1.0, clip=0.05, eps=1e-8):
super().__init__()
self.gamma_neg, self.gamma_pos, self.clip, self.eps = gamma_neg, gamma_pos, clip, eps
def forward(self, x, y, pos_weights=None):
probs = torch.sigmoid(x)
xs_pos, xs_neg = probs, 1 - probs
if self.clip > 0: xs_neg = (xs_neg + self.clip).clamp(max=1)
los_pos = y * torch.log(xs_pos.clamp(min=self.eps))
los_neg = (1 - y) * torch.log(xs_neg.clamp(min=self.eps))
pt = xs_pos * y + xs_neg * (1 - y)
gamma = self.gamma_pos * y + self.gamma_neg * (1 - y)
loss = (los_pos + los_neg) * torch.pow(1 - pt, gamma)
if pos_weights is not None: loss = loss * (y * pos_weights + (1 - y))
return -loss.mean()
class EVTClassifier(nn.Module):
def __init__(self, model_path, num_core_labels=5, pos_weights=None):
super().__init__()
self.base_model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
self.config = self.base_model.config
h = self.config.hidden_size
self.pooler = ConcatPooling()
self.dropout = nn.Dropout(0.2)
self.dense = nn.Linear(h * 2, h)
self.gelu = nn.GELU()
self.norm_head = NormLinear(h, num_core_labels, temperature=20.0)
self.loss_fct = AsymmetricLoss()
if pos_weights is not None:
self.register_buffer("pos_weights", torch.tensor(pos_weights, dtype=torch.float32))
else:
self.register_buffer("pos_weights", torch.ones(num_core_labels))
def forward(self, input_ids, attention_mask, labels=None, **kwargs):
outputs = self.base_model(input_ids=input_ids, attention_mask=attention_mask)
pooled = self.pooler(outputs.last_hidden_state, attention_mask)
x = self.dropout(pooled)
x = self.gelu(self.dense(x))
x = self.dropout(x)
logits = self.norm_head(x)
loss = None
if labels is not None:
loss = self.loss_fct(logits, labels.float(), pos_weights=self.pos_weights)
return SequenceClassifierOutput(loss=loss, logits=logits)
# ββ Inference βββββββββββββββββββββββββββββββββββββββββββββββββββ
from huggingface_hub import hf_hub_download
REPO_ID = "christiqn/mpnet-EVT-classifier"
THRESHOLD = 0.50
# 1. Download fine-tuned weights + tokenizer from HuggingFace Hub
weights_path = hf_hub_download(repo_id=REPO_ID, filename="pytorch_model.bin")
tokenizer = AutoTokenizer.from_pretrained(REPO_ID)
# 2. Initialise architecture and load fine-tuned weights
model = EVTClassifier("sentence-transformers/all-mpnet-base-v2", num_core_labels=5)
state = torch.load(weights_path, map_location="cpu")
model.load_state_dict(state, strict=False)
model.eval()
# Classify
# The 5 sigmoids map to: [AV, CO, EX, IV, UV]. OTHER = none above threshold.
SIGMOID_TO_LABEL = ["ATTAINMENT_VALUE", "COST", "EXPECTANCY",
"INTRINSIC_VALUE", "UTILITY_VALUE"]
def classify(texts, threshold=THRESHOLD):
inputs = tokenizer(texts, return_tensors="pt", truncation=True,
max_length=192, padding=True)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.sigmoid(logits).numpy()
results = []
for p in probs:
if p.max() >= threshold:
results.append(SIGMOID_TO_LABEL[p.argmax()])
else:
results.append("OTHER")
return results
# Example
items = [
"I expect to do well in this course.",
"I enjoy learning about biology.",
"I usually sit near the window.",
]
print(classify(items))
# β ['EXPECTANCY', 'INTRINSIC_VALUE', 'OTHER']
Adjusting the Decision Threshold
The default threshold of 0.50 balances precision and recall. You can adjust it:
- Lower threshold (e.g., 0.35): More inclusive β fewer items classified as OTHER, higher recall on EVT constructs, but more false positives.
- Higher threshold (e.g., 0.65): More conservative β more items fall to OTHER, higher precision on EVT constructs, but more false negatives.
Choose based on your application: use a lower threshold for exploratory screening, a higher one when precision matters.
References
- Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37β46.
- Eccles, J. S., et al. (1983). Expectancies, values, and academic behaviors. In J. T. Spence (Ed.), Achievement and achievement motives (pp. 75β146). W. H. Freeman.
- Eccles, J. S., & Wigfield, A. (2002). Motivational beliefs, values, and goals. Annual Review of Psychology, 53(1), 109β132.
- Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. Proceedings of ACL, 328β339.
- Izmailov, P., et al. (2018). Averaging weights leads to wider optima and better generalization. Proceedings of UAI, 876β885.
- Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159β174.
- Miyato, T., et al. (2017). Adversarial training methods for semi-supervised text classification. Proceedings of ICLR.
- Ridnik, T., et al. (2021). Asymmetric loss for multi-label classification. Proceedings of ICCV, 82β91.
- Sun, C., et al. (2019). How to fine-tune BERT for text classification. Proceedings of CCL, 194β206.
- Wang, H., et al. (2018). CosFace: Large margin cosine loss for deep face recognition. Proceedings of CVPR, 5265β5274.
- Downloads last month
- 9
Model tree for christiqn/mpnet-EVT-classifier
Base model
sentence-transformers/all-mpnet-base-v2Dataset used to train christiqn/mpnet-EVT-classifier
Collection including christiqn/mpnet-EVT-classifier
Evaluation results
- Cohen's Kappa on Human-coded EVT test itemstest set self-reported0.712
- Macro F1 on Human-coded EVT test itemstest set self-reported0.758
- Accuracy on Human-coded EVT test itemstest set self-reported0.764
- Krippendorff's Alpha on Human-coded EVT test itemstest set self-reported0.712