EVT Item Classifier β€” Expectancy-Value Theory Construct Tagger

A fine-tuned text classifier that assigns psychological test items (e.g., survey questions, scale items) to one of six categories derived from Expectancy-Value Theory (Eccles et al., 1983; Eccles & Wigfield, 2002):

Label EVT Construct Example item
ATTAINMENT_VALUE Personal importance of doing well "Doing well in math is important to who I am."
COST Perceived negative consequences of engagement "I have to give up too much to succeed in this class."
EXPECTANCY Beliefs about future success "I am confident I can master the skills taught in this course."
INTRINSIC_VALUE Enjoyment and interest "I find the content of this course very interesting."
UTILITY_VALUE Usefulness for future goals "What I learn in this class will be useful for my career."
OTHER Not classifiable as an EVT construct "I usually sit in the front row."

Intended Use

This model is intended for academic research in educational psychology, motivation science, and psychometrics. Typical use cases include:

  • Automated content analysis of existing item pools and questionnaire banks for EVT construct coverage
  • Scale development assistance β€” screening candidate items during the item-writing phase
  • Systematic reviews β€” coding large corpora of test items from published instruments
  • Construct validation β€” checking whether items align with their intended EVT construct

Out-of-Scope Uses

  • Clinical or diagnostic decision-making β€” This model classifies test items, not respondents. It should not be used to assess individuals.
  • Replacement for human coding β€” The model achieves substantial but not perfect agreement with human coders (ΞΊ = .71). It is intended as a tool to assist researchers, not to replace expert judgment.
  • Non-English items β€” The model was trained and evaluated on English-language items only.
  • Non-EVT constructs β€” The model will force any input into one of the six categories. Items from unrelated theoretical frameworks (e.g., Big Five personality, cognitive load) will be classified as OTHER at best, or spuriously assigned to an EVT category.

Model Description

Architecture

The model uses a custom classification head on top of sentence-transformers/all-mpnet-base-v2 (110M parameters):

MPNet encoder
    ↓
ConcatPooling (mean + max over token embeddings β†’ 2 Γ— hidden_size)
    ↓
Dropout(0.2) β†’ Dense(2h β†’ h) β†’ GELU β†’ Dropout(0.2)
    ↓
NormLinear head (cosine similarity classifier, Ο„ = 20)
    ↓
5 independent sigmoid outputs (one per EVT construct)

The OTHER class is not explicitly learned. Instead, an item is classified as OTHER when no construct's sigmoid probability exceeds the decision threshold (default: 0.50). This One-vs-Rest (OvR) formulation avoids forcing the model to learn a coherent representation for the heterogeneous OTHER category.

Key Design Choices

Component Rationale
ConcatPooling Concatenating mean and max pooling captures both distributional and salient features from short text (Howard & Ruder, 2018)
NormLinear (cosine) head L2-normalised features and weights make classification direction-based, improving robustness to domain shift (Wang et al., 2018)
Asymmetric Loss (Ξ³_neg=4, Ξ³_pos=1) Down-weights easy negatives in the multi-label sigmoid setup, preventing the dominant negative class from overwhelming the loss (Ridnik et al., 2021)
FGM adversarial training Embedding-space perturbations regularise the model, improving generalisation from synthetic to real items (Miyato et al., 2017)
LLRD (decay = 0.9) Lower layers receive smaller learning rates, preserving pre-trained linguistic knowledge (Sun et al., 2019)
Gradual unfreezing Head-only training for epoch 1, then full end-to-end fine-tuning, to prevent catastrophic forgetting (Howard & Ruder, 2018)
SWA Averaging top-3 checkpoints produces a flatter minimum and marginally improves ΞΊ (Izmailov et al., 2018)

Training

Training Data

The model was fine-tuned on the expectancy_value_pool_v2 dataset β€” approximately 15,000 synthetic psychological test items generated with construct-level attribution, filtered to remove participial-phrase fragments. The data was split 85/7.5/7.5 into train/validation/test sets with stratified sampling.

Training Procedure

  • Epochs: 12 (with early stopping, patience = 4, monitored on validation ΞΊ)
  • Batch size: 24 per device Γ— 2 gradient accumulation steps = 48 effective
  • Optimizer: AdamW (lr = 3e-5, weight decay = 0.01)
  • Scheduler: Linear with 10% warmup
  • Precision: bf16 mixed-precision
  • Hardware: Single NVIDIA GPU
  • Post-training: Stochastic Weight Averaging of top-3 checkpoints

Training Results (Held-Out Synthetic Test Set)

Metric Value
Accuracy 0.990
Macro F1 0.990
Cohen's ΞΊ 0.990

Note: These near-perfect numbers reflect performance on the synthetic data distribution. Real-world performance on human-authored items is reported below.

Evaluation

Test Set

The model was evaluated on N = 1,295 human-coded test items drawn from published and unpublished psychological instruments. Items were independently coded by a human expert and by the model. Items marked as formatting artefacts or third-person phrasing were excluded prior to evaluation. All confidence intervals are bias-corrected and accelerated (BCa) bootstrap CIs with B = 10,000.

Agreement with Human Coder (Full 6-Class)

Metric Value 95% BCa CI
Cohen's ΞΊ (unweighted) 0.712 [0.684, 0.740]
Cohen's ΞΊ (linear weighted) 0.702 [0.666, 0.734]
Krippendorff's Ξ± 0.712 [0.683, 0.740]
PABAK 0.717 β€”
Overall accuracy 0.765 [0.741, 0.786]
Macro F1 0.758 [0.735, 0.782]
Weighted F1 0.764 [0.740, 0.786]

Cohen's ΞΊ = .71 indicates substantial agreement according to Landis & Koch (1977).

Per-Class Performance (6-Class)

Class Precision Recall F1 ΞΊ (OvR) n
ATTAINMENT_VALUE .727 [.648, .804] .756 [.676, .829] .741 [.676, .798] .713 123
COST .708 [.637, .775] .778 [.712, .843] .741 [.686, .793] .705 153
EXPECTANCY .808 [.766, .849] .855 [.817, .893] .831 [.799, .861] .772 325
INTRINSIC_VALUE .838 [.789, .884] .852 [.805, .895] .845 [.808, .879] .810 236
OTHER .621 [.561, .681] .606 [.546, .667] .614 [.561, .662] .523 249
UTILITY_VALUE .860 [.806, .911] .708 [.645, .769] .777 [.729, .821] .739 209

Core Construct Discrimination (5-Class, Excluding OTHER)

The OTHER category is not a psychological construct β€” it is a heterogeneous residual class that the model was never explicitly trained to recognise. To separately assess the model's ability to discriminate among the five core EVT constructs, we report a restricted analysis on items where both the human coder and the model assigned a core construct (n = 954, 73.7% of all items).

Metric Value 95% BCa CI
Cohen's ΞΊ (5-class) 0.845 [0.816, 0.869]
Krippendorff's Ξ± 0.845 [0.816, 0.869]
Overall accuracy 0.880 [0.856, 0.898]
Macro F1 0.867 [0.843, 0.889]

Cohen's ΞΊ = .84 indicates almost perfect agreement on construct discrimination.

Class Precision Recall F1 n
ATTAINMENT_VALUE .823 [.750, .891] .816 [.742, .884] .819 [.762, .870] 114
COST .793 [.728, .856] .856 [.794, .912] .824 [.773, .869] 139
EXPECTANCY .911 [.878, .942] .917 [.885, .946] .914 [.890, .936] 303
INTRINSIC_VALUE .889 [.848, .928] .918 [.880, .953] .903 [.874, .931] 219
UTILITY_VALUE .925 [.882, .964] .827 [.769, .881] .873 [.833, .910] 179

Additionally, when evaluating all items the human coded as a core construct (n = 1,046) β€” allowing the model to predict OTHER β€” the model achieved ΞΊ = .75 [.721, .780] and accuracy = .80 [.775, .824]. Of these 1,046 items, 92 (8.8%) were incorrectly pushed to OTHER by the model.

Summary Across Evaluation Scenarios

Scenario ΞΊ Macro F1 N
Full 6-class (with OTHER) .712 .758 1,295
5-class: both raters assigned core construct .845 .867 954
Human = core, model unrestricted .752 β€” 1,046

The jump from ΞΊ = .71 to ΞΊ = .84 confirms that the model's primary weakness lies at the EVT/non-EVT detection boundary (the OTHER category), not in discriminating among the five core constructs.

Confusion Matrix (6-Class)

Pred: AV Pred: CO Pred: EX Pred: IV Pred: OT Pred: UV
True: AV 93 3 7 7 9 4
True: CO 3 119 3 10 14 4
True: EX 4 17 278 4 22 0
True: IV 4 6 4 201 17 4
True: OT 15 18 39 14 151 12
True: UV 9 5 13 4 30 148

CramΓ©r's V = .72 (6-class), .84 (5-class core only).

Known Limitations and Biases

Systematic over-assignment of EVT constructs. The Stuart-Maxwell test for marginal homogeneity is significant (χ²(5) = 20.70, p < .001). The model over-predicts EXPECTANCY (+19 items) and COST (+15) while under-predicting UTILITY_VALUE (βˆ’37) relative to the human coder. This is a direct consequence of the OvR + asymmetric loss design, which prioritises recall on core constructs at the expense of the OTHER category.

OTHER is the weakest category (ΞΊ = .52, F1 = .61). Because OTHER is defined as "no construct activated above threshold" rather than a learned class, the model tends to find some EVT signal in ambiguous items that a human would code as non-EVT. The most frequent error type is OTHER β†’ EXPECTANCY (39 cases, 12.8% of all errors). When OTHER is excluded, the model reaches ΞΊ = .84 β€” almost perfect agreement β€” on the five core constructs.

UTILITY_VALUE under-prediction. The model is conservative on utility value (Ξ” = βˆ’37 items vs. human coder), with high precision (.86) but lower recall (.71). Items referencing indirect or long-term benefits may be harder for the model to detect.

Trained on synthetic data. The training set consists of synthetically generated items, not real published instruments. While the model generalises well (ΞΊ = .71 overall, ΞΊ = .84 on core constructs), there may be stylistic patterns in human-authored items from specific subfields or cultures that the model has not encountered.

English only. The model has not been evaluated on items in other languages. Cross-lingual transfer should not be assumed.

How to Use

Quick Start

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
from transformers.modeling_outputs import SequenceClassifierOutput

# ── Architecture (must match training) ──────────────────────────
class ConcatPooling(nn.Module):
    def forward(self, hidden_states, attention_mask):
        mask = attention_mask.unsqueeze(-1).expand(hidden_states.size()).float()
        sum_emb = torch.sum(hidden_states * mask, 1)
        sum_mask = torch.clamp(mask.sum(1), min=1e-9)
        mean_pool = sum_emb / sum_mask
        masked = hidden_states.masked_fill(mask == 0, -1e9)
        max_pool = torch.max(masked, dim=1)[0]
        return torch.cat([mean_pool, max_pool], dim=-1)

class NormLinear(nn.Module):
    def __init__(self, in_features, out_features, temperature=20.0):
        super().__init__()
        self.weight = nn.Parameter(torch.FloatTensor(out_features, in_features))
        nn.init.xavier_uniform_(self.weight)
        self.temperature = temperature
    def forward(self, x):
        return F.linear(F.normalize(x, p=2, dim=1),
                       F.normalize(self.weight, p=2, dim=1)) * self.temperature

class AsymmetricLoss(nn.Module):
    def __init__(self, gamma_neg=4.0, gamma_pos=1.0, clip=0.05, eps=1e-8):
        super().__init__()
        self.gamma_neg, self.gamma_pos, self.clip, self.eps = gamma_neg, gamma_pos, clip, eps
    def forward(self, x, y, pos_weights=None):
        probs = torch.sigmoid(x)
        xs_pos, xs_neg = probs, 1 - probs
        if self.clip > 0: xs_neg = (xs_neg + self.clip).clamp(max=1)
        los_pos = y * torch.log(xs_pos.clamp(min=self.eps))
        los_neg = (1 - y) * torch.log(xs_neg.clamp(min=self.eps))
        pt = xs_pos * y + xs_neg * (1 - y)
        gamma = self.gamma_pos * y + self.gamma_neg * (1 - y)
        loss = (los_pos + los_neg) * torch.pow(1 - pt, gamma)
        if pos_weights is not None: loss = loss * (y * pos_weights + (1 - y))
        return -loss.mean()

class EVTClassifier(nn.Module):
    def __init__(self, model_path, num_core_labels=5, pos_weights=None):
        super().__init__()
        self.base_model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
        self.config = self.base_model.config
        h = self.config.hidden_size
        self.pooler = ConcatPooling()
        self.dropout = nn.Dropout(0.2)
        self.dense = nn.Linear(h * 2, h)
        self.gelu = nn.GELU()
        self.norm_head = NormLinear(h, num_core_labels, temperature=20.0)
        self.loss_fct = AsymmetricLoss()
        if pos_weights is not None:
            self.register_buffer("pos_weights", torch.tensor(pos_weights, dtype=torch.float32))
        else:
            self.register_buffer("pos_weights", torch.ones(num_core_labels))
    def forward(self, input_ids, attention_mask, labels=None, **kwargs):
        outputs = self.base_model(input_ids=input_ids, attention_mask=attention_mask)
        pooled = self.pooler(outputs.last_hidden_state, attention_mask)
        x = self.dropout(pooled)
        x = self.gelu(self.dense(x))
        x = self.dropout(x)
        logits = self.norm_head(x)
        loss = None
        if labels is not None:
            loss = self.loss_fct(logits, labels.float(), pos_weights=self.pos_weights)
        return SequenceClassifierOutput(loss=loss, logits=logits)

# ── Inference ───────────────────────────────────────────────────
from huggingface_hub import hf_hub_download

REPO_ID = "christiqn/mpnet-EVT-classifier"
THRESHOLD = 0.50

# 1. Download fine-tuned weights + tokenizer from HuggingFace Hub
weights_path = hf_hub_download(repo_id=REPO_ID, filename="pytorch_model.bin")
tokenizer = AutoTokenizer.from_pretrained(REPO_ID)

# 2. Initialise architecture and load fine-tuned weights
model = EVTClassifier("sentence-transformers/all-mpnet-base-v2", num_core_labels=5)
state = torch.load(weights_path, map_location="cpu")
model.load_state_dict(state, strict=False)
model.eval()

# Classify
# The 5 sigmoids map to: [AV, CO, EX, IV, UV]. OTHER = none above threshold.
SIGMOID_TO_LABEL = ["ATTAINMENT_VALUE", "COST", "EXPECTANCY",
                    "INTRINSIC_VALUE", "UTILITY_VALUE"]

def classify(texts, threshold=THRESHOLD):
    inputs = tokenizer(texts, return_tensors="pt", truncation=True,
                       max_length=192, padding=True)
    with torch.no_grad():
        logits = model(**inputs).logits
        probs = torch.sigmoid(logits).numpy()
    results = []
    for p in probs:
        if p.max() >= threshold:
            results.append(SIGMOID_TO_LABEL[p.argmax()])
        else:
            results.append("OTHER")
    return results

# Example
items = [
    "I expect to do well in this course.",
    "I enjoy learning about biology.",
    "I usually sit near the window.",
]
print(classify(items))
# β†’ ['EXPECTANCY', 'INTRINSIC_VALUE', 'OTHER']

Adjusting the Decision Threshold

The default threshold of 0.50 balances precision and recall. You can adjust it:

  • Lower threshold (e.g., 0.35): More inclusive β€” fewer items classified as OTHER, higher recall on EVT constructs, but more false positives.
  • Higher threshold (e.g., 0.65): More conservative β€” more items fall to OTHER, higher precision on EVT constructs, but more false negatives.

Choose based on your application: use a lower threshold for exploratory screening, a higher one when precision matters.

References

  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
  • Eccles, J. S., et al. (1983). Expectancies, values, and academic behaviors. In J. T. Spence (Ed.), Achievement and achievement motives (pp. 75–146). W. H. Freeman.
  • Eccles, J. S., & Wigfield, A. (2002). Motivational beliefs, values, and goals. Annual Review of Psychology, 53(1), 109–132.
  • Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. Proceedings of ACL, 328–339.
  • Izmailov, P., et al. (2018). Averaging weights leads to wider optima and better generalization. Proceedings of UAI, 876–885.
  • Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.
  • Miyato, T., et al. (2017). Adversarial training methods for semi-supervised text classification. Proceedings of ICLR.
  • Ridnik, T., et al. (2021). Asymmetric loss for multi-label classification. Proceedings of ICCV, 82–91.
  • Sun, C., et al. (2019). How to fine-tune BERT for text classification. Proceedings of CCL, 194–206.
  • Wang, H., et al. (2018). CosFace: Large margin cosine loss for deep face recognition. Proceedings of CVPR, 5265–5274.
Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for christiqn/mpnet-EVT-classifier

Finetuned
(360)
this model

Dataset used to train christiqn/mpnet-EVT-classifier

Collection including christiqn/mpnet-EVT-classifier

Evaluation results