File size: 15,927 Bytes
69d76d4 42e4b2a 69d76d4 42e4b2a ae0bc6d 42e4b2a ae0bc6d 69d76d4 42e4b2a ae0bc6d 69d76d4 42e4b2a ae0bc6d 69d76d4 ae0bc6d 42e4b2a ae0bc6d 42e4b2a 4ec44e5 ae0bc6d 4ec44e5 ae0bc6d 4ec44e5 ae0bc6d 4ec44e5 ae0bc6d 4ec44e5 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 86aca8f ae0bc6d 86aca8f ae0bc6d 86aca8f ae0bc6d 86aca8f ae0bc6d 86aca8f ae0bc6d 86aca8f ae0bc6d 86aca8f ae0bc6d 86aca8f ae0bc6d 86aca8f ae0bc6d 86aca8f ae0bc6d 86aca8f ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ae0bc6d 69d76d4 ac4c3a6 ae0bc6d 69d76d4 ac4c3a6 69d76d4 ac4c3a6 ae0bc6d ac4c3a6 69d76d4 ae0bc6d ac4c3a6 ae0bc6d ac4c3a6 69d76d4 ae0bc6d 5d3bc8d ae0bc6d 5d3bc8d ae0bc6d ac4c3a6 ae0bc6d ac4c3a6 ae0bc6d ac4c3a6 ae0bc6d ac4c3a6 ae0bc6d ac4c3a6 ae0bc6d ac4c3a6 ae0bc6d ac4c3a6 5d3bc8d ae0bc6d ac4c3a6 69d76d4 ac4c3a6 ae0bc6d ac4c3a6 ae0bc6d ac4c3a6 ae0bc6d ac4c3a6 ae0bc6d ac4c3a6 69d76d4 5d3bc8d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 | ---
language:
- en
license: cc-by-4.0
library_name: transformers
tags:
- text-classification
- pytorch
- safetensors
- evt_classifier
- feature-extraction
- psychology
- expectancy-value-theory
- psychometrics
- test-item-classification
- educational-psychology
- motivation
- survey-instruments
- content-analysis
- custom_code
pipeline_tag: text-classification
datasets:
- christiqn/expectancy_value_pool
- christiqn/EVT-items
metrics:
- f1
- accuracy
model-index:
- name: mpnet-EVT-classifier
results:
- task:
type: text-classification
dataset:
name: EVT-items (Human-coded psychological test items)
type: christiqn/EVT-items
metrics:
- type: cohen_kappa
value: 0.7674
name: Cohen's Kappa (6-class, N=1284)
- type: f1
value: 0.8121
name: Macro F1 (6-class)
- type: accuracy
value: 0.8100
name: Accuracy (6-class)
- type: krippendorff_alpha
value: 0.7674
name: Krippendorff's Alpha
---
# EVT Item Classifier — Expectancy-Value Theory Construct Tagger
A fine-tuned text classifier that assigns psychological test items (e.g., survey questions, scale items) to one of six categories derived from **Expectancy-Value Theory** (Eccles et al., 1983; Eccles & Wigfield, 2002):
| Label | EVT Construct | Example item |
|---|---|---|
| `ATTAINMENT_VALUE` | Personal importance of doing well | *"Doing well in math is important to who I am."* |
| `COST` | Perceived negative consequences of engagement | *"I have to give up too much to succeed in this class."* |
| `EXPECTANCY` | Beliefs about future success | *"I am confident I can master the skills taught in this course."* |
| `INTRINSIC_VALUE` | Enjoyment and interest | *"I find the content of this course very interesting."* |
| `UTILITY_VALUE` | Usefulness for future goals | *"What I learn in this class will be useful for my career."* |
| `OTHER` | Not classifiable as an EVT construct | *"I usually sit in the front row."* |
## Intended Use
This model is intended for **academic research** in educational psychology, motivation science, and psychometrics. Typical use cases include:
- **Automated content analysis** of existing item pools and questionnaire banks for EVT construct coverage
- **Scale development assistance** — screening candidate items during the item-writing phase
- **Systematic reviews** — coding large corpora of test items from published instruments
- **Construct validation** — checking whether items align with their intended EVT construct
### Out-of-Scope Uses
- **Clinical or diagnostic decision-making** — This model classifies test *items*, not respondents. It should not be used to assess individuals.
- **Replacement for human coding** — The model achieves substantial but not perfect agreement with human coders (κ = .77). It is intended as a tool to assist researchers, not to replace expert judgment.
- **Non-English items** — The model was trained and evaluated on English-language items only.
- **Non-EVT constructs** — The model will force any input into one of the six categories. Items from unrelated theoretical frameworks (e.g., Big Five personality, cognitive load) will be classified as `OTHER` at best, or spuriously assigned to an EVT category.
## How to Use
### Quick Start
```python
from transformers import AutoTokenizer, AutoModel
import torch
# Load model
tokenizer = AutoTokenizer.from_pretrained("christiqn/mpnet-EVT-classifier")
model = AutoModel.from_pretrained("christiqn/mpnet-EVT-classifier", trust_remote_code=True)
model.eval()
# Inference
text = "I expect to do well in this course."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.sigmoid(outputs.logits)
label = model.config.id2label[probs[0].argmax().item()]
print(f"Predicted: {label}")
```
### Adjusting the Decision Threshold
The default threshold of 0.50 balances precision and recall. You can adjust it:
- **Lower threshold (e.g., 0.35):** More inclusive — fewer items classified as OTHER, higher recall on EVT constructs, but more false positives.
- **Higher threshold (e.g., 0.65):** More conservative — more items fall to OTHER, higher precision on EVT constructs, but more false negatives.
Choose based on your application: use a lower threshold for exploratory screening, a higher one when precision matters.
## Model Description
### Architecture
The model uses a custom classification head on top of `sentence-transformers/all-mpnet-base-v2` (110M parameters):
```
MPNet encoder
↓
ConcatPooling (mean + max over token embeddings → 2 × hidden_size)
↓
Dropout(0.2) → Dense(2h → h) → GELU → Dropout(0.2)
↓
NormLinear head (cosine similarity classifier, Ï„ = 20)
↓
5 independent sigmoid outputs (one per EVT construct)
```
The `OTHER` class is not explicitly learned. Instead, an item is classified as `OTHER` when no construct's sigmoid probability exceeds the decision threshold (default: 0.50). This **One-vs-Rest (OvR)** formulation avoids forcing the model to learn a coherent representation for the heterogeneous `OTHER` category.
### Key Design Choices
| Component | Rationale |
|---|---|
| **ConcatPooling** | Concatenating mean and max pooling captures both distributional and salient features from short text (Howard & Ruder, 2018) |
| **NormLinear (cosine) head** | L2-normalised features and weights make classification direction-based, improving robustness to domain shift (Wang et al., 2018) |
| **Asymmetric Loss** (γ_neg=4, γ_pos=1) | Down-weights easy negatives in the multi-label sigmoid setup, preventing the dominant negative class from overwhelming the loss (Ridnik et al., 2021) |
| **FGM adversarial training** | Embedding-space perturbations regularise the model, improving generalisation from synthetic to real items (Miyato et al., 2017) |
| **LLRD** (decay = 0.9) | Lower layers receive smaller learning rates, preserving pre-trained linguistic knowledge (Sun et al., 2019) |
| **Gradual unfreezing** | Head-only training for epoch 1, then full end-to-end fine-tuning, to prevent catastrophic forgetting (Howard & Ruder, 2018) |
| **SWA** | Averaging top-3 checkpoints produces a flatter minimum and marginally improves κ (Izmailov et al., 2018) |
## Training
### Training Data
The model was fine-tuned on the [expectancy_value_pool_v2](https://huggingface.co/datasets/christiqn/expectancy_value_pool_v2) dataset — approximately 15,000 synthetic psychological test items generated with construct-level attribution, filtered to remove participial-phrase fragments. The data was split 85/7.5/7.5 into train/validation/test sets with stratified sampling.
### Training Procedure
- **Epochs:** 12 (with early stopping, patience = 4, monitored on validation κ)
- **Batch size:** 24 per device × 2 gradient accumulation steps = 48 effective
- **Optimizer:** AdamW (lr = 3e-5, weight decay = 0.01)
- **Scheduler:** Linear with 10% warmup
- **Precision:** bf16 mixed-precision
- **Hardware:** Single NVIDIA H100 GPU
- **Post-training:** Stochastic Weight Averaging of top-3 checkpoints
### Training Results (Held-Out Synthetic Test Set)
| Metric | Value |
|---|---|
| Accuracy | 0.990 |
| Macro F1 | 0.990 |
| Cohen's κ | 0.990 |
> **Note:** These near-perfect numbers reflect performance on the synthetic data distribution. Real-world performance on human-authored items is reported below.
## Evaluation
### Test Set
The model was evaluated on **N = 1,284 human-coded test items** from [christiqn/EVT-items](https://huggingface.co/datasets/christiqn/EVT-items), drawn from published and unpublished psychological instruments spanning 95 distinct scales. Items were independently coded by a human expert and by the model. Items marked as formatting artefacts, third-person phrasing, or multiple anchors were excluded prior to evaluation. All confidence intervals are bias-corrected and accelerated (BCa) bootstrap CIs with B = 10,000.
### Agreement with Human Coder (Full 6-Class)
| Metric | Value | 95% BCa CI |
|---|---|---|
| **Cohen's κ (unweighted)** | **0.767** | [0.741, 0.793] |
| Cohen's κ (linear weighted) | 0.768 | [0.735, 0.797] |
| Krippendorff's α | 0.767 | [0.740, 0.793] |
| PABAK | 0.772 | — |
| Overall accuracy | 0.810 | [0.787, 0.830] |
| Macro F1 | 0.812 | [0.790, 0.834] |
| Weighted F1 | 0.807 | [0.785, 0.828] |
Cohen's κ = .77 indicates **substantial agreement** according to Landis & Koch (1977).
### Per-Class Performance (6-Class)
| Class | Precision | Recall | F1 | κ (OvR) | n |
|---|---|---|---|---|---|
| ATTAINMENT_VALUE | .800 [.726, .868] | .865 [.798, .926] | .831 [.775, .880] | .815 | 111 |
| COST | .795 [.732, .854] | .859 [.800, .912] | .826 [.778, .869] | .802 | 149 |
| EXPECTANCY | .843 [.801, .881] | .886 [.850, .921] | .864 [.834, .892] | .820 | 308 |
| INTRINSIC_VALUE | .839 [.793, .883] | .893 [.852, .931] | .865 [.831, .896] | .834 | 234 |
| OTHER | .726 [.668, .780] | .636 [.580, .691] | .678 [.631, .722] | .595 | 283 |
| UTILITY_VALUE | .846 [.793, .897] | .774 [.716, .831] | .808 [.764, .849] | .775 | 199 |
### Confusion Matrix (6-Class)
| | Pred: AV | Pred: CO | Pred: EX | Pred: IV | Pred: OT | Pred: UV |
|---|---|---|---|---|---|---|
| **True: AV** | **96** | 1 | 1 | 5 | 4 | 4 |
| **True: CO** | 1 | **128** | 1 | 10 | 7 | 2 |
| **True: EX** | 0 | 11 | **273** | 5 | 18 | 1 |
| **True: IV** | 4 | 4 | 0 | **209** | 15 | 2 |
| **True: OT** | 13 | 14 | 40 | 17 | **180** | 19 |
| **True: UV** | 6 | 3 | 9 | 3 | 24 | **154** |
Cramér's V = 0.782 (6-class).
### Marginal Homogeneity (Stuart-Maxwell Test)
The Stuart-Maxwell test for marginal homogeneity is significant (χ²(5) = 18.62, p = .002), indicating a systematic difference between the model's and human coder's label distributions. The model tends to over-predict EXPECTANCY (+16), INTRINSIC_VALUE (+15), COST (+12), and ATTAINMENT_VALUE (+9) relative to the human coder, while under-predicting OTHER (−35) and UTILITY_VALUE (−17). This is a direct consequence of the OvR + asymmetric loss design, which prioritises recall on core constructs.
### Core Construct Discrimination (5-Class, Excluding OTHER)
The OTHER category is not a psychological construct — it is a heterogeneous residual class that the model was never explicitly trained to recognise. To separately assess the model's ability to *discriminate among the five core EVT constructs*, a restricted analysis was conducted on items where both the human coder and the model assigned a core construct (n = 933, 72.7% of all items).
| Metric | Value | 95% BCa CI |
|---|---|---|
| **Cohen's κ (5-class)** | **0.899** | [0.876, 0.920] |
| Krippendorff's α | 0.899 | [0.876, 0.921] |
| Overall accuracy | 0.922 | [0.901, 0.937] |
| Macro F1 | 0.915 | [0.895, 0.933] |
| PABAK | 0.902 | — |
Cohen's κ = .90 indicates **almost perfect agreement** on construct discrimination.
| Class | Precision | Recall | F1 | n |
|---|---|---|---|---|
| ATTAINMENT_VALUE | .897 [.838, .950] | .897 [.835, .951] | .897 [.851, .937] | 107 |
| COST | .871 [.812, .922] | .901 [.849, .948] | .886 [.843, .922] | 142 |
| EXPECTANCY | .961 [.937, .982] | .941 [.912, .967] | .951 [.932, .968] | 290 |
| INTRINSIC_VALUE | .901 [.861, .938] | .954 [.924, .980] | .927 [.901, .950] | 219 |
| UTILITY_VALUE | .945 [.907, .977] | .880 [.830, .926] | .911 [.877, .941] | 175 |
Additionally, when evaluating all items the human coded as a core construct (n = 1,001) — allowing the model to predict OTHER — the model achieved κ = 0.822 [0.795, 0.848] and accuracy = 0.859 [0.835, 0.878]. Of these 1,001 items, 68 (6.8%) were incorrectly pushed to OTHER by the model.
### Summary Across Evaluation Scenarios
| Scenario | κ | Macro F1 | N |
|---|---|---|---|
| Full 6-class (with OTHER) | 0.767 | 0.812 | 1,284 |
| 5-class: both raters assigned core construct | **0.899** | **0.915** | 933 |
| Human = core, model unrestricted | 0.822 | — | 1,001 |
The jump from κ = .77 to κ = .90 confirms that the model's primary weakness lies at the **EVT/non-EVT detection boundary** (the OTHER category), not in discriminating among the five core constructs.
### Base Model Comparison
To quantify the effect of domain-specific fine-tuning, the classifier was compared against the un-fine-tuned base model (`all-mpnet-base-v2`). The base model achieved κ = −0.054 [−0.073, −0.033] vs. κ = 0.767 [0.741, 0.793] for the fine-tuned model (Δκ = +0.821 [0.787, 0.854]). A McNemar test indicated a highly significant difference in error rates, χ²(1) = 867.91, p < .001 (fine-tuned correct only: 911 items; base correct only: 14 items). The fine-tuned model outperforms the base model on all six classes.
| Class | Base F1 | Fine-Tuned F1 | Δ |
|---|---|---|---|
| ATTAINMENT_VALUE | 0.051 | 0.831 | +0.780 |
| COST | 0.090 | 0.826 | +0.736 |
| EXPECTANCY | 0.169 | 0.864 | +0.695 |
| INTRINSIC_VALUE | 0.200 | 0.865 | +0.666 |
| OTHER | 0.007 | 0.678 | +0.671 |
| UTILITY_VALUE | 0.010 | 0.808 | +0.799 |
### Known Limitations and Biases
**OTHER is the weakest category** (κ = .60, F1 = .68). Because OTHER is defined as "no construct activated above threshold" rather than a learned class, the model tends to find *some* EVT signal in ambiguous items that a human would code as non-EVT. The most frequent error type is predicting EXPECTANCY when the true label is OTHER (40 cases, 16.4% of all errors). When OTHER is excluded, the model reaches κ = .90 — almost perfect agreement — on the five core constructs.
**Systematic label distribution differences.** The Stuart-Maxwell test is significant (χ²(5) = 18.62, p = .002). The model over-predicts EXPECTANCY (+16), INTRINSIC_VALUE (+15), COST (+12), and ATTAINMENT_VALUE (+9), and under-predicts OTHER (−35) and UTILITY_VALUE (−17).
**UTILITY_VALUE under-recall.** The model is conservative on utility value (recall = .774 vs. precision = .846), suggesting that items referencing indirect or long-term benefits may be harder for the model to detect.
**Trained on synthetic data.** The training set consists of synthetically generated items, not real published instruments. While the model generalises well (κ = .77 overall, κ = .90 on core constructs), there may be stylistic patterns in human-authored items from specific subfields or cultures that the model has not encountered.
**English only.** The model has not been evaluated on items in other languages. Cross-lingual transfer should not be assumed.
## References
- Cohen, J. (1960). A coefficient of agreement for nominal scales. *Educational and Psychological Measurement, 20*(1), 37–46.
- Eccles, J. S., et al. (1983). Expectancies, values, and academic behaviors. In J. T. Spence (Ed.), *Achievement and achievement motives* (pp. 75–146). W. H. Freeman.
- Eccles, J. S., & Wigfield, A. (2002). Motivational beliefs, values, and goals. *Annual Review of Psychology, 53*(1), 109–132.
- Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. *Proceedings of ACL*, 328–339.
- Izmailov, P., et al. (2018). Averaging weights leads to wider optima and better generalization. *Proceedings of UAI*, 876–885.
- Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. *Biometrics, 33*(1), 159–174.
- Miyato, T., et al. (2017). Adversarial training methods for semi-supervised text classification. *Proceedings of ICLR*.
- Ridnik, T., et al. (2021). Asymmetric loss for multi-label classification. *Proceedings of ICCV*, 82–91.
- Sun, C., et al. (2019). How to fine-tune BERT for text classification. *Proceedings of CCL*, 194–206.
- Wang, H., et al. (2018). CosFace: Large margin cosine loss for deep face recognition. *Proceedings of CVPR*, 5265–5274.
``` |