christiqn's picture
Update README.md
ae0bc6d verified
---
language:
- en
license: cc-by-4.0
library_name: transformers
tags:
- text-classification
- pytorch
- safetensors
- evt_classifier
- feature-extraction
- psychology
- expectancy-value-theory
- psychometrics
- test-item-classification
- educational-psychology
- motivation
- survey-instruments
- content-analysis
- custom_code
pipeline_tag: text-classification
datasets:
- christiqn/expectancy_value_pool
- christiqn/EVT-items
metrics:
- f1
- accuracy
model-index:
- name: mpnet-EVT-classifier
results:
- task:
type: text-classification
dataset:
name: EVT-items (Human-coded psychological test items)
type: christiqn/EVT-items
metrics:
- type: cohen_kappa
value: 0.7674
name: Cohen's Kappa (6-class, N=1284)
- type: f1
value: 0.8121
name: Macro F1 (6-class)
- type: accuracy
value: 0.8100
name: Accuracy (6-class)
- type: krippendorff_alpha
value: 0.7674
name: Krippendorff's Alpha
---
# EVT Item Classifier — Expectancy-Value Theory Construct Tagger
A fine-tuned text classifier that assigns psychological test items (e.g., survey questions, scale items) to one of six categories derived from **Expectancy-Value Theory** (Eccles et al., 1983; Eccles & Wigfield, 2002):
| Label | EVT Construct | Example item |
|---|---|---|
| `ATTAINMENT_VALUE` | Personal importance of doing well | *"Doing well in math is important to who I am."* |
| `COST` | Perceived negative consequences of engagement | *"I have to give up too much to succeed in this class."* |
| `EXPECTANCY` | Beliefs about future success | *"I am confident I can master the skills taught in this course."* |
| `INTRINSIC_VALUE` | Enjoyment and interest | *"I find the content of this course very interesting."* |
| `UTILITY_VALUE` | Usefulness for future goals | *"What I learn in this class will be useful for my career."* |
| `OTHER` | Not classifiable as an EVT construct | *"I usually sit in the front row."* |
## Intended Use
This model is intended for **academic research** in educational psychology, motivation science, and psychometrics. Typical use cases include:
- **Automated content analysis** of existing item pools and questionnaire banks for EVT construct coverage
- **Scale development assistance** — screening candidate items during the item-writing phase
- **Systematic reviews** — coding large corpora of test items from published instruments
- **Construct validation** — checking whether items align with their intended EVT construct
### Out-of-Scope Uses
- **Clinical or diagnostic decision-making** — This model classifies test *items*, not respondents. It should not be used to assess individuals.
- **Replacement for human coding** — The model achieves substantial but not perfect agreement with human coders (κ = .77). It is intended as a tool to assist researchers, not to replace expert judgment.
- **Non-English items** — The model was trained and evaluated on English-language items only.
- **Non-EVT constructs** — The model will force any input into one of the six categories. Items from unrelated theoretical frameworks (e.g., Big Five personality, cognitive load) will be classified as `OTHER` at best, or spuriously assigned to an EVT category.
## How to Use
### Quick Start
```python
from transformers import AutoTokenizer, AutoModel
import torch
# Load model
tokenizer = AutoTokenizer.from_pretrained("christiqn/mpnet-EVT-classifier")
model = AutoModel.from_pretrained("christiqn/mpnet-EVT-classifier", trust_remote_code=True)
model.eval()
# Inference
text = "I expect to do well in this course."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.sigmoid(outputs.logits)
label = model.config.id2label[probs[0].argmax().item()]
print(f"Predicted: {label}")
```
### Adjusting the Decision Threshold
The default threshold of 0.50 balances precision and recall. You can adjust it:
- **Lower threshold (e.g., 0.35):** More inclusive — fewer items classified as OTHER, higher recall on EVT constructs, but more false positives.
- **Higher threshold (e.g., 0.65):** More conservative — more items fall to OTHER, higher precision on EVT constructs, but more false negatives.
Choose based on your application: use a lower threshold for exploratory screening, a higher one when precision matters.
## Model Description
### Architecture
The model uses a custom classification head on top of `sentence-transformers/all-mpnet-base-v2` (110M parameters):
```
MPNet encoder
ConcatPooling (mean + max over token embeddings → 2 × hidden_size)
Dropout(0.2) → Dense(2h → h) → GELU → Dropout(0.2)
NormLinear head (cosine similarity classifier, τ = 20)
5 independent sigmoid outputs (one per EVT construct)
```
The `OTHER` class is not explicitly learned. Instead, an item is classified as `OTHER` when no construct's sigmoid probability exceeds the decision threshold (default: 0.50). This **One-vs-Rest (OvR)** formulation avoids forcing the model to learn a coherent representation for the heterogeneous `OTHER` category.
### Key Design Choices
| Component | Rationale |
|---|---|
| **ConcatPooling** | Concatenating mean and max pooling captures both distributional and salient features from short text (Howard & Ruder, 2018) |
| **NormLinear (cosine) head** | L2-normalised features and weights make classification direction-based, improving robustness to domain shift (Wang et al., 2018) |
| **Asymmetric Loss**_neg=4, γ_pos=1) | Down-weights easy negatives in the multi-label sigmoid setup, preventing the dominant negative class from overwhelming the loss (Ridnik et al., 2021) |
| **FGM adversarial training** | Embedding-space perturbations regularise the model, improving generalisation from synthetic to real items (Miyato et al., 2017) |
| **LLRD** (decay = 0.9) | Lower layers receive smaller learning rates, preserving pre-trained linguistic knowledge (Sun et al., 2019) |
| **Gradual unfreezing** | Head-only training for epoch 1, then full end-to-end fine-tuning, to prevent catastrophic forgetting (Howard & Ruder, 2018) |
| **SWA** | Averaging top-3 checkpoints produces a flatter minimum and marginally improves κ (Izmailov et al., 2018) |
## Training
### Training Data
The model was fine-tuned on the [expectancy_value_pool_v2](https://huggingface.co/datasets/christiqn/expectancy_value_pool_v2) dataset — approximately 15,000 synthetic psychological test items generated with construct-level attribution, filtered to remove participial-phrase fragments. The data was split 85/7.5/7.5 into train/validation/test sets with stratified sampling.
### Training Procedure
- **Epochs:** 12 (with early stopping, patience = 4, monitored on validation κ)
- **Batch size:** 24 per device × 2 gradient accumulation steps = 48 effective
- **Optimizer:** AdamW (lr = 3e-5, weight decay = 0.01)
- **Scheduler:** Linear with 10% warmup
- **Precision:** bf16 mixed-precision
- **Hardware:** Single NVIDIA H100 GPU
- **Post-training:** Stochastic Weight Averaging of top-3 checkpoints
### Training Results (Held-Out Synthetic Test Set)
| Metric | Value |
|---|---|
| Accuracy | 0.990 |
| Macro F1 | 0.990 |
| Cohen's κ | 0.990 |
> **Note:** These near-perfect numbers reflect performance on the synthetic data distribution. Real-world performance on human-authored items is reported below.
## Evaluation
### Test Set
The model was evaluated on **N = 1,284 human-coded test items** from [christiqn/EVT-items](https://huggingface.co/datasets/christiqn/EVT-items), drawn from published and unpublished psychological instruments spanning 95 distinct scales. Items were independently coded by a human expert and by the model. Items marked as formatting artefacts, third-person phrasing, or multiple anchors were excluded prior to evaluation. All confidence intervals are bias-corrected and accelerated (BCa) bootstrap CIs with B = 10,000.
### Agreement with Human Coder (Full 6-Class)
| Metric | Value | 95% BCa CI |
|---|---|---|
| **Cohen's κ (unweighted)** | **0.767** | [0.741, 0.793] |
| Cohen's κ (linear weighted) | 0.768 | [0.735, 0.797] |
| Krippendorff's α | 0.767 | [0.740, 0.793] |
| PABAK | 0.772 | — |
| Overall accuracy | 0.810 | [0.787, 0.830] |
| Macro F1 | 0.812 | [0.790, 0.834] |
| Weighted F1 | 0.807 | [0.785, 0.828] |
Cohen's κ = .77 indicates **substantial agreement** according to Landis & Koch (1977).
### Per-Class Performance (6-Class)
| Class | Precision | Recall | F1 | κ (OvR) | n |
|---|---|---|---|---|---|
| ATTAINMENT_VALUE | .800 [.726, .868] | .865 [.798, .926] | .831 [.775, .880] | .815 | 111 |
| COST | .795 [.732, .854] | .859 [.800, .912] | .826 [.778, .869] | .802 | 149 |
| EXPECTANCY | .843 [.801, .881] | .886 [.850, .921] | .864 [.834, .892] | .820 | 308 |
| INTRINSIC_VALUE | .839 [.793, .883] | .893 [.852, .931] | .865 [.831, .896] | .834 | 234 |
| OTHER | .726 [.668, .780] | .636 [.580, .691] | .678 [.631, .722] | .595 | 283 |
| UTILITY_VALUE | .846 [.793, .897] | .774 [.716, .831] | .808 [.764, .849] | .775 | 199 |
### Confusion Matrix (6-Class)
| | Pred: AV | Pred: CO | Pred: EX | Pred: IV | Pred: OT | Pred: UV |
|---|---|---|---|---|---|---|
| **True: AV** | **96** | 1 | 1 | 5 | 4 | 4 |
| **True: CO** | 1 | **128** | 1 | 10 | 7 | 2 |
| **True: EX** | 0 | 11 | **273** | 5 | 18 | 1 |
| **True: IV** | 4 | 4 | 0 | **209** | 15 | 2 |
| **True: OT** | 13 | 14 | 40 | 17 | **180** | 19 |
| **True: UV** | 6 | 3 | 9 | 3 | 24 | **154** |
Cramér's V = 0.782 (6-class).
### Marginal Homogeneity (Stuart-Maxwell Test)
The Stuart-Maxwell test for marginal homogeneity is significant (χ²(5) = 18.62, p = .002), indicating a systematic difference between the model's and human coder's label distributions. The model tends to over-predict EXPECTANCY (+16), INTRINSIC_VALUE (+15), COST (+12), and ATTAINMENT_VALUE (+9) relative to the human coder, while under-predicting OTHER (−35) and UTILITY_VALUE (−17). This is a direct consequence of the OvR + asymmetric loss design, which prioritises recall on core constructs.
### Core Construct Discrimination (5-Class, Excluding OTHER)
The OTHER category is not a psychological construct — it is a heterogeneous residual class that the model was never explicitly trained to recognise. To separately assess the model's ability to *discriminate among the five core EVT constructs*, a restricted analysis was conducted on items where both the human coder and the model assigned a core construct (n = 933, 72.7% of all items).
| Metric | Value | 95% BCa CI |
|---|---|---|
| **Cohen's κ (5-class)** | **0.899** | [0.876, 0.920] |
| Krippendorff's α | 0.899 | [0.876, 0.921] |
| Overall accuracy | 0.922 | [0.901, 0.937] |
| Macro F1 | 0.915 | [0.895, 0.933] |
| PABAK | 0.902 | — |
Cohen's κ = .90 indicates **almost perfect agreement** on construct discrimination.
| Class | Precision | Recall | F1 | n |
|---|---|---|---|---|
| ATTAINMENT_VALUE | .897 [.838, .950] | .897 [.835, .951] | .897 [.851, .937] | 107 |
| COST | .871 [.812, .922] | .901 [.849, .948] | .886 [.843, .922] | 142 |
| EXPECTANCY | .961 [.937, .982] | .941 [.912, .967] | .951 [.932, .968] | 290 |
| INTRINSIC_VALUE | .901 [.861, .938] | .954 [.924, .980] | .927 [.901, .950] | 219 |
| UTILITY_VALUE | .945 [.907, .977] | .880 [.830, .926] | .911 [.877, .941] | 175 |
Additionally, when evaluating all items the human coded as a core construct (n = 1,001) — allowing the model to predict OTHER — the model achieved κ = 0.822 [0.795, 0.848] and accuracy = 0.859 [0.835, 0.878]. Of these 1,001 items, 68 (6.8%) were incorrectly pushed to OTHER by the model.
### Summary Across Evaluation Scenarios
| Scenario | κ | Macro F1 | N |
|---|---|---|---|
| Full 6-class (with OTHER) | 0.767 | 0.812 | 1,284 |
| 5-class: both raters assigned core construct | **0.899** | **0.915** | 933 |
| Human = core, model unrestricted | 0.822 | — | 1,001 |
The jump from κ = .77 to κ = .90 confirms that the model's primary weakness lies at the **EVT/non-EVT detection boundary** (the OTHER category), not in discriminating among the five core constructs.
### Base Model Comparison
To quantify the effect of domain-specific fine-tuning, the classifier was compared against the un-fine-tuned base model (`all-mpnet-base-v2`). The base model achieved κ = −0.054 [−0.073, −0.033] vs. κ = 0.767 [0.741, 0.793] for the fine-tuned model (Δκ = +0.821 [0.787, 0.854]). A McNemar test indicated a highly significant difference in error rates, χ²(1) = 867.91, p < .001 (fine-tuned correct only: 911 items; base correct only: 14 items). The fine-tuned model outperforms the base model on all six classes.
| Class | Base F1 | Fine-Tuned F1 | Δ |
|---|---|---|---|
| ATTAINMENT_VALUE | 0.051 | 0.831 | +0.780 |
| COST | 0.090 | 0.826 | +0.736 |
| EXPECTANCY | 0.169 | 0.864 | +0.695 |
| INTRINSIC_VALUE | 0.200 | 0.865 | +0.666 |
| OTHER | 0.007 | 0.678 | +0.671 |
| UTILITY_VALUE | 0.010 | 0.808 | +0.799 |
### Known Limitations and Biases
**OTHER is the weakest category** (κ = .60, F1 = .68). Because OTHER is defined as "no construct activated above threshold" rather than a learned class, the model tends to find *some* EVT signal in ambiguous items that a human would code as non-EVT. The most frequent error type is predicting EXPECTANCY when the true label is OTHER (40 cases, 16.4% of all errors). When OTHER is excluded, the model reaches κ = .90 — almost perfect agreement — on the five core constructs.
**Systematic label distribution differences.** The Stuart-Maxwell test is significant (χ²(5) = 18.62, p = .002). The model over-predicts EXPECTANCY (+16), INTRINSIC_VALUE (+15), COST (+12), and ATTAINMENT_VALUE (+9), and under-predicts OTHER (−35) and UTILITY_VALUE (−17).
**UTILITY_VALUE under-recall.** The model is conservative on utility value (recall = .774 vs. precision = .846), suggesting that items referencing indirect or long-term benefits may be harder for the model to detect.
**Trained on synthetic data.** The training set consists of synthetically generated items, not real published instruments. While the model generalises well (κ = .77 overall, κ = .90 on core constructs), there may be stylistic patterns in human-authored items from specific subfields or cultures that the model has not encountered.
**English only.** The model has not been evaluated on items in other languages. Cross-lingual transfer should not be assumed.
## References
- Cohen, J. (1960). A coefficient of agreement for nominal scales. *Educational and Psychological Measurement, 20*(1), 37–46.
- Eccles, J. S., et al. (1983). Expectancies, values, and academic behaviors. In J. T. Spence (Ed.), *Achievement and achievement motives* (pp. 75–146). W. H. Freeman.
- Eccles, J. S., & Wigfield, A. (2002). Motivational beliefs, values, and goals. *Annual Review of Psychology, 53*(1), 109–132.
- Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. *Proceedings of ACL*, 328–339.
- Izmailov, P., et al. (2018). Averaging weights leads to wider optima and better generalization. *Proceedings of UAI*, 876–885.
- Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. *Biometrics, 33*(1), 159–174.
- Miyato, T., et al. (2017). Adversarial training methods for semi-supervised text classification. *Proceedings of ICLR*.
- Ridnik, T., et al. (2021). Asymmetric loss for multi-label classification. *Proceedings of ICCV*, 82–91.
- Sun, C., et al. (2019). How to fine-tune BERT for text classification. *Proceedings of CCL*, 194–206.
- Wang, H., et al. (2018). CosFace: Large margin cosine loss for deep face recognition. *Proceedings of CVPR*, 5265–5274.
```