README.md · christiqn/mpnet-EVT-classifier at main

mpnet-EVT-classifier / README.md

christiqn

Update README.md

ae0bc6d verified 19 days ago

preview code

raw

history blame contribute delete

15.9 kB

	---
	language:
	- en
	license: cc-by-4.0
	library_name: transformers
	tags:
	- text-classification
	- pytorch
	- safetensors
	- evt_classifier
	- feature-extraction
	- psychology
	- expectancy-value-theory
	- psychometrics
	- test-item-classification
	- educational-psychology
	- motivation
	- survey-instruments
	- content-analysis
	- custom_code
	pipeline_tag: text-classification
	datasets:
	- christiqn/expectancy_value_pool
	- christiqn/EVT-items
	metrics:
	- f1
	- accuracy
	model-index:
	- name: mpnet-EVT-classifier
	results:
	- task:
	type: text-classification
	dataset:
	name: EVT-items (Human-coded psychological test items)
	type: christiqn/EVT-items
	metrics:
	- type: cohen_kappa
	value: 0.7674
	name: Cohen's Kappa (6-class, N=1284)
	- type: f1
	value: 0.8121
	name: Macro F1 (6-class)
	- type: accuracy
	value: 0.8100
	name: Accuracy (6-class)
	- type: krippendorff_alpha
	value: 0.7674
	name: Krippendorff's Alpha
	---

	# EVT Item Classifier — Expectancy-Value Theory Construct Tagger

	A fine-tuned text classifier that assigns psychological test items (e.g., survey questions, scale items) to one of six categories derived from Expectancy-Value Theory (Eccles et al., 1983; Eccles & Wigfield, 2002):

	\| Label \| EVT Construct \| Example item \|
	\|---\|---\|---\|
	\| `ATTAINMENT_VALUE` \| Personal importance of doing well \| "Doing well in math is important to who I am." \|
	\| `COST` \| Perceived negative consequences of engagement \| "I have to give up too much to succeed in this class." \|
	\| `EXPECTANCY` \| Beliefs about future success \| "I am confident I can master the skills taught in this course." \|
	\| `INTRINSIC_VALUE` \| Enjoyment and interest \| "I find the content of this course very interesting." \|
	\| `UTILITY_VALUE` \| Usefulness for future goals \| "What I learn in this class will be useful for my career." \|
	\| `OTHER` \| Not classifiable as an EVT construct \| "I usually sit in the front row." \|


	## Intended Use

	This model is intended for academic research in educational psychology, motivation science, and psychometrics. Typical use cases include:

	- Automated content analysis of existing item pools and questionnaire banks for EVT construct coverage
	- Scale development assistance — screening candidate items during the item-writing phase
	- Systematic reviews — coding large corpora of test items from published instruments
	- Construct validation — checking whether items align with their intended EVT construct

	### Out-of-Scope Uses

	- Clinical or diagnostic decision-making — This model classifies test items, not respondents. It should not be used to assess individuals.
	- Replacement for human coding — The model achieves substantial but not perfect agreement with human coders (κ = .77). It is intended as a tool to assist researchers, not to replace expert judgment.
	- Non-English items — The model was trained and evaluated on English-language items only.
	- Non-EVT constructs — The model will force any input into one of the six categories. Items from unrelated theoretical frameworks (e.g., Big Five personality, cognitive load) will be classified as `OTHER` at best, or spuriously assigned to an EVT category.


	## How to Use

	### Quick Start

	```python
	from transformers import AutoTokenizer, AutoModel
	import torch

	# Load model
	tokenizer = AutoTokenizer.from_pretrained("christiqn/mpnet-EVT-classifier")
	model = AutoModel.from_pretrained("christiqn/mpnet-EVT-classifier", trust_remote_code=True)
	model.eval()

	# Inference
	text = "I expect to do well in this course."
	inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

	with torch.no_grad():
	outputs = model(**inputs)
	probs = torch.sigmoid(outputs.logits)

	label = model.config.id2label[probs[0].argmax().item()]
	print(f"Predicted: {label}")
	```

	### Adjusting the Decision Threshold

	The default threshold of 0.50 balances precision and recall. You can adjust it:

	- Lower threshold (e.g., 0.35): More inclusive — fewer items classified as OTHER, higher recall on EVT constructs, but more false positives.
	- Higher threshold (e.g., 0.65): More conservative — more items fall to OTHER, higher precision on EVT constructs, but more false negatives.

	Choose based on your application: use a lower threshold for exploratory screening, a higher one when precision matters.


	## Model Description

	### Architecture

	The model uses a custom classification head on top of `sentence-transformers/all-mpnet-base-v2` (110M parameters):

	```
	MPNet encoder
	↓
	ConcatPooling (mean + max over token embeddings → 2 × hidden_size)
	↓
	Dropout(0.2) → Dense(2h → h) → GELU → Dropout(0.2)
	↓
	NormLinear head (cosine similarity classifier, τ = 20)
	↓
	5 independent sigmoid outputs (one per EVT construct)
	```

	The `OTHER` class is not explicitly learned. Instead, an item is classified as `OTHER` when no construct's sigmoid probability exceeds the decision threshold (default: 0.50). This One-vs-Rest (OvR) formulation avoids forcing the model to learn a coherent representation for the heterogeneous `OTHER` category.

	### Key Design Choices

	\| Component \| Rationale \|
	\|---\|---\|
	\| ConcatPooling \| Concatenating mean and max pooling captures both distributional and salient features from short text (Howard & Ruder, 2018) \|
	\| NormLinear (cosine) head \| L2-normalised features and weights make classification direction-based, improving robustness to domain shift (Wang et al., 2018) \|
	\| Asymmetric Loss (γ_neg=4, γ_pos=1) \| Down-weights easy negatives in the multi-label sigmoid setup, preventing the dominant negative class from overwhelming the loss (Ridnik et al., 2021) \|
	\| FGM adversarial training \| Embedding-space perturbations regularise the model, improving generalisation from synthetic to real items (Miyato et al., 2017) \|
	\| LLRD (decay = 0.9) \| Lower layers receive smaller learning rates, preserving pre-trained linguistic knowledge (Sun et al., 2019) \|
	\| Gradual unfreezing \| Head-only training for epoch 1, then full end-to-end fine-tuning, to prevent catastrophic forgetting (Howard & Ruder, 2018) \|
	\| SWA \| Averaging top-3 checkpoints produces a flatter minimum and marginally improves κ (Izmailov et al., 2018) \|


	## Training

	### Training Data

	The model was fine-tuned on the [expectancy_value_pool_v2](https://huggingface.co/datasets/christiqn/expectancy_value_pool_v2) dataset — approximately 15,000 synthetic psychological test items generated with construct-level attribution, filtered to remove participial-phrase fragments. The data was split 85/7.5/7.5 into train/validation/test sets with stratified sampling.

	### Training Procedure

	- Epochs: 12 (with early stopping, patience = 4, monitored on validation κ)
	- Batch size: 24 per device × 2 gradient accumulation steps = 48 effective
	- Optimizer: AdamW (lr = 3e-5, weight decay = 0.01)
	- Scheduler: Linear with 10% warmup
	- Precision: bf16 mixed-precision
	- Hardware: Single NVIDIA H100 GPU
	- Post-training: Stochastic Weight Averaging of top-3 checkpoints

	### Training Results (Held-Out Synthetic Test Set)

	\| Metric \| Value \|
	\|---\|---\|
	\| Accuracy \| 0.990 \|
	\| Macro F1 \| 0.990 \|
	\| Cohen's κ \| 0.990 \|

	> Note: These near-perfect numbers reflect performance on the synthetic data distribution. Real-world performance on human-authored items is reported below.


	## Evaluation

	### Test Set

	The model was evaluated on N = 1,284 human-coded test items from [christiqn/EVT-items](https://huggingface.co/datasets/christiqn/EVT-items), drawn from published and unpublished psychological instruments spanning 95 distinct scales. Items were independently coded by a human expert and by the model. Items marked as formatting artefacts, third-person phrasing, or multiple anchors were excluded prior to evaluation. All confidence intervals are bias-corrected and accelerated (BCa) bootstrap CIs with B = 10,000.

	### Agreement with Human Coder (Full 6-Class)

	\| Metric \| Value \| 95% BCa CI \|
	\|---\|---\|---\|
	\| Cohen's κ (unweighted) \| 0.767 \| [0.741, 0.793] \|
	\| Cohen's κ (linear weighted) \| 0.768 \| [0.735, 0.797] \|
	\| Krippendorff's α \| 0.767 \| [0.740, 0.793] \|
	\| PABAK \| 0.772 \| — \|
	\| Overall accuracy \| 0.810 \| [0.787, 0.830] \|
	\| Macro F1 \| 0.812 \| [0.790, 0.834] \|
	\| Weighted F1 \| 0.807 \| [0.785, 0.828] \|

	Cohen's κ = .77 indicates substantial agreement according to Landis & Koch (1977).

	### Per-Class Performance (6-Class)

	\| Class \| Precision \| Recall \| F1 \| κ (OvR) \| n \|
	\|---\|---\|---\|---\|---\|---\|
	\| ATTAINMENT_VALUE \| .800 [.726, .868] \| .865 [.798, .926] \| .831 [.775, .880] \| .815 \| 111 \|
	\| COST \| .795 [.732, .854] \| .859 [.800, .912] \| .826 [.778, .869] \| .802 \| 149 \|
	\| EXPECTANCY \| .843 [.801, .881] \| .886 [.850, .921] \| .864 [.834, .892] \| .820 \| 308 \|
	\| INTRINSIC_VALUE \| .839 [.793, .883] \| .893 [.852, .931] \| .865 [.831, .896] \| .834 \| 234 \|
	\| OTHER \| .726 [.668, .780] \| .636 [.580, .691] \| .678 [.631, .722] \| .595 \| 283 \|
	\| UTILITY_VALUE \| .846 [.793, .897] \| .774 [.716, .831] \| .808 [.764, .849] \| .775 \| 199 \|

	### Confusion Matrix (6-Class)

	\| \| Pred: AV \| Pred: CO \| Pred: EX \| Pred: IV \| Pred: OT \| Pred: UV \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| True: AV \| 96 \| 1 \| 1 \| 5 \| 4 \| 4 \|
	\| True: CO \| 1 \| 128 \| 1 \| 10 \| 7 \| 2 \|
	\| True: EX \| 0 \| 11 \| 273 \| 5 \| 18 \| 1 \|
	\| True: IV \| 4 \| 4 \| 0 \| 209 \| 15 \| 2 \|
	\| True: OT \| 13 \| 14 \| 40 \| 17 \| 180 \| 19 \|
	\| True: UV \| 6 \| 3 \| 9 \| 3 \| 24 \| 154 \|

	Cramér's V = 0.782 (6-class).

	### Marginal Homogeneity (Stuart-Maxwell Test)

	The Stuart-Maxwell test for marginal homogeneity is significant (χ²(5) = 18.62, p = .002), indicating a systematic difference between the model's and human coder's label distributions. The model tends to over-predict EXPECTANCY (+16), INTRINSIC_VALUE (+15), COST (+12), and ATTAINMENT_VALUE (+9) relative to the human coder, while under-predicting OTHER (−35) and UTILITY_VALUE (−17). This is a direct consequence of the OvR + asymmetric loss design, which prioritises recall on core constructs.

	### Core Construct Discrimination (5-Class, Excluding OTHER)

	The OTHER category is not a psychological construct — it is a heterogeneous residual class that the model was never explicitly trained to recognise. To separately assess the model's ability to discriminate among the five core EVT constructs, a restricted analysis was conducted on items where both the human coder and the model assigned a core construct (n = 933, 72.7% of all items).

	\| Metric \| Value \| 95% BCa CI \|
	\|---\|---\|---\|
	\| Cohen's κ (5-class) \| 0.899 \| [0.876, 0.920] \|
	\| Krippendorff's α \| 0.899 \| [0.876, 0.921] \|
	\| Overall accuracy \| 0.922 \| [0.901, 0.937] \|
	\| Macro F1 \| 0.915 \| [0.895, 0.933] \|
	\| PABAK \| 0.902 \| — \|

	Cohen's κ = .90 indicates almost perfect agreement on construct discrimination.

	\| Class \| Precision \| Recall \| F1 \| n \|
	\|---\|---\|---\|---\|---\|
	\| ATTAINMENT_VALUE \| .897 [.838, .950] \| .897 [.835, .951] \| .897 [.851, .937] \| 107 \|
	\| COST \| .871 [.812, .922] \| .901 [.849, .948] \| .886 [.843, .922] \| 142 \|
	\| EXPECTANCY \| .961 [.937, .982] \| .941 [.912, .967] \| .951 [.932, .968] \| 290 \|
	\| INTRINSIC_VALUE \| .901 [.861, .938] \| .954 [.924, .980] \| .927 [.901, .950] \| 219 \|
	\| UTILITY_VALUE \| .945 [.907, .977] \| .880 [.830, .926] \| .911 [.877, .941] \| 175 \|

	Additionally, when evaluating all items the human coded as a core construct (n = 1,001) — allowing the model to predict OTHER — the model achieved κ = 0.822 [0.795, 0.848] and accuracy = 0.859 [0.835, 0.878]. Of these 1,001 items, 68 (6.8%) were incorrectly pushed to OTHER by the model.

	### Summary Across Evaluation Scenarios

	\| Scenario \| κ \| Macro F1 \| N \|
	\|---\|---\|---\|---\|
	\| Full 6-class (with OTHER) \| 0.767 \| 0.812 \| 1,284 \|
	\| 5-class: both raters assigned core construct \| 0.899 \| 0.915 \| 933 \|
	\| Human = core, model unrestricted \| 0.822 \| — \| 1,001 \|

	The jump from κ = .77 to κ = .90 confirms that the model's primary weakness lies at the EVT/non-EVT detection boundary (the OTHER category), not in discriminating among the five core constructs.

	### Base Model Comparison

	To quantify the effect of domain-specific fine-tuning, the classifier was compared against the un-fine-tuned base model (`all-mpnet-base-v2`). The base model achieved κ = −0.054 [−0.073, −0.033] vs. κ = 0.767 [0.741, 0.793] for the fine-tuned model (Δκ = +0.821 [0.787, 0.854]). A McNemar test indicated a highly significant difference in error rates, χ²(1) = 867.91, p < .001 (fine-tuned correct only: 911 items; base correct only: 14 items). The fine-tuned model outperforms the base model on all six classes.

	\| Class \| Base F1 \| Fine-Tuned F1 \| Δ \|
	\|---\|---\|---\|---\|
	\| ATTAINMENT_VALUE \| 0.051 \| 0.831 \| +0.780 \|
	\| COST \| 0.090 \| 0.826 \| +0.736 \|
	\| EXPECTANCY \| 0.169 \| 0.864 \| +0.695 \|
	\| INTRINSIC_VALUE \| 0.200 \| 0.865 \| +0.666 \|
	\| OTHER \| 0.007 \| 0.678 \| +0.671 \|
	\| UTILITY_VALUE \| 0.010 \| 0.808 \| +0.799 \|

	### Known Limitations and Biases

	OTHER is the weakest category (κ = .60, F1 = .68). Because OTHER is defined as "no construct activated above threshold" rather than a learned class, the model tends to find some EVT signal in ambiguous items that a human would code as non-EVT. The most frequent error type is predicting EXPECTANCY when the true label is OTHER (40 cases, 16.4% of all errors). When OTHER is excluded, the model reaches κ = .90 — almost perfect agreement — on the five core constructs.

	Systematic label distribution differences. The Stuart-Maxwell test is significant (χ²(5) = 18.62, p = .002). The model over-predicts EXPECTANCY (+16), INTRINSIC_VALUE (+15), COST (+12), and ATTAINMENT_VALUE (+9), and under-predicts OTHER (−35) and UTILITY_VALUE (−17).

	UTILITY_VALUE under-recall. The model is conservative on utility value (recall = .774 vs. precision = .846), suggesting that items referencing indirect or long-term benefits may be harder for the model to detect.

	Trained on synthetic data. The training set consists of synthetically generated items, not real published instruments. While the model generalises well (κ = .77 overall, κ = .90 on core constructs), there may be stylistic patterns in human-authored items from specific subfields or cultures that the model has not encountered.

	English only. The model has not been evaluated on items in other languages. Cross-lingual transfer should not be assumed.

	## References

	- Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
	- Eccles, J. S., et al. (1983). Expectancies, values, and academic behaviors. In J. T. Spence (Ed.), Achievement and achievement motives (pp. 75–146). W. H. Freeman.
	- Eccles, J. S., & Wigfield, A. (2002). Motivational beliefs, values, and goals. Annual Review of Psychology, 53(1), 109–132.
	- Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. Proceedings of ACL, 328–339.
	- Izmailov, P., et al. (2018). Averaging weights leads to wider optima and better generalization. Proceedings of UAI, 876–885.
	- Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.
	- Miyato, T., et al. (2017). Adversarial training methods for semi-supervised text classification. Proceedings of ICLR.
	- Ridnik, T., et al. (2021). Asymmetric loss for multi-label classification. Proceedings of ICCV, 82–91.
	- Sun, C., et al. (2019). How to fine-tune BERT for text classification. Proceedings of CCL, 194–206.
	- Wang, H., et al. (2018). CosFace: Large margin cosine loss for deep face recognition. Proceedings of CVPR, 5265–5274.
	```