--- language: - en license: cc-by-4.0 library_name: transformers tags: - text-classification - pytorch - safetensors - evt_classifier - feature-extraction - psychology - expectancy-value-theory - psychometrics - test-item-classification - educational-psychology - motivation - survey-instruments - content-analysis - custom_code pipeline_tag: text-classification datasets: - christiqn/expectancy_value_pool - christiqn/EVT-items metrics: - f1 - accuracy model-index: - name: mpnet-EVT-classifier results: - task: type: text-classification dataset: name: EVT-items (Human-coded psychological test items) type: christiqn/EVT-items metrics: - type: cohen_kappa value: 0.7674 name: Cohen's Kappa (6-class, N=1284) - type: f1 value: 0.8121 name: Macro F1 (6-class) - type: accuracy value: 0.8100 name: Accuracy (6-class) - type: krippendorff_alpha value: 0.7674 name: Krippendorff's Alpha --- # EVT Item Classifier — Expectancy-Value Theory Construct Tagger A fine-tuned text classifier that assigns psychological test items (e.g., survey questions, scale items) to one of six categories derived from **Expectancy-Value Theory** (Eccles et al., 1983; Eccles & Wigfield, 2002): | Label | EVT Construct | Example item | |---|---|---| | `ATTAINMENT_VALUE` | Personal importance of doing well | *"Doing well in math is important to who I am."* | | `COST` | Perceived negative consequences of engagement | *"I have to give up too much to succeed in this class."* | | `EXPECTANCY` | Beliefs about future success | *"I am confident I can master the skills taught in this course."* | | `INTRINSIC_VALUE` | Enjoyment and interest | *"I find the content of this course very interesting."* | | `UTILITY_VALUE` | Usefulness for future goals | *"What I learn in this class will be useful for my career."* | | `OTHER` | Not classifiable as an EVT construct | *"I usually sit in the front row."* | ## Intended Use This model is intended for **academic research** in educational psychology, motivation science, and psychometrics. Typical use cases include: - **Automated content analysis** of existing item pools and questionnaire banks for EVT construct coverage - **Scale development assistance** — screening candidate items during the item-writing phase - **Systematic reviews** — coding large corpora of test items from published instruments - **Construct validation** — checking whether items align with their intended EVT construct ### Out-of-Scope Uses - **Clinical or diagnostic decision-making** — This model classifies test *items*, not respondents. It should not be used to assess individuals. - **Replacement for human coding** — The model achieves substantial but not perfect agreement with human coders (κ = .77). It is intended as a tool to assist researchers, not to replace expert judgment. - **Non-English items** — The model was trained and evaluated on English-language items only. - **Non-EVT constructs** — The model will force any input into one of the six categories. Items from unrelated theoretical frameworks (e.g., Big Five personality, cognitive load) will be classified as `OTHER` at best, or spuriously assigned to an EVT category. ## How to Use ### Quick Start ```python from transformers import AutoTokenizer, AutoModel import torch # Load model tokenizer = AutoTokenizer.from_pretrained("christiqn/mpnet-EVT-classifier") model = AutoModel.from_pretrained("christiqn/mpnet-EVT-classifier", trust_remote_code=True) model.eval() # Inference text = "I expect to do well in this course." inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) with torch.no_grad(): outputs = model(**inputs) probs = torch.sigmoid(outputs.logits) label = model.config.id2label[probs[0].argmax().item()] print(f"Predicted: {label}") ``` ### Adjusting the Decision Threshold The default threshold of 0.50 balances precision and recall. You can adjust it: - **Lower threshold (e.g., 0.35):** More inclusive — fewer items classified as OTHER, higher recall on EVT constructs, but more false positives. - **Higher threshold (e.g., 0.65):** More conservative — more items fall to OTHER, higher precision on EVT constructs, but more false negatives. Choose based on your application: use a lower threshold for exploratory screening, a higher one when precision matters. ## Model Description ### Architecture The model uses a custom classification head on top of `sentence-transformers/all-mpnet-base-v2` (110M parameters): ``` MPNet encoder ↓ ConcatPooling (mean + max over token embeddings → 2 × hidden_size) ↓ Dropout(0.2) → Dense(2h → h) → GELU → Dropout(0.2) ↓ NormLinear head (cosine similarity classifier, τ = 20) ↓ 5 independent sigmoid outputs (one per EVT construct) ``` The `OTHER` class is not explicitly learned. Instead, an item is classified as `OTHER` when no construct's sigmoid probability exceeds the decision threshold (default: 0.50). This **One-vs-Rest (OvR)** formulation avoids forcing the model to learn a coherent representation for the heterogeneous `OTHER` category. ### Key Design Choices | Component | Rationale | |---|---| | **ConcatPooling** | Concatenating mean and max pooling captures both distributional and salient features from short text (Howard & Ruder, 2018) | | **NormLinear (cosine) head** | L2-normalised features and weights make classification direction-based, improving robustness to domain shift (Wang et al., 2018) | | **Asymmetric Loss** (γ_neg=4, γ_pos=1) | Down-weights easy negatives in the multi-label sigmoid setup, preventing the dominant negative class from overwhelming the loss (Ridnik et al., 2021) | | **FGM adversarial training** | Embedding-space perturbations regularise the model, improving generalisation from synthetic to real items (Miyato et al., 2017) | | **LLRD** (decay = 0.9) | Lower layers receive smaller learning rates, preserving pre-trained linguistic knowledge (Sun et al., 2019) | | **Gradual unfreezing** | Head-only training for epoch 1, then full end-to-end fine-tuning, to prevent catastrophic forgetting (Howard & Ruder, 2018) | | **SWA** | Averaging top-3 checkpoints produces a flatter minimum and marginally improves κ (Izmailov et al., 2018) | ## Training ### Training Data The model was fine-tuned on the [expectancy_value_pool_v2](https://huggingface.co/datasets/christiqn/expectancy_value_pool_v2) dataset — approximately 15,000 synthetic psychological test items generated with construct-level attribution, filtered to remove participial-phrase fragments. The data was split 85/7.5/7.5 into train/validation/test sets with stratified sampling. ### Training Procedure - **Epochs:** 12 (with early stopping, patience = 4, monitored on validation κ) - **Batch size:** 24 per device × 2 gradient accumulation steps = 48 effective - **Optimizer:** AdamW (lr = 3e-5, weight decay = 0.01) - **Scheduler:** Linear with 10% warmup - **Precision:** bf16 mixed-precision - **Hardware:** Single NVIDIA H100 GPU - **Post-training:** Stochastic Weight Averaging of top-3 checkpoints ### Training Results (Held-Out Synthetic Test Set) | Metric | Value | |---|---| | Accuracy | 0.990 | | Macro F1 | 0.990 | | Cohen's κ | 0.990 | > **Note:** These near-perfect numbers reflect performance on the synthetic data distribution. Real-world performance on human-authored items is reported below. ## Evaluation ### Test Set The model was evaluated on **N = 1,284 human-coded test items** from [christiqn/EVT-items](https://huggingface.co/datasets/christiqn/EVT-items), drawn from published and unpublished psychological instruments spanning 95 distinct scales. Items were independently coded by a human expert and by the model. Items marked as formatting artefacts, third-person phrasing, or multiple anchors were excluded prior to evaluation. All confidence intervals are bias-corrected and accelerated (BCa) bootstrap CIs with B = 10,000. ### Agreement with Human Coder (Full 6-Class) | Metric | Value | 95% BCa CI | |---|---|---| | **Cohen's κ (unweighted)** | **0.767** | [0.741, 0.793] | | Cohen's κ (linear weighted) | 0.768 | [0.735, 0.797] | | Krippendorff's α | 0.767 | [0.740, 0.793] | | PABAK | 0.772 | — | | Overall accuracy | 0.810 | [0.787, 0.830] | | Macro F1 | 0.812 | [0.790, 0.834] | | Weighted F1 | 0.807 | [0.785, 0.828] | Cohen's κ = .77 indicates **substantial agreement** according to Landis & Koch (1977). ### Per-Class Performance (6-Class) | Class | Precision | Recall | F1 | κ (OvR) | n | |---|---|---|---|---|---| | ATTAINMENT_VALUE | .800 [.726, .868] | .865 [.798, .926] | .831 [.775, .880] | .815 | 111 | | COST | .795 [.732, .854] | .859 [.800, .912] | .826 [.778, .869] | .802 | 149 | | EXPECTANCY | .843 [.801, .881] | .886 [.850, .921] | .864 [.834, .892] | .820 | 308 | | INTRINSIC_VALUE | .839 [.793, .883] | .893 [.852, .931] | .865 [.831, .896] | .834 | 234 | | OTHER | .726 [.668, .780] | .636 [.580, .691] | .678 [.631, .722] | .595 | 283 | | UTILITY_VALUE | .846 [.793, .897] | .774 [.716, .831] | .808 [.764, .849] | .775 | 199 | ### Confusion Matrix (6-Class) | | Pred: AV | Pred: CO | Pred: EX | Pred: IV | Pred: OT | Pred: UV | |---|---|---|---|---|---|---| | **True: AV** | **96** | 1 | 1 | 5 | 4 | 4 | | **True: CO** | 1 | **128** | 1 | 10 | 7 | 2 | | **True: EX** | 0 | 11 | **273** | 5 | 18 | 1 | | **True: IV** | 4 | 4 | 0 | **209** | 15 | 2 | | **True: OT** | 13 | 14 | 40 | 17 | **180** | 19 | | **True: UV** | 6 | 3 | 9 | 3 | 24 | **154** | Cramér's V = 0.782 (6-class). ### Marginal Homogeneity (Stuart-Maxwell Test) The Stuart-Maxwell test for marginal homogeneity is significant (χ²(5) = 18.62, p = .002), indicating a systematic difference between the model's and human coder's label distributions. The model tends to over-predict EXPECTANCY (+16), INTRINSIC_VALUE (+15), COST (+12), and ATTAINMENT_VALUE (+9) relative to the human coder, while under-predicting OTHER (−35) and UTILITY_VALUE (−17). This is a direct consequence of the OvR + asymmetric loss design, which prioritises recall on core constructs. ### Core Construct Discrimination (5-Class, Excluding OTHER) The OTHER category is not a psychological construct — it is a heterogeneous residual class that the model was never explicitly trained to recognise. To separately assess the model's ability to *discriminate among the five core EVT constructs*, a restricted analysis was conducted on items where both the human coder and the model assigned a core construct (n = 933, 72.7% of all items). | Metric | Value | 95% BCa CI | |---|---|---| | **Cohen's κ (5-class)** | **0.899** | [0.876, 0.920] | | Krippendorff's α | 0.899 | [0.876, 0.921] | | Overall accuracy | 0.922 | [0.901, 0.937] | | Macro F1 | 0.915 | [0.895, 0.933] | | PABAK | 0.902 | — | Cohen's κ = .90 indicates **almost perfect agreement** on construct discrimination. | Class | Precision | Recall | F1 | n | |---|---|---|---|---| | ATTAINMENT_VALUE | .897 [.838, .950] | .897 [.835, .951] | .897 [.851, .937] | 107 | | COST | .871 [.812, .922] | .901 [.849, .948] | .886 [.843, .922] | 142 | | EXPECTANCY | .961 [.937, .982] | .941 [.912, .967] | .951 [.932, .968] | 290 | | INTRINSIC_VALUE | .901 [.861, .938] | .954 [.924, .980] | .927 [.901, .950] | 219 | | UTILITY_VALUE | .945 [.907, .977] | .880 [.830, .926] | .911 [.877, .941] | 175 | Additionally, when evaluating all items the human coded as a core construct (n = 1,001) — allowing the model to predict OTHER — the model achieved κ = 0.822 [0.795, 0.848] and accuracy = 0.859 [0.835, 0.878]. Of these 1,001 items, 68 (6.8%) were incorrectly pushed to OTHER by the model. ### Summary Across Evaluation Scenarios | Scenario | κ | Macro F1 | N | |---|---|---|---| | Full 6-class (with OTHER) | 0.767 | 0.812 | 1,284 | | 5-class: both raters assigned core construct | **0.899** | **0.915** | 933 | | Human = core, model unrestricted | 0.822 | — | 1,001 | The jump from κ = .77 to κ = .90 confirms that the model's primary weakness lies at the **EVT/non-EVT detection boundary** (the OTHER category), not in discriminating among the five core constructs. ### Base Model Comparison To quantify the effect of domain-specific fine-tuning, the classifier was compared against the un-fine-tuned base model (`all-mpnet-base-v2`). The base model achieved κ = −0.054 [−0.073, −0.033] vs. κ = 0.767 [0.741, 0.793] for the fine-tuned model (Δκ = +0.821 [0.787, 0.854]). A McNemar test indicated a highly significant difference in error rates, χ²(1) = 867.91, p < .001 (fine-tuned correct only: 911 items; base correct only: 14 items). The fine-tuned model outperforms the base model on all six classes. | Class | Base F1 | Fine-Tuned F1 | Δ | |---|---|---|---| | ATTAINMENT_VALUE | 0.051 | 0.831 | +0.780 | | COST | 0.090 | 0.826 | +0.736 | | EXPECTANCY | 0.169 | 0.864 | +0.695 | | INTRINSIC_VALUE | 0.200 | 0.865 | +0.666 | | OTHER | 0.007 | 0.678 | +0.671 | | UTILITY_VALUE | 0.010 | 0.808 | +0.799 | ### Known Limitations and Biases **OTHER is the weakest category** (κ = .60, F1 = .68). Because OTHER is defined as "no construct activated above threshold" rather than a learned class, the model tends to find *some* EVT signal in ambiguous items that a human would code as non-EVT. The most frequent error type is predicting EXPECTANCY when the true label is OTHER (40 cases, 16.4% of all errors). When OTHER is excluded, the model reaches κ = .90 — almost perfect agreement — on the five core constructs. **Systematic label distribution differences.** The Stuart-Maxwell test is significant (χ²(5) = 18.62, p = .002). The model over-predicts EXPECTANCY (+16), INTRINSIC_VALUE (+15), COST (+12), and ATTAINMENT_VALUE (+9), and under-predicts OTHER (−35) and UTILITY_VALUE (−17). **UTILITY_VALUE under-recall.** The model is conservative on utility value (recall = .774 vs. precision = .846), suggesting that items referencing indirect or long-term benefits may be harder for the model to detect. **Trained on synthetic data.** The training set consists of synthetically generated items, not real published instruments. While the model generalises well (κ = .77 overall, κ = .90 on core constructs), there may be stylistic patterns in human-authored items from specific subfields or cultures that the model has not encountered. **English only.** The model has not been evaluated on items in other languages. Cross-lingual transfer should not be assumed. ## References - Cohen, J. (1960). A coefficient of agreement for nominal scales. *Educational and Psychological Measurement, 20*(1), 37–46. - Eccles, J. S., et al. (1983). Expectancies, values, and academic behaviors. In J. T. Spence (Ed.), *Achievement and achievement motives* (pp. 75–146). W. H. Freeman. - Eccles, J. S., & Wigfield, A. (2002). Motivational beliefs, values, and goals. *Annual Review of Psychology, 53*(1), 109–132. - Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. *Proceedings of ACL*, 328–339. - Izmailov, P., et al. (2018). Averaging weights leads to wider optima and better generalization. *Proceedings of UAI*, 876–885. - Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. *Biometrics, 33*(1), 159–174. - Miyato, T., et al. (2017). Adversarial training methods for semi-supervised text classification. *Proceedings of ICLR*. - Ridnik, T., et al. (2021). Asymmetric loss for multi-label classification. *Proceedings of ICCV*, 82–91. - Sun, C., et al. (2019). How to fine-tune BERT for text classification. *Proceedings of CCL*, 194–206. - Wang, H., et al. (2018). CosFace: Large margin cosine loss for deep face recognition. *Proceedings of CVPR*, 5265–5274. ```