Update README.md
Browse filesUpdated statistics after rerunning analysis on final dataset (fixed some issues with double labels / items that had a label but should have been excluded - vice versa)
README.md
CHANGED
|
@@ -5,6 +5,10 @@ license: cc-by-4.0
|
|
| 5 |
library_name: transformers
|
| 6 |
tags:
|
| 7 |
- text-classification
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
- psychology
|
| 9 |
- expectancy-value-theory
|
| 10 |
- psychometrics
|
|
@@ -13,109 +17,109 @@ tags:
|
|
| 13 |
- motivation
|
| 14 |
- survey-instruments
|
| 15 |
- content-analysis
|
|
|
|
|
|
|
| 16 |
datasets:
|
| 17 |
- christiqn/expectancy_value_pool
|
| 18 |
-
|
| 19 |
-
pipeline_tag: text-classification
|
| 20 |
metrics:
|
| 21 |
-
- accuracy
|
| 22 |
- f1
|
| 23 |
-
-
|
| 24 |
model-index:
|
| 25 |
-
- name:
|
| 26 |
results:
|
| 27 |
- task:
|
| 28 |
type: text-classification
|
| 29 |
-
name: EVT Item Classification
|
| 30 |
dataset:
|
| 31 |
-
name: Human-coded
|
| 32 |
-
type:
|
| 33 |
-
config: default
|
| 34 |
-
split: test
|
| 35 |
metrics:
|
| 36 |
- type: cohen_kappa
|
| 37 |
-
value: 0.
|
| 38 |
-
name: Cohen's Kappa
|
| 39 |
- type: f1
|
| 40 |
-
value: 0.
|
| 41 |
-
name: Macro F1
|
| 42 |
- type: accuracy
|
| 43 |
-
value: 0.
|
| 44 |
-
name: Accuracy
|
| 45 |
- type: krippendorff_alpha
|
| 46 |
-
value: 0.
|
| 47 |
name: Krippendorff's Alpha
|
| 48 |
---
|
| 49 |
|
| 50 |
# EVT Item Classifier — Expectancy-Value Theory Construct Tagger
|
| 51 |
-
|
| 52 |
A fine-tuned text classifier that assigns psychological test items (e.g., survey questions, scale items) to one of six categories derived from **Expectancy-Value Theory** (Eccles et al., 1983; Eccles & Wigfield, 2002):
|
| 53 |
-
|
| 54 |
| Label | EVT Construct | Example item |
|
| 55 |
-
|---
|
| 56 |
| `ATTAINMENT_VALUE` | Personal importance of doing well | *"Doing well in math is important to who I am."* |
|
| 57 |
| `COST` | Perceived negative consequences of engagement | *"I have to give up too much to succeed in this class."* |
|
| 58 |
| `EXPECTANCY` | Beliefs about future success | *"I am confident I can master the skills taught in this course."* |
|
| 59 |
| `INTRINSIC_VALUE` | Enjoyment and interest | *"I find the content of this course very interesting."* |
|
| 60 |
| `UTILITY_VALUE` | Usefulness for future goals | *"What I learn in this class will be useful for my career."* |
|
| 61 |
| `OTHER` | Not classifiable as an EVT construct | *"I usually sit in the front row."* |
|
| 62 |
-
|
|
|
|
| 63 |
## Intended Use
|
| 64 |
-
|
| 65 |
This model is intended for **academic research** in educational psychology, motivation science, and psychometrics. Typical use cases include:
|
| 66 |
-
|
| 67 |
- **Automated content analysis** of existing item pools and questionnaire banks for EVT construct coverage
|
| 68 |
- **Scale development assistance** — screening candidate items during the item-writing phase
|
| 69 |
- **Systematic reviews** — coding large corpora of test items from published instruments
|
| 70 |
- **Construct validation** — checking whether items align with their intended EVT construct
|
| 71 |
-
|
| 72 |
### Out-of-Scope Uses
|
| 73 |
-
|
| 74 |
- **Clinical or diagnostic decision-making** — This model classifies test *items*, not respondents. It should not be used to assess individuals.
|
| 75 |
-
- **Replacement for human coding** — The model achieves substantial but not perfect agreement with human coders (κ = .
|
| 76 |
- **Non-English items** — The model was trained and evaluated on English-language items only.
|
| 77 |
- **Non-EVT constructs** — The model will force any input into one of the six categories. Items from unrelated theoretical frameworks (e.g., Big Five personality, cognitive load) will be classified as `OTHER` at best, or spuriously assigned to an EVT category.
|
| 78 |
-
|
|
|
|
| 79 |
## How to Use
|
| 80 |
-
|
| 81 |
### Quick Start
|
| 82 |
-
|
| 83 |
```python
|
| 84 |
from transformers import AutoTokenizer, AutoModel
|
| 85 |
import torch
|
| 86 |
-
|
| 87 |
-
# Load model
|
| 88 |
tokenizer = AutoTokenizer.from_pretrained("christiqn/mpnet-EVT-classifier")
|
| 89 |
model = AutoModel.from_pretrained("christiqn/mpnet-EVT-classifier", trust_remote_code=True)
|
| 90 |
model.eval()
|
| 91 |
-
|
| 92 |
# Inference
|
| 93 |
text = "I expect to do well in this course."
|
| 94 |
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
|
| 95 |
-
|
| 96 |
with torch.no_grad():
|
| 97 |
outputs = model(**inputs)
|
| 98 |
probs = torch.sigmoid(outputs.logits)
|
| 99 |
-
|
| 100 |
label = model.config.id2label[probs[0].argmax().item()]
|
| 101 |
print(f"Predicted: {label}")
|
| 102 |
```
|
| 103 |
-
|
| 104 |
### Adjusting the Decision Threshold
|
| 105 |
-
|
| 106 |
The default threshold of 0.50 balances precision and recall. You can adjust it:
|
| 107 |
-
|
| 108 |
- **Lower threshold (e.g., 0.35):** More inclusive — fewer items classified as OTHER, higher recall on EVT constructs, but more false positives.
|
| 109 |
- **Higher threshold (e.g., 0.65):** More conservative — more items fall to OTHER, higher precision on EVT constructs, but more false negatives.
|
| 110 |
-
|
| 111 |
Choose based on your application: use a lower threshold for exploratory screening, a higher one when precision matters.
|
| 112 |
-
|
|
|
|
| 113 |
## Model Description
|
| 114 |
-
|
| 115 |
### Architecture
|
| 116 |
-
|
| 117 |
The model uses a custom classification head on top of `sentence-transformers/all-mpnet-base-v2` (110M parameters):
|
| 118 |
-
|
| 119 |
```
|
| 120 |
MPNet encoder
|
| 121 |
↓
|
|
@@ -127,138 +131,153 @@ NormLinear head (cosine similarity classifier, τ = 20)
|
|
| 127 |
↓
|
| 128 |
5 independent sigmoid outputs (one per EVT construct)
|
| 129 |
```
|
| 130 |
-
|
| 131 |
The `OTHER` class is not explicitly learned. Instead, an item is classified as `OTHER` when no construct's sigmoid probability exceeds the decision threshold (default: 0.50). This **One-vs-Rest (OvR)** formulation avoids forcing the model to learn a coherent representation for the heterogeneous `OTHER` category.
|
| 132 |
-
|
| 133 |
### Key Design Choices
|
| 134 |
-
|
| 135 |
| Component | Rationale |
|
| 136 |
-
|---
|
| 137 |
| **ConcatPooling** | Concatenating mean and max pooling captures both distributional and salient features from short text (Howard & Ruder, 2018) |
|
| 138 |
| **NormLinear (cosine) head** | L2-normalised features and weights make classification direction-based, improving robustness to domain shift (Wang et al., 2018) |
|
| 139 |
-
| **Asymmetric Loss** (γ
|
| 140 |
| **FGM adversarial training** | Embedding-space perturbations regularise the model, improving generalisation from synthetic to real items (Miyato et al., 2017) |
|
| 141 |
| **LLRD** (decay = 0.9) | Lower layers receive smaller learning rates, preserving pre-trained linguistic knowledge (Sun et al., 2019) |
|
| 142 |
| **Gradual unfreezing** | Head-only training for epoch 1, then full end-to-end fine-tuning, to prevent catastrophic forgetting (Howard & Ruder, 2018) |
|
| 143 |
| **SWA** | Averaging top-3 checkpoints produces a flatter minimum and marginally improves κ (Izmailov et al., 2018) |
|
| 144 |
-
|
|
|
|
| 145 |
## Training
|
| 146 |
-
|
| 147 |
### Training Data
|
| 148 |
-
|
| 149 |
The model was fine-tuned on the [expectancy_value_pool_v2](https://huggingface.co/datasets/christiqn/expectancy_value_pool_v2) dataset — approximately 15,000 synthetic psychological test items generated with construct-level attribution, filtered to remove participial-phrase fragments. The data was split 85/7.5/7.5 into train/validation/test sets with stratified sampling.
|
| 150 |
-
|
| 151 |
### Training Procedure
|
| 152 |
-
|
| 153 |
- **Epochs:** 12 (with early stopping, patience = 4, monitored on validation κ)
|
| 154 |
- **Batch size:** 24 per device × 2 gradient accumulation steps = 48 effective
|
| 155 |
- **Optimizer:** AdamW (lr = 3e-5, weight decay = 0.01)
|
| 156 |
- **Scheduler:** Linear with 10% warmup
|
| 157 |
- **Precision:** bf16 mixed-precision
|
| 158 |
-
- **Hardware:** Single NVIDIA GPU
|
| 159 |
- **Post-training:** Stochastic Weight Averaging of top-3 checkpoints
|
| 160 |
-
|
| 161 |
### Training Results (Held-Out Synthetic Test Set)
|
| 162 |
-
|
| 163 |
| Metric | Value |
|
| 164 |
-
|---
|
| 165 |
| Accuracy | 0.990 |
|
| 166 |
| Macro F1 | 0.990 |
|
| 167 |
| Cohen's κ | 0.990 |
|
| 168 |
-
|
| 169 |
> **Note:** These near-perfect numbers reflect performance on the synthetic data distribution. Real-world performance on human-authored items is reported below.
|
| 170 |
|
|
|
|
| 171 |
## Evaluation
|
| 172 |
|
| 173 |
### Test Set
|
| 174 |
|
| 175 |
-
The model was evaluated on **N = 1,
|
| 176 |
|
| 177 |
### Agreement with Human Coder (Full 6-Class)
|
| 178 |
|
| 179 |
| Metric | Value | 95% BCa CI |
|
| 180 |
-
|---
|
| 181 |
-
| **Cohen's κ (unweighted)** | 0.
|
| 182 |
-
| Cohen's κ (linear weighted) | 0.
|
| 183 |
-
| Krippendorff's α | 0.
|
| 184 |
-
| PABAK | 0.
|
| 185 |
-
| Overall accuracy | 0.
|
| 186 |
-
| Macro F1 | 0.
|
| 187 |
-
| Weighted F1 | 0.
|
| 188 |
|
| 189 |
-
Cohen's κ = .
|
| 190 |
|
| 191 |
### Per-Class Performance (6-Class)
|
| 192 |
|
| 193 |
| Class | Precision | Recall | F1 | κ (OvR) | n |
|
| 194 |
-
|---
|
| 195 |
-
| ATTAINMENT_VALUE | .
|
| 196 |
-
| COST | .
|
| 197 |
-
| EXPECTANCY | .
|
| 198 |
-
| INTRINSIC_VALUE | .
|
| 199 |
-
| OTHER | .
|
| 200 |
-
| UTILITY_VALUE | .
|
| 201 |
-
|
| 202 |
### Confusion Matrix (6-Class)
|
| 203 |
|
| 204 |
| | Pred: AV | Pred: CO | Pred: EX | Pred: IV | Pred: OT | Pred: UV |
|
| 205 |
-
|---|---
|
| 206 |
-
| **True: AV** | **
|
| 207 |
-
| **True: CO** |
|
| 208 |
-
| **True: EX** |
|
| 209 |
-
| **True: IV** | 4 | 4 | 0 | **
|
| 210 |
-
| **True: OT** |
|
| 211 |
-
| **True: UV** |
|
|
|
|
|
|
|
| 212 |
|
| 213 |
-
|
|
|
|
|
|
|
| 214 |
|
| 215 |
### Core Construct Discrimination (5-Class, Excluding OTHER)
|
| 216 |
|
| 217 |
-
The OTHER category is not a psychological construct — it is a heterogeneous residual class that the model was never explicitly trained to recognise. To separately assess the model's ability to *discriminate among the five core EVT constructs*,
|
| 218 |
|
| 219 |
| Metric | Value | 95% BCa CI |
|
| 220 |
-
|---
|
| 221 |
-
| **Cohen's κ (5-class)** | 0.
|
| 222 |
-
| Krippendorff's α | 0.
|
| 223 |
-
| Overall accuracy | 0.
|
| 224 |
-
| Macro F1 | 0.
|
|
|
|
| 225 |
|
| 226 |
-
Cohen's κ = .
|
| 227 |
|
| 228 |
| Class | Precision | Recall | F1 | n |
|
| 229 |
-
|---
|
| 230 |
-
| ATTAINMENT_VALUE | .
|
| 231 |
-
| COST | .
|
| 232 |
-
| EXPECTANCY | .
|
| 233 |
-
| INTRINSIC_VALUE | .
|
| 234 |
-
| UTILITY_VALUE | .
|
| 235 |
|
| 236 |
-
Additionally, when evaluating all items the human coded as a core construct (n = 1,
|
| 237 |
|
| 238 |
### Summary Across Evaluation Scenarios
|
| 239 |
|
| 240 |
| Scenario | κ | Macro F1 | N |
|
| 241 |
-
|---
|
| 242 |
-
| Full 6-class (with OTHER) | .
|
| 243 |
-
| 5-class: both raters assigned core construct | **.
|
| 244 |
-
| Human = core, model unrestricted | .
|
|
|
|
|
|
|
| 245 |
|
| 246 |
-
The jump from κ = .73 to κ = .87 confirms that the model's primary weakness lies at the **EVT/non-EVT detection boundary** (the OTHER category), not in discriminating among the five core constructs.
|
| 247 |
-
|
| 248 |
### Base Model Comparison
|
| 249 |
-
|
| 250 |
-
To quantify the effect of domain-specific fine-tuning, the classifier was compared against the un-fine-tuned base model (`all-mpnet-base-v2`). The base model achieved
|
| 251 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 252 |
|
| 253 |
### Known Limitations and Biases
|
| 254 |
|
| 255 |
-
**
|
| 256 |
|
| 257 |
-
**
|
| 258 |
|
| 259 |
-
**UTILITY_VALUE under-
|
| 260 |
|
| 261 |
-
**Trained on synthetic data.** The training set consists of synthetically generated items, not real published instruments. While the model generalises well (κ = .
|
| 262 |
|
| 263 |
**English only.** The model has not been evaluated on items in other languages. Cross-lingual transfer should not be assumed.
|
| 264 |
|
|
|
|
| 5 |
library_name: transformers
|
| 6 |
tags:
|
| 7 |
- text-classification
|
| 8 |
+
- pytorch
|
| 9 |
+
- safetensors
|
| 10 |
+
- evt_classifier
|
| 11 |
+
- feature-extraction
|
| 12 |
- psychology
|
| 13 |
- expectancy-value-theory
|
| 14 |
- psychometrics
|
|
|
|
| 17 |
- motivation
|
| 18 |
- survey-instruments
|
| 19 |
- content-analysis
|
| 20 |
+
- custom_code
|
| 21 |
+
pipeline_tag: text-classification
|
| 22 |
datasets:
|
| 23 |
- christiqn/expectancy_value_pool
|
| 24 |
+
- christiqn/EVT-items
|
|
|
|
| 25 |
metrics:
|
|
|
|
| 26 |
- f1
|
| 27 |
+
- accuracy
|
| 28 |
model-index:
|
| 29 |
+
- name: mpnet-EVT-classifier
|
| 30 |
results:
|
| 31 |
- task:
|
| 32 |
type: text-classification
|
|
|
|
| 33 |
dataset:
|
| 34 |
+
name: EVT-items (Human-coded psychological test items)
|
| 35 |
+
type: christiqn/EVT-items
|
|
|
|
|
|
|
| 36 |
metrics:
|
| 37 |
- type: cohen_kappa
|
| 38 |
+
value: 0.7674
|
| 39 |
+
name: Cohen's Kappa (6-class, N=1284)
|
| 40 |
- type: f1
|
| 41 |
+
value: 0.8121
|
| 42 |
+
name: Macro F1 (6-class)
|
| 43 |
- type: accuracy
|
| 44 |
+
value: 0.8100
|
| 45 |
+
name: Accuracy (6-class)
|
| 46 |
- type: krippendorff_alpha
|
| 47 |
+
value: 0.7674
|
| 48 |
name: Krippendorff's Alpha
|
| 49 |
---
|
| 50 |
|
| 51 |
# EVT Item Classifier — Expectancy-Value Theory Construct Tagger
|
| 52 |
+
|
| 53 |
A fine-tuned text classifier that assigns psychological test items (e.g., survey questions, scale items) to one of six categories derived from **Expectancy-Value Theory** (Eccles et al., 1983; Eccles & Wigfield, 2002):
|
| 54 |
+
|
| 55 |
| Label | EVT Construct | Example item |
|
| 56 |
+
|---|---|---|
|
| 57 |
| `ATTAINMENT_VALUE` | Personal importance of doing well | *"Doing well in math is important to who I am."* |
|
| 58 |
| `COST` | Perceived negative consequences of engagement | *"I have to give up too much to succeed in this class."* |
|
| 59 |
| `EXPECTANCY` | Beliefs about future success | *"I am confident I can master the skills taught in this course."* |
|
| 60 |
| `INTRINSIC_VALUE` | Enjoyment and interest | *"I find the content of this course very interesting."* |
|
| 61 |
| `UTILITY_VALUE` | Usefulness for future goals | *"What I learn in this class will be useful for my career."* |
|
| 62 |
| `OTHER` | Not classifiable as an EVT construct | *"I usually sit in the front row."* |
|
| 63 |
+
|
| 64 |
+
|
| 65 |
## Intended Use
|
| 66 |
+
|
| 67 |
This model is intended for **academic research** in educational psychology, motivation science, and psychometrics. Typical use cases include:
|
| 68 |
+
|
| 69 |
- **Automated content analysis** of existing item pools and questionnaire banks for EVT construct coverage
|
| 70 |
- **Scale development assistance** — screening candidate items during the item-writing phase
|
| 71 |
- **Systematic reviews** — coding large corpora of test items from published instruments
|
| 72 |
- **Construct validation** — checking whether items align with their intended EVT construct
|
| 73 |
+
|
| 74 |
### Out-of-Scope Uses
|
| 75 |
+
|
| 76 |
- **Clinical or diagnostic decision-making** — This model classifies test *items*, not respondents. It should not be used to assess individuals.
|
| 77 |
+
- **Replacement for human coding** — The model achieves substantial but not perfect agreement with human coders (κ = .77). It is intended as a tool to assist researchers, not to replace expert judgment.
|
| 78 |
- **Non-English items** — The model was trained and evaluated on English-language items only.
|
| 79 |
- **Non-EVT constructs** — The model will force any input into one of the six categories. Items from unrelated theoretical frameworks (e.g., Big Five personality, cognitive load) will be classified as `OTHER` at best, or spuriously assigned to an EVT category.
|
| 80 |
+
|
| 81 |
+
|
| 82 |
## How to Use
|
| 83 |
+
|
| 84 |
### Quick Start
|
| 85 |
+
|
| 86 |
```python
|
| 87 |
from transformers import AutoTokenizer, AutoModel
|
| 88 |
import torch
|
| 89 |
+
|
| 90 |
+
# Load model
|
| 91 |
tokenizer = AutoTokenizer.from_pretrained("christiqn/mpnet-EVT-classifier")
|
| 92 |
model = AutoModel.from_pretrained("christiqn/mpnet-EVT-classifier", trust_remote_code=True)
|
| 93 |
model.eval()
|
| 94 |
+
|
| 95 |
# Inference
|
| 96 |
text = "I expect to do well in this course."
|
| 97 |
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
|
| 98 |
+
|
| 99 |
with torch.no_grad():
|
| 100 |
outputs = model(**inputs)
|
| 101 |
probs = torch.sigmoid(outputs.logits)
|
| 102 |
+
|
| 103 |
label = model.config.id2label[probs[0].argmax().item()]
|
| 104 |
print(f"Predicted: {label}")
|
| 105 |
```
|
| 106 |
+
|
| 107 |
### Adjusting the Decision Threshold
|
| 108 |
+
|
| 109 |
The default threshold of 0.50 balances precision and recall. You can adjust it:
|
| 110 |
+
|
| 111 |
- **Lower threshold (e.g., 0.35):** More inclusive — fewer items classified as OTHER, higher recall on EVT constructs, but more false positives.
|
| 112 |
- **Higher threshold (e.g., 0.65):** More conservative — more items fall to OTHER, higher precision on EVT constructs, but more false negatives.
|
| 113 |
+
|
| 114 |
Choose based on your application: use a lower threshold for exploratory screening, a higher one when precision matters.
|
| 115 |
+
|
| 116 |
+
|
| 117 |
## Model Description
|
| 118 |
+
|
| 119 |
### Architecture
|
| 120 |
+
|
| 121 |
The model uses a custom classification head on top of `sentence-transformers/all-mpnet-base-v2` (110M parameters):
|
| 122 |
+
|
| 123 |
```
|
| 124 |
MPNet encoder
|
| 125 |
↓
|
|
|
|
| 131 |
↓
|
| 132 |
5 independent sigmoid outputs (one per EVT construct)
|
| 133 |
```
|
| 134 |
+
|
| 135 |
The `OTHER` class is not explicitly learned. Instead, an item is classified as `OTHER` when no construct's sigmoid probability exceeds the decision threshold (default: 0.50). This **One-vs-Rest (OvR)** formulation avoids forcing the model to learn a coherent representation for the heterogeneous `OTHER` category.
|
| 136 |
+
|
| 137 |
### Key Design Choices
|
| 138 |
+
|
| 139 |
| Component | Rationale |
|
| 140 |
+
|---|---|
|
| 141 |
| **ConcatPooling** | Concatenating mean and max pooling captures both distributional and salient features from short text (Howard & Ruder, 2018) |
|
| 142 |
| **NormLinear (cosine) head** | L2-normalised features and weights make classification direction-based, improving robustness to domain shift (Wang et al., 2018) |
|
| 143 |
+
| **Asymmetric Loss** (γ_neg=4, γ_pos=1) | Down-weights easy negatives in the multi-label sigmoid setup, preventing the dominant negative class from overwhelming the loss (Ridnik et al., 2021) |
|
| 144 |
| **FGM adversarial training** | Embedding-space perturbations regularise the model, improving generalisation from synthetic to real items (Miyato et al., 2017) |
|
| 145 |
| **LLRD** (decay = 0.9) | Lower layers receive smaller learning rates, preserving pre-trained linguistic knowledge (Sun et al., 2019) |
|
| 146 |
| **Gradual unfreezing** | Head-only training for epoch 1, then full end-to-end fine-tuning, to prevent catastrophic forgetting (Howard & Ruder, 2018) |
|
| 147 |
| **SWA** | Averaging top-3 checkpoints produces a flatter minimum and marginally improves κ (Izmailov et al., 2018) |
|
| 148 |
+
|
| 149 |
+
|
| 150 |
## Training
|
| 151 |
+
|
| 152 |
### Training Data
|
| 153 |
+
|
| 154 |
The model was fine-tuned on the [expectancy_value_pool_v2](https://huggingface.co/datasets/christiqn/expectancy_value_pool_v2) dataset — approximately 15,000 synthetic psychological test items generated with construct-level attribution, filtered to remove participial-phrase fragments. The data was split 85/7.5/7.5 into train/validation/test sets with stratified sampling.
|
| 155 |
+
|
| 156 |
### Training Procedure
|
| 157 |
+
|
| 158 |
- **Epochs:** 12 (with early stopping, patience = 4, monitored on validation κ)
|
| 159 |
- **Batch size:** 24 per device × 2 gradient accumulation steps = 48 effective
|
| 160 |
- **Optimizer:** AdamW (lr = 3e-5, weight decay = 0.01)
|
| 161 |
- **Scheduler:** Linear with 10% warmup
|
| 162 |
- **Precision:** bf16 mixed-precision
|
| 163 |
+
- **Hardware:** Single NVIDIA H100 GPU
|
| 164 |
- **Post-training:** Stochastic Weight Averaging of top-3 checkpoints
|
| 165 |
+
|
| 166 |
### Training Results (Held-Out Synthetic Test Set)
|
| 167 |
+
|
| 168 |
| Metric | Value |
|
| 169 |
+
|---|---|
|
| 170 |
| Accuracy | 0.990 |
|
| 171 |
| Macro F1 | 0.990 |
|
| 172 |
| Cohen's κ | 0.990 |
|
| 173 |
+
|
| 174 |
> **Note:** These near-perfect numbers reflect performance on the synthetic data distribution. Real-world performance on human-authored items is reported below.
|
| 175 |
|
| 176 |
+
|
| 177 |
## Evaluation
|
| 178 |
|
| 179 |
### Test Set
|
| 180 |
|
| 181 |
+
The model was evaluated on **N = 1,284 human-coded test items** from [christiqn/EVT-items](https://huggingface.co/datasets/christiqn/EVT-items), drawn from published and unpublished psychological instruments spanning 95 distinct scales. Items were independently coded by a human expert and by the model. Items marked as formatting artefacts, third-person phrasing, or multiple anchors were excluded prior to evaluation. All confidence intervals are bias-corrected and accelerated (BCa) bootstrap CIs with B = 10,000.
|
| 182 |
|
| 183 |
### Agreement with Human Coder (Full 6-Class)
|
| 184 |
|
| 185 |
| Metric | Value | 95% BCa CI |
|
| 186 |
+
|---|---|---|
|
| 187 |
+
| **Cohen's κ (unweighted)** | **0.767** | [0.741, 0.793] |
|
| 188 |
+
| Cohen's κ (linear weighted) | 0.768 | [0.735, 0.797] |
|
| 189 |
+
| Krippendorff's α | 0.767 | [0.740, 0.793] |
|
| 190 |
+
| PABAK | 0.772 | — |
|
| 191 |
+
| Overall accuracy | 0.810 | [0.787, 0.830] |
|
| 192 |
+
| Macro F1 | 0.812 | [0.790, 0.834] |
|
| 193 |
+
| Weighted F1 | 0.807 | [0.785, 0.828] |
|
| 194 |
|
| 195 |
+
Cohen's κ = .77 indicates **substantial agreement** according to Landis & Koch (1977).
|
| 196 |
|
| 197 |
### Per-Class Performance (6-Class)
|
| 198 |
|
| 199 |
| Class | Precision | Recall | F1 | κ (OvR) | n |
|
| 200 |
+
|---|---|---|---|---|---|
|
| 201 |
+
| ATTAINMENT_VALUE | .800 [.726, .868] | .865 [.798, .926] | .831 [.775, .880] | .815 | 111 |
|
| 202 |
+
| COST | .795 [.732, .854] | .859 [.800, .912] | .826 [.778, .869] | .802 | 149 |
|
| 203 |
+
| EXPECTANCY | .843 [.801, .881] | .886 [.850, .921] | .864 [.834, .892] | .820 | 308 |
|
| 204 |
+
| INTRINSIC_VALUE | .839 [.793, .883] | .893 [.852, .931] | .865 [.831, .896] | .834 | 234 |
|
| 205 |
+
| OTHER | .726 [.668, .780] | .636 [.580, .691] | .678 [.631, .722] | .595 | 283 |
|
| 206 |
+
| UTILITY_VALUE | .846 [.793, .897] | .774 [.716, .831] | .808 [.764, .849] | .775 | 199 |
|
| 207 |
+
|
| 208 |
### Confusion Matrix (6-Class)
|
| 209 |
|
| 210 |
| | Pred: AV | Pred: CO | Pred: EX | Pred: IV | Pred: OT | Pred: UV |
|
| 211 |
+
|---|---|---|---|---|---|---|
|
| 212 |
+
| **True: AV** | **96** | 1 | 1 | 5 | 4 | 4 |
|
| 213 |
+
| **True: CO** | 1 | **128** | 1 | 10 | 7 | 2 |
|
| 214 |
+
| **True: EX** | 0 | 11 | **273** | 5 | 18 | 1 |
|
| 215 |
+
| **True: IV** | 4 | 4 | 0 | **209** | 15 | 2 |
|
| 216 |
+
| **True: OT** | 13 | 14 | 40 | 17 | **180** | 19 |
|
| 217 |
+
| **True: UV** | 6 | 3 | 9 | 3 | 24 | **154** |
|
| 218 |
+
|
| 219 |
+
Cramér's V = 0.782 (6-class).
|
| 220 |
|
| 221 |
+
### Marginal Homogeneity (Stuart-Maxwell Test)
|
| 222 |
+
|
| 223 |
+
The Stuart-Maxwell test for marginal homogeneity is significant (χ²(5) = 18.62, p = .002), indicating a systematic difference between the model's and human coder's label distributions. The model tends to over-predict EXPECTANCY (+16), INTRINSIC_VALUE (+15), COST (+12), and ATTAINMENT_VALUE (+9) relative to the human coder, while under-predicting OTHER (−35) and UTILITY_VALUE (−17). This is a direct consequence of the OvR + asymmetric loss design, which prioritises recall on core constructs.
|
| 224 |
|
| 225 |
### Core Construct Discrimination (5-Class, Excluding OTHER)
|
| 226 |
|
| 227 |
+
The OTHER category is not a psychological construct — it is a heterogeneous residual class that the model was never explicitly trained to recognise. To separately assess the model's ability to *discriminate among the five core EVT constructs*, a restricted analysis was conducted on items where both the human coder and the model assigned a core construct (n = 933, 72.7% of all items).
|
| 228 |
|
| 229 |
| Metric | Value | 95% BCa CI |
|
| 230 |
+
|---|---|---|
|
| 231 |
+
| **Cohen's κ (5-class)** | **0.899** | [0.876, 0.920] |
|
| 232 |
+
| Krippendorff's α | 0.899 | [0.876, 0.921] |
|
| 233 |
+
| Overall accuracy | 0.922 | [0.901, 0.937] |
|
| 234 |
+
| Macro F1 | 0.915 | [0.895, 0.933] |
|
| 235 |
+
| PABAK | 0.902 | — |
|
| 236 |
|
| 237 |
+
Cohen's κ = .90 indicates **almost perfect agreement** on construct discrimination.
|
| 238 |
|
| 239 |
| Class | Precision | Recall | F1 | n |
|
| 240 |
+
|---|---|---|---|---|
|
| 241 |
+
| ATTAINMENT_VALUE | .897 [.838, .950] | .897 [.835, .951] | .897 [.851, .937] | 107 |
|
| 242 |
+
| COST | .871 [.812, .922] | .901 [.849, .948] | .886 [.843, .922] | 142 |
|
| 243 |
+
| EXPECTANCY | .961 [.937, .982] | .941 [.912, .967] | .951 [.932, .968] | 290 |
|
| 244 |
+
| INTRINSIC_VALUE | .901 [.861, .938] | .954 [.924, .980] | .927 [.901, .950] | 219 |
|
| 245 |
+
| UTILITY_VALUE | .945 [.907, .977] | .880 [.830, .926] | .911 [.877, .941] | 175 |
|
| 246 |
|
| 247 |
+
Additionally, when evaluating all items the human coded as a core construct (n = 1,001) — allowing the model to predict OTHER — the model achieved κ = 0.822 [0.795, 0.848] and accuracy = 0.859 [0.835, 0.878]. Of these 1,001 items, 68 (6.8%) were incorrectly pushed to OTHER by the model.
|
| 248 |
|
| 249 |
### Summary Across Evaluation Scenarios
|
| 250 |
|
| 251 |
| Scenario | κ | Macro F1 | N |
|
| 252 |
+
|---|---|---|---|
|
| 253 |
+
| Full 6-class (with OTHER) | 0.767 | 0.812 | 1,284 |
|
| 254 |
+
| 5-class: both raters assigned core construct | **0.899** | **0.915** | 933 |
|
| 255 |
+
| Human = core, model unrestricted | 0.822 | — | 1,001 |
|
| 256 |
+
|
| 257 |
+
The jump from κ = .77 to κ = .90 confirms that the model's primary weakness lies at the **EVT/non-EVT detection boundary** (the OTHER category), not in discriminating among the five core constructs.
|
| 258 |
|
|
|
|
|
|
|
| 259 |
### Base Model Comparison
|
| 260 |
+
|
| 261 |
+
To quantify the effect of domain-specific fine-tuning, the classifier was compared against the un-fine-tuned base model (`all-mpnet-base-v2`). The base model achieved κ = −0.054 [−0.073, −0.033] vs. κ = 0.767 [0.741, 0.793] for the fine-tuned model (Δκ = +0.821 [0.787, 0.854]). A McNemar test indicated a highly significant difference in error rates, χ²(1) = 867.91, p < .001 (fine-tuned correct only: 911 items; base correct only: 14 items). The fine-tuned model outperforms the base model on all six classes.
|
| 262 |
+
|
| 263 |
+
| Class | Base F1 | Fine-Tuned F1 | Δ |
|
| 264 |
+
|---|---|---|---|
|
| 265 |
+
| ATTAINMENT_VALUE | 0.051 | 0.831 | +0.780 |
|
| 266 |
+
| COST | 0.090 | 0.826 | +0.736 |
|
| 267 |
+
| EXPECTANCY | 0.169 | 0.864 | +0.695 |
|
| 268 |
+
| INTRINSIC_VALUE | 0.200 | 0.865 | +0.666 |
|
| 269 |
+
| OTHER | 0.007 | 0.678 | +0.671 |
|
| 270 |
+
| UTILITY_VALUE | 0.010 | 0.808 | +0.799 |
|
| 271 |
|
| 272 |
### Known Limitations and Biases
|
| 273 |
|
| 274 |
+
**OTHER is the weakest category** (κ = .60, F1 = .68). Because OTHER is defined as "no construct activated above threshold" rather than a learned class, the model tends to find *some* EVT signal in ambiguous items that a human would code as non-EVT. The most frequent error type is predicting EXPECTANCY when the true label is OTHER (40 cases, 16.4% of all errors). When OTHER is excluded, the model reaches κ = .90 — almost perfect agreement — on the five core constructs.
|
| 275 |
|
| 276 |
+
**Systematic label distribution differences.** The Stuart-Maxwell test is significant (χ²(5) = 18.62, p = .002). The model over-predicts EXPECTANCY (+16), INTRINSIC_VALUE (+15), COST (+12), and ATTAINMENT_VALUE (+9), and under-predicts OTHER (−35) and UTILITY_VALUE (−17).
|
| 277 |
|
| 278 |
+
**UTILITY_VALUE under-recall.** The model is conservative on utility value (recall = .774 vs. precision = .846), suggesting that items referencing indirect or long-term benefits may be harder for the model to detect.
|
| 279 |
|
| 280 |
+
**Trained on synthetic data.** The training set consists of synthetically generated items, not real published instruments. While the model generalises well (κ = .77 overall, κ = .90 on core constructs), there may be stylistic patterns in human-authored items from specific subfields or cultures that the model has not encountered.
|
| 281 |
|
| 282 |
**English only.** The model has not been evaluated on items in other languages. Cross-lingual transfer should not be assumed.
|
| 283 |
|