christiqn commited on
Commit
ae0bc6d
·
verified ·
1 Parent(s): 5d3bc8d

Update README.md

Browse files

Updated statistics after rerunning analysis on final dataset (fixed some issues with double labels / items that had a label but should have been excluded - vice versa)

Files changed (1) hide show
  1. README.md +129 -110
README.md CHANGED
@@ -5,6 +5,10 @@ license: cc-by-4.0
5
  library_name: transformers
6
  tags:
7
  - text-classification
 
 
 
 
8
  - psychology
9
  - expectancy-value-theory
10
  - psychometrics
@@ -13,109 +17,109 @@ tags:
13
  - motivation
14
  - survey-instruments
15
  - content-analysis
 
 
16
  datasets:
17
  - christiqn/expectancy_value_pool
18
- base_model: sentence-transformers/all-mpnet-base-v2
19
- pipeline_tag: text-classification
20
  metrics:
21
- - accuracy
22
  - f1
23
- - cohen_kappa
24
  model-index:
25
- - name: evt-classifier-v11-mpnet
26
  results:
27
  - task:
28
  type: text-classification
29
- name: EVT Item Classification
30
  dataset:
31
- name: Human-coded EVT test items
32
- type: custom
33
- config: default
34
- split: test
35
  metrics:
36
  - type: cohen_kappa
37
- value: 0.7267
38
- name: Cohen's Kappa
39
  - type: f1
40
- value: 0.7744
41
- name: Macro F1
42
  - type: accuracy
43
- value: 0.7768
44
- name: Accuracy
45
  - type: krippendorff_alpha
46
- value: 0.7267
47
  name: Krippendorff's Alpha
48
  ---
49
 
50
  # EVT Item Classifier — Expectancy-Value Theory Construct Tagger
51
-
52
  A fine-tuned text classifier that assigns psychological test items (e.g., survey questions, scale items) to one of six categories derived from **Expectancy-Value Theory** (Eccles et al., 1983; Eccles & Wigfield, 2002):
53
-
54
  | Label | EVT Construct | Example item |
55
- |-------|--------------|--------------|
56
  | `ATTAINMENT_VALUE` | Personal importance of doing well | *"Doing well in math is important to who I am."* |
57
  | `COST` | Perceived negative consequences of engagement | *"I have to give up too much to succeed in this class."* |
58
  | `EXPECTANCY` | Beliefs about future success | *"I am confident I can master the skills taught in this course."* |
59
  | `INTRINSIC_VALUE` | Enjoyment and interest | *"I find the content of this course very interesting."* |
60
  | `UTILITY_VALUE` | Usefulness for future goals | *"What I learn in this class will be useful for my career."* |
61
  | `OTHER` | Not classifiable as an EVT construct | *"I usually sit in the front row."* |
62
-
 
63
  ## Intended Use
64
-
65
  This model is intended for **academic research** in educational psychology, motivation science, and psychometrics. Typical use cases include:
66
-
67
  - **Automated content analysis** of existing item pools and questionnaire banks for EVT construct coverage
68
  - **Scale development assistance** — screening candidate items during the item-writing phase
69
  - **Systematic reviews** — coding large corpora of test items from published instruments
70
  - **Construct validation** — checking whether items align with their intended EVT construct
71
-
72
  ### Out-of-Scope Uses
73
-
74
  - **Clinical or diagnostic decision-making** — This model classifies test *items*, not respondents. It should not be used to assess individuals.
75
- - **Replacement for human coding** — The model achieves substantial but not perfect agreement with human coders (κ = .73). It is intended as a tool to assist researchers, not to replace expert judgment.
76
  - **Non-English items** — The model was trained and evaluated on English-language items only.
77
  - **Non-EVT constructs** — The model will force any input into one of the six categories. Items from unrelated theoretical frameworks (e.g., Big Five personality, cognitive load) will be classified as `OTHER` at best, or spuriously assigned to an EVT category.
78
-
 
79
  ## How to Use
80
-
81
  ### Quick Start
82
-
83
  ```python
84
  from transformers import AutoTokenizer, AutoModel
85
  import torch
86
-
87
- # Load model - no custom imports needed!
88
  tokenizer = AutoTokenizer.from_pretrained("christiqn/mpnet-EVT-classifier")
89
  model = AutoModel.from_pretrained("christiqn/mpnet-EVT-classifier", trust_remote_code=True)
90
  model.eval()
91
-
92
  # Inference
93
  text = "I expect to do well in this course."
94
  inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
95
-
96
  with torch.no_grad():
97
  outputs = model(**inputs)
98
  probs = torch.sigmoid(outputs.logits)
99
-
100
  label = model.config.id2label[probs[0].argmax().item()]
101
  print(f"Predicted: {label}")
102
  ```
103
-
104
  ### Adjusting the Decision Threshold
105
-
106
  The default threshold of 0.50 balances precision and recall. You can adjust it:
107
-
108
  - **Lower threshold (e.g., 0.35):** More inclusive — fewer items classified as OTHER, higher recall on EVT constructs, but more false positives.
109
  - **Higher threshold (e.g., 0.65):** More conservative — more items fall to OTHER, higher precision on EVT constructs, but more false negatives.
110
-
111
  Choose based on your application: use a lower threshold for exploratory screening, a higher one when precision matters.
112
-
 
113
  ## Model Description
114
-
115
  ### Architecture
116
-
117
  The model uses a custom classification head on top of `sentence-transformers/all-mpnet-base-v2` (110M parameters):
118
-
119
  ```
120
  MPNet encoder
121
 
@@ -127,138 +131,153 @@ NormLinear head (cosine similarity classifier, τ = 20)
127
 
128
  5 independent sigmoid outputs (one per EVT construct)
129
  ```
130
-
131
  The `OTHER` class is not explicitly learned. Instead, an item is classified as `OTHER` when no construct's sigmoid probability exceeds the decision threshold (default: 0.50). This **One-vs-Rest (OvR)** formulation avoids forcing the model to learn a coherent representation for the heterogeneous `OTHER` category.
132
-
133
  ### Key Design Choices
134
-
135
  | Component | Rationale |
136
- |-----------|-----------|
137
  | **ConcatPooling** | Concatenating mean and max pooling captures both distributional and salient features from short text (Howard & Ruder, 2018) |
138
  | **NormLinear (cosine) head** | L2-normalised features and weights make classification direction-based, improving robustness to domain shift (Wang et al., 2018) |
139
- | **Asymmetric Loss** (γ\_neg=4, γ\_pos=1) | Down-weights easy negatives in the multi-label sigmoid setup, preventing the dominant negative class from overwhelming the loss (Ridnik et al., 2021) |
140
  | **FGM adversarial training** | Embedding-space perturbations regularise the model, improving generalisation from synthetic to real items (Miyato et al., 2017) |
141
  | **LLRD** (decay = 0.9) | Lower layers receive smaller learning rates, preserving pre-trained linguistic knowledge (Sun et al., 2019) |
142
  | **Gradual unfreezing** | Head-only training for epoch 1, then full end-to-end fine-tuning, to prevent catastrophic forgetting (Howard & Ruder, 2018) |
143
  | **SWA** | Averaging top-3 checkpoints produces a flatter minimum and marginally improves κ (Izmailov et al., 2018) |
144
-
 
145
  ## Training
146
-
147
  ### Training Data
148
-
149
  The model was fine-tuned on the [expectancy_value_pool_v2](https://huggingface.co/datasets/christiqn/expectancy_value_pool_v2) dataset — approximately 15,000 synthetic psychological test items generated with construct-level attribution, filtered to remove participial-phrase fragments. The data was split 85/7.5/7.5 into train/validation/test sets with stratified sampling.
150
-
151
  ### Training Procedure
152
-
153
  - **Epochs:** 12 (with early stopping, patience = 4, monitored on validation κ)
154
  - **Batch size:** 24 per device × 2 gradient accumulation steps = 48 effective
155
  - **Optimizer:** AdamW (lr = 3e-5, weight decay = 0.01)
156
  - **Scheduler:** Linear with 10% warmup
157
  - **Precision:** bf16 mixed-precision
158
- - **Hardware:** Single NVIDIA GPU
159
  - **Post-training:** Stochastic Weight Averaging of top-3 checkpoints
160
-
161
  ### Training Results (Held-Out Synthetic Test Set)
162
-
163
  | Metric | Value |
164
- |--------|-------|
165
  | Accuracy | 0.990 |
166
  | Macro F1 | 0.990 |
167
  | Cohen's κ | 0.990 |
168
-
169
  > **Note:** These near-perfect numbers reflect performance on the synthetic data distribution. Real-world performance on human-authored items is reported below.
170
 
 
171
  ## Evaluation
172
 
173
  ### Test Set
174
 
175
- The model was evaluated on **N = 1,286 human-coded test items** drawn from published and unpublished psychological instruments. Items were independently coded by a human expert and by the model. Items marked as formatting artefacts or third-person phrasing were excluded prior to evaluation. All confidence intervals are bias-corrected and accelerated (BCa) bootstrap CIs with B = 10,000.
176
 
177
  ### Agreement with Human Coder (Full 6-Class)
178
 
179
  | Metric | Value | 95% BCa CI |
180
- |--------|-------|------------|
181
- | **Cohen's κ (unweighted)** | 0.727 | [0.698, 0.754] |
182
- | Cohen's κ (linear weighted) | 0.721 | [0.686, 0.754] |
183
- | Krippendorff's α | 0.727 | [0.698, 0.754] |
184
- | PABAK | 0.732 | — |
185
- | Overall accuracy | 0.777 | [0.752, 0.799] |
186
- | Macro F1 | 0.774 | [0.750, 0.797] |
187
- | Weighted F1 | 0.775 | [0.752, 0.797] |
188
 
189
- Cohen's κ = .73 indicates **substantial agreement** according to Landis & Koch (1977).
190
 
191
  ### Per-Class Performance (6-Class)
192
 
193
  | Class | Precision | Recall | F1 | κ (OvR) | n |
194
- |-------|-----------|--------|----|---------|---|
195
- | ATTAINMENT_VALUE | .730 [.647, .808] | .809 [.731, .881] | .767 [.703, .824] | .744 | 110 |
196
- | COST | .739 [.669, .807] | .804 [.739, .865] | .770 [.716, .821] | .739 | 148 |
197
- | EXPECTANCY | .831 [.790, .871] | .844 [.803, .883] | .837 [.805, .866] | .783 | 320 |
198
- | INTRINSIC_VALUE | .817 [.768, .863] | .891 [.850, .930] | .852 [.817, .884] | .819 | 230 |
199
- | OTHER | .663 [.602, .720] | .601 [.542, .659] | .631 [.579, .677] | .538 | 271 |
200
- | UTILITY_VALUE | .845 [.792, .896] | .739 [.679, .798] | .789 [.742, .831] | .751 | 207 |
201
-
202
  ### Confusion Matrix (6-Class)
203
 
204
  | | Pred: AV | Pred: CO | Pred: EX | Pred: IV | Pred: OT | Pred: UV |
205
- |---|---:|---:|---:|---:|---:|---:|
206
- | **True: AV** | **89** | 2 | 1 | 5 | 7 | 6 |
207
- | **True: CO** | 3 | **119** | 2 | 10 | 11 | 3 |
208
- | **True: EX** | 3 | 17 | **270** | 6 | 23 | 1 |
209
- | **True: IV** | 4 | 4 | 0 | **205** | 15 | 2 |
210
- | **True: OT** | 15 | 15 | 42 | 20 | **163** | 16 |
211
- | **True: UV** | 8 | 4 | 10 | 5 | 27 | **153** |
 
 
212
 
213
- Cramér's V = .74 (6-class), .86 (5-class core only).
 
 
214
 
215
  ### Core Construct Discrimination (5-Class, Excluding OTHER)
216
 
217
- The OTHER category is not a psychological construct — it is a heterogeneous residual class that the model was never explicitly trained to recognise. To separately assess the model's ability to *discriminate among the five core EVT constructs*, we report a restricted analysis on items where both the human coder and the model assigned a core construct (n = 932, 72.5% of all items).
218
 
219
  | Metric | Value | 95% BCa CI |
220
- |--------|-------|------------|
221
- | **Cohen's κ (5-class)** | 0.867 | [0.842, 0.891] |
222
- | Krippendorff's α | 0.867 | [0.841, 0.891] |
223
- | Overall accuracy | 0.897 | [0.876, 0.914] |
224
- | Macro F1 | 0.885 | [0.862, 0.906] |
 
225
 
226
- Cohen's κ = .87 indicates **almost perfect agreement** on construct discrimination.
227
 
228
  | Class | Precision | Recall | F1 | n |
229
- |-------|-----------|--------|----|---|
230
- | ATTAINMENT_VALUE | .832 [.758, .899] | .864 [.794, .925] | .848 [.790, .896] | 103 |
231
- | COST | .815 [.748, .876] | .869 [.809, .922] | .841 [.791, .884] | 137 |
232
- | EXPECTANCY | .954 [.928, .977] | .909 [.875, .941] | .931 [.909, .951] | 297 |
233
- | INTRINSIC_VALUE | .887 [.845, .927] | .953 [.923, .980] | .919 [.891, .944] | 215 |
234
- | UTILITY_VALUE | .927 [.886, .964] | .850 [.796, .901] | .887 [.850, .920] | 180 |
235
 
236
- Additionally, when evaluating all items the human coded as a core construct (n = 1,015) — allowing the model to predict OTHER — the model achieved κ = .778 [.748, .805] and accuracy = .824 [.799, .846]. Of these 1,015 items, 83 (8.2%) were incorrectly pushed to OTHER by the model.
237
 
238
  ### Summary Across Evaluation Scenarios
239
 
240
  | Scenario | κ | Macro F1 | N |
241
- |----------|---|----------|---|
242
- | Full 6-class (with OTHER) | .727 | .774 | 1,286 |
243
- | 5-class: both raters assigned core construct | **.867** | **.885** | 932 |
244
- | Human = core, model unrestricted | .778 | — | 1,015 |
 
 
245
 
246
- The jump from κ = .73 to κ = .87 confirms that the model's primary weakness lies at the **EVT/non-EVT detection boundary** (the OTHER category), not in discriminating among the five core constructs.
247
-
248
  ### Base Model Comparison
249
-
250
- To quantify the effect of domain-specific fine-tuning, the classifier was compared against the un-fine-tuned base model (`all-mpnet-base-v2`). The base model achieved a Cohen's κ = -0.04 vs. κ = 0.73 for the fine-tuned model (Δκ = +0.77). A McNemar test indicated a highly significant difference in error rates, χ²(1) = 791.37, p < .001.
251
-
 
 
 
 
 
 
 
 
252
 
253
  ### Known Limitations and Biases
254
 
255
- **Systematic label distribution differences.** The Stuart-Maxwell test for marginal homogeneity is significant (χ²(5) = 20.57, p = .001). The model tends to over-predict INTRINSIC_VALUE (+21 items), COST (+13), ATTAINMENT_VALUE (+12), and EXPECTANCY (+5) relative to the human coder. This is a direct consequence of the OvR + asymmetric loss design, which prioritises recall on core constructs at the expense of the OTHER category (which is under-predicted by 25 items).
256
 
257
- **OTHER is the weakest category** (κ = .54, F1 = .63). Because OTHER is defined as "no construct activated above threshold" rather than a learned class, the model tends to find *some* EVT signal in ambiguous items that a human would code as non-EVT. The most frequent error type overall is predicting EXPECTANCY when the true label is OTHER (42 cases, 14.6% of all errors). When OTHER is excluded, the model reaches κ = .87 — almost perfect agreement — on the five core constructs.
258
 
259
- **UTILITY_VALUE under-prediction.** The model is conservative on utility value (Δ = −26 items vs. human coder), with high precision (.845) but lower recall (.739). Items referencing indirect or long-term benefits may be harder for the model to detect.
260
 
261
- **Trained on synthetic data.** The training set consists of synthetically generated items, not real published instruments. While the model generalises well (κ = .73 overall, κ = .87 on core constructs), there may be stylistic patterns in human-authored items from specific subfields or cultures that the model has not encountered.
262
 
263
  **English only.** The model has not been evaluated on items in other languages. Cross-lingual transfer should not be assumed.
264
 
 
5
  library_name: transformers
6
  tags:
7
  - text-classification
8
+ - pytorch
9
+ - safetensors
10
+ - evt_classifier
11
+ - feature-extraction
12
  - psychology
13
  - expectancy-value-theory
14
  - psychometrics
 
17
  - motivation
18
  - survey-instruments
19
  - content-analysis
20
+ - custom_code
21
+ pipeline_tag: text-classification
22
  datasets:
23
  - christiqn/expectancy_value_pool
24
+ - christiqn/EVT-items
 
25
  metrics:
 
26
  - f1
27
+ - accuracy
28
  model-index:
29
+ - name: mpnet-EVT-classifier
30
  results:
31
  - task:
32
  type: text-classification
 
33
  dataset:
34
+ name: EVT-items (Human-coded psychological test items)
35
+ type: christiqn/EVT-items
 
 
36
  metrics:
37
  - type: cohen_kappa
38
+ value: 0.7674
39
+ name: Cohen's Kappa (6-class, N=1284)
40
  - type: f1
41
+ value: 0.8121
42
+ name: Macro F1 (6-class)
43
  - type: accuracy
44
+ value: 0.8100
45
+ name: Accuracy (6-class)
46
  - type: krippendorff_alpha
47
+ value: 0.7674
48
  name: Krippendorff's Alpha
49
  ---
50
 
51
  # EVT Item Classifier — Expectancy-Value Theory Construct Tagger
52
+
53
  A fine-tuned text classifier that assigns psychological test items (e.g., survey questions, scale items) to one of six categories derived from **Expectancy-Value Theory** (Eccles et al., 1983; Eccles & Wigfield, 2002):
54
+
55
  | Label | EVT Construct | Example item |
56
+ |---|---|---|
57
  | `ATTAINMENT_VALUE` | Personal importance of doing well | *"Doing well in math is important to who I am."* |
58
  | `COST` | Perceived negative consequences of engagement | *"I have to give up too much to succeed in this class."* |
59
  | `EXPECTANCY` | Beliefs about future success | *"I am confident I can master the skills taught in this course."* |
60
  | `INTRINSIC_VALUE` | Enjoyment and interest | *"I find the content of this course very interesting."* |
61
  | `UTILITY_VALUE` | Usefulness for future goals | *"What I learn in this class will be useful for my career."* |
62
  | `OTHER` | Not classifiable as an EVT construct | *"I usually sit in the front row."* |
63
+
64
+
65
  ## Intended Use
66
+
67
  This model is intended for **academic research** in educational psychology, motivation science, and psychometrics. Typical use cases include:
68
+
69
  - **Automated content analysis** of existing item pools and questionnaire banks for EVT construct coverage
70
  - **Scale development assistance** — screening candidate items during the item-writing phase
71
  - **Systematic reviews** — coding large corpora of test items from published instruments
72
  - **Construct validation** — checking whether items align with their intended EVT construct
73
+
74
  ### Out-of-Scope Uses
75
+
76
  - **Clinical or diagnostic decision-making** — This model classifies test *items*, not respondents. It should not be used to assess individuals.
77
+ - **Replacement for human coding** — The model achieves substantial but not perfect agreement with human coders (κ = .77). It is intended as a tool to assist researchers, not to replace expert judgment.
78
  - **Non-English items** — The model was trained and evaluated on English-language items only.
79
  - **Non-EVT constructs** — The model will force any input into one of the six categories. Items from unrelated theoretical frameworks (e.g., Big Five personality, cognitive load) will be classified as `OTHER` at best, or spuriously assigned to an EVT category.
80
+
81
+
82
  ## How to Use
83
+
84
  ### Quick Start
85
+
86
  ```python
87
  from transformers import AutoTokenizer, AutoModel
88
  import torch
89
+
90
+ # Load model
91
  tokenizer = AutoTokenizer.from_pretrained("christiqn/mpnet-EVT-classifier")
92
  model = AutoModel.from_pretrained("christiqn/mpnet-EVT-classifier", trust_remote_code=True)
93
  model.eval()
94
+
95
  # Inference
96
  text = "I expect to do well in this course."
97
  inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
98
+
99
  with torch.no_grad():
100
  outputs = model(**inputs)
101
  probs = torch.sigmoid(outputs.logits)
102
+
103
  label = model.config.id2label[probs[0].argmax().item()]
104
  print(f"Predicted: {label}")
105
  ```
106
+
107
  ### Adjusting the Decision Threshold
108
+
109
  The default threshold of 0.50 balances precision and recall. You can adjust it:
110
+
111
  - **Lower threshold (e.g., 0.35):** More inclusive — fewer items classified as OTHER, higher recall on EVT constructs, but more false positives.
112
  - **Higher threshold (e.g., 0.65):** More conservative — more items fall to OTHER, higher precision on EVT constructs, but more false negatives.
113
+
114
  Choose based on your application: use a lower threshold for exploratory screening, a higher one when precision matters.
115
+
116
+
117
  ## Model Description
118
+
119
  ### Architecture
120
+
121
  The model uses a custom classification head on top of `sentence-transformers/all-mpnet-base-v2` (110M parameters):
122
+
123
  ```
124
  MPNet encoder
125
 
 
131
 
132
  5 independent sigmoid outputs (one per EVT construct)
133
  ```
134
+
135
  The `OTHER` class is not explicitly learned. Instead, an item is classified as `OTHER` when no construct's sigmoid probability exceeds the decision threshold (default: 0.50). This **One-vs-Rest (OvR)** formulation avoids forcing the model to learn a coherent representation for the heterogeneous `OTHER` category.
136
+
137
  ### Key Design Choices
138
+
139
  | Component | Rationale |
140
+ |---|---|
141
  | **ConcatPooling** | Concatenating mean and max pooling captures both distributional and salient features from short text (Howard & Ruder, 2018) |
142
  | **NormLinear (cosine) head** | L2-normalised features and weights make classification direction-based, improving robustness to domain shift (Wang et al., 2018) |
143
+ | **Asymmetric Loss** (γ_neg=4, γ_pos=1) | Down-weights easy negatives in the multi-label sigmoid setup, preventing the dominant negative class from overwhelming the loss (Ridnik et al., 2021) |
144
  | **FGM adversarial training** | Embedding-space perturbations regularise the model, improving generalisation from synthetic to real items (Miyato et al., 2017) |
145
  | **LLRD** (decay = 0.9) | Lower layers receive smaller learning rates, preserving pre-trained linguistic knowledge (Sun et al., 2019) |
146
  | **Gradual unfreezing** | Head-only training for epoch 1, then full end-to-end fine-tuning, to prevent catastrophic forgetting (Howard & Ruder, 2018) |
147
  | **SWA** | Averaging top-3 checkpoints produces a flatter minimum and marginally improves κ (Izmailov et al., 2018) |
148
+
149
+
150
  ## Training
151
+
152
  ### Training Data
153
+
154
  The model was fine-tuned on the [expectancy_value_pool_v2](https://huggingface.co/datasets/christiqn/expectancy_value_pool_v2) dataset — approximately 15,000 synthetic psychological test items generated with construct-level attribution, filtered to remove participial-phrase fragments. The data was split 85/7.5/7.5 into train/validation/test sets with stratified sampling.
155
+
156
  ### Training Procedure
157
+
158
  - **Epochs:** 12 (with early stopping, patience = 4, monitored on validation κ)
159
  - **Batch size:** 24 per device × 2 gradient accumulation steps = 48 effective
160
  - **Optimizer:** AdamW (lr = 3e-5, weight decay = 0.01)
161
  - **Scheduler:** Linear with 10% warmup
162
  - **Precision:** bf16 mixed-precision
163
+ - **Hardware:** Single NVIDIA H100 GPU
164
  - **Post-training:** Stochastic Weight Averaging of top-3 checkpoints
165
+
166
  ### Training Results (Held-Out Synthetic Test Set)
167
+
168
  | Metric | Value |
169
+ |---|---|
170
  | Accuracy | 0.990 |
171
  | Macro F1 | 0.990 |
172
  | Cohen's κ | 0.990 |
173
+
174
  > **Note:** These near-perfect numbers reflect performance on the synthetic data distribution. Real-world performance on human-authored items is reported below.
175
 
176
+
177
  ## Evaluation
178
 
179
  ### Test Set
180
 
181
+ The model was evaluated on **N = 1,284 human-coded test items** from [christiqn/EVT-items](https://huggingface.co/datasets/christiqn/EVT-items), drawn from published and unpublished psychological instruments spanning 95 distinct scales. Items were independently coded by a human expert and by the model. Items marked as formatting artefacts, third-person phrasing, or multiple anchors were excluded prior to evaluation. All confidence intervals are bias-corrected and accelerated (BCa) bootstrap CIs with B = 10,000.
182
 
183
  ### Agreement with Human Coder (Full 6-Class)
184
 
185
  | Metric | Value | 95% BCa CI |
186
+ |---|---|---|
187
+ | **Cohen's κ (unweighted)** | **0.767** | [0.741, 0.793] |
188
+ | Cohen's κ (linear weighted) | 0.768 | [0.735, 0.797] |
189
+ | Krippendorff's α | 0.767 | [0.740, 0.793] |
190
+ | PABAK | 0.772 | — |
191
+ | Overall accuracy | 0.810 | [0.787, 0.830] |
192
+ | Macro F1 | 0.812 | [0.790, 0.834] |
193
+ | Weighted F1 | 0.807 | [0.785, 0.828] |
194
 
195
+ Cohen's κ = .77 indicates **substantial agreement** according to Landis & Koch (1977).
196
 
197
  ### Per-Class Performance (6-Class)
198
 
199
  | Class | Precision | Recall | F1 | κ (OvR) | n |
200
+ |---|---|---|---|---|---|
201
+ | ATTAINMENT_VALUE | .800 [.726, .868] | .865 [.798, .926] | .831 [.775, .880] | .815 | 111 |
202
+ | COST | .795 [.732, .854] | .859 [.800, .912] | .826 [.778, .869] | .802 | 149 |
203
+ | EXPECTANCY | .843 [.801, .881] | .886 [.850, .921] | .864 [.834, .892] | .820 | 308 |
204
+ | INTRINSIC_VALUE | .839 [.793, .883] | .893 [.852, .931] | .865 [.831, .896] | .834 | 234 |
205
+ | OTHER | .726 [.668, .780] | .636 [.580, .691] | .678 [.631, .722] | .595 | 283 |
206
+ | UTILITY_VALUE | .846 [.793, .897] | .774 [.716, .831] | .808 [.764, .849] | .775 | 199 |
207
+
208
  ### Confusion Matrix (6-Class)
209
 
210
  | | Pred: AV | Pred: CO | Pred: EX | Pred: IV | Pred: OT | Pred: UV |
211
+ |---|---|---|---|---|---|---|
212
+ | **True: AV** | **96** | 1 | 1 | 5 | 4 | 4 |
213
+ | **True: CO** | 1 | **128** | 1 | 10 | 7 | 2 |
214
+ | **True: EX** | 0 | 11 | **273** | 5 | 18 | 1 |
215
+ | **True: IV** | 4 | 4 | 0 | **209** | 15 | 2 |
216
+ | **True: OT** | 13 | 14 | 40 | 17 | **180** | 19 |
217
+ | **True: UV** | 6 | 3 | 9 | 3 | 24 | **154** |
218
+
219
+ Cramér's V = 0.782 (6-class).
220
 
221
+ ### Marginal Homogeneity (Stuart-Maxwell Test)
222
+
223
+ The Stuart-Maxwell test for marginal homogeneity is significant (χ²(5) = 18.62, p = .002), indicating a systematic difference between the model's and human coder's label distributions. The model tends to over-predict EXPECTANCY (+16), INTRINSIC_VALUE (+15), COST (+12), and ATTAINMENT_VALUE (+9) relative to the human coder, while under-predicting OTHER (−35) and UTILITY_VALUE (−17). This is a direct consequence of the OvR + asymmetric loss design, which prioritises recall on core constructs.
224
 
225
  ### Core Construct Discrimination (5-Class, Excluding OTHER)
226
 
227
+ The OTHER category is not a psychological construct — it is a heterogeneous residual class that the model was never explicitly trained to recognise. To separately assess the model's ability to *discriminate among the five core EVT constructs*, a restricted analysis was conducted on items where both the human coder and the model assigned a core construct (n = 933, 72.7% of all items).
228
 
229
  | Metric | Value | 95% BCa CI |
230
+ |---|---|---|
231
+ | **Cohen's κ (5-class)** | **0.899** | [0.876, 0.920] |
232
+ | Krippendorff's α | 0.899 | [0.876, 0.921] |
233
+ | Overall accuracy | 0.922 | [0.901, 0.937] |
234
+ | Macro F1 | 0.915 | [0.895, 0.933] |
235
+ | PABAK | 0.902 | — |
236
 
237
+ Cohen's κ = .90 indicates **almost perfect agreement** on construct discrimination.
238
 
239
  | Class | Precision | Recall | F1 | n |
240
+ |---|---|---|---|---|
241
+ | ATTAINMENT_VALUE | .897 [.838, .950] | .897 [.835, .951] | .897 [.851, .937] | 107 |
242
+ | COST | .871 [.812, .922] | .901 [.849, .948] | .886 [.843, .922] | 142 |
243
+ | EXPECTANCY | .961 [.937, .982] | .941 [.912, .967] | .951 [.932, .968] | 290 |
244
+ | INTRINSIC_VALUE | .901 [.861, .938] | .954 [.924, .980] | .927 [.901, .950] | 219 |
245
+ | UTILITY_VALUE | .945 [.907, .977] | .880 [.830, .926] | .911 [.877, .941] | 175 |
246
 
247
+ Additionally, when evaluating all items the human coded as a core construct (n = 1,001) — allowing the model to predict OTHER — the model achieved κ = 0.822 [0.795, 0.848] and accuracy = 0.859 [0.835, 0.878]. Of these 1,001 items, 68 (6.8%) were incorrectly pushed to OTHER by the model.
248
 
249
  ### Summary Across Evaluation Scenarios
250
 
251
  | Scenario | κ | Macro F1 | N |
252
+ |---|---|---|---|
253
+ | Full 6-class (with OTHER) | 0.767 | 0.812 | 1,284 |
254
+ | 5-class: both raters assigned core construct | **0.899** | **0.915** | 933 |
255
+ | Human = core, model unrestricted | 0.822 | — | 1,001 |
256
+
257
+ The jump from κ = .77 to κ = .90 confirms that the model's primary weakness lies at the **EVT/non-EVT detection boundary** (the OTHER category), not in discriminating among the five core constructs.
258
 
 
 
259
  ### Base Model Comparison
260
+
261
+ To quantify the effect of domain-specific fine-tuning, the classifier was compared against the un-fine-tuned base model (`all-mpnet-base-v2`). The base model achieved κ = 0.054 [−0.073, −0.033] vs. κ = 0.767 [0.741, 0.793] for the fine-tuned model (Δκ = +0.821 [0.787, 0.854]). A McNemar test indicated a highly significant difference in error rates, χ²(1) = 867.91, p < .001 (fine-tuned correct only: 911 items; base correct only: 14 items). The fine-tuned model outperforms the base model on all six classes.
262
+
263
+ | Class | Base F1 | Fine-Tuned F1 | Δ |
264
+ |---|---|---|---|
265
+ | ATTAINMENT_VALUE | 0.051 | 0.831 | +0.780 |
266
+ | COST | 0.090 | 0.826 | +0.736 |
267
+ | EXPECTANCY | 0.169 | 0.864 | +0.695 |
268
+ | INTRINSIC_VALUE | 0.200 | 0.865 | +0.666 |
269
+ | OTHER | 0.007 | 0.678 | +0.671 |
270
+ | UTILITY_VALUE | 0.010 | 0.808 | +0.799 |
271
 
272
  ### Known Limitations and Biases
273
 
274
+ **OTHER is the weakest category** = .60, F1 = .68). Because OTHER is defined as "no construct activated above threshold" rather than a learned class, the model tends to find *some* EVT signal in ambiguous items that a human would code as non-EVT. The most frequent error type is predicting EXPECTANCY when the true label is OTHER (40 cases, 16.4% of all errors). When OTHER is excluded, the model reaches κ = .90 almost perfect agreement on the five core constructs.
275
 
276
+ **Systematic label distribution differences.** The Stuart-Maxwell test is significant (χ²(5) = 18.62, p = .002). The model over-predicts EXPECTANCY (+16), INTRINSIC_VALUE (+15), COST (+12), and ATTAINMENT_VALUE (+9), and under-predicts OTHER (−35) and UTILITY_VALUE (−17).
277
 
278
+ **UTILITY_VALUE under-recall.** The model is conservative on utility value (recall = .774 vs. precision = .846), suggesting that items referencing indirect or long-term benefits may be harder for the model to detect.
279
 
280
+ **Trained on synthetic data.** The training set consists of synthetically generated items, not real published instruments. While the model generalises well (κ = .77 overall, κ = .90 on core constructs), there may be stylistic patterns in human-authored items from specific subfields or cultures that the model has not encountered.
281
 
282
  **English only.** The model has not been evaluated on items in other languages. Cross-lingual transfer should not be assumed.
283