Update README.md

Updated statistics after rerunning analysis on final dataset (fixed some issues with double labels / items that had a label but should have been excluded - vice versa)

Files changed (1) hide show

README.md +129 -110

README.md CHANGED Viewed

@@ -5,6 +5,10 @@ license: cc-by-4.0
 library_name: transformers
 tags:
 - text-classification
 - psychology
 - expectancy-value-theory
 - psychometrics
@@ -13,109 +17,109 @@ tags:
 - motivation
 - survey-instruments
 - content-analysis
 datasets:
 - christiqn/expectancy_value_pool
-base_model: sentence-transformers/all-mpnet-base-v2
-pipeline_tag: text-classification
 metrics:
-- accuracy
 - f1
-- cohen_kappa
 model-index:
-- name: evt-classifier-v11-mpnet
   results:
   - task:
       type: text-classification
-      name: EVT Item Classification
     dataset:
-      name: Human-coded EVT test items
-      type: custom
-      config: default
-      split: test
     metrics:
     - type: cohen_kappa
-      value: 0.7267
-      name: Cohen's Kappa
     - type: f1
-      value: 0.7744
-      name: Macro F1
     - type: accuracy
-      value: 0.7768
-      name: Accuracy
     - type: krippendorff_alpha
-      value: 0.7267
       name: Krippendorff's Alpha
 ---
 # EVT Item Classifier — Expectancy-Value Theory Construct Tagger
 A fine-tuned text classifier that assigns psychological test items (e.g., survey questions, scale items) to one of six categories derived from **Expectancy-Value Theory** (Eccles et al., 1983; Eccles & Wigfield, 2002):
 | Label | EVT Construct | Example item |
-|-------|--------------|--------------|
 | `ATTAINMENT_VALUE` | Personal importance of doing well | *"Doing well in math is important to who I am."* |
 | `COST` | Perceived negative consequences of engagement | *"I have to give up too much to succeed in this class."* |
 | `EXPECTANCY` | Beliefs about future success | *"I am confident I can master the skills taught in this course."* |
 | `INTRINSIC_VALUE` | Enjoyment and interest | *"I find the content of this course very interesting."* |
 | `UTILITY_VALUE` | Usefulness for future goals | *"What I learn in this class will be useful for my career."* |
 | `OTHER` | Not classifiable as an EVT construct | *"I usually sit in the front row."* |
 ## Intended Use
 This model is intended for **academic research** in educational psychology, motivation science, and psychometrics. Typical use cases include:
 - **Automated content analysis** of existing item pools and questionnaire banks for EVT construct coverage
 - **Scale development assistance** — screening candidate items during the item-writing phase
 - **Systematic reviews** — coding large corpora of test items from published instruments
 - **Construct validation** — checking whether items align with their intended EVT construct
 ### Out-of-Scope Uses
 - **Clinical or diagnostic decision-making** — This model classifies test *items*, not respondents. It should not be used to assess individuals.
-- **Replacement for human coding** — The model achieves substantial but not perfect agreement with human coders (κ = .73). It is intended as a tool to assist researchers, not to replace expert judgment.
 - **Non-English items** — The model was trained and evaluated on English-language items only.
 - **Non-EVT constructs** — The model will force any input into one of the six categories. Items from unrelated theoretical frameworks (e.g., Big Five personality, cognitive load) will be classified as `OTHER` at best, or spuriously assigned to an EVT category.
 ## How to Use
 ### Quick Start
 ```python
 from transformers import AutoTokenizer, AutoModel
 import torch
-# Load model - no custom imports needed!
 tokenizer = AutoTokenizer.from_pretrained("christiqn/mpnet-EVT-classifier")
 model = AutoModel.from_pretrained("christiqn/mpnet-EVT-classifier", trust_remote_code=True)
 model.eval()
 # Inference
 text = "I expect to do well in this course."
 inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
 with torch.no_grad():
     outputs = model(**inputs)
     probs = torch.sigmoid(outputs.logits)
 label = model.config.id2label[probs[0].argmax().item()]
 print(f"Predicted: {label}")
 ```
 ### Adjusting the Decision Threshold
 The default threshold of 0.50 balances precision and recall. You can adjust it:
 - **Lower threshold (e.g., 0.35):** More inclusive — fewer items classified as OTHER, higher recall on EVT constructs, but more false positives.
 - **Higher threshold (e.g., 0.65):** More conservative — more items fall to OTHER, higher precision on EVT constructs, but more false negatives.
 Choose based on your application: use a lower threshold for exploratory screening, a higher one when precision matters.
 ## Model Description
 ### Architecture
 The model uses a custom classification head on top of `sentence-transformers/all-mpnet-base-v2` (110M parameters):
 ```
 MPNet encoder
     ↓
@@ -127,138 +131,153 @@ NormLinear head (cosine similarity classifier, τ = 20)
     ↓
 5 independent sigmoid outputs (one per EVT construct)
 ```
 The `OTHER` class is not explicitly learned. Instead, an item is classified as `OTHER` when no construct's sigmoid probability exceeds the decision threshold (default: 0.50). This **One-vs-Rest (OvR)** formulation avoids forcing the model to learn a coherent representation for the heterogeneous `OTHER` category.
 ### Key Design Choices
 | Component | Rationale |
-|-----------|-----------|
 | **ConcatPooling** | Concatenating mean and max pooling captures both distributional and salient features from short text (Howard & Ruder, 2018) |
 | **NormLinear (cosine) head** | L2-normalised features and weights make classification direction-based, improving robustness to domain shift (Wang et al., 2018) |
-| **Asymmetric Loss** (γ\_neg=4, γ\_pos=1) | Down-weights easy negatives in the multi-label sigmoid setup, preventing the dominant negative class from overwhelming the loss (Ridnik et al., 2021) |
 | **FGM adversarial training** | Embedding-space perturbations regularise the model, improving generalisation from synthetic to real items (Miyato et al., 2017) |
 | **LLRD** (decay = 0.9) | Lower layers receive smaller learning rates, preserving pre-trained linguistic knowledge (Sun et al., 2019) |
 | **Gradual unfreezing** | Head-only training for epoch 1, then full end-to-end fine-tuning, to prevent catastrophic forgetting (Howard & Ruder, 2018) |
 | **SWA** | Averaging top-3 checkpoints produces a flatter minimum and marginally improves κ (Izmailov et al., 2018) |
 ## Training
 ### Training Data
 The model was fine-tuned on the [expectancy_value_pool_v2](https://huggingface.co/datasets/christiqn/expectancy_value_pool_v2) dataset — approximately 15,000 synthetic psychological test items generated with construct-level attribution, filtered to remove participial-phrase fragments. The data was split 85/7.5/7.5 into train/validation/test sets with stratified sampling.
 ### Training Procedure
 - **Epochs:** 12 (with early stopping, patience = 4, monitored on validation κ)
 - **Batch size:** 24 per device × 2 gradient accumulation steps = 48 effective
 - **Optimizer:** AdamW (lr = 3e-5, weight decay = 0.01)
 - **Scheduler:** Linear with 10% warmup
 - **Precision:** bf16 mixed-precision
-- **Hardware:** Single NVIDIA GPU
 - **Post-training:** Stochastic Weight Averaging of top-3 checkpoints
 ### Training Results (Held-Out Synthetic Test Set)
 | Metric | Value |
-|--------|-------|
 | Accuracy | 0.990 |
 | Macro F1 | 0.990 |
 | Cohen's κ | 0.990 |
 > **Note:** These near-perfect numbers reflect performance on the synthetic data distribution. Real-world performance on human-authored items is reported below.
 ## Evaluation
 ### Test Set
-The model was evaluated on **N = 1,286 human-coded test items** drawn from published and unpublished psychological instruments. Items were independently coded by a human expert and by the model. Items marked as formatting artefacts or third-person phrasing were excluded prior to evaluation. All confidence intervals are bias-corrected and accelerated (BCa) bootstrap CIs with B = 10,000.
 ### Agreement with Human Coder (Full 6-Class)
 | Metric | Value | 95% BCa CI |
-|--------|-------|------------|
-| **Cohen's κ (unweighted)** | 0.727 | [0.698, 0.754] |
-| Cohen's κ (linear weighted) | 0.721 | [0.686, 0.754] |
-| Krippendorff's α | 0.727 | [0.698, 0.754] |
-| PABAK | 0.732 | — |
-| Overall accuracy | 0.777 | [0.752, 0.799] |
-| Macro F1 | 0.774 | [0.750, 0.797] |
-| Weighted F1 | 0.775 | [0.752, 0.797] |
-Cohen's κ = .73 indicates **substantial agreement** according to Landis & Koch (1977).
 ### Per-Class Performance (6-Class)
 | Class | Precision | Recall | F1 | κ (OvR) | n |
-|-------|-----------|--------|----|---------|---|
-| ATTAINMENT_VALUE | .730 [.647, .808] | .809 [.731, .881] | .767 [.703, .824] | .744 | 110 |
-| COST | .739 [.669, .807] | .804 [.739, .865] | .770 [.716, .821] | .739 | 148 |
-| EXPECTANCY | .831 [.790, .871] | .844 [.803, .883] | .837 [.805, .866] | .783 | 320 |
-| INTRINSIC_VALUE | .817 [.768, .863] | .891 [.850, .930] | .852 [.817, .884] | .819 | 230 |
-| OTHER | .663 [.602, .720] | .601 [.542, .659] | .631 [.579, .677] | .538 | 271 |
-| UTILITY_VALUE | .845 [.792, .896] | .739 [.679, .798] | .789 [.742, .831] | .751 | 207 |
 ### Confusion Matrix (6-Class)
 | | Pred: AV | Pred: CO | Pred: EX | Pred: IV | Pred: OT | Pred: UV |
-|---|---:|---:|---:|---:|---:|---:|
-| **True: AV** | **89** | 2 | 1 | 5 | 7 | 6 |
-| **True: CO** | 3 | **119** | 2 | 10 | 11 | 3 |
-| **True: EX** | 3 | 17 | **270** | 6 | 23 | 1 |
-| **True: IV** | 4 | 4 | 0 | **205** | 15 | 2 |
-| **True: OT** | 15 | 15 | 42 | 20 | **163** | 16 |
-| **True: UV** | 8 | 4 | 10 | 5 | 27 | **153** |
-Cramér's V = .74 (6-class), .86 (5-class core only).
 ### Core Construct Discrimination (5-Class, Excluding OTHER)
-The OTHER category is not a psychological construct — it is a heterogeneous residual class that the model was never explicitly trained to recognise. To separately assess the model's ability to *discriminate among the five core EVT constructs*, we report a restricted analysis on items where both the human coder and the model assigned a core construct (n = 932, 72.5% of all items).
 | Metric | Value | 95% BCa CI |
-|--------|-------|------------|
-| **Cohen's κ (5-class)** | 0.867 | [0.842, 0.891] |
-| Krippendorff's α | 0.867 | [0.841, 0.891] |
-| Overall accuracy | 0.897 | [0.876, 0.914] |
-| Macro F1 | 0.885 | [0.862, 0.906] |
-Cohen's κ = .87 indicates **almost perfect agreement** on construct discrimination.
 | Class | Precision | Recall | F1 | n |
-|-------|-----------|--------|----|---|
-| ATTAINMENT_VALUE | .832 [.758, .899] | .864 [.794, .925] | .848 [.790, .896] | 103 |
-| COST | .815 [.748, .876] | .869 [.809, .922] | .841 [.791, .884] | 137 |
-| EXPECTANCY | .954 [.928, .977] | .909 [.875, .941] | .931 [.909, .951] | 297 |
-| INTRINSIC_VALUE | .887 [.845, .927] | .953 [.923, .980] | .919 [.891, .944] | 215 |
-| UTILITY_VALUE | .927 [.886, .964] | .850 [.796, .901] | .887 [.850, .920] | 180 |
-Additionally, when evaluating all items the human coded as a core construct (n = 1,015) — allowing the model to predict OTHER — the model achieved κ = .778 [.748, .805] and accuracy = .824 [.799, .846]. Of these 1,015 items, 83 (8.2%) were incorrectly pushed to OTHER by the model.
 ### Summary Across Evaluation Scenarios
 | Scenario | κ | Macro F1 | N |
-|----------|---|----------|---|
-| Full 6-class (with OTHER) | .727 | .774 | 1,286 |
-| 5-class: both raters assigned core construct | **.867** | **.885** | 932 |
-| Human = core, model unrestricted | .778 | — | 1,015 |
-The jump from κ = .73 to κ = .87 confirms that the model's primary weakness lies at the **EVT/non-EVT detection boundary** (the OTHER category), not in discriminating among the five core constructs.
 ### Base Model Comparison
-To quantify the effect of domain-specific fine-tuning, the classifier was compared against the un-fine-tuned base model (`all-mpnet-base-v2`). The base model achieved a Cohen's κ = -0.04 vs. κ = 0.73 for the fine-tuned model (Δκ = +0.77). A McNemar test indicated a highly significant difference in error rates, χ²(1) = 791.37, p < .001.
 ### Known Limitations and Biases
-**Systematic label distribution differences.** The Stuart-Maxwell test for marginal homogeneity is significant (χ²(5) = 20.57, p = .001). The model tends to over-predict INTRINSIC_VALUE (+21 items), COST (+13), ATTAINMENT_VALUE (+12), and EXPECTANCY (+5) relative to the human coder. This is a direct consequence of the OvR + asymmetric loss design, which prioritises recall on core constructs at the expense of the OTHER category (which is under-predicted by 25 items).
-**OTHER is the weakest category** (κ = .54, F1 = .63). Because OTHER is defined as "no construct activated above threshold" rather than a learned class, the model tends to find *some* EVT signal in ambiguous items that a human would code as non-EVT. The most frequent error type overall is predicting EXPECTANCY when the true label is OTHER (42 cases, 14.6% of all errors). When OTHER is excluded, the model reaches κ = .87 — almost perfect agreement — on the five core constructs.
-**UTILITY_VALUE under-prediction.** The model is conservative on utility value (Δ = −26 items vs. human coder), with high precision (.845) but lower recall (.739). Items referencing indirect or long-term benefits may be harder for the model to detect.
-**Trained on synthetic data.** The training set consists of synthetically generated items, not real published instruments. While the model generalises well (κ = .73 overall, κ = .87 on core constructs), there may be stylistic patterns in human-authored items from specific subfields or cultures that the model has not encountered.
 **English only.** The model has not been evaluated on items in other languages. Cross-lingual transfer should not be assumed.

 library_name: transformers
 tags:
 - text-classification
+- pytorch
+- safetensors
+- evt_classifier
+- feature-extraction
 - psychology
 - expectancy-value-theory
 - psychometrics
 - motivation
 - survey-instruments
 - content-analysis
+- custom_code
+pipeline_tag: text-classification
 datasets:
 - christiqn/expectancy_value_pool
+- christiqn/EVT-items
 metrics:
 - f1
+- accuracy
 model-index:
+- name: mpnet-EVT-classifier
   results:
   - task:
       type: text-classification
     dataset:
+      name: EVT-items (Human-coded psychological test items)
+      type: christiqn/EVT-items
     metrics:
     - type: cohen_kappa
+      value: 0.7674
+      name: Cohen's Kappa (6-class, N=1284)
     - type: f1
+      value: 0.8121
+      name: Macro F1 (6-class)
     - type: accuracy
+      value: 0.8100
+      name: Accuracy (6-class)
     - type: krippendorff_alpha
+      value: 0.7674
       name: Krippendorff's Alpha
 ---
 # EVT Item Classifier — Expectancy-Value Theory Construct Tagger
 A fine-tuned text classifier that assigns psychological test items (e.g., survey questions, scale items) to one of six categories derived from **Expectancy-Value Theory** (Eccles et al., 1983; Eccles & Wigfield, 2002):
 | Label | EVT Construct | Example item |
+|---|---|---|
 | `ATTAINMENT_VALUE` | Personal importance of doing well | *"Doing well in math is important to who I am."* |
 | `COST` | Perceived negative consequences of engagement | *"I have to give up too much to succeed in this class."* |
 | `EXPECTANCY` | Beliefs about future success | *"I am confident I can master the skills taught in this course."* |
 | `INTRINSIC_VALUE` | Enjoyment and interest | *"I find the content of this course very interesting."* |
 | `UTILITY_VALUE` | Usefulness for future goals | *"What I learn in this class will be useful for my career."* |
 | `OTHER` | Not classifiable as an EVT construct | *"I usually sit in the front row."* |
 ## Intended Use
 This model is intended for **academic research** in educational psychology, motivation science, and psychometrics. Typical use cases include:
 - **Automated content analysis** of existing item pools and questionnaire banks for EVT construct coverage
 - **Scale development assistance** — screening candidate items during the item-writing phase
 - **Systematic reviews** — coding large corpora of test items from published instruments
 - **Construct validation** — checking whether items align with their intended EVT construct
 ### Out-of-Scope Uses
 - **Clinical or diagnostic decision-making** — This model classifies test *items*, not respondents. It should not be used to assess individuals.
+- **Replacement for human coding** — The model achieves substantial but not perfect agreement with human coders (κ = .77). It is intended as a tool to assist researchers, not to replace expert judgment.
 - **Non-English items** — The model was trained and evaluated on English-language items only.
 - **Non-EVT constructs** — The model will force any input into one of the six categories. Items from unrelated theoretical frameworks (e.g., Big Five personality, cognitive load) will be classified as `OTHER` at best, or spuriously assigned to an EVT category.
 ## How to Use
 ### Quick Start
 ```python
 from transformers import AutoTokenizer, AutoModel
 import torch
+# Load model
 tokenizer = AutoTokenizer.from_pretrained("christiqn/mpnet-EVT-classifier")
 model = AutoModel.from_pretrained("christiqn/mpnet-EVT-classifier", trust_remote_code=True)
 model.eval()
 # Inference
 text = "I expect to do well in this course."
 inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
 with torch.no_grad():
     outputs = model(**inputs)
     probs = torch.sigmoid(outputs.logits)
 label = model.config.id2label[probs[0].argmax().item()]
 print(f"Predicted: {label}")
 ```
 ### Adjusting the Decision Threshold
 The default threshold of 0.50 balances precision and recall. You can adjust it:
 - **Lower threshold (e.g., 0.35):** More inclusive — fewer items classified as OTHER, higher recall on EVT constructs, but more false positives.
 - **Higher threshold (e.g., 0.65):** More conservative — more items fall to OTHER, higher precision on EVT constructs, but more false negatives.
 Choose based on your application: use a lower threshold for exploratory screening, a higher one when precision matters.
 ## Model Description
 ### Architecture
 The model uses a custom classification head on top of `sentence-transformers/all-mpnet-base-v2` (110M parameters):
 ```
 MPNet encoder
     ↓
     ↓
 5 independent sigmoid outputs (one per EVT construct)
 ```
 The `OTHER` class is not explicitly learned. Instead, an item is classified as `OTHER` when no construct's sigmoid probability exceeds the decision threshold (default: 0.50). This **One-vs-Rest (OvR)** formulation avoids forcing the model to learn a coherent representation for the heterogeneous `OTHER` category.
 ### Key Design Choices
 | Component | Rationale |
+|---|---|
 | **ConcatPooling** | Concatenating mean and max pooling captures both distributional and salient features from short text (Howard & Ruder, 2018) |
 | **NormLinear (cosine) head** | L2-normalised features and weights make classification direction-based, improving robustness to domain shift (Wang et al., 2018) |
+| **Asymmetric Loss** (γ_neg=4, γ_pos=1) | Down-weights easy negatives in the multi-label sigmoid setup, preventing the dominant negative class from overwhelming the loss (Ridnik et al., 2021) |
 | **FGM adversarial training** | Embedding-space perturbations regularise the model, improving generalisation from synthetic to real items (Miyato et al., 2017) |
 | **LLRD** (decay = 0.9) | Lower layers receive smaller learning rates, preserving pre-trained linguistic knowledge (Sun et al., 2019) |
 | **Gradual unfreezing** | Head-only training for epoch 1, then full end-to-end fine-tuning, to prevent catastrophic forgetting (Howard & Ruder, 2018) |
 | **SWA** | Averaging top-3 checkpoints produces a flatter minimum and marginally improves κ (Izmailov et al., 2018) |
 ## Training
 ### Training Data
 The model was fine-tuned on the [expectancy_value_pool_v2](https://huggingface.co/datasets/christiqn/expectancy_value_pool_v2) dataset — approximately 15,000 synthetic psychological test items generated with construct-level attribution, filtered to remove participial-phrase fragments. The data was split 85/7.5/7.5 into train/validation/test sets with stratified sampling.
 ### Training Procedure
 - **Epochs:** 12 (with early stopping, patience = 4, monitored on validation κ)
 - **Batch size:** 24 per device × 2 gradient accumulation steps = 48 effective
 - **Optimizer:** AdamW (lr = 3e-5, weight decay = 0.01)
 - **Scheduler:** Linear with 10% warmup
 - **Precision:** bf16 mixed-precision
+- **Hardware:** Single NVIDIA H100 GPU
 - **Post-training:** Stochastic Weight Averaging of top-3 checkpoints
 ### Training Results (Held-Out Synthetic Test Set)
 | Metric | Value |
+|---|---|
 | Accuracy | 0.990 |
 | Macro F1 | 0.990 |
 | Cohen's κ | 0.990 |
 > **Note:** These near-perfect numbers reflect performance on the synthetic data distribution. Real-world performance on human-authored items is reported below.
 ## Evaluation
 ### Test Set
+The model was evaluated on **N = 1,284 human-coded test items** from [christiqn/EVT-items](https://huggingface.co/datasets/christiqn/EVT-items), drawn from published and unpublished psychological instruments spanning 95 distinct scales. Items were independently coded by a human expert and by the model. Items marked as formatting artefacts, third-person phrasing, or multiple anchors were excluded prior to evaluation. All confidence intervals are bias-corrected and accelerated (BCa) bootstrap CIs with B = 10,000.
 ### Agreement with Human Coder (Full 6-Class)
 | Metric | Value | 95% BCa CI |
+|---|---|---|
+| **Cohen's κ (unweighted)** | **0.767** | [0.741, 0.793] |
+| Cohen's κ (linear weighted) | 0.768 | [0.735, 0.797] |
+| Krippendorff's α | 0.767 | [0.740, 0.793] |
+| PABAK | 0.772 | — |
+| Overall accuracy | 0.810 | [0.787, 0.830] |
+| Macro F1 | 0.812 | [0.790, 0.834] |
+| Weighted F1 | 0.807 | [0.785, 0.828] |
+Cohen's κ = .77 indicates **substantial agreement** according to Landis & Koch (1977).
 ### Per-Class Performance (6-Class)
 | Class | Precision | Recall | F1 | κ (OvR) | n |
+|---|---|---|---|---|---|
+| ATTAINMENT_VALUE | .800 [.726, .868] | .865 [.798, .926] | .831 [.775, .880] | .815 | 111 |
+| COST | .795 [.732, .854] | .859 [.800, .912] | .826 [.778, .869] | .802 | 149 |
+| EXPECTANCY | .843 [.801, .881] | .886 [.850, .921] | .864 [.834, .892] | .820 | 308 |
+| INTRINSIC_VALUE | .839 [.793, .883] | .893 [.852, .931] | .865 [.831, .896] | .834 | 234 |
+| OTHER | .726 [.668, .780] | .636 [.580, .691] | .678 [.631, .722] | .595 | 283 |
+| UTILITY_VALUE | .846 [.793, .897] | .774 [.716, .831] | .808 [.764, .849] | .775 | 199 |
 ### Confusion Matrix (6-Class)
 | | Pred: AV | Pred: CO | Pred: EX | Pred: IV | Pred: OT | Pred: UV |
+|---|---|---|---|---|---|---|
+| **True: AV** | **96** | 1 | 1 | 5 | 4 | 4 |
+| **True: CO** | 1 | **128** | 1 | 10 | 7 | 2 |
+| **True: EX** | 0 | 11 | **273** | 5 | 18 | 1 |
+| **True: IV** | 4 | 4 | 0 | **209** | 15 | 2 |
+| **True: OT** | 13 | 14 | 40 | 17 | **180** | 19 |
+| **True: UV** | 6 | 3 | 9 | 3 | 24 | **154** |
+Cramér's V = 0.782 (6-class).
+### Marginal Homogeneity (Stuart-Maxwell Test)
+The Stuart-Maxwell test for marginal homogeneity is significant (χ²(5) = 18.62, p = .002), indicating a systematic difference between the model's and human coder's label distributions. The model tends to over-predict EXPECTANCY (+16), INTRINSIC_VALUE (+15), COST (+12), and ATTAINMENT_VALUE (+9) relative to the human coder, while under-predicting OTHER (−35) and UTILITY_VALUE (−17). This is a direct consequence of the OvR + asymmetric loss design, which prioritises recall on core constructs.
 ### Core Construct Discrimination (5-Class, Excluding OTHER)
+The OTHER category is not a psychological construct — it is a heterogeneous residual class that the model was never explicitly trained to recognise. To separately assess the model's ability to *discriminate among the five core EVT constructs*, a restricted analysis was conducted on items where both the human coder and the model assigned a core construct (n = 933, 72.7% of all items).
 | Metric | Value | 95% BCa CI |
+|---|---|---|
+| **Cohen's κ (5-class)** | **0.899** | [0.876, 0.920] |
+| Krippendorff's α | 0.899 | [0.876, 0.921] |
+| Overall accuracy | 0.922 | [0.901, 0.937] |
+| Macro F1 | 0.915 | [0.895, 0.933] |
+| PABAK | 0.902 | — |
+Cohen's κ = .90 indicates **almost perfect agreement** on construct discrimination.
 | Class | Precision | Recall | F1 | n |
+|---|---|---|---|---|
+| ATTAINMENT_VALUE | .897 [.838, .950] | .897 [.835, .951] | .897 [.851, .937] | 107 |
+| COST | .871 [.812, .922] | .901 [.849, .948] | .886 [.843, .922] | 142 |
+| EXPECTANCY | .961 [.937, .982] | .941 [.912, .967] | .951 [.932, .968] | 290 |
+| INTRINSIC_VALUE | .901 [.861, .938] | .954 [.924, .980] | .927 [.901, .950] | 219 |
+| UTILITY_VALUE | .945 [.907, .977] | .880 [.830, .926] | .911 [.877, .941] | 175 |
+Additionally, when evaluating all items the human coded as a core construct (n = 1,001) — allowing the model to predict OTHER — the model achieved κ = 0.822 [0.795, 0.848] and accuracy = 0.859 [0.835, 0.878]. Of these 1,001 items, 68 (6.8%) were incorrectly pushed to OTHER by the model.
 ### Summary Across Evaluation Scenarios
 | Scenario | κ | Macro F1 | N |
+|---|---|---|---|
+| Full 6-class (with OTHER) | 0.767 | 0.812 | 1,284 |
+| 5-class: both raters assigned core construct | **0.899** | **0.915** | 933 |
+| Human = core, model unrestricted | 0.822 | — | 1,001 |
+The jump from κ = .77 to κ = .90 confirms that the model's primary weakness lies at the **EVT/non-EVT detection boundary** (the OTHER category), not in discriminating among the five core constructs.
 ### Base Model Comparison
+To quantify the effect of domain-specific fine-tuning, the classifier was compared against the un-fine-tuned base model (`all-mpnet-base-v2`). The base model achieved κ = −0.054 [−0.073, −0.033] vs. κ = 0.767 [0.741, 0.793] for the fine-tuned model (Δκ = +0.821 [0.787, 0.854]). A McNemar test indicated a highly significant difference in error rates, χ²(1) = 867.91, p < .001 (fine-tuned correct only: 911 items; base correct only: 14 items). The fine-tuned model outperforms the base model on all six classes.
+| Class | Base F1 | Fine-Tuned F1 | Δ |
+|---|---|---|---|
+| ATTAINMENT_VALUE | 0.051 | 0.831 | +0.780 |
+| COST | 0.090 | 0.826 | +0.736 |
+| EXPECTANCY | 0.169 | 0.864 | +0.695 |
+| INTRINSIC_VALUE | 0.200 | 0.865 | +0.666 |
+| OTHER | 0.007 | 0.678 | +0.671 |
+| UTILITY_VALUE | 0.010 | 0.808 | +0.799 |
 ### Known Limitations and Biases
+**OTHER is the weakest category** (κ = .60, F1 = .68). Because OTHER is defined as "no construct activated above threshold" rather than a learned class, the model tends to find *some* EVT signal in ambiguous items that a human would code as non-EVT. The most frequent error type is predicting EXPECTANCY when the true label is OTHER (40 cases, 16.4% of all errors). When OTHER is excluded, the model reaches κ = .90 — almost perfect agreement — on the five core constructs.
+**Systematic label distribution differences.** The Stuart-Maxwell test is significant (χ²(5) = 18.62, p = .002). The model over-predicts EXPECTANCY (+16), INTRINSIC_VALUE (+15), COST (+12), and ATTAINMENT_VALUE (+9), and under-predicts OTHER (−35) and UTILITY_VALUE (−17).
+**UTILITY_VALUE under-recall.** The model is conservative on utility value (recall = .774 vs. precision = .846), suggesting that items referencing indirect or long-term benefits may be harder for the model to detect.
+**Trained on synthetic data.** The training set consists of synthetically generated items, not real published instruments. While the model generalises well (κ = .77 overall, κ = .90 on core constructs), there may be stylistic patterns in human-authored items from specific subfields or cultures that the model has not encountered.
 **English only.** The model has not been evaluated on items in other languages. Cross-lingual transfer should not be assumed.