Update README.md
Browse files
README.md
CHANGED
|
@@ -133,15 +133,15 @@ The model was fine-tuned on the [expectancy_value_pool_v2](https://huggingface.c
|
|
| 133 |
| Cohen's κ | 0.990 |
|
| 134 |
|
| 135 |
> **Note:** These near-perfect numbers reflect performance on the synthetic data distribution. Real-world performance on human-authored items is reported below.
|
| 136 |
-
|
| 137 |
## Evaluation
|
| 138 |
-
|
| 139 |
### Test Set
|
| 140 |
-
|
| 141 |
-
The model was evaluated on **N = 1,295 human-coded test items** drawn from published and unpublished psychological instruments. Items were independently coded by a human expert and by the model. Items marked as formatting artefacts or third-person phrasing were excluded prior to evaluation.
|
| 142 |
-
|
| 143 |
-
### Agreement with Human Coder
|
| 144 |
-
|
| 145 |
| Metric | Value | 95% BCa CI |
|
| 146 |
|--------|-------|------------|
|
| 147 |
| **Cohen's κ (unweighted)** | 0.712 | [0.684, 0.740] |
|
|
@@ -151,11 +151,11 @@ The model was evaluated on **N = 1,295 human-coded test items** drawn from publi
|
|
| 151 |
| Overall accuracy | 0.765 | [0.741, 0.786] |
|
| 152 |
| Macro F1 | 0.758 | [0.735, 0.782] |
|
| 153 |
| Weighted F1 | 0.764 | [0.740, 0.786] |
|
| 154 |
-
|
| 155 |
-
Cohen's κ = .71 indicates **substantial agreement** according to Landis & Koch (1977).
|
| 156 |
-
|
| 157 |
-
### Per-Class Performance
|
| 158 |
-
|
| 159 |
| Class | Precision | Recall | F1 | κ (OvR) | n |
|
| 160 |
|-------|-----------|--------|----|---------|---|
|
| 161 |
| ATTAINMENT_VALUE | .727 [.648, .804] | .756 [.676, .829] | .741 [.676, .798] | .713 | 123 |
|
|
@@ -164,9 +164,42 @@ Cohen's κ = .71 indicates **substantial agreement** according to Landis & Koch
|
|
| 164 |
| INTRINSIC_VALUE | .838 [.789, .884] | .852 [.805, .895] | .845 [.808, .879] | .810 | 236 |
|
| 165 |
| OTHER | .621 [.561, .681] | .606 [.546, .667] | .614 [.561, .662] | .523 | 249 |
|
| 166 |
| UTILITY_VALUE | .860 [.806, .911] | .708 [.645, .769] | .777 [.729, .821] | .739 | 209 |
|
| 167 |
-
|
| 168 |
-
###
|
| 169 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 170 |
| | Pred: AV | Pred: CO | Pred: EX | Pred: IV | Pred: OT | Pred: UV |
|
| 171 |
|---|---:|---:|---:|---:|---:|---:|
|
| 172 |
| **True: AV** | **93** | 3 | 7 | 7 | 9 | 4 |
|
|
@@ -175,15 +208,19 @@ Cohen's κ = .71 indicates **substantial agreement** according to Landis & Koch
|
|
| 175 |
| **True: IV** | 4 | 6 | 4 | **201** | 17 | 4 |
|
| 176 |
| **True: OT** | 15 | 18 | 39 | 14 | **151** | 12 |
|
| 177 |
| **True: UV** | 9 | 5 | 13 | 4 | 30 | **148** |
|
| 178 |
-
|
|
|
|
|
|
|
| 179 |
### Known Limitations and Biases
|
| 180 |
-
|
| 181 |
-
**Systematic over-assignment of EVT constructs.** The Stuart-Maxwell test for marginal homogeneity is significant (χ²(5) = 20.70, p < .001). The model over-predicts EXPECTANCY (+19 items) and COST (+15) while under-predicting UTILITY_VALUE (−37)
|
| 182 |
-
|
| 183 |
-
**OTHER is the weakest category** (κ = .52, F1 = .61). Because OTHER is defined as "no construct activated above threshold" rather than a learned class, the model tends to find *some* EVT signal in ambiguous items that a human would code as non-EVT. The most frequent error type is OTHER → EXPECTANCY (39 cases, 12.8% of all errors).
|
| 184 |
-
|
| 185 |
-
**
|
| 186 |
-
|
|
|
|
|
|
|
| 187 |
**English only.** The model has not been evaluated on items in other languages. Cross-lingual transfer should not be assumed.
|
| 188 |
|
| 189 |
## How to Use
|
|
|
|
| 133 |
| Cohen's κ | 0.990 |
|
| 134 |
|
| 135 |
> **Note:** These near-perfect numbers reflect performance on the synthetic data distribution. Real-world performance on human-authored items is reported below.
|
| 136 |
+
|
| 137 |
## Evaluation
|
| 138 |
+
|
| 139 |
### Test Set
|
| 140 |
+
|
| 141 |
+
The model was evaluated on **N = 1,295 human-coded test items** drawn from published and unpublished psychological instruments. Items were independently coded by a human expert and by the model. Items marked as formatting artefacts or third-person phrasing were excluded prior to evaluation. All confidence intervals are bias-corrected and accelerated (BCa) bootstrap CIs with B = 10,000.
|
| 142 |
+
|
| 143 |
+
### Agreement with Human Coder (Full 6-Class)
|
| 144 |
+
|
| 145 |
| Metric | Value | 95% BCa CI |
|
| 146 |
|--------|-------|------------|
|
| 147 |
| **Cohen's κ (unweighted)** | 0.712 | [0.684, 0.740] |
|
|
|
|
| 151 |
| Overall accuracy | 0.765 | [0.741, 0.786] |
|
| 152 |
| Macro F1 | 0.758 | [0.735, 0.782] |
|
| 153 |
| Weighted F1 | 0.764 | [0.740, 0.786] |
|
| 154 |
+
|
| 155 |
+
Cohen's κ = .71 indicates **substantial agreement** according to Landis & Koch (1977).
|
| 156 |
+
|
| 157 |
+
### Per-Class Performance (6-Class)
|
| 158 |
+
|
| 159 |
| Class | Precision | Recall | F1 | κ (OvR) | n |
|
| 160 |
|-------|-----------|--------|----|---------|---|
|
| 161 |
| ATTAINMENT_VALUE | .727 [.648, .804] | .756 [.676, .829] | .741 [.676, .798] | .713 | 123 |
|
|
|
|
| 164 |
| INTRINSIC_VALUE | .838 [.789, .884] | .852 [.805, .895] | .845 [.808, .879] | .810 | 236 |
|
| 165 |
| OTHER | .621 [.561, .681] | .606 [.546, .667] | .614 [.561, .662] | .523 | 249 |
|
| 166 |
| UTILITY_VALUE | .860 [.806, .911] | .708 [.645, .769] | .777 [.729, .821] | .739 | 209 |
|
| 167 |
+
|
| 168 |
+
### Core Construct Discrimination (5-Class, Excluding OTHER)
|
| 169 |
+
|
| 170 |
+
The OTHER category is not a psychological construct — it is a heterogeneous residual class that the model was never explicitly trained to recognise. To separately assess the model's ability to *discriminate among the five core EVT constructs*, we report a restricted analysis on items where both the human coder and the model assigned a core construct (n = 954, 73.7% of all items).
|
| 171 |
+
|
| 172 |
+
| Metric | Value | 95% BCa CI |
|
| 173 |
+
|--------|-------|------------|
|
| 174 |
+
| **Cohen's κ (5-class)** | 0.845 | [0.816, 0.869] |
|
| 175 |
+
| Krippendorff's α | 0.845 | [0.816, 0.869] |
|
| 176 |
+
| Overall accuracy | 0.880 | [0.856, 0.898] |
|
| 177 |
+
| Macro F1 | 0.867 | [0.843, 0.889] |
|
| 178 |
+
|
| 179 |
+
Cohen's κ = .84 indicates **almost perfect agreement** on construct discrimination.
|
| 180 |
+
|
| 181 |
+
| Class | Precision | Recall | F1 | n |
|
| 182 |
+
|-------|-----------|--------|----|---|
|
| 183 |
+
| ATTAINMENT_VALUE | .823 [.750, .891] | .816 [.742, .884] | .819 [.762, .870] | 114 |
|
| 184 |
+
| COST | .793 [.728, .856] | .856 [.794, .912] | .824 [.773, .869] | 139 |
|
| 185 |
+
| EXPECTANCY | .911 [.878, .942] | .917 [.885, .946] | .914 [.890, .936] | 303 |
|
| 186 |
+
| INTRINSIC_VALUE | .889 [.848, .928] | .918 [.880, .953] | .903 [.874, .931] | 219 |
|
| 187 |
+
| UTILITY_VALUE | .925 [.882, .964] | .827 [.769, .881] | .873 [.833, .910] | 179 |
|
| 188 |
+
|
| 189 |
+
Additionally, when evaluating all items the human coded as a core construct (n = 1,046) — allowing the model to predict OTHER — the model achieved κ = .75 [.721, .780] and accuracy = .80 [.775, .824]. Of these 1,046 items, 92 (8.8%) were incorrectly pushed to OTHER by the model.
|
| 190 |
+
|
| 191 |
+
### Summary Across Evaluation Scenarios
|
| 192 |
+
|
| 193 |
+
| Scenario | κ | Macro F1 | N |
|
| 194 |
+
|----------|---|----------|---|
|
| 195 |
+
| Full 6-class (with OTHER) | .712 | .758 | 1,295 |
|
| 196 |
+
| 5-class: both raters assigned core construct | **.845** | **.867** | 954 |
|
| 197 |
+
| Human = core, model unrestricted | .752 | — | 1,046 |
|
| 198 |
+
|
| 199 |
+
The jump from κ = .71 to κ = .84 confirms that the model's primary weakness lies at the **EVT/non-EVT detection boundary** (the OTHER category), not in discriminating among the five core constructs.
|
| 200 |
+
|
| 201 |
+
### Confusion Matrix (6-Class)
|
| 202 |
+
|
| 203 |
| | Pred: AV | Pred: CO | Pred: EX | Pred: IV | Pred: OT | Pred: UV |
|
| 204 |
|---|---:|---:|---:|---:|---:|---:|
|
| 205 |
| **True: AV** | **93** | 3 | 7 | 7 | 9 | 4 |
|
|
|
|
| 208 |
| **True: IV** | 4 | 6 | 4 | **201** | 17 | 4 |
|
| 209 |
| **True: OT** | 15 | 18 | 39 | 14 | **151** | 12 |
|
| 210 |
| **True: UV** | 9 | 5 | 13 | 4 | 30 | **148** |
|
| 211 |
+
|
| 212 |
+
Cramér's V = .72 (6-class), .84 (5-class core only).
|
| 213 |
+
|
| 214 |
### Known Limitations and Biases
|
| 215 |
+
|
| 216 |
+
**Systematic over-assignment of EVT constructs.** The Stuart-Maxwell test for marginal homogeneity is significant (χ²(5) = 20.70, p < .001). The model over-predicts EXPECTANCY (+19 items) and COST (+15) while under-predicting UTILITY_VALUE (−37) relative to the human coder. This is a direct consequence of the OvR + asymmetric loss design, which prioritises recall on core constructs at the expense of the OTHER category.
|
| 217 |
+
|
| 218 |
+
**OTHER is the weakest category** (κ = .52, F1 = .61). Because OTHER is defined as "no construct activated above threshold" rather than a learned class, the model tends to find *some* EVT signal in ambiguous items that a human would code as non-EVT. The most frequent error type is OTHER → EXPECTANCY (39 cases, 12.8% of all errors). When OTHER is excluded, the model reaches κ = .84 — almost perfect agreement — on the five core constructs.
|
| 219 |
+
|
| 220 |
+
**UTILITY_VALUE under-prediction.** The model is conservative on utility value (Δ = −37 items vs. human coder), with high precision (.86) but lower recall (.71). Items referencing indirect or long-term benefits may be harder for the model to detect.
|
| 221 |
+
|
| 222 |
+
**Trained on synthetic data.** The training set consists of synthetically generated items, not real published instruments. While the model generalises well (κ = .71 overall, κ = .84 on core constructs), there may be stylistic patterns in human-authored items from specific subfields or cultures that the model has not encountered.
|
| 223 |
+
|
| 224 |
**English only.** The model has not been evaluated on items in other languages. Cross-lingual transfer should not be assumed.
|
| 225 |
|
| 226 |
## How to Use
|