Update README.md

Browse files

Files changed (1) hide show

README.md +60 -23

README.md CHANGED Viewed

@@ -133,15 +133,15 @@ The model was fine-tuned on the [expectancy_value_pool_v2](https://huggingface.c
 | Cohen's κ | 0.990 |
 > **Note:** These near-perfect numbers reflect performance on the synthetic data distribution. Real-world performance on human-authored items is reported below.
 ## Evaluation
 ### Test Set
-The model was evaluated on **N = 1,295 human-coded test items** drawn from published and unpublished psychological instruments. Items were independently coded by a human expert and by the model. Items marked as formatting artefacts or third-person phrasing were excluded prior to evaluation.
-### Agreement with Human Coder
 | Metric | Value | 95% BCa CI |
 |--------|-------|------------|
 | **Cohen's κ (unweighted)** | 0.712 | [0.684, 0.740] |
@@ -151,11 +151,11 @@ The model was evaluated on **N = 1,295 human-coded test items** drawn from publi
 | Overall accuracy | 0.765 | [0.741, 0.786] |
 | Macro F1 | 0.758 | [0.735, 0.782] |
 | Weighted F1 | 0.764 | [0.740, 0.786] |
-Cohen's κ = .71 indicates **substantial agreement** according to Landis & Koch (1977). All confidence intervals are bias-corrected and accelerated bootstrap CIs with B = 10,000.
-### Per-Class Performance
 | Class | Precision | Recall | F1 | κ (OvR) | n |
 |-------|-----------|--------|----|---------|---|
 | ATTAINMENT_VALUE | .727 [.648, .804] | .756 [.676, .829] | .741 [.676, .798] | .713 | 123 |
@@ -164,9 +164,42 @@ Cohen's κ = .71 indicates **substantial agreement** according to Landis & Koch
 | INTRINSIC_VALUE | .838 [.789, .884] | .852 [.805, .895] | .845 [.808, .879] | .810 | 236 |
 | OTHER | .621 [.561, .681] | .606 [.546, .667] | .614 [.561, .662] | .523 | 249 |
 | UTILITY_VALUE | .860 [.806, .911] | .708 [.645, .769] | .777 [.729, .821] | .739 | 209 |
-### Confusion Matrix
 | | Pred: AV | Pred: CO | Pred: EX | Pred: IV | Pred: OT | Pred: UV |
 |---|---:|---:|---:|---:|---:|---:|
 | **True: AV** | **93** | 3 | 7 | 7 | 9 | 4 |
@@ -175,15 +208,19 @@ Cohen's κ = .71 indicates **substantial agreement** according to Landis & Koch
 | **True: IV** | 4 | 6 | 4 | **201** | 17 | 4 |
 | **True: OT** | 15 | 18 | 39 | 14 | **151** | 12 |
 | **True: UV** | 9 | 5 | 13 | 4 | 30 | **148** |
 ### Known Limitations and Biases
-**Systematic over-assignment of EVT constructs.** The Stuart-Maxwell test for marginal homogeneity is significant (χ²(5) = 20.70, p < .001). The model over-predicts EXPECTANCY (+19 items) and COST (+15) while under-predicting UTILITY_VALUE (−37) and OTHER (−6) relative to the human coder. This is a direct consequence of the OvR + asymmetric loss design, which prioritises recall on core constructs at the expense of the OTHER category.
-**OTHER is the weakest category** (κ = .52, F1 = .61). Because OTHER is defined as "no construct activated above threshold" rather than a learned class, the model tends to find *some* EVT signal in ambiguous items that a human would code as non-EVT. The most frequent error type is OTHER → EXPECTANCY (39 cases, 12.8% of all errors).
-**Trained on synthetic data.** The training set consists of synthetically generated items, not real published instruments. While the model generalises reasonably well (as shown by the κ = .71 on real items), there may be stylistic patterns in human-authored items from specific subfields or cultures that the model has not encountered.
 **English only.** The model has not been evaluated on items in other languages. Cross-lingual transfer should not be assumed.
 ## How to Use

 | Cohen's κ | 0.990 |
 > **Note:** These near-perfect numbers reflect performance on the synthetic data distribution. Real-world performance on human-authored items is reported below.
 ## Evaluation
 ### Test Set
+The model was evaluated on **N = 1,295 human-coded test items** drawn from published and unpublished psychological instruments. Items were independently coded by a human expert and by the model. Items marked as formatting artefacts or third-person phrasing were excluded prior to evaluation. All confidence intervals are bias-corrected and accelerated (BCa) bootstrap CIs with B = 10,000.
+### Agreement with Human Coder (Full 6-Class)
 | Metric | Value | 95% BCa CI |
 |--------|-------|------------|
 | **Cohen's κ (unweighted)** | 0.712 | [0.684, 0.740] |
 | Overall accuracy | 0.765 | [0.741, 0.786] |
 | Macro F1 | 0.758 | [0.735, 0.782] |
 | Weighted F1 | 0.764 | [0.740, 0.786] |
+Cohen's κ = .71 indicates **substantial agreement** according to Landis & Koch (1977).
+### Per-Class Performance (6-Class)
 | Class | Precision | Recall | F1 | κ (OvR) | n |
 |-------|-----------|--------|----|---------|---|
 | ATTAINMENT_VALUE | .727 [.648, .804] | .756 [.676, .829] | .741 [.676, .798] | .713 | 123 |
 | INTRINSIC_VALUE | .838 [.789, .884] | .852 [.805, .895] | .845 [.808, .879] | .810 | 236 |
 | OTHER | .621 [.561, .681] | .606 [.546, .667] | .614 [.561, .662] | .523 | 249 |
 | UTILITY_VALUE | .860 [.806, .911] | .708 [.645, .769] | .777 [.729, .821] | .739 | 209 |
+### Core Construct Discrimination (5-Class, Excluding OTHER)
+The OTHER category is not a psychological construct — it is a heterogeneous residual class that the model was never explicitly trained to recognise. To separately assess the model's ability to *discriminate among the five core EVT constructs*, we report a restricted analysis on items where both the human coder and the model assigned a core construct (n = 954, 73.7% of all items).
+| Metric | Value | 95% BCa CI |
+|--------|-------|------------|
+| **Cohen's κ (5-class)** | 0.845 | [0.816, 0.869] |
+| Krippendorff's α | 0.845 | [0.816, 0.869] |
+| Overall accuracy | 0.880 | [0.856, 0.898] |
+| Macro F1 | 0.867 | [0.843, 0.889] |
+Cohen's κ = .84 indicates **almost perfect agreement** on construct discrimination.
+| Class | Precision | Recall | F1 | n |
+|-------|-----------|--------|----|---|
+| ATTAINMENT_VALUE | .823 [.750, .891] | .816 [.742, .884] | .819 [.762, .870] | 114 |
+| COST | .793 [.728, .856] | .856 [.794, .912] | .824 [.773, .869] | 139 |
+| EXPECTANCY | .911 [.878, .942] | .917 [.885, .946] | .914 [.890, .936] | 303 |
+| INTRINSIC_VALUE | .889 [.848, .928] | .918 [.880, .953] | .903 [.874, .931] | 219 |
+| UTILITY_VALUE | .925 [.882, .964] | .827 [.769, .881] | .873 [.833, .910] | 179 |
+Additionally, when evaluating all items the human coded as a core construct (n = 1,046) — allowing the model to predict OTHER — the model achieved κ = .75 [.721, .780] and accuracy = .80 [.775, .824]. Of these 1,046 items, 92 (8.8%) were incorrectly pushed to OTHER by the model.
+### Summary Across Evaluation Scenarios
+| Scenario | κ | Macro F1 | N |
+|----------|---|----------|---|
+| Full 6-class (with OTHER) | .712 | .758 | 1,295 |
+| 5-class: both raters assigned core construct | **.845** | **.867** | 954 |
+| Human = core, model unrestricted | .752 | — | 1,046 |
+The jump from κ = .71 to κ = .84 confirms that the model's primary weakness lies at the **EVT/non-EVT detection boundary** (the OTHER category), not in discriminating among the five core constructs.
+### Confusion Matrix (6-Class)
 | | Pred: AV | Pred: CO | Pred: EX | Pred: IV | Pred: OT | Pred: UV |
 |---|---:|---:|---:|---:|---:|---:|
 | **True: AV** | **93** | 3 | 7 | 7 | 9 | 4 |
 | **True: IV** | 4 | 6 | 4 | **201** | 17 | 4 |
 | **True: OT** | 15 | 18 | 39 | 14 | **151** | 12 |
 | **True: UV** | 9 | 5 | 13 | 4 | 30 | **148** |
+Cramér's V = .72 (6-class), .84 (5-class core only).
 ### Known Limitations and Biases
+**Systematic over-assignment of EVT constructs.** The Stuart-Maxwell test for marginal homogeneity is significant (χ²(5) = 20.70, p < .001). The model over-predicts EXPECTANCY (+19 items) and COST (+15) while under-predicting UTILITY_VALUE (−37) relative to the human coder. This is a direct consequence of the OvR + asymmetric loss design, which prioritises recall on core constructs at the expense of the OTHER category.
+**OTHER is the weakest category** (κ = .52, F1 = .61). Because OTHER is defined as "no construct activated above threshold" rather than a learned class, the model tends to find *some* EVT signal in ambiguous items that a human would code as non-EVT. The most frequent error type is OTHER → EXPECTANCY (39 cases, 12.8% of all errors). When OTHER is excluded, the model reaches κ = .84 — almost perfect agreement — on the five core constructs.
+**UTILITY_VALUE under-prediction.** The model is conservative on utility value (Δ = −37 items vs. human coder), with high precision (.86) but lower recall (.71). Items referencing indirect or long-term benefits may be harder for the model to detect.
+**Trained on synthetic data.** The training set consists of synthetically generated items, not real published instruments. While the model generalises well (κ = .71 overall, κ = .84 on core constructs), there may be stylistic patterns in human-authored items from specific subfields or cultures that the model has not encountered.
 **English only.** The model has not been evaluated on items in other languages. Cross-lingual transfer should not be assumed.
 ## How to Use