christiqn commited on
Commit
ac4c3a6
·
verified ·
1 Parent(s): 0ad6cee

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -23
README.md CHANGED
@@ -133,15 +133,15 @@ The model was fine-tuned on the [expectancy_value_pool_v2](https://huggingface.c
133
  | Cohen's κ | 0.990 |
134
 
135
  > **Note:** These near-perfect numbers reflect performance on the synthetic data distribution. Real-world performance on human-authored items is reported below.
136
-
137
  ## Evaluation
138
-
139
  ### Test Set
140
-
141
- The model was evaluated on **N = 1,295 human-coded test items** drawn from published and unpublished psychological instruments. Items were independently coded by a human expert and by the model. Items marked as formatting artefacts or third-person phrasing were excluded prior to evaluation.
142
-
143
- ### Agreement with Human Coder
144
-
145
  | Metric | Value | 95% BCa CI |
146
  |--------|-------|------------|
147
  | **Cohen's κ (unweighted)** | 0.712 | [0.684, 0.740] |
@@ -151,11 +151,11 @@ The model was evaluated on **N = 1,295 human-coded test items** drawn from publi
151
  | Overall accuracy | 0.765 | [0.741, 0.786] |
152
  | Macro F1 | 0.758 | [0.735, 0.782] |
153
  | Weighted F1 | 0.764 | [0.740, 0.786] |
154
-
155
- Cohen's κ = .71 indicates **substantial agreement** according to Landis & Koch (1977). All confidence intervals are bias-corrected and accelerated bootstrap CIs with B = 10,000.
156
-
157
- ### Per-Class Performance
158
-
159
  | Class | Precision | Recall | F1 | κ (OvR) | n |
160
  |-------|-----------|--------|----|---------|---|
161
  | ATTAINMENT_VALUE | .727 [.648, .804] | .756 [.676, .829] | .741 [.676, .798] | .713 | 123 |
@@ -164,9 +164,42 @@ Cohen's κ = .71 indicates **substantial agreement** according to Landis & Koch
164
  | INTRINSIC_VALUE | .838 [.789, .884] | .852 [.805, .895] | .845 [.808, .879] | .810 | 236 |
165
  | OTHER | .621 [.561, .681] | .606 [.546, .667] | .614 [.561, .662] | .523 | 249 |
166
  | UTILITY_VALUE | .860 [.806, .911] | .708 [.645, .769] | .777 [.729, .821] | .739 | 209 |
167
-
168
- ### Confusion Matrix
169
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
  | | Pred: AV | Pred: CO | Pred: EX | Pred: IV | Pred: OT | Pred: UV |
171
  |---|---:|---:|---:|---:|---:|---:|
172
  | **True: AV** | **93** | 3 | 7 | 7 | 9 | 4 |
@@ -175,15 +208,19 @@ Cohen's κ = .71 indicates **substantial agreement** according to Landis & Koch
175
  | **True: IV** | 4 | 6 | 4 | **201** | 17 | 4 |
176
  | **True: OT** | 15 | 18 | 39 | 14 | **151** | 12 |
177
  | **True: UV** | 9 | 5 | 13 | 4 | 30 | **148** |
178
-
 
 
179
  ### Known Limitations and Biases
180
-
181
- **Systematic over-assignment of EVT constructs.** The Stuart-Maxwell test for marginal homogeneity is significant (χ²(5) = 20.70, p < .001). The model over-predicts EXPECTANCY (+19 items) and COST (+15) while under-predicting UTILITY_VALUE (−37) and OTHER (−6) relative to the human coder. This is a direct consequence of the OvR + asymmetric loss design, which prioritises recall on core constructs at the expense of the OTHER category.
182
-
183
- **OTHER is the weakest category** (κ = .52, F1 = .61). Because OTHER is defined as "no construct activated above threshold" rather than a learned class, the model tends to find *some* EVT signal in ambiguous items that a human would code as non-EVT. The most frequent error type is OTHER → EXPECTANCY (39 cases, 12.8% of all errors).
184
-
185
- **Trained on synthetic data.** The training set consists of synthetically generated items, not real published instruments. While the model generalises reasonably well (as shown by the κ = .71 on real items), there may be stylistic patterns in human-authored items from specific subfields or cultures that the model has not encountered.
186
-
 
 
187
  **English only.** The model has not been evaluated on items in other languages. Cross-lingual transfer should not be assumed.
188
 
189
  ## How to Use
 
133
  | Cohen's κ | 0.990 |
134
 
135
  > **Note:** These near-perfect numbers reflect performance on the synthetic data distribution. Real-world performance on human-authored items is reported below.
136
+
137
  ## Evaluation
138
+
139
  ### Test Set
140
+
141
+ The model was evaluated on **N = 1,295 human-coded test items** drawn from published and unpublished psychological instruments. Items were independently coded by a human expert and by the model. Items marked as formatting artefacts or third-person phrasing were excluded prior to evaluation. All confidence intervals are bias-corrected and accelerated (BCa) bootstrap CIs with B = 10,000.
142
+
143
+ ### Agreement with Human Coder (Full 6-Class)
144
+
145
  | Metric | Value | 95% BCa CI |
146
  |--------|-------|------------|
147
  | **Cohen's κ (unweighted)** | 0.712 | [0.684, 0.740] |
 
151
  | Overall accuracy | 0.765 | [0.741, 0.786] |
152
  | Macro F1 | 0.758 | [0.735, 0.782] |
153
  | Weighted F1 | 0.764 | [0.740, 0.786] |
154
+
155
+ Cohen's κ = .71 indicates **substantial agreement** according to Landis & Koch (1977).
156
+
157
+ ### Per-Class Performance (6-Class)
158
+
159
  | Class | Precision | Recall | F1 | κ (OvR) | n |
160
  |-------|-----------|--------|----|---------|---|
161
  | ATTAINMENT_VALUE | .727 [.648, .804] | .756 [.676, .829] | .741 [.676, .798] | .713 | 123 |
 
164
  | INTRINSIC_VALUE | .838 [.789, .884] | .852 [.805, .895] | .845 [.808, .879] | .810 | 236 |
165
  | OTHER | .621 [.561, .681] | .606 [.546, .667] | .614 [.561, .662] | .523 | 249 |
166
  | UTILITY_VALUE | .860 [.806, .911] | .708 [.645, .769] | .777 [.729, .821] | .739 | 209 |
167
+
168
+ ### Core Construct Discrimination (5-Class, Excluding OTHER)
169
+
170
+ The OTHER category is not a psychological construct — it is a heterogeneous residual class that the model was never explicitly trained to recognise. To separately assess the model's ability to *discriminate among the five core EVT constructs*, we report a restricted analysis on items where both the human coder and the model assigned a core construct (n = 954, 73.7% of all items).
171
+
172
+ | Metric | Value | 95% BCa CI |
173
+ |--------|-------|------------|
174
+ | **Cohen's κ (5-class)** | 0.845 | [0.816, 0.869] |
175
+ | Krippendorff's α | 0.845 | [0.816, 0.869] |
176
+ | Overall accuracy | 0.880 | [0.856, 0.898] |
177
+ | Macro F1 | 0.867 | [0.843, 0.889] |
178
+
179
+ Cohen's κ = .84 indicates **almost perfect agreement** on construct discrimination.
180
+
181
+ | Class | Precision | Recall | F1 | n |
182
+ |-------|-----------|--------|----|---|
183
+ | ATTAINMENT_VALUE | .823 [.750, .891] | .816 [.742, .884] | .819 [.762, .870] | 114 |
184
+ | COST | .793 [.728, .856] | .856 [.794, .912] | .824 [.773, .869] | 139 |
185
+ | EXPECTANCY | .911 [.878, .942] | .917 [.885, .946] | .914 [.890, .936] | 303 |
186
+ | INTRINSIC_VALUE | .889 [.848, .928] | .918 [.880, .953] | .903 [.874, .931] | 219 |
187
+ | UTILITY_VALUE | .925 [.882, .964] | .827 [.769, .881] | .873 [.833, .910] | 179 |
188
+
189
+ Additionally, when evaluating all items the human coded as a core construct (n = 1,046) — allowing the model to predict OTHER — the model achieved κ = .75 [.721, .780] and accuracy = .80 [.775, .824]. Of these 1,046 items, 92 (8.8%) were incorrectly pushed to OTHER by the model.
190
+
191
+ ### Summary Across Evaluation Scenarios
192
+
193
+ | Scenario | κ | Macro F1 | N |
194
+ |----------|---|----------|---|
195
+ | Full 6-class (with OTHER) | .712 | .758 | 1,295 |
196
+ | 5-class: both raters assigned core construct | **.845** | **.867** | 954 |
197
+ | Human = core, model unrestricted | .752 | — | 1,046 |
198
+
199
+ The jump from κ = .71 to κ = .84 confirms that the model's primary weakness lies at the **EVT/non-EVT detection boundary** (the OTHER category), not in discriminating among the five core constructs.
200
+
201
+ ### Confusion Matrix (6-Class)
202
+
203
  | | Pred: AV | Pred: CO | Pred: EX | Pred: IV | Pred: OT | Pred: UV |
204
  |---|---:|---:|---:|---:|---:|---:|
205
  | **True: AV** | **93** | 3 | 7 | 7 | 9 | 4 |
 
208
  | **True: IV** | 4 | 6 | 4 | **201** | 17 | 4 |
209
  | **True: OT** | 15 | 18 | 39 | 14 | **151** | 12 |
210
  | **True: UV** | 9 | 5 | 13 | 4 | 30 | **148** |
211
+
212
+ Cramér's V = .72 (6-class), .84 (5-class core only).
213
+
214
  ### Known Limitations and Biases
215
+
216
+ **Systematic over-assignment of EVT constructs.** The Stuart-Maxwell test for marginal homogeneity is significant (χ²(5) = 20.70, p < .001). The model over-predicts EXPECTANCY (+19 items) and COST (+15) while under-predicting UTILITY_VALUE (−37) relative to the human coder. This is a direct consequence of the OvR + asymmetric loss design, which prioritises recall on core constructs at the expense of the OTHER category.
217
+
218
+ **OTHER is the weakest category** (κ = .52, F1 = .61). Because OTHER is defined as "no construct activated above threshold" rather than a learned class, the model tends to find *some* EVT signal in ambiguous items that a human would code as non-EVT. The most frequent error type is OTHER → EXPECTANCY (39 cases, 12.8% of all errors). When OTHER is excluded, the model reaches κ = .84 — almost perfect agreement — on the five core constructs.
219
+
220
+ **UTILITY_VALUE under-prediction.** The model is conservative on utility value = −37 items vs. human coder), with high precision (.86) but lower recall (.71). Items referencing indirect or long-term benefits may be harder for the model to detect.
221
+
222
+ **Trained on synthetic data.** The training set consists of synthetically generated items, not real published instruments. While the model generalises well (κ = .71 overall, κ = .84 on core constructs), there may be stylistic patterns in human-authored items from specific subfields or cultures that the model has not encountered.
223
+
224
  **English only.** The model has not been evaluated on items in other languages. Cross-lingual transfer should not be assumed.
225
 
226
  ## How to Use