permutans commited on
Commit
84f7e31
·
verified ·
1 Parent(s): 02643c1

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +60 -61
  2. model.safetensors +1 -1
README.md CHANGED
@@ -31,7 +31,7 @@ This model performs multi-label span-level detection of 53 rhetorical marker typ
31
  | Base model | `bert-base-uncased` |
32
  | Task | Multi-label token classification (independent B/I/O per type) |
33
  | Marker types | 53 (22 oral, 31 literate) |
34
- | Test macro F1 | **0.400** (per-type detection, binary positive = B or I) |
35
  | Training | 20 epochs, batch 24, lr 3e-5, fp16 |
36
  | Regularization | Mixout (p=0.1) — stochastic L2 anchor to pretrained weights |
37
  | Loss | Per-type weighted cross-entropy with inverse-frequency type weights |
@@ -118,61 +118,61 @@ Per-type detection F1 on test set (binary: B or I = positive, O = negative):
118
  ```
119
  Type Prec Rec F1 Sup
120
  ========================================================================
121
- literate_abstract_noun 0.211 0.319 0.254 464
122
- literate_additive_formal 0.263 0.506 0.346 83
123
- literate_agent_demoted 0.364 0.629 0.461 291
124
- literate_agentless_passive 0.545 0.701 0.613 1274
125
- literate_aside 0.396 0.565 0.466 467
126
- literate_categorical_statement 0.246 0.245 0.245 388
127
- literate_causal_explicit 0.325 0.305 0.315 370
128
- literate_citation 0.500 0.551 0.524 243
129
- literate_conceptual_metaphor 0.168 0.297 0.215 219
130
- literate_concessive 0.542 0.491 0.515 731
131
- literate_concessive_connector 0.113 0.378 0.174 37
132
- literate_concrete_setting 0.174 0.279 0.214 301
133
- literate_conditional 0.586 0.710 0.642 1610
134
- literate_contrastive 0.374 0.343 0.358 382
135
- literate_cross_reference 0.351 0.317 0.333 82
136
- literate_definitional_move 0.217 0.371 0.274 62
137
- literate_enumeration 0.456 0.570 0.507 899
138
- literate_epistemic_hedge 0.415 0.511 0.458 534
139
- literate_evidential 0.364 0.503 0.422 175
140
- literate_institutional_subject 0.296 0.520 0.378 246
141
- literate_list_structure 0.709 0.559 0.625 653
142
- literate_metadiscourse 0.291 0.451 0.354 355
143
- literate_nested_clauses 0.105 0.266 0.151 1250
144
- literate_nominalization 0.475 0.554 0.511 1144
145
- literate_objectifying_stance 0.518 0.448 0.481 194
146
- literate_probability 0.612 0.548 0.578 115
147
- literate_qualified_assertion 0.185 0.160 0.172 125
148
- literate_relative_chain 0.320 0.537 0.401 1713
149
- literate_technical_abbreviation 0.545 0.783 0.643 161
150
- literate_technical_term 0.331 0.458 0.384 909
151
- literate_temporal_embedding 0.222 0.249 0.235 570
152
- oral_anaphora 0.207 0.248 0.226 137
153
- oral_antithesis 0.245 0.289 0.265 453
154
- oral_discourse_formula 0.353 0.384 0.368 563
155
- oral_embodied_action 0.263 0.374 0.309 470
156
- oral_everyday_example 0.160 0.164 0.162 366
157
- oral_imperative 0.519 0.670 0.585 200
158
- oral_inclusive_we 0.587 0.672 0.626 752
159
- oral_intensifier_doubling 0.310 0.165 0.215 79
160
- oral_lexical_repetition 0.293 0.488 0.366 217
161
- oral_named_individual 0.428 0.676 0.524 791
162
- oral_parallelism 0.654 0.048 0.089 710
163
- oral_phatic_check 0.465 0.882 0.609 76
164
- oral_phatic_filler 0.375 0.582 0.456 182
165
- oral_rhetorical_question 0.589 0.894 0.710 1264
166
- oral_second_person 0.614 0.545 0.577 833
167
- oral_self_correction 0.597 0.295 0.395 156
168
- oral_sensory_detail 0.275 0.312 0.293 352
169
- oral_simple_conjunction 0.096 0.211 0.132 71
170
- oral_specific_place 0.472 0.716 0.569 387
171
- oral_temporal_anchor 0.397 0.748 0.518 551
172
- oral_tricolon 0.274 0.285 0.280 557
173
- oral_vocative 0.634 0.761 0.692 155
174
  ========================================================================
175
- Macro avg (types w/ support) 0.400
176
  ```
177
 
178
  </details>
@@ -180,10 +180,9 @@ Macro avg (types w/ support) 0.400
180
  **Missing labels (test set):** 0/53 — all types detected at least once.
181
 
182
  Notable patterns:
183
- - **Strong performers** (F1 > 0.5): rhetorical_question (0.710), vocative (0.692), conditional (0.642), technical_abbreviation (0.643), inclusive_we (0.626), list_structure (0.625), agentless_passive (0.613), phatic_check (0.609), imperative (0.585), probability (0.578), second_person (0.577), specific_place (0.569), citation (0.524), named_individual (0.524), temporal_anchor (0.518), concessive (0.515), nominalization (0.511), enumeration (0.507)
184
- - **Weak performers** (F1 < 0.2): parallelism (0.089), simple_conjunction (0.132), nested_clauses (0.151), everyday_example (0.162), qualified_assertion (0.172), concessive_connector (0.174)
185
- - **Precision-recall tradeoff**: Most types now show higher recall than precision, indicating the model over-predicts markers — reversed from the previous release. Notable exceptions include `parallelism` (0.654 precision / 0.048 recall), `self_correction`, and `intensifier_doubling`, which remain high-precision but low-recall.
186
- - **Recovered type**: `oral_parallelism` crossed the 150-span threshold and was re-included, though its near-zero recall (0.048) means it is effectively non-functional despite high precision when it does fire.
187
 
188
  ## Architecture
189
 
@@ -216,8 +215,8 @@ classifier.bias → randomly initialized
216
  ## Limitations
217
 
218
  - **Recall-dominated errors**: Most types over-predict (recall > precision), producing false positives; downstream applications may need confidence thresholding
219
- - **Near-zero recall types**: `parallelism` (0.048 recall), `intensifier_doubling` (0.165), and `simple_conjunction` (0.211) are rarely detected despite being present in training data
220
- - **Low-precision types**: `simple_conjunction` (0.096), `nested_clauses` (0.105), and `concessive_connector` (0.113) have precision below 0.15, meaning most predictions for those types are false positives
221
  - **Context window**: 128 tokens max; longer spans may be truncated
222
  - **Domain**: Trained primarily on historical/literary texts; may underperform on modern social media
223
  - **Subjectivity**: Some marker boundaries are inherently ambiguous
 
31
  | Base model | `bert-base-uncased` |
32
  | Task | Multi-label token classification (independent B/I/O per type) |
33
  | Marker types | 53 (22 oral, 31 literate) |
34
+ | Test macro F1 | **0.386** (per-type detection, binary positive = B or I) |
35
  | Training | 20 epochs, batch 24, lr 3e-5, fp16 |
36
  | Regularization | Mixout (p=0.1) — stochastic L2 anchor to pretrained weights |
37
  | Loss | Per-type weighted cross-entropy with inverse-frequency type weights |
 
118
  ```
119
  Type Prec Rec F1 Sup
120
  ========================================================================
121
+ literate_abstract_noun 0.209 0.329 0.255 420
122
+ literate_additive_formal 0.243 0.479 0.322 71
123
+ literate_agent_demoted 0.468 0.664 0.549 414
124
+ literate_agentless_passive 0.555 0.648 0.598 1168
125
+ literate_aside 0.481 0.469 0.475 469
126
+ literate_categorical_statement 0.084 0.263 0.128 118
127
+ literate_causal_explicit 0.314 0.386 0.347 272
128
+ literate_citation 0.468 0.431 0.449 255
129
+ literate_conceptual_metaphor 0.370 0.397 0.383 517
130
+ literate_concessive 0.456 0.503 0.478 533
131
+ literate_concessive_connector 0.250 0.603 0.353 63
132
+ literate_concrete_setting 0.186 0.322 0.236 298
133
+ literate_conditional 0.519 0.548 0.533 1514
134
+ literate_contrastive 0.391 0.462 0.424 424
135
+ literate_cross_reference 0.825 0.316 0.457 253
136
+ literate_definitional_move 0.443 0.432 0.438 236
137
+ literate_enumeration 0.147 0.306 0.198 297
138
+ literate_epistemic_hedge 0.236 0.431 0.305 255
139
+ literate_evidential 0.269 0.472 0.342 106
140
+ literate_institutional_subject 0.157 0.450 0.233 111
141
+ literate_list_structure 0.528 0.614 0.567 295
142
+ literate_metadiscourse 0.355 0.407 0.379 447
143
+ literate_nested_clauses 0.143 0.093 0.113 2044
144
+ literate_nominalization 0.433 0.538 0.480 1013
145
+ literate_objectifying_stance 0.451 0.575 0.506 113
146
+ literate_probability 0.439 0.720 0.545 50
147
+ literate_qualified_assertion 0.186 0.077 0.109 142
148
+ literate_relative_chain 0.344 0.606 0.439 1456
149
+ literate_technical_abbreviation 0.500 0.705 0.585 139
150
+ literate_technical_term 0.278 0.423 0.336 825
151
+ literate_temporal_embedding 0.174 0.253 0.206 400
152
+ oral_anaphora 0.500 0.303 0.377 297
153
+ oral_antithesis 0.298 0.339 0.317 561
154
+ oral_discourse_formula 0.373 0.461 0.413 492
155
+ oral_embodied_action 0.295 0.368 0.327 454
156
+ oral_everyday_example 0.279 0.307 0.293 420
157
+ oral_imperative 0.359 0.600 0.449 110
158
+ oral_inclusive_we 0.579 0.668 0.620 681
159
+ oral_intensifier_doubling 0.429 0.220 0.290 82
160
+ oral_lexical_repetition 0.328 0.382 0.353 275
161
+ oral_named_individual 0.359 0.712 0.478 573
162
+ oral_parallelism 0.111 0.114 0.112 202
163
+ oral_phatic_check 0.288 0.436 0.347 39
164
+ oral_phatic_filler 0.389 0.527 0.448 146
165
+ oral_rhetorical_question 0.581 0.892 0.703 1006
166
+ oral_second_person 0.555 0.528 0.541 718
167
+ oral_self_correction 0.293 0.357 0.322 115
168
+ oral_sensory_detail 0.194 0.402 0.262 246
169
+ oral_simple_conjunction 0.174 0.229 0.198 131
170
+ oral_specific_place 0.453 0.751 0.565 406
171
+ oral_temporal_anchor 0.223 0.704 0.339 257
172
+ oral_tricolon 0.470 0.293 0.361 907
173
+ oral_vocative 0.386 0.942 0.547 52
174
  ========================================================================
175
+ Macro avg (types w/ support) 0.386
176
  ```
177
 
178
  </details>
 
180
  **Missing labels (test set):** 0/53 — all types detected at least once.
181
 
182
  Notable patterns:
183
+ - **Strong performers** (F1 > 0.5): rhetorical_question (0.703), inclusive_we (0.620), agentless_passive (0.598), technical_abbreviation (0.585), list_structure (0.567), specific_place (0.565), agent_demoted (0.549), vocative (0.547), probability (0.545), second_person (0.541), conditional (0.533), objectifying_stance (0.506)
184
+ - **Weak performers** (F1 < 0.2): qualified_assertion (0.109), parallelism (0.112), nested_clauses (0.113), categorical_statement (0.128), enumeration (0.198), simple_conjunction (0.198)
185
+ - **Precision-recall tradeoff**: Most types show higher recall than precision, indicating the model over-predicts markers. Notable exceptions include `cross_reference` (0.825 precision / 0.316 recall), `anaphora` (0.500 / 0.303), and `tricolon` (0.470 / 0.293), which remain high-precision but low-recall.
 
186
 
187
  ## Architecture
188
 
 
215
  ## Limitations
216
 
217
  - **Recall-dominated errors**: Most types over-predict (recall > precision), producing false positives; downstream applications may need confidence thresholding
218
+ - **Near-zero recall types**: `qualified_assertion` (0.077 recall), `nested_clauses` (0.093), and `parallelism` (0.114) are rarely detected despite being present in training data
219
+ - **Low-precision types**: `categorical_statement` (0.084), `parallelism` (0.111), and `nested_clauses` (0.143) have precision below 0.15, meaning most predictions for those types are false positives
220
  - **Context window**: 128 tokens max; longer spans may be truncated
221
  - **Domain**: Trained primarily on historical/literary texts; may underperform on modern social media
222
  - **Subjectivity**: Some marker boundaries are inherently ambiguous
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:41995685c78ead06fdda874b90a8bdf7b283997fa076207a33c0bd7136179ef3
3
  size 436082548
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:37d9c74b122fa304421948d1f1bc5ad1d686fb33eab36ae82079c1e8f4a03282
3
  size 436082548