HavelockAI
/

bert-token-classifier

@@ -31,7 +31,7 @@ This model performs multi-label span-level detection of 53 rhetorical marker typ
 | Base model | `bert-base-uncased` |
 | Task | Multi-label token classification (independent B/I/O per type) |
 | Marker types | 53 (22 oral, 31 literate) |
-| Test macro F1 | **0.400** (per-type detection, binary positive = B or I) |
 | Training | 20 epochs, batch 24, lr 3e-5, fp16 |
 | Regularization | Mixout (p=0.1) — stochastic L2 anchor to pretrained weights |
 | Loss | Per-type weighted cross-entropy with inverse-frequency type weights |
@@ -118,61 +118,61 @@ Per-type detection F1 on test set (binary: B or I = positive, O = negative):
 ```
 Type                                            Prec    Rec     F1    Sup
 ========================================================================
-literate_abstract_noun                         0.211  0.319  0.254    464
-literate_additive_formal                       0.263  0.506  0.346     83
-literate_agent_demoted                         0.364  0.629  0.461    291
-literate_agentless_passive                     0.545  0.701  0.613   1274
-literate_aside                                 0.396  0.565  0.466    467
-literate_categorical_statement                 0.246  0.245  0.245    388
-literate_causal_explicit                       0.325  0.305  0.315    370
-literate_citation                              0.500  0.551  0.524    243
-literate_conceptual_metaphor                   0.168  0.297  0.215    219
-literate_concessive                            0.542  0.491  0.515    731
-literate_concessive_connector                  0.113  0.378  0.174     37
-literate_concrete_setting                      0.174  0.279  0.214    301
-literate_conditional                           0.586  0.710  0.642   1610
-literate_contrastive                           0.374  0.343  0.358    382
-literate_cross_reference                       0.351  0.317  0.333     82
-literate_definitional_move                     0.217  0.371  0.274     62
-literate_enumeration                           0.456  0.570  0.507    899
-literate_epistemic_hedge                       0.415  0.511  0.458    534
-literate_evidential                            0.364  0.503  0.422    175
-literate_institutional_subject                 0.296  0.520  0.378    246
-literate_list_structure                        0.709  0.559  0.625    653
-literate_metadiscourse                         0.291  0.451  0.354    355
-literate_nested_clauses                        0.105  0.266  0.151   1250
-literate_nominalization                        0.475  0.554  0.511   1144
-literate_objectifying_stance                   0.518  0.448  0.481    194
-literate_probability                           0.612  0.548  0.578    115
-literate_qualified_assertion                   0.185  0.160  0.172    125
-literate_relative_chain                        0.320  0.537  0.401   1713
-literate_technical_abbreviation                0.545  0.783  0.643    161
-literate_technical_term                        0.331  0.458  0.384    909
-literate_temporal_embedding                    0.222  0.249  0.235    570
-oral_anaphora                                  0.207  0.248  0.226    137
-oral_antithesis                                0.245  0.289  0.265    453
-oral_discourse_formula                         0.353  0.384  0.368    563
-oral_embodied_action                           0.263  0.374  0.309    470
-oral_everyday_example                          0.160  0.164  0.162    366
-oral_imperative                                0.519  0.670  0.585    200
-oral_inclusive_we                              0.587  0.672  0.626    752
-oral_intensifier_doubling                      0.310  0.165  0.215     79
-oral_lexical_repetition                        0.293  0.488  0.366    217
-oral_named_individual                          0.428  0.676  0.524    791
-oral_parallelism                               0.654  0.048  0.089    710
-oral_phatic_check                              0.465  0.882  0.609     76
-oral_phatic_filler                             0.375  0.582  0.456    182
-oral_rhetorical_question                       0.589  0.894  0.710   1264
-oral_second_person                             0.614  0.545  0.577    833
-oral_self_correction                           0.597  0.295  0.395    156
-oral_sensory_detail                            0.275  0.312  0.293    352
-oral_simple_conjunction                        0.096  0.211  0.132     71
-oral_specific_place                            0.472  0.716  0.569    387
-oral_temporal_anchor                           0.397  0.748  0.518    551
-oral_tricolon                                  0.274  0.285  0.280    557
-oral_vocative                                  0.634  0.761  0.692    155
 ========================================================================
-Macro avg (types w/ support)                                 0.400
 ```
 </details>
@@ -180,10 +180,9 @@ Macro avg (types w/ support)                                 0.400
 **Missing labels (test set):** 0/53 — all types detected at least once.
 Notable patterns:
-- **Strong performers** (F1 > 0.5): rhetorical_question (0.710), vocative (0.692), conditional (0.642), technical_abbreviation (0.643), inclusive_we (0.626), list_structure (0.625), agentless_passive (0.613), phatic_check (0.609), imperative (0.585), probability (0.578), second_person (0.577), specific_place (0.569), citation (0.524), named_individual (0.524), temporal_anchor (0.518), concessive (0.515), nominalization (0.511), enumeration (0.507)
-- **Weak performers** (F1 < 0.2): parallelism (0.089), simple_conjunction (0.132), nested_clauses (0.151), everyday_example (0.162), qualified_assertion (0.172), concessive_connector (0.174)
-- **Precision-recall tradeoff**: Most types now show higher recall than precision, indicating the model over-predicts markers — reversed from the previous release. Notable exceptions include `parallelism` (0.654 precision / 0.048 recall), `self_correction`, and `intensifier_doubling`, which remain high-precision but low-recall.
-- **Recovered type**: `oral_parallelism` crossed the 150-span threshold and was re-included, though its near-zero recall (0.048) means it is effectively non-functional despite high precision when it does fire.
 ## Architecture
@@ -216,8 +215,8 @@ classifier.bias    → randomly initialized
 ## Limitations
 - **Recall-dominated errors**: Most types over-predict (recall > precision), producing false positives; downstream applications may need confidence thresholding
-- **Near-zero recall types**: `parallelism` (0.048 recall), `intensifier_doubling` (0.165), and `simple_conjunction` (0.211) are rarely detected despite being present in training data
-- **Low-precision types**: `simple_conjunction` (0.096), `nested_clauses` (0.105), and `concessive_connector` (0.113) have precision below 0.15, meaning most predictions for those types are false positives
 - **Context window**: 128 tokens max; longer spans may be truncated
 - **Domain**: Trained primarily on historical/literary texts; may underperform on modern social media
 - **Subjectivity**: Some marker boundaries are inherently ambiguous

 | Base model | `bert-base-uncased` |
 | Task | Multi-label token classification (independent B/I/O per type) |
 | Marker types | 53 (22 oral, 31 literate) |
+| Test macro F1 | **0.386** (per-type detection, binary positive = B or I) |
 | Training | 20 epochs, batch 24, lr 3e-5, fp16 |
 | Regularization | Mixout (p=0.1) — stochastic L2 anchor to pretrained weights |
 | Loss | Per-type weighted cross-entropy with inverse-frequency type weights |
 ```
 Type                                            Prec    Rec     F1    Sup
 ========================================================================
+literate_abstract_noun                         0.209  0.329  0.255    420
+literate_additive_formal                       0.243  0.479  0.322     71
+literate_agent_demoted                         0.468  0.664  0.549    414
+literate_agentless_passive                     0.555  0.648  0.598   1168
+literate_aside                                 0.481  0.469  0.475    469
+literate_categorical_statement                 0.084  0.263  0.128    118
+literate_causal_explicit                       0.314  0.386  0.347    272
+literate_citation                              0.468  0.431  0.449    255
+literate_conceptual_metaphor                   0.370  0.397  0.383    517
+literate_concessive                            0.456  0.503  0.478    533
+literate_concessive_connector                  0.250  0.603  0.353     63
+literate_concrete_setting                      0.186  0.322  0.236    298
+literate_conditional                           0.519  0.548  0.533   1514
+literate_contrastive                           0.391  0.462  0.424    424
+literate_cross_reference                       0.825  0.316  0.457    253
+literate_definitional_move                     0.443  0.432  0.438    236
+literate_enumeration                           0.147  0.306  0.198    297
+literate_epistemic_hedge                       0.236  0.431  0.305    255
+literate_evidential                            0.269  0.472  0.342    106
+literate_institutional_subject                 0.157  0.450  0.233    111
+literate_list_structure                        0.528  0.614  0.567    295
+literate_metadiscourse                         0.355  0.407  0.379    447
+literate_nested_clauses                        0.143  0.093  0.113   2044
+literate_nominalization                        0.433  0.538  0.480   1013
+literate_objectifying_stance                   0.451  0.575  0.506    113
+literate_probability                           0.439  0.720  0.545     50
+literate_qualified_assertion                   0.186  0.077  0.109    142
+literate_relative_chain                        0.344  0.606  0.439   1456
+literate_technical_abbreviation                0.500  0.705  0.585    139
+literate_technical_term                        0.278  0.423  0.336    825
+literate_temporal_embedding                    0.174  0.253  0.206    400
+oral_anaphora                                  0.500  0.303  0.377    297
+oral_antithesis                                0.298  0.339  0.317    561
+oral_discourse_formula                         0.373  0.461  0.413    492
+oral_embodied_action                           0.295  0.368  0.327    454
+oral_everyday_example                          0.279  0.307  0.293    420
+oral_imperative                                0.359  0.600  0.449    110
+oral_inclusive_we                              0.579  0.668  0.620    681
+oral_intensifier_doubling                      0.429  0.220  0.290     82
+oral_lexical_repetition                        0.328  0.382  0.353    275
+oral_named_individual                          0.359  0.712  0.478    573
+oral_parallelism                               0.111  0.114  0.112    202
+oral_phatic_check                              0.288  0.436  0.347     39
+oral_phatic_filler                             0.389  0.527  0.448    146
+oral_rhetorical_question                       0.581  0.892  0.703   1006
+oral_second_person                             0.555  0.528  0.541    718
+oral_self_correction                           0.293  0.357  0.322    115
+oral_sensory_detail                            0.194  0.402  0.262    246
+oral_simple_conjunction                        0.174  0.229  0.198    131
+oral_specific_place                            0.453  0.751  0.565    406
+oral_temporal_anchor                           0.223  0.704  0.339    257
+oral_tricolon                                  0.470  0.293  0.361    907
+oral_vocative                                  0.386  0.942  0.547     52
 ========================================================================
+Macro avg (types w/ support)                                 0.386
 ```
 </details>
 **Missing labels (test set):** 0/53 — all types detected at least once.
 Notable patterns:
+- **Strong performers** (F1 > 0.5): rhetorical_question (0.703), inclusive_we (0.620), agentless_passive (0.598), technical_abbreviation (0.585), list_structure (0.567), specific_place (0.565), agent_demoted (0.549), vocative (0.547), probability (0.545), second_person (0.541), conditional (0.533), objectifying_stance (0.506)
+- **Weak performers** (F1 < 0.2): qualified_assertion (0.109), parallelism (0.112), nested_clauses (0.113), categorical_statement (0.128), enumeration (0.198), simple_conjunction (0.198)
+- **Precision-recall tradeoff**: Most types show higher recall than precision, indicating the model over-predicts markers. Notable exceptions include `cross_reference` (0.825 precision / 0.316 recall), `anaphora` (0.500 / 0.303), and `tricolon` (0.470 / 0.293), which remain high-precision but low-recall.
 ## Architecture
 ## Limitations
 - **Recall-dominated errors**: Most types over-predict (recall > precision), producing false positives; downstream applications may need confidence thresholding
+- **Near-zero recall types**: `qualified_assertion` (0.077 recall), `nested_clauses` (0.093), and `parallelism` (0.114) are rarely detected despite being present in training data
+- **Low-precision types**: `categorical_statement` (0.084), `parallelism` (0.111), and `nested_clauses` (0.143) have precision below 0.15, meaning most predictions for those types are false positives
 - **Context window**: 128 tokens max; longer spans may be truncated
 - **Domain**: Trained primarily on historical/literary texts; may underperform on modern social media
 - **Subjectivity**: Some marker boundaries are inherently ambiguous

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:41995685c78ead06fdda874b90a8bdf7b283997fa076207a33c0bd7136179ef3
 size 436082548

 version https://git-lfs.github.com/spec/v1
+oid sha256:37d9c74b122fa304421948d1f1bc5ad1d686fb33eab36ae82079c1e8f4a03282
 size 436082548