HavelockAI
/

bert-marker-subtype

@@ -25,10 +25,10 @@ model-index:
       name: Marker Subtype Classification
     metrics:
     - type: f1
-      value: 0.5320
       name: F1 (macro)
     - type: accuracy
-      value: 0.517
       name: Accuracy
 ---
@@ -46,8 +46,9 @@ This is the finest level of the Havelock span classification hierarchy. Given a
 | Architecture | `BertForSequenceClassification` |
 | Task | Multi-class classification (71 classes) |
 | Max sequence length | 128 tokens |
-| Best F1 (macro) | **0.5320** |
-| Best Accuracy | **0.517** |
 | Parameters | ~109M |
 ## Usage
@@ -78,12 +79,12 @@ print(f"Marker subtype: {model.config.id2label[pred]}")
 | **Repetition & Pattern** | `anaphora`, `epistrophe`, `parallelism`, `tricolon`, `lexical_repetition`, `refrain` |
 | **Sound & Rhythm** | `alliteration`, `assonance`, `rhyme`, `rhythm` |
 | **Address & Interaction** | `vocative`, `imperative`, `second_person`, `inclusive_we`, `rhetorical_question`, `audience_response`, `phatic_check`, `phatic_filler` |
-| **Conjunction** | `polysyndeton`, `asyndeton`, `simple_conjunction`, `binomial_expression` |
 | **Formulas** | `discourse_formula`, `proverb`, `religious_formula`, `epithet` |
 | **Narrative** | `named_individual`, `specific_place`, `temporal_anchor`, `sensory_detail`, `embodied_action`, `everyday_example` |
 | **Performance** | `dramatic_pause`, `self_correction`, `conflict_frame`, `us_them`, `intensifier_doubling`, `antithesis` |
-### Literate Subtypes (36)
 | Category | Subtypes |
 |----------|----------|
@@ -99,129 +100,111 @@ print(f"Marker subtype: {model.config.id2label[pred]}")
 ### Data
-Span-level annotations from the Havelock corpus. Each span carries a `marker_subtype` field. Only subtypes with ≥15 examples in the full dataset are included. The corpus draws from Project Gutenberg, textfiles.com, Reddit, and Wikipedia talk pages.
-A stratified 80/20 train/test split was used (random seed 42). The test set contains 4,608 spans.
 ### Hyperparameters
 | Parameter | Value |
 |-----------|-------|
-| Epochs | 10 |
-| Batch size | 256 |
-| Learning rate | 1.5e-4 |
-| Optimizer | AdamW |
-| LR schedule | Linear warmup (10% of total steps) |
 | Gradient clipping | 1.0 |
-| Loss | Cross-entropy with class weights (range 0.23–4.33) |
-| Min examples per class | 15 |
-### Training Metrics
-| Epoch | Loss | Accuracy | F1 (macro) |
-|-------|------|----------|------------|
-| 1 | 3.7795 | 0.3249 | 0.1618 |
-| 2 | 2.3703 | 0.4918 | 0.4254 |
-| 3 | 1.5864 | 0.5139 | 0.4964 |
-| 4 | 1.0582 | 0.5195 | 0.5238 |
-| 5 | 0.6955 | 0.5189 | 0.5196 |
-| 6 | 0.4761 | 0.5148 | 0.5227 |
-| 7 | 0.3279 | 0.5178 | **0.5320** |
-| 8 | 0.2419 | 0.5119 | 0.5213 |
-| 9 | 0.1885 | 0.5206 | 0.5283 |
-| 10 | 0.1454 | 0.5169 | 0.5250 |
-Best checkpoint selected by F1 at epoch 7. Accuracy plateaus from epoch 3 onward while F1 continues improving through rare-class gains.
 ### Test Set Classification Report
 <details><summary>Click to expand per-class precision/recall/F1/support</summary>
 ```
                         precision    recall  f1-score   support
-         abstract_noun      0.315     0.312     0.314       144
-       additive_formal      0.478     0.423     0.449        26
-         agent_demoted      0.909     0.645     0.755        31
-     agentless_passive      0.533     0.543     0.538       105
-          alliteration      0.632     0.400     0.490        30
-              anaphora      0.526     0.466     0.494        88
-            antithesis      0.641     0.806     0.714        31
-                 aside      0.261     0.218     0.238        55
-             assonance      0.917     1.000     0.957        33
-             asyndeton      0.677     0.700     0.689        30
-     audience_response      0.808     0.700     0.750        30
- categorical_statement      0.329     0.245     0.281        98
-          causal_chain      0.442     0.425     0.433        80
-       causal_explicit      0.406     0.406     0.406        69
-              citation      0.646     0.627     0.636        67
-   conceptual_metaphor      0.298     0.233     0.262        73
-            concessive      0.690     0.659     0.674        88
-  concessive_connector      0.920     0.742     0.821        31
-           conditional      0.620     0.684     0.650       155
-        conflict_frame      0.833     0.806     0.820        31
-           contrastive      0.463     0.543     0.500       116
-       cross_reference      0.538     0.412     0.467        34
-     definitional_move      0.300     0.308     0.304        39
-     discourse_formula      0.559     0.565     0.562       276
-        dramatic_pause      0.781     0.806     0.794        31
-       embodied_action      0.333     0.362     0.347        69
-           enumeration      0.607     0.600     0.604        85
-       epistemic_hedge      0.491     0.554     0.521       101
-            epistrophe      0.867     0.812     0.839        32
-               epithet      0.424     0.519     0.467        27
-      everyday_example      0.361     0.317     0.338        41
-            evidential      0.526     0.556     0.541        54
-    footnote_reference      0.615     0.533     0.571        15
-            imperative      0.659     0.753     0.703       146
-          inclusive_we      0.613     0.608     0.611       120
- institutional_subject      0.600     0.581     0.590        31
-  intensifier_doubling      0.833     0.667     0.741        30
-    lexical_repetition      0.486     0.564     0.522        94
-        list_structure      0.286     0.278     0.282        36
-         metadiscourse      0.320     0.276     0.296        87
-methodological_framing      0.269     0.219     0.241        32
-      named_individual      0.364     0.436     0.397        55
-        nested_clauses      0.370     0.310     0.338        87
-        nominalization      0.377     0.433     0.403       120
-   objectifying_stance      0.125     0.233     0.163        43
-           parallelism      0.218     0.293     0.250        58
-          phatic_check      0.636     0.667     0.651        21
-         phatic_filler      0.333     0.400     0.364        30
-          polysyndeton      0.964     0.844     0.900        32
-           probability      0.574     0.551     0.562        49
-               proverb      0.304     0.226     0.259        31
-   qualified_assertion      0.219     0.233     0.226        60
-               refrain      0.818     0.600     0.692        30
-        relative_chain      0.558     0.504     0.530       115
-     religious_formula      0.840     0.656     0.737        32
-   rhetorical_question      0.686     0.745     0.714       161
-                 rhyme      0.480     0.375     0.421        32
-                rhythm      0.778     0.875     0.824        32
-         second_person      0.543     0.596     0.568       235
-       self_correction      0.826     0.633     0.717        30
-        sensory_detail      0.387     0.324     0.353        37
-    simple_conjunction      0.222     0.195     0.208        41
-        specific_place      0.526     0.385     0.444        26
-technical_abbreviation      0.278     0.263     0.270        19
-        technical_term      0.615     0.466     0.530       161
-       temporal_anchor      0.404     0.429     0.416        49
-    temporal_embedding      0.438     0.519     0.475        81
-third_person_reference      0.788     0.839     0.812        31
-              tricolon      0.607     0.567     0.586        30
-               us_them      0.606     0.645     0.625        31
-              vocative      0.643     0.621     0.632        58
-              accuracy                          0.517      4608
-             macro avg      0.540     0.517     0.525      4608
-          weighted avg      0.522     0.517     0.517      4608
 ```
 </details>
-**Top performing subtypes (F1 > 0.75):** `assonance` (0.957), `polysyndeton` (0.900), `epistrophe` (0.839), `rhythm` (0.824), `concessive_connector` (0.821), `conflict_frame` (0.820), `third_person_reference` (0.812), `dramatic_pause` (0.794), `agent_demoted` (0.755), `audience_response` (0.750).
-**Weakest subtypes (F1 < 0.25):** `objectifying_stance` (0.163), `simple_conjunction` (0.208), `qualified_assertion` (0.226), `aside` (0.238), `methodological_framing` (0.241), `parallelism` (0.250). These tend to be semantically diffuse classes that overlap heavily with neighbouring subtypes.
 ## Class Distribution
@@ -229,17 +212,16 @@ The test set exhibits significant imbalance across 71 classes:
 | Support Range | Classes | % of Total |
 |---------------|---------|------------|
-| >200 | 2 (`discourse_formula`, `second_person`) | 3% |
-| 100–200 | 11 | 15% |
-| 50–100 | 18 | 25% |
-| 25–50 | 41 | 57% |
 ## Limitations
-- **Accuracy plateau with F1 headroom**: Accuracy saturated around 0.52 from epoch 3 while F1 continued climbing through epoch 7, suggesting the model is still finding better decision boundaries for rare classes. Further training with LR decay or curriculum strategies may help.
-- **71-way classification on ~23k spans**: The data budget per class is thin, particularly for classes near the 15-example minimum. More data or class consolidation would help.
 - **Semantic overlap**: Some subtypes are difficult to distinguish from surface text alone (e.g., `parallelism` vs `anaphora` vs `tricolon`; `epistemic_hedge` vs `qualified_assertion` vs `probability`). The model may benefit from hierarchical classification that conditions on type-level predictions.
-- **Recall-precision tradeoff**: Many rare classes show high precision but lower recall (e.g., `polysyndeton`: P=0.964, R=0.844; `agent_demoted`: P=0.909, R=0.645), suggesting the model learns narrow prototypes but misses variation.
 - **Span-level only**: Requires pre-extracted spans. Does not detect boundaries.
 - **128-token context window**: Longer spans are truncated.
@@ -252,8 +234,8 @@ The 71 subtypes represent the full granularity of the Havelock taxonomy, operati
 | Model | Task | Classes | F1 |
 |-------|------|---------|-----|
 | [`HavelockAI/bert-marker-category`](https://huggingface.co/HavelockAI/bert-marker-category) | Binary (oral/literate) | 2 | 0.875 |
-| [`HavelockAI/bert-marker-type`](https://huggingface.co/HavelockAI/bert-marker-type) | Functional type | 25 | 0.449 |
-| **This model** | Fine-grained subtype | 71 | 0.532 |
 | [`HavelockAI/bert-orality-regressor`](https://huggingface.co/HavelockAI/bert-orality-regressor) | Document-level score | Regression | MAE 0.079 |
 | [`HavelockAI/bert-token-classifier`](https://huggingface.co/HavelockAI/bert-token-classifier) | Span detection (BIO) | 145 | 0.500 |
@@ -273,4 +255,4 @@ The 71 subtypes represent the full granularity of the Havelock taxonomy, operati
 ---
-*Model version: da931b4a · Trained: February 2026*

       name: Marker Subtype Classification
     metrics:
     - type: f1
+      value: 0.500
       name: F1 (macro)
     - type: accuracy
+      value: 0.498
       name: Accuracy
 ---
 | Architecture | `BertForSequenceClassification` |
 | Task | Multi-class classification (71 classes) |
 | Max sequence length | 128 tokens |
+| Test F1 (macro) | **0.500** |
+| Test Accuracy | **0.498** |
+| Missing labels (test) | 1/71 (`rhyme`) |
 | Parameters | ~109M |
 ## Usage
 | **Repetition & Pattern** | `anaphora`, `epistrophe`, `parallelism`, `tricolon`, `lexical_repetition`, `refrain` |
 | **Sound & Rhythm** | `alliteration`, `assonance`, `rhyme`, `rhythm` |
 | **Address & Interaction** | `vocative`, `imperative`, `second_person`, `inclusive_we`, `rhetorical_question`, `audience_response`, `phatic_check`, `phatic_filler` |
+| **Conjunction** | `polysyndeton`, `asyndeton`, `simple_conjunction` |
 | **Formulas** | `discourse_formula`, `proverb`, `religious_formula`, `epithet` |
 | **Narrative** | `named_individual`, `specific_place`, `temporal_anchor`, `sensory_detail`, `embodied_action`, `everyday_example` |
 | **Performance** | `dramatic_pause`, `self_correction`, `conflict_frame`, `us_them`, `intensifier_doubling`, `antithesis` |
+### Literate Subtypes (35)
 | Category | Subtypes |
 |----------|----------|
 ### Data
+Span-level annotations from the Havelock corpus with marker types normalized against a canonical taxonomy at build time. Each span carries a `marker_subtype` field. Only subtypes with ≥50 examples are included. A stratified 80/10/10 train/val/test split was used with swap-based optimization to balance label distributions across splits. The test set contains 2,357 spans.
 ### Hyperparameters
 | Parameter | Value |
 |-----------|-------|
+| Epochs | 20 |
+| Batch size | 16 |
+| Learning rate | 3e-5 |
+| Optimizer | AdamW (weight decay 0.01) |
+| LR schedule | Cosine with 10% warmup |
 | Gradient clipping | 1.0 |
+| Loss | Focal loss (γ=2.0) + class weights |
+| Mixout | 0.1 |
+| Mixed precision | FP16 |
+| Min examples per class | 50 |
 ### Test Set Classification Report
 <details><summary>Click to expand per-class precision/recall/F1/support</summary>
 ```
                         precision    recall  f1-score   support
+         abstract_noun      0.376     0.364     0.370        88
+       additive_formal      0.455     0.417     0.435        12
+         agent_demoted      0.533     0.800     0.640        10
+     agentless_passive      0.542     0.456     0.495        57
+          alliteration      0.714     0.500     0.588        10
+              anaphora      0.490     0.585     0.533        41
+            antithesis      0.947     0.818     0.878        22
+                 aside      0.225     0.243     0.234        37
+             assonance      0.926     1.000     0.962        25
+             asyndeton      0.583     0.500     0.538        14
+     audience_response      0.778     0.700     0.737        10
+ categorical_statement      0.209     0.450     0.286        20
+          causal_chain      0.425     0.405     0.415        42
+       causal_explicit      0.537     0.468     0.500        47
+              citation      0.794     0.587     0.675        46
+   conceptual_metaphor      0.176     0.077     0.107        39
+            concessive      0.617     0.644     0.630        45
+  concessive_connector      0.833     0.833     0.833        18
+           conditional      0.582     0.655     0.616        87
+        conflict_frame      0.588     0.667     0.625        15
+           contrastive      0.442     0.557     0.493        61
+       cross_reference      0.733     0.458     0.564        24
+     definitional_move      0.333     0.200     0.250        10
+     discourse_formula      0.485     0.424     0.452       118
+        dramatic_pause      0.875     0.700     0.778        10
+       embodied_action      0.271     0.310     0.289        42
+           enumeration      0.556     0.581     0.568        43
+       epistemic_hedge      0.206     0.500     0.292        14
+            epistrophe      0.778     0.875     0.824        16
+               epithet      0.385     0.417     0.400        12
+      everyday_example      0.278     0.179     0.217        28
+            evidential      0.606     0.541     0.571        37
+    footnote_reference      0.444     0.400     0.421        10
+            imperative      0.628     0.590     0.608       100
+          inclusive_we      0.561     0.627     0.592        59
+ institutional_subject      0.947     0.857     0.900        21
+  intensifier_doubling      0.905     0.864     0.884        22
+    lexical_repetition      0.447     0.467     0.457        45
+        list_structure      0.190     0.174     0.182        23
+         metadiscourse      0.073     0.182     0.104        22
+methodological_framing      0.500     0.238     0.323        21
+      named_individual      0.455     0.333     0.385        30
+        nested_clauses      0.294     0.326     0.309        46
+        nominalization      0.353     0.429     0.387        56
+   objectifying_stance      0.167     0.300     0.214        10
+           parallelism      0.188     0.222     0.203        27
+          phatic_check      0.444     0.364     0.400        11
+         phatic_filler      0.300     0.600     0.400        10
+          polysyndeton      1.000     0.833     0.909        24
+           probability      0.500     0.682     0.577        22
+               proverb      0.059     0.100     0.074        10
+   qualified_assertion      0.280     0.241     0.259        29
+               refrain      0.850     0.708     0.773        24
+        relative_chain      0.431     0.455     0.442        55
+     religious_formula      1.000     0.688     0.815        16
+   rhetorical_question      0.646     0.738     0.689        84
+                 rhyme      0.000     0.000     0.000        10
+                rhythm      1.000     0.625     0.769        16
+         second_person      0.573     0.474     0.519       116
+       self_correction      0.952     0.500     0.656        40
+        sensory_detail      0.538     0.350     0.424        20
+    simple_conjunction      0.133     0.200     0.160        10
+        specific_place      0.625     0.278     0.385        18
+technical_abbreviation      0.818     0.321     0.462        28
+        technical_term      0.438     0.432     0.435        74
+       temporal_anchor      0.472     0.500     0.486        34
+    temporal_embedding      0.475     0.604     0.532        48
+third_person_reference      0.692     0.900     0.783        10
+              tricolon      0.667     0.667     0.667        18
+               us_them      0.750     0.500     0.600        18
+              vocative      0.414     0.600     0.490        20
+              accuracy                          0.498      2357
+             macro avg      0.528     0.497     0.500      2357
+          weighted avg      0.525     0.498     0.502      2357
 ```
 </details>
+**Top performing subtypes (F1 ≥ 0.75):** `assonance` (0.962), `polysyndeton` (0.909), `institutional_subject` (0.900), `intensifier_doubling` (0.884), `antithesis` (0.878), `concessive_connector` (0.833), `epistrophe` (0.824), `religious_formula` (0.815), `third_person_reference` (0.783), `dramatic_pause` (0.778), `refrain` (0.773), `rhythm` (0.769).
+**Weakest subtypes (F1 < 0.20):** `rhyme` (0.000), `proverb` (0.074), `metadiscourse` (0.104), `simple_conjunction` (0.160), `list_structure` (0.182). These tend to be semantically diffuse classes that overlap heavily with neighbouring subtypes or have very low test support.
 ## Class Distribution
 | Support Range | Classes | % of Total |
 |---------------|---------|------------|
+| >100 | 3 (`discourse_formula`, `second_person`, `imperative`) | 4% |
+| 50–100 | 11 | 15% |
+| 25–50 | 26 | 37% |
+| 10–25 | 31 | 44% |
 ## Limitations
+- **71-way classification on ~22k spans**: The data budget per class is thin, particularly for classes near the 50-example minimum. More data or class consolidation would help.
 - **Semantic overlap**: Some subtypes are difficult to distinguish from surface text alone (e.g., `parallelism` vs `anaphora` vs `tricolon`; `epistemic_hedge` vs `qualified_assertion` vs `probability`). The model may benefit from hierarchical classification that conditions on type-level predictions.
+- **Recall-precision tradeoff on rare classes**: Many rare classes show high precision but lower recall (e.g., `self_correction`: P=0.952, R=0.500; `religious_formula`: P=1.000, R=0.688), suggesting the model learns narrow prototypes but misses variation.
 - **Span-level only**: Requires pre-extracted spans. Does not detect boundaries.
 - **128-token context window**: Longer spans are truncated.
 | Model | Task | Classes | F1 |
 |-------|------|---------|-----|
 | [`HavelockAI/bert-marker-category`](https://huggingface.co/HavelockAI/bert-marker-category) | Binary (oral/literate) | 2 | 0.875 |
+| [`HavelockAI/bert-marker-type`](https://huggingface.co/HavelockAI/bert-marker-type) | Functional type | 18 | 0.583 |
+| **This model** | Fine-grained subtype | 71 | 0.500 |
 | [`HavelockAI/bert-orality-regressor`](https://huggingface.co/HavelockAI/bert-orality-regressor) | Document-level score | Regression | MAE 0.079 |
 | [`HavelockAI/bert-token-classifier`](https://huggingface.co/HavelockAI/bert-token-classifier) | Span detection (BIO) | 145 | 0.500 |
 ---
+*Model version: b31f147d · Trained: February 2026*

config.json CHANGED Viewed

@@ -168,7 +168,6 @@
   "num_hidden_layers": 12,
   "pad_token_id": 0,
   "position_embedding_type": "absolute",
-  "problem_type": "single_label_classification",
   "tie_word_embeddings": true,
   "transformers_version": "5.0.0",
   "type_vocab_size": 2,

   "num_hidden_layers": 12,
   "pad_token_id": 0,
   "position_embedding_type": "absolute",
   "tie_word_embeddings": true,
   "transformers_version": "5.0.0",
   "type_vocab_size": 2,

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:908751e3d1db4b122a3c05ea81d50dfbef7cacda40e601e6b09905b8aa7fb99f
-size 438170868

 version https://git-lfs.github.com/spec/v1
+oid sha256:1ff78a23e1f73a3c2b1b41f7b253d652d236d03395d41483f87deba0000c9124
+size 780277732