HavelockAI
/

bert-marker-subtype

+---
+license: mit
+tags:
+- text-classification
+- bert
+- orality
+- linguistics
+- rhetorical-analysis
+language:
+- en
+metrics:
+- f1
+- accuracy
+base_model:
+- google-bert/bert-base-uncased
+pipeline_tag: text-classification
+library_name: transformers
+datasets:
+- custom
+model-index:
+- name: bert-marker-subtype
+  results:
+  - task:
+      type: text-classification
+      name: Marker Subtype Classification
+    metrics:
+    - type: f1
+      value: 0.4704
+      name: F1 (macro)
+    - type: accuracy
+      value: 0.515
+      name: Accuracy
+---
+# Havelock Marker Subtype Classifier
+BERT-based classifier for **71 fine-grained rhetorical marker subtypes** on the oral–literate spectrum, grounded in Walter Ong's *Orality and Literacy* (1982).
+This is the finest level of the Havelock span classification hierarchy. Given a text span identified as a rhetorical marker, the model classifies it into one of 71 specific rhetorical devices (e.g., `anaphora`, `epistemic_hedge`, `vocative`, `nested_clauses`).
+## Model Details
+| Property | Value |
+|----------|-------|
+| Base model | `bert-base-uncased` |
+| Architecture | `BertForSequenceClassification` |
+| Task | Multi-class classification (71 classes) |
+| Max sequence length | 128 tokens |
+| Best F1 (macro) | **0.4704** |
+| Best Accuracy | **0.515** |
+| Parameters | ~109M |
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+model_name = "HavelockAI/bert-marker-subtype"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+span = "it seems likely that this would, in principle, be feasible"
+inputs = tokenizer(span, return_tensors="pt", truncation=True, max_length=128)
+with torch.no_grad():
+    logits = model(**inputs).logits
+    pred = torch.argmax(logits, dim=1).item()
+print(f"Marker subtype: {model.config.id2label[pred]}")
+```
+## Label Taxonomy (71 subtypes)
+### Oral Subtypes (36)
+| Category | Subtypes |
+|----------|----------|
+| **Repetition & Pattern** | `anaphora`, `epistrophe`, `parallelism`, `tricolon`, `lexical_repetition`, `refrain` |
+| **Sound & Rhythm** | `alliteration`, `assonance`, `rhyme`, `rhythm` |
+| **Address & Interaction** | `vocative`, `imperative`, `second_person`, `inclusive_we`, `rhetorical_question`, `audience_response`, `phatic_check`, `phatic_filler` |
+| **Conjunction** | `polysyndeton`, `asyndeton`, `simple_conjunction`, `binomial_expression` |
+| **Formulas** | `discourse_formula`, `proverb`, `religious_formula`, `epithet` |
+| **Narrative** | `named_individual`, `specific_place`, `temporal_anchor`, `sensory_detail`, `embodied_action`, `everyday_example` |
+| **Performance** | `dramatic_pause`, `self_correction`, `conflict_frame`, `us_them`, `intensifier_doubling`, `antithesis` |
+### Literate Subtypes (36)
+| Category | Subtypes |
+|----------|----------|
+| **Abstraction** | `nominalization`, `abstract_noun`, `conceptual_metaphor`, `categorical_statement` |
+| **Syntax** | `nested_clauses`, `relative_chain`, `conditional`, `concessive`, `temporal_embedding`, `causal_chain` |
+| **Hedging** | `epistemic_hedge`, `probability`, `evidential`, `qualified_assertion`, `concessive_connector` |
+| **Impersonality** | `agentless_passive`, `agent_demoted`, `institutional_subject`, `objectifying_stance`, `third_person_reference` |
+| **Scholarly Apparatus** | `citation`, `footnote_reference`, `cross_reference`, `metadiscourse`, `methodological_framing` |
+| **Technical** | `technical_term`, `technical_abbreviation`, `enumeration`, `list_structure`, `definitional_move` |
+| **Connectives** | `contrastive`, `causal_explicit`, `additive_formal`, `aside` |
+## Training
+### Data
+Span-level annotations from the Havelock corpus. Each span carries a `marker_subtype` field. Only subtypes with ≥15 examples in the full dataset are included. The corpus draws from Project Gutenberg, textfiles.com, Reddit, and Wikipedia talk pages.
+A stratified 80/20 train/test split was used (random seed 42). The test set contains 4,608 spans.
+### Hyperparameters
+| Parameter | Value |
+|-----------|-------|
+| Epochs | 3 |
+| Batch size | 8 |
+| Learning rate | 2e-5 |
+| Optimizer | AdamW |
+| LR schedule | Linear warmup (10% of total steps) |
+| Gradient clipping | 1.0 |
+| Loss | Cross-entropy |
+| Min examples per class | 15 |
+### Training Metrics
+| Epoch | Loss | Accuracy | F1 (macro) |
+|-------|------|----------|------------|
+| 1 | 3.2554 | 0.4210 | 0.3060 |
+| 2 | 2.0844 | 0.5033 | 0.4345 |
+| 3 | 1.5922 | 0.5154 | 0.4704 |
+Best checkpoint selected by F1 at epoch 3. Loss still declining steeply.
+### Test Set Classification Report
+<details><summary>Click to expand per-class precision/recall/F1/support</summary>
+```
+                        precision    recall  f1-score   support
+         abstract_noun      0.262     0.333     0.294       144
+       additive_formal      0.250     0.038     0.067        26
+         agent_demoted      0.944     0.548     0.694        31
+     agentless_passive      0.458     0.619     0.526       105
+          alliteration      0.400     0.133     0.200        30
+              anaphora      0.468     0.659     0.547        88
+            antithesis      0.575     0.742     0.648        31
+                 aside      0.467     0.127     0.200        55
+             assonance      0.744     0.970     0.842        33
+             asyndeton      0.867     0.433     0.578        30
+     audience_response      0.800     0.533     0.640        30
+ categorical_statement      0.362     0.388     0.374        98
+          causal_chain      0.472     0.625     0.538        80
+       causal_explicit      0.400     0.406     0.403        69
+              citation      0.494     0.612     0.547        67
+   conceptual_metaphor      0.235     0.055     0.089        73
+            concessive      0.677     0.739     0.707        88
+  concessive_connector      0.920     0.742     0.821        31
+           conditional      0.627     0.671     0.648       155
+        conflict_frame      0.800     0.774     0.787        31
+           contrastive      0.390     0.595     0.471       116
+       cross_reference      0.429     0.353     0.387        34
+     definitional_move      0.429     0.077     0.130        39
+     discourse_formula      0.499     0.703     0.583       276
+        dramatic_pause      0.833     0.806     0.820        31
+       embodied_action      0.286     0.377     0.325        69
+           enumeration      0.504     0.694     0.584        85
+       epistemic_hedge      0.429     0.624     0.508       101
+            epistrophe      0.763     0.906     0.829        32
+               epithet      0.429     0.444     0.436        27
+      everyday_example      0.432     0.390     0.410        41
+            evidential      0.608     0.574     0.590        54
+    footnote_reference      1.000     0.133     0.235        15
+            imperative      0.617     0.760     0.681       146
+          inclusive_we      0.579     0.700     0.634       120
+ institutional_subject      0.586     0.548     0.567        31
+  intensifier_doubling      0.792     0.633     0.704        30
+    lexical_repetition      0.535     0.649     0.587        94
+        list_structure      0.300     0.167     0.214        36
+         metadiscourse      0.310     0.310     0.310        87
+methodological_framing      0.000     0.000     0.000        32
+      named_individual      0.446     0.527     0.483        55
+        nested_clauses      0.375     0.172     0.236        87
+        nominalization      0.336     0.333     0.335       120
+   objectifying_stance      0.250     0.023     0.043        43
+           parallelism      0.250     0.052     0.086        58
+          phatic_check      1.000     0.286     0.444        21
+         phatic_filler      0.529     0.300     0.383        30
+          polysyndeton      0.675     0.844     0.750        32
+           probability      0.571     0.327     0.416        49
+               proverb      0.222     0.065     0.100        31
+   qualified_assertion      0.286     0.100     0.148        60
+               refrain      0.895     0.567     0.694        30
+        relative_chain      0.504     0.600     0.548       115
+     religious_formula      0.917     0.688     0.786        32
+   rhetorical_question      0.614     0.820     0.702       161
+                 rhyme      0.545     0.562     0.554        32
+                rhythm      0.839     0.812     0.825        32
+         second_person      0.557     0.600     0.578       235
+       self_correction      0.895     0.567     0.694        30
+        sensory_detail      0.000     0.000     0.000        37
+    simple_conjunction      0.667     0.049     0.091        41
+        specific_place      1.000     0.038     0.074        26
+technical_abbreviation      1.000     0.053     0.100        19
+        technical_term      0.489     0.571     0.527       161
+       temporal_anchor      0.471     0.490     0.480        49
+    temporal_embedding      0.448     0.481     0.464        81
+third_person_reference      0.917     0.710     0.800        31
+              tricolon      0.656     0.700     0.677        30
+               us_them      0.882     0.484     0.625        31
+              vocative      0.593     0.603     0.598        58
+              accuracy                          0.515      4608
+             macro avg      0.561     0.465     0.470      4608
+          weighted avg      0.512     0.515     0.490      4608
+```
+</details>
+**Top performing subtypes (F1 > 0.75):** `assonance` (0.842), `epistrophe` (0.829), `rhythm` (0.825), `concessive_connector` (0.821), `dramatic_pause` (0.820), `third_person_reference` (0.800), `conflict_frame` (0.787), `religious_formula` (0.786), `polysyndeton` (0.750).
+**Near-zero F1 subtypes:** `methodological_framing` (0.000), `sensory_detail` (0.000), `specific_place` (0.074), `parallelism` (0.086), `conceptual_metaphor` (0.089), `objectifying_stance` (0.043), `simple_conjunction` (0.091), `technical_abbreviation` (0.100), `proverb` (0.100). These tend to be either semantically diffuse classes or classes with very low support.
+## Class Distribution
+The test set exhibits significant imbalance across 71 classes:
+| Support Range | Classes | % of Total |
+|---------------|---------|------------|
+| >200 | 2 (`discourse_formula`, `second_person`) | 3% |
+| 100–200 | 11 | 15% |
+| 50–100 | 18 | 25% |
+| 25–50 | 41 | 57% |
+## Limitations
+- **Severely undertrained**: 3 epochs with loss at 1.59 and still falling steeply. This model has the most headroom for improvement of the three span classifiers.
+- **71-way classification on ~23k spans**: The data budget per class is thin, particularly for classes near the 15-example minimum. More data or class consolidation would help.
+- **Semantic overlap**: Some subtypes are difficult to distinguish from surface text alone (e.g., `parallelism` vs `anaphora` vs `tricolon`; `epistemic_hedge` vs `qualified_assertion` vs `probability`). The model may benefit from hierarchical classification that conditions on type-level predictions.
+- **Recall-precision tradeoff**: Many rare classes show high precision but very low recall (e.g., `footnote_reference`: P=1.000, R=0.133), suggesting the model learns narrow prototypes but misses variation.
+- **Span-level only**: Requires pre-extracted spans. Does not detect boundaries.
+- **128-token context window**: Longer spans are truncated.
+## Theoretical Background
+The 71 subtypes represent the full granularity of the Havelock taxonomy, operationalizing Ong's oral–literate framework into specific, annotatable rhetorical devices. Oral subtypes capture the textural signatures of spoken and performative discourse: repetitive structures (`anaphora`, `epistrophe`, `tricolon`), sound patterning (`alliteration`, `assonance`, `rhythm`), direct audience engagement (`vocative`, `imperative`, `rhetorical_question`), and formulas (`proverb`, `epithet`, `discourse_formula`). Literate subtypes capture the apparatus of analytic prose: complex syntax (`nested_clauses`, `relative_chain`, `conditional`), epistemic positioning (`epistemic_hedge`, `evidential`, `probability`), impersonal voice (`agentless_passive`, `institutional_subject`), and scholarly machinery (`citation`, `footnote_reference`, `metadiscourse`).
+## Related Models
+| Model | Task | Classes | F1 |
+|-------|------|---------|-----|
+| [`HavelockAI/bert-marker-category`](https://huggingface.co/HavelockAI/bert-marker-category) | Binary (oral/literate) | 2 | 0.875 |
+| [`HavelockAI/bert-marker-type`](https://huggingface.co/HavelockAI/bert-marker-type) | Functional type | 25 | 0.449 |
+| **This model** | Fine-grained subtype | 71 | 0.470 |
+| [`HavelockAI/bert-orality-regressor`](https://huggingface.co/HavelockAI/bert-orality-regressor) | Document-level score | Regression | MAE 0.079 |
+| [`HavelockAI/bert-token-classifier`](https://huggingface.co/HavelockAI/bert-token-classifier) | Span detection (BIO) | 145 | 0.461 |
+## Citation
+```bibtex
+@misc{havelock2026subtype,
+  title={Havelock Marker Subtype Classifier},
+  author={Havelock AI},
+  year={2026},
+  url={https://huggingface.co/HavelockAI/bert-marker-subtype}
+}
+```
+## References
+- Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
+---
+*Trained: February 2026*
+*Model version: da931b4a · Trained: February 2026*

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:57612c2d570b6ad4b50fa5e3983044ff32f89670b3a07fa1c01c9d802ed18fb6
 size 438170868

 version https://git-lfs.github.com/spec/v1
+oid sha256:a5a1f8420254999b58763469bd26ef2ba803a70ee980986aca9881f290dd9bb4
 size 438170868