File size: 12,461 Bytes
86275ec 47ff542 86275ec 2070386 86275ec 47ff542 86275ec 47ff542 86275ec d64c032 86275ec 47ff542 2070386 d64c032 47ff542 2070386 47ff542 2070386 86275ec 2070386 86275ec 17f1925 86275ec 17f1925 2070386 17f1925 2070386 86275ec 2070386 86275ec 17f1925 86275ec 2070386 86275ec 17f1925 86275ec 2070386 86275ec d64c032 86275ec d64c032 86275ec 2070386 d64c032 2070386 86275ec 2070386 86275ec 2070386 86275ec 2070386 86275ec 2070386 86275ec 2070386 86275ec 3775141 2070386 47ff542 2070386 47ff542 3775141 d64c032 f6fe748 2070386 47ff542 86275ec 2070386 86275ec 47ff542 86275ec 47ff542 2070386 86275ec 2070386 47ff542 2070386 86275ec 47ff542 86275ec 2070386 86275ec 47ff542 86275ec 2070386 47ff542 86275ec d64c032 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 | ---
license: mit
tags:
- token-classification
- modernbert
- orality
- linguistics
- multi-label
language:
- en
metrics:
- f1
base_model:
- answerdotai/ModernBERT-base
pipeline_tag: token-classification
library_name: transformers
datasets:
- custom
---
# Havelock Orality Token Classifier
ModernBERT-based token classifier for detecting **oral and literate markers** in text, based on Walter Ong's "Orality and Literacy" (1982).
This model performs multi-label span-level detection of 53 rhetorical marker types, where each token independently carries B/I/O labels per type β allowing overlapping spans (e.g. a token that is simultaneously part of a concessive and a nested clause).
## Model Details
| Property | Value |
|----------|-------|
| Base model | `answerdotai/ModernBERT-base` |
| Task | Multi-label token classification (independent B/I/O per type) |
| Marker types | 53 (22 oral, 31 literate) |
| Test macro F1 | **0.378** (per-type detection, binary positive = B or I) |
| Training | 20 epochs, fp16 |
| Regularization | Mixout (p=0.1) β stochastic L2 anchor to pretrained weights |
| Loss | Per-type focal loss (Ξ³=2.0) with inverse-frequency OBI and type weights |
| Min examples | 150 (types below this threshold excluded) |
## Usage
```python
import json
import torch
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import hf_hub_download
model_name = "HavelockAI/bert-token-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
model.eval()
# Load marker type map
type_map_path = hf_hub_download(model_name, "type_to_idx.json")
type_to_idx = json.loads(open(type_map_path).read())
idx_to_type = {v: k for k, v in type_to_idx.items()}
text = "Tell me, O Muse, of that ingenious hero who travelled far and wide"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
logits = model(**inputs) # (1, seq_len, num_types, 3)
preds = logits.argmax(dim=-1) # (1, seq_len, num_types)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for i, token in enumerate(tokens):
active = [
f"{idx_to_type[t]}={'OBI'[v]}"
for t, v in enumerate(preds[0, i].tolist())
if v > 0
]
if active:
print(f"{token:15} {', '.join(active)}")
```
> **Note:** This model uses a custom architecture (`HavelockTokenClassifier`) with independent B/I/O heads per marker type, enabling overlapping span detection. Loading requires `trust_remote_code=True`.
## Training Data
- Sources: Project Gutenberg, textfiles.com, Reddit, Wikipedia talk pages
- Types with fewer than 150 annotated spans are excluded from training
- Multi-label BIO annotation: tokens can carry labels for multiple overlapping marker types simultaneously
## Marker Types (53)
### Oral Markers (22 types)
Characteristics of oral tradition and spoken discourse:
| Category | Markers |
|----------|---------|
| **Address & Interaction** | vocative, imperative, second_person, inclusive_we, rhetorical_question, phatic_check, phatic_filler |
| **Repetition & Pattern** | anaphora, parallelism, tricolon, lexical_repetition, antithesis |
| **Conjunction** | simple_conjunction |
| **Formulas** | discourse_formula, intensifier_doubling |
| **Narrative** | named_individual, specific_place, temporal_anchor, sensory_detail, embodied_action, everyday_example |
| **Performance** | self_correction |
### Literate Markers (31 types)
Characteristics of written, analytical discourse:
| Category | Markers |
|----------|---------|
| **Abstraction** | nominalization, abstract_noun, conceptual_metaphor, categorical_statement |
| **Syntax** | nested_clauses, relative_chain, conditional, concessive, temporal_embedding, causal_explicit |
| **Hedging** | epistemic_hedge, probability, evidential, qualified_assertion, concessive_connector |
| **Impersonality** | agentless_passive, agent_demoted, institutional_subject, objectifying_stance |
| **Scholarly apparatus** | citation, cross_reference, metadiscourse, definitional_move |
| **Technical** | technical_term, technical_abbreviation, enumeration, list_structure |
| **Connectives** | contrastive, additive_formal |
| **Setting** | concrete_setting, aside |
## Evaluation
Per-type detection F1 on test set (binary: B or I = positive, O = negative):
<details><summary>Click to show per-marker precision/recall/F1/support</summary>
```
Type Prec Rec F1 Sup
========================================================================
literate_abstract_noun 0.190 0.325 0.240 381
literate_additive_formal 0.246 0.556 0.341 27
literate_agent_demoted 0.404 0.368 0.386 304
literate_agentless_passive 0.575 0.607 0.591 1133
literate_aside 0.379 0.429 0.403 436
literate_categorical_statement 0.267 0.146 0.189 514
literate_causal_explicit 0.227 0.279 0.251 190
literate_citation 0.639 0.556 0.595 372
literate_conceptual_metaphor 0.310 0.364 0.335 415
literate_concessive 0.499 0.470 0.484 502
literate_concessive_connector 0.455 0.408 0.430 49
literate_concrete_setting 0.241 0.125 0.165 407
literate_conditional 0.369 0.630 0.466 760
literate_contrastive 0.310 0.428 0.360 341
literate_cross_reference 0.386 0.524 0.444 42
literate_definitional_move 0.395 0.185 0.252 81
literate_enumeration 0.495 0.483 0.489 775
literate_epistemic_hedge 0.421 0.481 0.449 445
literate_evidential 0.625 0.360 0.457 472
literate_institutional_subject 0.332 0.326 0.329 282
literate_list_structure 0.338 0.523 0.411 86
literate_metadiscourse 0.140 0.393 0.206 135
literate_nested_clauses 0.091 0.246 0.133 1169
literate_nominalization 0.499 0.612 0.549 991
literate_objectifying_stance 0.635 0.365 0.464 167
literate_probability 0.432 0.593 0.500 27
literate_qualified_assertion 0.143 0.100 0.118 40
literate_relative_chain 0.382 0.507 0.436 1424
literate_technical_abbreviation 0.667 0.711 0.688 225
literate_technical_term 0.280 0.375 0.321 715
literate_temporal_embedding 0.228 0.259 0.242 526
oral_anaphora 0.800 0.028 0.054 287
oral_antithesis 0.249 0.238 0.243 412
oral_discourse_formula 0.340 0.408 0.371 557
oral_embodied_action 0.280 0.391 0.326 425
oral_everyday_example 0.333 0.156 0.212 404
oral_imperative 0.591 0.662 0.625 293
oral_inclusive_we 0.516 0.632 0.568 622
oral_intensifier_doubling 0.680 0.200 0.309 85
oral_lexical_repetition 0.404 0.254 0.312 173
oral_named_individual 0.441 0.749 0.556 770
oral_parallelism 0.741 0.110 0.191 182
oral_phatic_check 0.611 0.733 0.667 30
oral_phatic_filler 0.174 0.409 0.244 93
oral_rhetorical_question 0.509 0.692 0.586 905
oral_second_person 0.576 0.552 0.564 811
oral_self_correction 0.158 0.235 0.189 51
oral_sensory_detail 0.285 0.169 0.212 461
oral_simple_conjunction 0.179 0.102 0.130 98
oral_specific_place 0.556 0.705 0.622 424
oral_temporal_anchor 0.410 0.559 0.473 546
oral_tricolon 0.299 0.119 0.171 553
oral_vocative 0.652 0.747 0.696 158
========================================================================
Macro avg (types w/ support) 0.378
```
</details>
**Missing labels (test set):** 0/53 β all types detected at least once.
Notable patterns:
- **Strong performers** (F1 > 0.5): vocative (0.696), technical_abbreviation (0.688), phatic_check (0.667), imperative (0.625), specific_place (0.622), citation (0.595), agentless_passive (0.591), rhetorical_question (0.586), inclusive_we (0.568), second_person (0.564), named_individual (0.556), nominalization (0.549), probability (0.500)
- **Weak performers** (F1 < 0.2): anaphora (0.054), qualified_assertion (0.118), simple_conjunction (0.130), nested_clauses (0.133), concrete_setting (0.165), tricolon (0.171), categorical_statement (0.189), self_correction (0.189), parallelism (0.191)
- **Precision-recall tradeoff**: Most types show balanced precision/recall. Notable exceptions include `anaphora` (0.800 precision / 0.028 recall), `parallelism` (0.741 / 0.110), and `intensifier_doubling` (0.680 / 0.200), which remain high-precision but very low-recall.
## Architecture
Custom `MultiLabelTokenClassifier` with independent B/I/O heads per marker type:
```
ModernBERT (answerdotai/ModernBERT-base)
βββ Dropout (p=0.1)
βββ Linear (hidden_size β num_types Γ 3)
βββ Reshape to (batch, seq, num_types, 3)
```
Each marker type gets an independent 3-way O/B/I classification, so a token can simultaneously carry labels for multiple overlapping marker types. Types share the full backbone representation but make independent predictions.
### Regularization
- **Mixout** (p=0.1): During training, each backbone weight element has a 10% chance of being replaced by its pretrained value per forward pass, acting as a stochastic L2 anchor that prevents representation drift (Lee et al., 2019)
- **Per-type focal loss** (Ξ³=2.0): Focuses learning on hard examples, reducing the contribution of easy negatives
- **Inverse-frequency type weights**: Rare marker types receive higher loss weighting
- **Inverse-frequency OBI weights**: B and I classes upweighted relative to dominant O class
- **Weighted random sampling**: Examples containing rarer markers sampled more frequently
### Initialization
Fine-tuned from `answerdotai/ModernBERT-base`. Backbone linear layers wrapped with Mixout during training (frozen pretrained copy used as anchor). The classification head is randomly initialized:
```
backbone.* layers β loaded from pretrained, anchored via Mixout
classifier.weight β randomly initialized
classifier.bias β randomly initialized
```
## Limitations
- **Near-zero recall types**: `anaphora` (0.028 recall), `simple_conjunction` (0.102), `parallelism` (0.110), and `tricolon` (0.119) are rarely detected despite being present in training data
- **Low-precision types**: `nested_clauses` (0.091), `metadiscourse` (0.140), and `qualified_assertion` (0.143) have precision below 0.15, meaning most predictions for those types are false positives
- **Context window**: 128 tokens max; longer spans may be truncated
- **Domain**: Trained primarily on historical/literary texts; may underperform on modern social media
- **Subjectivity**: Some marker boundaries are inherently ambiguous
## Citation
```bibtex
@misc{havelock2026token,
title={Havelock Orality Token Classifier},
author={Havelock AI},
year={2026},
url={https://huggingface.co/HavelockAI/bert-token-classifier}
}
```
## References
- Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
- Lee, C. et al. "Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models." ICLR 2020.
- Warner, A. et al. "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference." 2024.
---
*Trained: February 2026* |