license: mit
tags:
- token-classification
- bert
- orality
- linguistics
- multi-label
language:
- en
metrics:
- f1
base_model:
- google-bert/bert-base-uncased
pipeline_tag: token-classification
library_name: transformers
datasets:
- custom
Havelock Orality Token Classifier
BERT-based token classifier for detecting oral and literate markers in text, based on Walter Ong's "Orality and Literacy" (1982).
This model performs multi-label span-level detection of 53 rhetorical marker types, where each token independently carries B/I/O labels per type β allowing overlapping spans (e.g. a token that is simultaneously part of a concessive and a nested clause).
Model Details
| Property | Value |
|---|---|
| Base model | bert-base-uncased |
| Task | Multi-label token classification (independent B/I/O per type) |
| Marker types | 53 (22 oral, 31 literate) |
| Test macro F1 | 0.388 (per-type detection, binary positive = B or I) |
| Training | 20 epochs, batch 24, lr 3e-5, fp16 |
| Regularization | Mixout (p=0.1) β stochastic L2 anchor to pretrained weights |
| Loss | Per-type weighted cross-entropy with inverse-frequency type weights |
| Min examples | 150 (types below this threshold excluded) |
Usage
import json
import torch
from transformers import AutoTokenizer
from estimators.tokens.model import MultiLabelTokenClassifier
model_path = "models/bert_token_classifier"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = MultiLabelTokenClassifier.load(model_path, device="cpu")
model.eval()
type_to_idx = json.loads((model_path / "type_to_idx.json").read_text())
idx_to_type = {v: k for k, v in type_to_idx.items()}
text = "Tell me, O Muse, of that ingenious hero who travelled far and wide"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
logits = model(inputs["input_ids"], inputs["attention_mask"])
preds = logits.argmax(dim=-1) # (1, seq, num_types)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for i, token in enumerate(tokens):
active = [
f"{idx_to_type[t]}={'OBI'[v]}"
for t, v in enumerate(preds[0, i].tolist())
if v > 0
]
if active:
print(f"{token:15} {', '.join(active)}")
Training Data
- Sources: Project Gutenberg, textfiles.com, Reddit, Wikipedia talk pages
- Types with fewer than 150 annotated spans are excluded from training
- Multi-label BIO annotation: tokens can carry labels for multiple overlapping marker types simultaneously
Marker Types (53)
Oral Markers (22 types)
Characteristics of oral tradition and spoken discourse:
| Category | Markers |
|---|---|
| Address & Interaction | vocative, imperative, second_person, inclusive_we, rhetorical_question, phatic_check, phatic_filler |
| Repetition & Pattern | anaphora, parallelism, tricolon, lexical_repetition, antithesis |
| Conjunction | simple_conjunction |
| Formulas | discourse_formula, intensifier_doubling |
| Narrative | named_individual, specific_place, temporal_anchor, sensory_detail, embodied_action, everyday_example |
| Performance | self_correction |
Literate Markers (31 types)
Characteristics of written, analytical discourse:
| Category | Markers |
|---|---|
| Abstraction | nominalization, abstract_noun, conceptual_metaphor, categorical_statement |
| Syntax | nested_clauses, relative_chain, conditional, concessive, temporal_embedding, causal_explicit |
| Hedging | epistemic_hedge, probability, evidential, qualified_assertion, concessive_connector |
| Impersonality | agentless_passive, agent_demoted, institutional_subject, objectifying_stance |
| Scholarly apparatus | citation, cross_reference, metadiscourse, definitional_move |
| Technical | technical_term, technical_abbreviation, enumeration, list_structure |
| Connectives | contrastive, additive_formal |
| Setting | concrete_setting, aside |
Evaluation
Per-type detection F1 on test set (binary: B or I = positive, O = negative):
Click to show per-marker precision/recall/F1/support
``` Type Prec Rec F1 Sup ======================================================================== literate_abstract_noun 0.119 0.114 0.116 466 literate_additive_formal 0.225 0.576 0.323 85 literate_agent_demoted 0.345 0.670 0.455 288 literate_agentless_passive 0.399 0.750 0.521 1286 literate_aside 0.399 0.599 0.479 461 literate_categorical_statement 0.191 0.277 0.226 393 literate_causal_explicit 0.285 0.370 0.322 376 literate_citation 0.515 0.671 0.582 237 literate_conceptual_metaphor 0.172 0.387 0.238 222 literate_concessive 0.475 0.596 0.529 740 literate_concessive_connector 0.107 0.514 0.178 37 literate_concrete_setting 0.189 0.462 0.269 292 literate_conditional 0.511 0.823 0.631 1609 literate_contrastive 0.310 0.460 0.370 383 literate_cross_reference 0.390 0.366 0.377 82 literate_definitional_move 0.288 0.515 0.370 66 literate_enumeration 0.285 0.743 0.412 855 literate_epistemic_hedge 0.339 0.564 0.424 541 literate_evidential 0.323 0.630 0.427 162 literate_institutional_subject 0.237 0.532 0.328 250 literate_list_structure 0.795 0.529 0.635 652 literate_metadiscourse 0.243 0.446 0.314 361 literate_nested_clauses 0.148 0.398 0.216 1271 literate_nominalization 0.241 0.490 0.323 1140 literate_objectifying_stance 0.474 0.469 0.471 192 literate_probability 0.572 0.728 0.641 114 literate_qualified_assertion 0.132 0.163 0.146 123 literate_relative_chain 0.282 0.572 0.378 1753 literate_technical_abbreviation 0.381 0.773 0.510 132 literate_technical_term 0.264 0.481 0.341 908 literate_temporal_embedding 0.187 0.318 0.235 550 oral_anaphora 0.120 0.348 0.179 141 oral_antithesis 0.213 0.249 0.230 453 oral_discourse_formula 0.287 0.432 0.345 570 oral_embodied_action 0.247 0.430 0.314 465 oral_everyday_example 0.263 0.411 0.320 358 oral_imperative 0.402 0.787 0.532 211 oral_inclusive_we 0.485 0.819 0.609 747 oral_intensifier_doubling 0.291 0.316 0.303 79 oral_lexical_repetition 0.331 0.550 0.414 218 oral_named_individual 0.386 0.708 0.500 818 oral_parallelism 0.674 0.041 0.077 710 oral_phatic_check 0.432 0.829 0.568 76 oral_phatic_filler 0.340 0.630 0.442 184 oral_rhetorical_question 0.587 0.899 0.710 1276 oral_second_person 0.421 0.610 0.498 839 oral_self_correction 0.479 0.372 0.419 156 oral_sensory_detail 0.249 0.452 0.321 367 oral_simple_conjunction 0.096 0.343 0.150 70 oral_specific_place 0.396 0.717 0.510 367 oral_temporal_anchor 0.347 0.831 0.490 555 oral_tricolon 0.217 0.220 0.218 560 oral_vocative 0.505 0.759 0.607 133 ======================================================================== Macro avg (types w/ support) 0.388 ```Missing labels (test set): 0/53 β all types detected at least once.
Notable patterns:
- Strong performers (F1 > 0.5): rhetorical_question (0.710), probability (0.641), list_structure (0.635), conditional (0.631), inclusive_we (0.609), vocative (0.607), citation (0.582), phatic_check (0.568)
- Weak performers (F1 < 0.2): parallelism (0.077), simple_conjunction (0.150), abstract_noun (0.116), qualified_assertion (0.146), concessive_connector (0.178), anaphora (0.179)
- Precision-recall tradeoff: Most types show higher recall than precision, indicating the model over-predicts rather than under-predicts markers
Architecture
Custom MultiLabelTokenClassifier with independent B/I/O heads per marker type:
BertModel (bert-base-uncased)
βββ Dropout (p=0.1)
βββ Linear (768 β num_types Γ 3)
βββ Reshape to (batch, seq, num_types, 3)
Each marker type gets an independent 3-way O/B/I classification, so a token can simultaneously carry labels for multiple overlapping marker types. Types share the full backbone representation but make independent predictions.
Regularization
- Mixout (p=0.1): During training, each backbone weight element has a 10% chance of being replaced by its pretrained value per forward pass, acting as a stochastic L2 anchor that prevents representation drift (Lee et al., 2019)
- Inverse-frequency type weights: Rare marker types receive higher loss weighting
- Inverse-frequency OBI weights: B and I classes upweighted relative to dominant O class
- Weighted random sampling: Examples containing rarer markers sampled more frequently
Initialization
Fine-tuned from bert-base-uncased. Backbone linear layers wrapped with Mixout during training (frozen pretrained copy used as anchor). The classification head is randomly initialized:
backbone.* layers β loaded from pretrained, anchored via Mixout
classifier.weight β randomly initialized
classifier.bias β randomly initialized
Limitations
- Low-precision types: Several types show precision below 0.2, meaning most predictions for those types are false positives
- Parallelism collapse:
oral_parallelismhas high precision (0.674) but near-zero recall (0.041), suggesting the model learned a very narrow pattern - Context window: 128 tokens max; longer spans may be truncated
- Domain: Trained primarily on historical/literary texts; may underperform on modern social media
- Subjectivity: Some marker boundaries are inherently ambiguous
Citation
@misc{havelock2026token,
title={Havelock Orality Token Classifier},
author={Havelock AI},
year={2026},
url={https://huggingface.co/HavelockAI/bert-token-classifier}
}
References
- Ong, Walter J. Orality and Literacy: The Technologizing of the Word. Routledge, 1982.
- Lee, C. et al. "Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models." ICLR 2020.
Trained: February 2026