Upload folder using huggingface_hub

2070386 verified 2 months ago

11.4 kB

license: mit
tags:
  - token-classification
  - bert
  - orality
  - linguistics
  - multi-label
language:
  - en
metrics:
  - f1
base_model:
  - google-bert/bert-base-uncased
pipeline_tag: token-classification
library_name: transformers
datasets:
  - custom

Havelock Orality Token Classifier

BERT-based token classifier for detecting oral and literate markers in text, based on Walter Ong's "Orality and Literacy" (1982).

This model performs multi-label span-level detection of 53 rhetorical marker types, where each token independently carries B/I/O labels per type — allowing overlapping spans (e.g. a token that is simultaneously part of a concessive and a nested clause).

Model Details

Property	Value
Base model	`bert-base-uncased`
Task	Multi-label token classification (independent B/I/O per type)
Marker types	53 (22 oral, 31 literate)
Test macro F1	0.388 (per-type detection, binary positive = B or I)
Training	20 epochs, batch 24, lr 3e-5, fp16
Regularization	Mixout (p=0.1) — stochastic L2 anchor to pretrained weights
Loss	Per-type weighted cross-entropy with inverse-frequency type weights
Min examples	150 (types below this threshold excluded)

Usage

import json
import torch
from transformers import AutoTokenizer
from estimators.tokens.model import MultiLabelTokenClassifier

model_path = "models/bert_token_classifier"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = MultiLabelTokenClassifier.load(model_path, device="cpu")
model.eval()

type_to_idx = json.loads((model_path / "type_to_idx.json").read_text())
idx_to_type = {v: k for k, v in type_to_idx.items()}

text = "Tell me, O Muse, of that ingenious hero who travelled far and wide"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

with torch.no_grad():
    logits = model(inputs["input_ids"], inputs["attention_mask"])
    preds = logits.argmax(dim=-1)  # (1, seq, num_types)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for i, token in enumerate(tokens):
    active = [
        f"{idx_to_type[t]}={'OBI'[v]}"
        for t, v in enumerate(preds[0, i].tolist())
        if v > 0
    ]
    if active:
        print(f"{token:15} {', '.join(active)}")

Training Data

Sources: Project Gutenberg, textfiles.com, Reddit, Wikipedia talk pages
Types with fewer than 150 annotated spans are excluded from training
Multi-label BIO annotation: tokens can carry labels for multiple overlapping marker types simultaneously

Marker Types (53)

Oral Markers (22 types)

Characteristics of oral tradition and spoken discourse:

Category	Markers
Address & Interaction	vocative, imperative, second_person, inclusive_we, rhetorical_question, phatic_check, phatic_filler
Repetition & Pattern	anaphora, parallelism, tricolon, lexical_repetition, antithesis
Conjunction	simple_conjunction
Formulas	discourse_formula, intensifier_doubling
Narrative	named_individual, specific_place, temporal_anchor, sensory_detail, embodied_action, everyday_example
Performance	self_correction

Literate Markers (31 types)

Characteristics of written, analytical discourse:

Category	Markers
Abstraction	nominalization, abstract_noun, conceptual_metaphor, categorical_statement
Syntax	nested_clauses, relative_chain, conditional, concessive, temporal_embedding, causal_explicit
Hedging	epistemic_hedge, probability, evidential, qualified_assertion, concessive_connector
Impersonality	agentless_passive, agent_demoted, institutional_subject, objectifying_stance
Scholarly apparatus	citation, cross_reference, metadiscourse, definitional_move
Technical	technical_term, technical_abbreviation, enumeration, list_structure
Connectives	contrastive, additive_formal
Setting	concrete_setting, aside

Evaluation

Per-type detection F1 on test set (binary: B or I = positive, O = negative):

Click to show per-marker precision/recall/F1/support

``` Type Prec Rec F1 Sup ======================================================================== literate_abstract_noun 0.119 0.114 0.116 466 literate_additive_formal 0.225 0.576 0.323 85 literate_agent_demoted 0.345 0.670 0.455 288 literate_agentless_passive 0.399 0.750 0.521 1286 literate_aside 0.399 0.599 0.479 461 literate_categorical_statement 0.191 0.277 0.226 393 literate_causal_explicit 0.285 0.370 0.322 376 literate_citation 0.515 0.671 0.582 237 literate_conceptual_metaphor 0.172 0.387 0.238 222 literate_concessive 0.475 0.596 0.529 740 literate_concessive_connector 0.107 0.514 0.178 37 literate_concrete_setting 0.189 0.462 0.269 292 literate_conditional 0.511 0.823 0.631 1609 literate_contrastive 0.310 0.460 0.370 383 literate_cross_reference 0.390 0.366 0.377 82 literate_definitional_move 0.288 0.515 0.370 66 literate_enumeration 0.285 0.743 0.412 855 literate_epistemic_hedge 0.339 0.564 0.424 541 literate_evidential 0.323 0.630 0.427 162 literate_institutional_subject 0.237 0.532 0.328 250 literate_list_structure 0.795 0.529 0.635 652 literate_metadiscourse 0.243 0.446 0.314 361 literate_nested_clauses 0.148 0.398 0.216 1271 literate_nominalization 0.241 0.490 0.323 1140 literate_objectifying_stance 0.474 0.469 0.471 192 literate_probability 0.572 0.728 0.641 114 literate_qualified_assertion 0.132 0.163 0.146 123 literate_relative_chain 0.282 0.572 0.378 1753 literate_technical_abbreviation 0.381 0.773 0.510 132 literate_technical_term 0.264 0.481 0.341 908 literate_temporal_embedding 0.187 0.318 0.235 550 oral_anaphora 0.120 0.348 0.179 141 oral_antithesis 0.213 0.249 0.230 453 oral_discourse_formula 0.287 0.432 0.345 570 oral_embodied_action 0.247 0.430 0.314 465 oral_everyday_example 0.263 0.411 0.320 358 oral_imperative 0.402 0.787 0.532 211 oral_inclusive_we 0.485 0.819 0.609 747 oral_intensifier_doubling 0.291 0.316 0.303 79 oral_lexical_repetition 0.331 0.550 0.414 218 oral_named_individual 0.386 0.708 0.500 818 oral_parallelism 0.674 0.041 0.077 710 oral_phatic_check 0.432 0.829 0.568 76 oral_phatic_filler 0.340 0.630 0.442 184 oral_rhetorical_question 0.587 0.899 0.710 1276 oral_second_person 0.421 0.610 0.498 839 oral_self_correction 0.479 0.372 0.419 156 oral_sensory_detail 0.249 0.452 0.321 367 oral_simple_conjunction 0.096 0.343 0.150 70 oral_specific_place 0.396 0.717 0.510 367 oral_temporal_anchor 0.347 0.831 0.490 555 oral_tricolon 0.217 0.220 0.218 560 oral_vocative 0.505 0.759 0.607 133 ======================================================================== Macro avg (types w/ support) 0.388 ```

Missing labels (test set): 0/53 — all types detected at least once.

Notable patterns:

Strong performers (F1 > 0.5): rhetorical_question (0.710), probability (0.641), list_structure (0.635), conditional (0.631), inclusive_we (0.609), vocative (0.607), citation (0.582), phatic_check (0.568)
Weak performers (F1 < 0.2): parallelism (0.077), simple_conjunction (0.150), abstract_noun (0.116), qualified_assertion (0.146), concessive_connector (0.178), anaphora (0.179)
Precision-recall tradeoff: Most types show higher recall than precision, indicating the model over-predicts rather than under-predicts markers

Architecture

Custom MultiLabelTokenClassifier with independent B/I/O heads per marker type:

BertModel (bert-base-uncased)
    └── Dropout (p=0.1)
        └── Linear (768 → num_types × 3)
            └── Reshape to (batch, seq, num_types, 3)

Each marker type gets an independent 3-way O/B/I classification, so a token can simultaneously carry labels for multiple overlapping marker types. Types share the full backbone representation but make independent predictions.

Regularization

Mixout (p=0.1): During training, each backbone weight element has a 10% chance of being replaced by its pretrained value per forward pass, acting as a stochastic L2 anchor that prevents representation drift (Lee et al., 2019)
Inverse-frequency type weights: Rare marker types receive higher loss weighting
Inverse-frequency OBI weights: B and I classes upweighted relative to dominant O class
Weighted random sampling: Examples containing rarer markers sampled more frequently

Initialization

Fine-tuned from bert-base-uncased. Backbone linear layers wrapped with Mixout during training (frozen pretrained copy used as anchor). The classification head is randomly initialized:

backbone.* layers  → loaded from pretrained, anchored via Mixout
classifier.weight  → randomly initialized
classifier.bias    → randomly initialized

Limitations

Low-precision types: Several types show precision below 0.2, meaning most predictions for those types are false positives
Parallelism collapse: oral_parallelism has high precision (0.674) but near-zero recall (0.041), suggesting the model learned a very narrow pattern
Context window: 128 tokens max; longer spans may be truncated
Domain: Trained primarily on historical/literary texts; may underperform on modern social media
Subjectivity: Some marker boundaries are inherently ambiguous

Citation

@misc{havelock2026token,
  title={Havelock Orality Token Classifier},
  author={Havelock AI},
  year={2026},
  url={https://huggingface.co/HavelockAI/bert-token-classifier}
}

References

Ong, Walter J. Orality and Literacy: The Technologizing of the Word. Routledge, 1982.
Lee, C. et al. "Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models." ICLR 2020.

Trained: February 2026