File size: 12,461 Bytes

---
license: mit
tags:
- token-classification
- modernbert
- orality
- linguistics
- multi-label
language:
- en
metrics:
- f1
base_model:
- answerdotai/ModernBERT-base
pipeline_tag: token-classification
library_name: transformers
datasets:
- custom
---

# Havelock Orality Token Classifier

ModernBERT-based token classifier for detecting **oral and literate markers** in text, based on Walter Ong's "Orality and Literacy" (1982).

This model performs multi-label span-level detection of 53 rhetorical marker types, where each token independently carries B/I/O labels per type — allowing overlapping spans (e.g. a token that is simultaneously part of a concessive and a nested clause).

## Model Details

| Property | Value |
|----------|-------|
| Base model | `answerdotai/ModernBERT-base` |
| Task | Multi-label token classification (independent B/I/O per type) |
| Marker types | 53 (22 oral, 31 literate) |
| Test macro F1 | **0.378** (per-type detection, binary positive = B or I) |
| Training | 20 epochs, fp16 |
| Regularization | Mixout (p=0.1) — stochastic L2 anchor to pretrained weights |
| Loss | Per-type focal loss (γ=2.0) with inverse-frequency OBI and type weights |
| Min examples | 150 (types below this threshold excluded) |

## Usage
```python
import json
import torch
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import hf_hub_download

model_name = "HavelockAI/bert-token-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
model.eval()

# Load marker type map
type_map_path = hf_hub_download(model_name, "type_to_idx.json")
type_to_idx = json.loads(open(type_map_path).read())
idx_to_type = {v: k for k, v in type_to_idx.items()}

text = "Tell me, O Muse, of that ingenious hero who travelled far and wide"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

with torch.no_grad():
    logits = model(**inputs)  # (1, seq_len, num_types, 3)
    preds = logits.argmax(dim=-1)  # (1, seq_len, num_types)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for i, token in enumerate(tokens):
    active = [
        f"{idx_to_type[t]}={'OBI'[v]}"
        for t, v in enumerate(preds[0, i].tolist())
        if v > 0
    ]
    if active:
        print(f"{token:15} {', '.join(active)}")
```

> **Note:** This model uses a custom architecture (`HavelockTokenClassifier`) with independent B/I/O heads per marker type, enabling overlapping span detection. Loading requires `trust_remote_code=True`.

## Training Data

- Sources: Project Gutenberg, textfiles.com, Reddit, Wikipedia talk pages
- Types with fewer than 150 annotated spans are excluded from training
- Multi-label BIO annotation: tokens can carry labels for multiple overlapping marker types simultaneously

## Marker Types (53)

### Oral Markers (22 types)

Characteristics of oral tradition and spoken discourse:

| Category | Markers |
|----------|---------|
| **Address & Interaction** | vocative, imperative, second_person, inclusive_we, rhetorical_question, phatic_check, phatic_filler |
| **Repetition & Pattern** | anaphora, parallelism, tricolon, lexical_repetition, antithesis |
| **Conjunction** | simple_conjunction |
| **Formulas** | discourse_formula, intensifier_doubling |
| **Narrative** | named_individual, specific_place, temporal_anchor, sensory_detail, embodied_action, everyday_example |
| **Performance** | self_correction |

### Literate Markers (31 types)

Characteristics of written, analytical discourse:

| Category | Markers |
|----------|---------|
| **Abstraction** | nominalization, abstract_noun, conceptual_metaphor, categorical_statement |
| **Syntax** | nested_clauses, relative_chain, conditional, concessive, temporal_embedding, causal_explicit |
| **Hedging** | epistemic_hedge, probability, evidential, qualified_assertion, concessive_connector |
| **Impersonality** | agentless_passive, agent_demoted, institutional_subject, objectifying_stance |
| **Scholarly apparatus** | citation, cross_reference, metadiscourse, definitional_move |
| **Technical** | technical_term, technical_abbreviation, enumeration, list_structure |
| **Connectives** | contrastive, additive_formal |
| **Setting** | concrete_setting, aside |

## Evaluation

Per-type detection F1 on test set (binary: B or I = positive, O = negative):

<details><summary>Click to show per-marker precision/recall/F1/support</summary>
```
Type                                            Prec    Rec     F1    Sup
========================================================================
literate_abstract_noun                         0.190  0.325  0.240    381
literate_additive_formal                       0.246  0.556  0.341     27
literate_agent_demoted                         0.404  0.368  0.386    304
literate_agentless_passive                     0.575  0.607  0.591   1133
literate_aside                                 0.379  0.429  0.403    436
literate_categorical_statement                 0.267  0.146  0.189    514
literate_causal_explicit                       0.227  0.279  0.251    190
literate_citation                              0.639  0.556  0.595    372
literate_conceptual_metaphor                   0.310  0.364  0.335    415
literate_concessive                            0.499  0.470  0.484    502
literate_concessive_connector                  0.455  0.408  0.430     49
literate_concrete_setting                      0.241  0.125  0.165    407
literate_conditional                           0.369  0.630  0.466    760
literate_contrastive                           0.310  0.428  0.360    341
literate_cross_reference                       0.386  0.524  0.444     42
literate_definitional_move                     0.395  0.185  0.252     81
literate_enumeration                           0.495  0.483  0.489    775
literate_epistemic_hedge                       0.421  0.481  0.449    445
literate_evidential                            0.625  0.360  0.457    472
literate_institutional_subject                 0.332  0.326  0.329    282
literate_list_structure                        0.338  0.523  0.411     86
literate_metadiscourse                         0.140  0.393  0.206    135
literate_nested_clauses                        0.091  0.246  0.133   1169
literate_nominalization                        0.499  0.612  0.549    991
literate_objectifying_stance                   0.635  0.365  0.464    167
literate_probability                           0.432  0.593  0.500     27
literate_qualified_assertion                   0.143  0.100  0.118     40
literate_relative_chain                        0.382  0.507  0.436   1424
literate_technical_abbreviation                0.667  0.711  0.688    225
literate_technical_term                        0.280  0.375  0.321    715
literate_temporal_embedding                    0.228  0.259  0.242    526
oral_anaphora                                  0.800  0.028  0.054    287
oral_antithesis                                0.249  0.238  0.243    412
oral_discourse_formula                         0.340  0.408  0.371    557
oral_embodied_action                           0.280  0.391  0.326    425
oral_everyday_example                          0.333  0.156  0.212    404
oral_imperative                                0.591  0.662  0.625    293
oral_inclusive_we                              0.516  0.632  0.568    622
oral_intensifier_doubling                      0.680  0.200  0.309     85
oral_lexical_repetition                        0.404  0.254  0.312    173
oral_named_individual                          0.441  0.749  0.556    770
oral_parallelism                               0.741  0.110  0.191    182
oral_phatic_check                              0.611  0.733  0.667     30
oral_phatic_filler                             0.174  0.409  0.244     93
oral_rhetorical_question                       0.509  0.692  0.586    905
oral_second_person                             0.576  0.552  0.564    811
oral_self_correction                           0.158  0.235  0.189     51
oral_sensory_detail                            0.285  0.169  0.212    461
oral_simple_conjunction                        0.179  0.102  0.130     98
oral_specific_place                            0.556  0.705  0.622    424
oral_temporal_anchor                           0.410  0.559  0.473    546
oral_tricolon                                  0.299  0.119  0.171    553
oral_vocative                                  0.652  0.747  0.696    158
========================================================================
Macro avg (types w/ support)                                 0.378
```

</details>

**Missing labels (test set):** 0/53 — all types detected at least once.

Notable patterns:
- **Strong performers** (F1 > 0.5): vocative (0.696), technical_abbreviation (0.688), phatic_check (0.667), imperative (0.625), specific_place (0.622), citation (0.595), agentless_passive (0.591), rhetorical_question (0.586), inclusive_we (0.568), second_person (0.564), named_individual (0.556), nominalization (0.549), probability (0.500)
- **Weak performers** (F1 < 0.2): anaphora (0.054), qualified_assertion (0.118), simple_conjunction (0.130), nested_clauses (0.133), concrete_setting (0.165), tricolon (0.171), categorical_statement (0.189), self_correction (0.189), parallelism (0.191)
- **Precision-recall tradeoff**: Most types show balanced precision/recall. Notable exceptions include `anaphora` (0.800 precision / 0.028 recall), `parallelism` (0.741 / 0.110), and `intensifier_doubling` (0.680 / 0.200), which remain high-precision but very low-recall.

## Architecture

Custom `MultiLabelTokenClassifier` with independent B/I/O heads per marker type:
```
ModernBERT (answerdotai/ModernBERT-base)
    └── Dropout (p=0.1)
        └── Linear (hidden_size → num_types × 3)
            └── Reshape to (batch, seq, num_types, 3)
```

Each marker type gets an independent 3-way O/B/I classification, so a token can simultaneously carry labels for multiple overlapping marker types. Types share the full backbone representation but make independent predictions.

### Regularization

- **Mixout** (p=0.1): During training, each backbone weight element has a 10% chance of being replaced by its pretrained value per forward pass, acting as a stochastic L2 anchor that prevents representation drift (Lee et al., 2019)
- **Per-type focal loss** (γ=2.0): Focuses learning on hard examples, reducing the contribution of easy negatives
- **Inverse-frequency type weights**: Rare marker types receive higher loss weighting
- **Inverse-frequency OBI weights**: B and I classes upweighted relative to dominant O class
- **Weighted random sampling**: Examples containing rarer markers sampled more frequently

### Initialization

Fine-tuned from `answerdotai/ModernBERT-base`. Backbone linear layers wrapped with Mixout during training (frozen pretrained copy used as anchor). The classification head is randomly initialized:
```
backbone.* layers  → loaded from pretrained, anchored via Mixout
classifier.weight  → randomly initialized
classifier.bias    → randomly initialized
```

## Limitations

- **Near-zero recall types**: `anaphora` (0.028 recall), `simple_conjunction` (0.102), `parallelism` (0.110), and `tricolon` (0.119) are rarely detected despite being present in training data
- **Low-precision types**: `nested_clauses` (0.091), `metadiscourse` (0.140), and `qualified_assertion` (0.143) have precision below 0.15, meaning most predictions for those types are false positives
- **Context window**: 128 tokens max; longer spans may be truncated
- **Domain**: Trained primarily on historical/literary texts; may underperform on modern social media
- **Subjectivity**: Some marker boundaries are inherently ambiguous

## Citation
```bibtex
@misc{havelock2026token,
  title={Havelock Orality Token Classifier},
  author={Havelock AI},
  year={2026},
  url={https://huggingface.co/HavelockAI/bert-token-classifier}
}
```

## References

- Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
- Lee, C. et al. "Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models." ICLR 2020.
- Warner, A. et al. "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference." 2024.

---

*Trained: February 2026*