File size: 12,461 Bytes
86275ec
 
 
 
47ff542
86275ec
 
2070386
86275ec
 
 
 
 
47ff542
86275ec
 
 
 
 
 
 
 
47ff542
86275ec
d64c032
86275ec
 
 
 
 
47ff542
2070386
d64c032
47ff542
 
2070386
47ff542
2070386
86275ec
 
 
2070386
86275ec
17f1925
 
86275ec
17f1925
 
 
2070386
 
17f1925
 
 
2070386
86275ec
 
2070386
86275ec
 
17f1925
 
86275ec
 
2070386
 
 
 
 
 
 
 
86275ec
 
17f1925
 
86275ec
 
 
2070386
 
86275ec
d64c032
86275ec
d64c032
86275ec
 
 
 
 
2070386
d64c032
2070386
 
86275ec
2070386
86275ec
2070386
86275ec
 
 
 
 
 
2070386
86275ec
2070386
 
 
 
 
86275ec
 
 
2070386
86275ec
3775141
 
2070386
 
47ff542
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2070386
47ff542
3775141
 
 
 
d64c032
f6fe748
2070386
47ff542
 
 
86275ec
 
 
2070386
86275ec
47ff542
86275ec
47ff542
2070386
86275ec
 
2070386
 
 
 
 
47ff542
2070386
 
 
86275ec
 
 
47ff542
86275ec
2070386
86275ec
 
 
 
 
 
47ff542
 
86275ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2070386
47ff542
86275ec
 
 
d64c032
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
---
license: mit
tags:
- token-classification
- modernbert
- orality
- linguistics
- multi-label
language:
- en
metrics:
- f1
base_model:
- answerdotai/ModernBERT-base
pipeline_tag: token-classification
library_name: transformers
datasets:
- custom
---

# Havelock Orality Token Classifier

ModernBERT-based token classifier for detecting **oral and literate markers** in text, based on Walter Ong's "Orality and Literacy" (1982).

This model performs multi-label span-level detection of 53 rhetorical marker types, where each token independently carries B/I/O labels per type β€” allowing overlapping spans (e.g. a token that is simultaneously part of a concessive and a nested clause).

## Model Details

| Property | Value |
|----------|-------|
| Base model | `answerdotai/ModernBERT-base` |
| Task | Multi-label token classification (independent B/I/O per type) |
| Marker types | 53 (22 oral, 31 literate) |
| Test macro F1 | **0.378** (per-type detection, binary positive = B or I) |
| Training | 20 epochs, fp16 |
| Regularization | Mixout (p=0.1) β€” stochastic L2 anchor to pretrained weights |
| Loss | Per-type focal loss (Ξ³=2.0) with inverse-frequency OBI and type weights |
| Min examples | 150 (types below this threshold excluded) |

## Usage
```python
import json
import torch
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import hf_hub_download

model_name = "HavelockAI/bert-token-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
model.eval()

# Load marker type map
type_map_path = hf_hub_download(model_name, "type_to_idx.json")
type_to_idx = json.loads(open(type_map_path).read())
idx_to_type = {v: k for k, v in type_to_idx.items()}

text = "Tell me, O Muse, of that ingenious hero who travelled far and wide"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

with torch.no_grad():
    logits = model(**inputs)  # (1, seq_len, num_types, 3)
    preds = logits.argmax(dim=-1)  # (1, seq_len, num_types)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for i, token in enumerate(tokens):
    active = [
        f"{idx_to_type[t]}={'OBI'[v]}"
        for t, v in enumerate(preds[0, i].tolist())
        if v > 0
    ]
    if active:
        print(f"{token:15} {', '.join(active)}")
```

> **Note:** This model uses a custom architecture (`HavelockTokenClassifier`) with independent B/I/O heads per marker type, enabling overlapping span detection. Loading requires `trust_remote_code=True`.

## Training Data

- Sources: Project Gutenberg, textfiles.com, Reddit, Wikipedia talk pages
- Types with fewer than 150 annotated spans are excluded from training
- Multi-label BIO annotation: tokens can carry labels for multiple overlapping marker types simultaneously

## Marker Types (53)

### Oral Markers (22 types)

Characteristics of oral tradition and spoken discourse:

| Category | Markers |
|----------|---------|
| **Address & Interaction** | vocative, imperative, second_person, inclusive_we, rhetorical_question, phatic_check, phatic_filler |
| **Repetition & Pattern** | anaphora, parallelism, tricolon, lexical_repetition, antithesis |
| **Conjunction** | simple_conjunction |
| **Formulas** | discourse_formula, intensifier_doubling |
| **Narrative** | named_individual, specific_place, temporal_anchor, sensory_detail, embodied_action, everyday_example |
| **Performance** | self_correction |

### Literate Markers (31 types)

Characteristics of written, analytical discourse:

| Category | Markers |
|----------|---------|
| **Abstraction** | nominalization, abstract_noun, conceptual_metaphor, categorical_statement |
| **Syntax** | nested_clauses, relative_chain, conditional, concessive, temporal_embedding, causal_explicit |
| **Hedging** | epistemic_hedge, probability, evidential, qualified_assertion, concessive_connector |
| **Impersonality** | agentless_passive, agent_demoted, institutional_subject, objectifying_stance |
| **Scholarly apparatus** | citation, cross_reference, metadiscourse, definitional_move |
| **Technical** | technical_term, technical_abbreviation, enumeration, list_structure |
| **Connectives** | contrastive, additive_formal |
| **Setting** | concrete_setting, aside |

## Evaluation

Per-type detection F1 on test set (binary: B or I = positive, O = negative):

<details><summary>Click to show per-marker precision/recall/F1/support</summary>
```
Type                                            Prec    Rec     F1    Sup
========================================================================
literate_abstract_noun                         0.190  0.325  0.240    381
literate_additive_formal                       0.246  0.556  0.341     27
literate_agent_demoted                         0.404  0.368  0.386    304
literate_agentless_passive                     0.575  0.607  0.591   1133
literate_aside                                 0.379  0.429  0.403    436
literate_categorical_statement                 0.267  0.146  0.189    514
literate_causal_explicit                       0.227  0.279  0.251    190
literate_citation                              0.639  0.556  0.595    372
literate_conceptual_metaphor                   0.310  0.364  0.335    415
literate_concessive                            0.499  0.470  0.484    502
literate_concessive_connector                  0.455  0.408  0.430     49
literate_concrete_setting                      0.241  0.125  0.165    407
literate_conditional                           0.369  0.630  0.466    760
literate_contrastive                           0.310  0.428  0.360    341
literate_cross_reference                       0.386  0.524  0.444     42
literate_definitional_move                     0.395  0.185  0.252     81
literate_enumeration                           0.495  0.483  0.489    775
literate_epistemic_hedge                       0.421  0.481  0.449    445
literate_evidential                            0.625  0.360  0.457    472
literate_institutional_subject                 0.332  0.326  0.329    282
literate_list_structure                        0.338  0.523  0.411     86
literate_metadiscourse                         0.140  0.393  0.206    135
literate_nested_clauses                        0.091  0.246  0.133   1169
literate_nominalization                        0.499  0.612  0.549    991
literate_objectifying_stance                   0.635  0.365  0.464    167
literate_probability                           0.432  0.593  0.500     27
literate_qualified_assertion                   0.143  0.100  0.118     40
literate_relative_chain                        0.382  0.507  0.436   1424
literate_technical_abbreviation                0.667  0.711  0.688    225
literate_technical_term                        0.280  0.375  0.321    715
literate_temporal_embedding                    0.228  0.259  0.242    526
oral_anaphora                                  0.800  0.028  0.054    287
oral_antithesis                                0.249  0.238  0.243    412
oral_discourse_formula                         0.340  0.408  0.371    557
oral_embodied_action                           0.280  0.391  0.326    425
oral_everyday_example                          0.333  0.156  0.212    404
oral_imperative                                0.591  0.662  0.625    293
oral_inclusive_we                              0.516  0.632  0.568    622
oral_intensifier_doubling                      0.680  0.200  0.309     85
oral_lexical_repetition                        0.404  0.254  0.312    173
oral_named_individual                          0.441  0.749  0.556    770
oral_parallelism                               0.741  0.110  0.191    182
oral_phatic_check                              0.611  0.733  0.667     30
oral_phatic_filler                             0.174  0.409  0.244     93
oral_rhetorical_question                       0.509  0.692  0.586    905
oral_second_person                             0.576  0.552  0.564    811
oral_self_correction                           0.158  0.235  0.189     51
oral_sensory_detail                            0.285  0.169  0.212    461
oral_simple_conjunction                        0.179  0.102  0.130     98
oral_specific_place                            0.556  0.705  0.622    424
oral_temporal_anchor                           0.410  0.559  0.473    546
oral_tricolon                                  0.299  0.119  0.171    553
oral_vocative                                  0.652  0.747  0.696    158
========================================================================
Macro avg (types w/ support)                                 0.378
```

</details>

**Missing labels (test set):** 0/53 β€” all types detected at least once.

Notable patterns:
- **Strong performers** (F1 > 0.5): vocative (0.696), technical_abbreviation (0.688), phatic_check (0.667), imperative (0.625), specific_place (0.622), citation (0.595), agentless_passive (0.591), rhetorical_question (0.586), inclusive_we (0.568), second_person (0.564), named_individual (0.556), nominalization (0.549), probability (0.500)
- **Weak performers** (F1 < 0.2): anaphora (0.054), qualified_assertion (0.118), simple_conjunction (0.130), nested_clauses (0.133), concrete_setting (0.165), tricolon (0.171), categorical_statement (0.189), self_correction (0.189), parallelism (0.191)
- **Precision-recall tradeoff**: Most types show balanced precision/recall. Notable exceptions include `anaphora` (0.800 precision / 0.028 recall), `parallelism` (0.741 / 0.110), and `intensifier_doubling` (0.680 / 0.200), which remain high-precision but very low-recall.

## Architecture

Custom `MultiLabelTokenClassifier` with independent B/I/O heads per marker type:
```
ModernBERT (answerdotai/ModernBERT-base)
    └── Dropout (p=0.1)
        └── Linear (hidden_size β†’ num_types Γ— 3)
            └── Reshape to (batch, seq, num_types, 3)
```

Each marker type gets an independent 3-way O/B/I classification, so a token can simultaneously carry labels for multiple overlapping marker types. Types share the full backbone representation but make independent predictions.

### Regularization

- **Mixout** (p=0.1): During training, each backbone weight element has a 10% chance of being replaced by its pretrained value per forward pass, acting as a stochastic L2 anchor that prevents representation drift (Lee et al., 2019)
- **Per-type focal loss** (Ξ³=2.0): Focuses learning on hard examples, reducing the contribution of easy negatives
- **Inverse-frequency type weights**: Rare marker types receive higher loss weighting
- **Inverse-frequency OBI weights**: B and I classes upweighted relative to dominant O class
- **Weighted random sampling**: Examples containing rarer markers sampled more frequently

### Initialization

Fine-tuned from `answerdotai/ModernBERT-base`. Backbone linear layers wrapped with Mixout during training (frozen pretrained copy used as anchor). The classification head is randomly initialized:
```
backbone.* layers  β†’ loaded from pretrained, anchored via Mixout
classifier.weight  β†’ randomly initialized
classifier.bias    β†’ randomly initialized
```

## Limitations

- **Near-zero recall types**: `anaphora` (0.028 recall), `simple_conjunction` (0.102), `parallelism` (0.110), and `tricolon` (0.119) are rarely detected despite being present in training data
- **Low-precision types**: `nested_clauses` (0.091), `metadiscourse` (0.140), and `qualified_assertion` (0.143) have precision below 0.15, meaning most predictions for those types are false positives
- **Context window**: 128 tokens max; longer spans may be truncated
- **Domain**: Trained primarily on historical/literary texts; may underperform on modern social media
- **Subjectivity**: Some marker boundaries are inherently ambiguous

## Citation
```bibtex
@misc{havelock2026token,
  title={Havelock Orality Token Classifier},
  author={Havelock AI},
  year={2026},
  url={https://huggingface.co/HavelockAI/bert-token-classifier}
}
```

## References

- Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
- Lee, C. et al. "Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models." ICLR 2020.
- Warner, A. et al. "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference." 2024.

---

*Trained: February 2026*