HavelockAI
/

bert-token-classifier

+---
+license: mit
+tags:
+- token-classification
+- bert
+- orality
+- linguistics
+- ner
+language:
+- en
+metrics:
+- f1
+base_model:
+- google-bert/bert-base-uncased
+pipeline_tag: token-classification
+library_name: transformers
+datasets:
+- custom
+---
+# Havelock Orality Token Classifier
+BERT-based token classifier for detecting **oral and literate markers** in text, based on Walter Ong's "Orality and Literacy" (1982).
+This model performs span-level detection of 72 rhetorical marker types using BIO tagging (145 labels total).
+## Model Details
+| Property | Value |
+|----------|-------|
+| Base model | `bert-base-uncased` |
+| Task | Token classification (BIO tagging) |
+| Labels | 145 (72 marker types × B/I + O) |
+| Best F1 | **0.459** (macro, markers only) |
+| Training | 15 epochs, batch 8, lr 2e-5 |
+| Loss | Focal loss (γ=1.0) for class imbalance |
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+import torch
+model_name = "HavelockAI/bert-token-classifier"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForTokenClassification.from_pretrained(model_name)
+text = "Tell me, O Muse, of that ingenious hero who travelled far and wide"
+inputs = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
+offset_mapping = inputs.pop("offset_mapping")
+with torch.no_grad():
+    outputs = model(**inputs)
+    predictions = torch.argmax(outputs.logits, dim=-1)
+# Decode predictions
+tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
+labels = [model.config.id2label[p.item()] for p in predictions[0]]
+for token, label in zip(tokens, labels):
+    if label != "O":
+        print(f"{token:15} {label}")
+```
+**Output:**
+```
+tell            B-oral_imperative
+me              I-oral_imperative
+,               I-oral_imperative
+o               B-oral_vocative
+muse            I-oral_vocative
+```
+## Training Data
+- **3,119 examples** with BIO-tagged spans
+- **4,474 marker annotations** across 72 types
+- Sources: Project Gutenberg, textfiles.com, Reddit, Wikipedia talk pages
+- Synthetic examples for rare marker types (30 examples minimum per type)
+### Class Distribution
+The dataset exhibits extreme class imbalance (72 marker types, long-tail distribution). We use focal loss to down-weight easy examples and focus learning on rare markers.
+| Frequency | Marker types |
+|-----------|--------------|
+| >100 examples | 15 types (21%) |
+| 30-100 examples | 37 types (51%) |
+| <30 examples | 20 types (28%) |
+## Marker Types (72)
+### Oral Markers (36 types)
+Characteristics of oral tradition and spoken discourse:
+| Category | Markers |
+|----------|---------|
+| **Repetition & Pattern** | anaphora, epistrophe, parallelism, tricolon, lexical_repetition, refrain |
+| **Sound & Rhythm** | alliteration, rhythm, assonance, rhyme |
+| **Address & Interaction** | vocative, imperative, second_person, inclusive_we, rhetorical_question, audience_response, phatic_check, phatic_filler |
+| **Conjunction** | polysyndeton, asyndeton, simple_conjunction, binomial_expression |
+| **Formulas** | discourse_formula, proverb, religious_formula, epithet |
+| **Narrative** | named_individual, specific_place, temporal_anchor, sensory_detail, embodied_action, everyday_example |
+| **Performance** | dramatic_pause, self_correction, conflict_frame, us_them, first_person, paradox |
+### Literate Markers (36 types)
+Characteristics of written, analytical discourse:
+| Category | Markers |
+|----------|---------|
+| **Abstraction** | nominalization, abstract_noun, conceptual_metaphor, categorical_statement |
+| **Syntax** | nested_clauses, relative_chain, conditional, concessive, temporal_embedding, causal_chain |
+| **Hedging** | epistemic_hedge, probability, evidential, qualified_assertion, concessive_connector |
+| **Impersonality** | agentless_passive, agent_demoted, institutional_subject, objectifying_stance, third_person_reference |
+| **Scholarly apparatus** | citation, footnote_reference, cross_reference, metadiscourse, methodological_framing |
+| **Technical** | technical_term, technical_abbreviation, enumeration, list_structure, definitional_move |
+| **Connectives** | contrastive, causal_explicit, additive_formal, paradox |
+## Evaluation
+Per-class F1 on test set (selected markers):
+| Marker | Precision | Recall | F1 | Support |
+|--------|-----------|--------|-----|---------|
+| oral_vocative | 0.889 | 0.593 | 0.711 | 27 |
+| oral_inclusive_we | 0.500 | 0.586 | 0.540 | 29 |
+| oral_second_person | 0.556 | 0.600 | 0.577 | 25 |
+| literate_conditional | 0.769 | 0.714 | 0.741 | 14 |
+| oral_self_correction | 1.000 | 1.000 | 1.000 | 3 |
+| oral_audience_response | 1.000 | 1.000 | 1.000 | 4 |
+| literate_citation | 0.000 | 0.000 | 0.000 | 10 |
+**Macro F1 (all 145 labels):** 0.487
+**Weighted F1:** 0.645
+**Accuracy:** 66.5%
+## Architecture
+Custom `BertTokenClassifier` with focal loss:
+```
+BertModel (bert-base-uncased)
+    └── Dropout (p=0.1)
+        ���── Linear (768 → 145)
+            └── FocalLoss (α=1.0, γ=1.0)
+```
+Focal loss addresses class imbalance by down-weighting well-classified tokens (mostly "O") and focusing on hard examples (rare markers).
+### Initialization
+Fine-tuned from `bert-base-uncased`. The classification head (`classifier.weight`, `classifier.bias`) is randomly initialized:
+```
+bert.* layers      → loaded from checkpoint
+classifier.weight  → randomly initialized
+classifier.bias    → randomly initialized
+```
+## Limitations
+- **Rare markers**: Types with <10 training examples (e.g., `oral_paradox`, `oral_dramatic_pause`) have poor recall
+- **Context window**: 128 tokens max; longer spans may be truncated
+- **Domain**: Trained primarily on historical/literary texts; may underperform on modern social media
+- **Subjectivity**: Some marker boundaries are inherently ambiguous
+## Citation
+```bibtex
+@misc{havelock2026token,
+  title={Havelock Orality Token Classifier},
+  author={Havelock AI},
+  year={2026},
+  url={https://huggingface.co/HavelockAI/bert-token-classifier}
+}
+```
+## References
+- Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
+- Lin, T.-Y. et al. "Focal Loss for Dense Object Detection." ICCV 2017.
+---
+*Model version: 668564aa • Trained: February 2026*

config.json CHANGED Viewed

@@ -10,7 +10,7 @@
   "dtype": "float32",
   "eos_token_id": null,
   "focal_alpha": 1.0,
-  "focal_gamma": 2.0,
   "gradient_checkpointing": false,
   "hidden_act": "gelu",
   "hidden_dropout_prob": 0.1,

   "dtype": "float32",
   "eos_token_id": null,
   "focal_alpha": 1.0,
+  "focal_gamma": 1.0,
   "gradient_checkpointing": false,
   "hidden_act": "gelu",
   "hidden_dropout_prob": 0.1,

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:8300660cc92a031d0037034b99c54dc6f5b66534b8b88c8d87a1ca82eff280ed
 size 436035932

 version https://git-lfs.github.com/spec/v1
+oid sha256:d310f9767c901ae616ffd9d2fa59addc5e10a450b3b25d44c12bdedaeab3fbeb
 size 436035932