Upload folder using huggingface_hub
Browse files- README.md +183 -0
- config.json +1 -1
- model.safetensors +1 -1
README.md
ADDED
|
@@ -0,0 +1,183 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- token-classification
|
| 5 |
+
- bert
|
| 6 |
+
- orality
|
| 7 |
+
- linguistics
|
| 8 |
+
- ner
|
| 9 |
+
language:
|
| 10 |
+
- en
|
| 11 |
+
metrics:
|
| 12 |
+
- f1
|
| 13 |
+
base_model:
|
| 14 |
+
- google-bert/bert-base-uncased
|
| 15 |
+
pipeline_tag: token-classification
|
| 16 |
+
library_name: transformers
|
| 17 |
+
datasets:
|
| 18 |
+
- custom
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
# Havelock Orality Token Classifier
|
| 22 |
+
|
| 23 |
+
BERT-based token classifier for detecting **oral and literate markers** in text, based on Walter Ong's "Orality and Literacy" (1982).
|
| 24 |
+
|
| 25 |
+
This model performs span-level detection of 72 rhetorical marker types using BIO tagging (145 labels total).
|
| 26 |
+
|
| 27 |
+
## Model Details
|
| 28 |
+
|
| 29 |
+
| Property | Value |
|
| 30 |
+
|----------|-------|
|
| 31 |
+
| Base model | `bert-base-uncased` |
|
| 32 |
+
| Task | Token classification (BIO tagging) |
|
| 33 |
+
| Labels | 145 (72 marker types × B/I + O) |
|
| 34 |
+
| Best F1 | **0.459** (macro, markers only) |
|
| 35 |
+
| Training | 15 epochs, batch 8, lr 2e-5 |
|
| 36 |
+
| Loss | Focal loss (γ=1.0) for class imbalance |
|
| 37 |
+
|
| 38 |
+
## Usage
|
| 39 |
+
```python
|
| 40 |
+
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
| 41 |
+
import torch
|
| 42 |
+
|
| 43 |
+
model_name = "HavelockAI/bert-token-classifier"
|
| 44 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 45 |
+
model = AutoModelForTokenClassification.from_pretrained(model_name)
|
| 46 |
+
|
| 47 |
+
text = "Tell me, O Muse, of that ingenious hero who travelled far and wide"
|
| 48 |
+
inputs = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
|
| 49 |
+
offset_mapping = inputs.pop("offset_mapping")
|
| 50 |
+
|
| 51 |
+
with torch.no_grad():
|
| 52 |
+
outputs = model(**inputs)
|
| 53 |
+
predictions = torch.argmax(outputs.logits, dim=-1)
|
| 54 |
+
|
| 55 |
+
# Decode predictions
|
| 56 |
+
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
|
| 57 |
+
labels = [model.config.id2label[p.item()] for p in predictions[0]]
|
| 58 |
+
|
| 59 |
+
for token, label in zip(tokens, labels):
|
| 60 |
+
if label != "O":
|
| 61 |
+
print(f"{token:15} {label}")
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
**Output:**
|
| 65 |
+
```
|
| 66 |
+
tell B-oral_imperative
|
| 67 |
+
me I-oral_imperative
|
| 68 |
+
, I-oral_imperative
|
| 69 |
+
o B-oral_vocative
|
| 70 |
+
muse I-oral_vocative
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
## Training Data
|
| 74 |
+
|
| 75 |
+
- **3,119 examples** with BIO-tagged spans
|
| 76 |
+
- **4,474 marker annotations** across 72 types
|
| 77 |
+
- Sources: Project Gutenberg, textfiles.com, Reddit, Wikipedia talk pages
|
| 78 |
+
- Synthetic examples for rare marker types (30 examples minimum per type)
|
| 79 |
+
|
| 80 |
+
### Class Distribution
|
| 81 |
+
|
| 82 |
+
The dataset exhibits extreme class imbalance (72 marker types, long-tail distribution). We use focal loss to down-weight easy examples and focus learning on rare markers.
|
| 83 |
+
|
| 84 |
+
| Frequency | Marker types |
|
| 85 |
+
|-----------|--------------|
|
| 86 |
+
| >100 examples | 15 types (21%) |
|
| 87 |
+
| 30-100 examples | 37 types (51%) |
|
| 88 |
+
| <30 examples | 20 types (28%) |
|
| 89 |
+
|
| 90 |
+
## Marker Types (72)
|
| 91 |
+
|
| 92 |
+
### Oral Markers (36 types)
|
| 93 |
+
|
| 94 |
+
Characteristics of oral tradition and spoken discourse:
|
| 95 |
+
|
| 96 |
+
| Category | Markers |
|
| 97 |
+
|----------|---------|
|
| 98 |
+
| **Repetition & Pattern** | anaphora, epistrophe, parallelism, tricolon, lexical_repetition, refrain |
|
| 99 |
+
| **Sound & Rhythm** | alliteration, rhythm, assonance, rhyme |
|
| 100 |
+
| **Address & Interaction** | vocative, imperative, second_person, inclusive_we, rhetorical_question, audience_response, phatic_check, phatic_filler |
|
| 101 |
+
| **Conjunction** | polysyndeton, asyndeton, simple_conjunction, binomial_expression |
|
| 102 |
+
| **Formulas** | discourse_formula, proverb, religious_formula, epithet |
|
| 103 |
+
| **Narrative** | named_individual, specific_place, temporal_anchor, sensory_detail, embodied_action, everyday_example |
|
| 104 |
+
| **Performance** | dramatic_pause, self_correction, conflict_frame, us_them, first_person, paradox |
|
| 105 |
+
|
| 106 |
+
### Literate Markers (36 types)
|
| 107 |
+
|
| 108 |
+
Characteristics of written, analytical discourse:
|
| 109 |
+
|
| 110 |
+
| Category | Markers |
|
| 111 |
+
|----------|---------|
|
| 112 |
+
| **Abstraction** | nominalization, abstract_noun, conceptual_metaphor, categorical_statement |
|
| 113 |
+
| **Syntax** | nested_clauses, relative_chain, conditional, concessive, temporal_embedding, causal_chain |
|
| 114 |
+
| **Hedging** | epistemic_hedge, probability, evidential, qualified_assertion, concessive_connector |
|
| 115 |
+
| **Impersonality** | agentless_passive, agent_demoted, institutional_subject, objectifying_stance, third_person_reference |
|
| 116 |
+
| **Scholarly apparatus** | citation, footnote_reference, cross_reference, metadiscourse, methodological_framing |
|
| 117 |
+
| **Technical** | technical_term, technical_abbreviation, enumeration, list_structure, definitional_move |
|
| 118 |
+
| **Connectives** | contrastive, causal_explicit, additive_formal, paradox |
|
| 119 |
+
|
| 120 |
+
## Evaluation
|
| 121 |
+
|
| 122 |
+
Per-class F1 on test set (selected markers):
|
| 123 |
+
|
| 124 |
+
| Marker | Precision | Recall | F1 | Support |
|
| 125 |
+
|--------|-----------|--------|-----|---------|
|
| 126 |
+
| oral_vocative | 0.889 | 0.593 | 0.711 | 27 |
|
| 127 |
+
| oral_inclusive_we | 0.500 | 0.586 | 0.540 | 29 |
|
| 128 |
+
| oral_second_person | 0.556 | 0.600 | 0.577 | 25 |
|
| 129 |
+
| literate_conditional | 0.769 | 0.714 | 0.741 | 14 |
|
| 130 |
+
| oral_self_correction | 1.000 | 1.000 | 1.000 | 3 |
|
| 131 |
+
| oral_audience_response | 1.000 | 1.000 | 1.000 | 4 |
|
| 132 |
+
| literate_citation | 0.000 | 0.000 | 0.000 | 10 |
|
| 133 |
+
|
| 134 |
+
**Macro F1 (all 145 labels):** 0.487
|
| 135 |
+
**Weighted F1:** 0.645
|
| 136 |
+
**Accuracy:** 66.5%
|
| 137 |
+
|
| 138 |
+
## Architecture
|
| 139 |
+
|
| 140 |
+
Custom `BertTokenClassifier` with focal loss:
|
| 141 |
+
```
|
| 142 |
+
BertModel (bert-base-uncased)
|
| 143 |
+
└── Dropout (p=0.1)
|
| 144 |
+
���── Linear (768 → 145)
|
| 145 |
+
└── FocalLoss (α=1.0, γ=1.0)
|
| 146 |
+
```
|
| 147 |
+
|
| 148 |
+
Focal loss addresses class imbalance by down-weighting well-classified tokens (mostly "O") and focusing on hard examples (rare markers).
|
| 149 |
+
|
| 150 |
+
### Initialization
|
| 151 |
+
|
| 152 |
+
Fine-tuned from `bert-base-uncased`. The classification head (`classifier.weight`, `classifier.bias`) is randomly initialized:
|
| 153 |
+
```
|
| 154 |
+
bert.* layers → loaded from checkpoint
|
| 155 |
+
classifier.weight → randomly initialized
|
| 156 |
+
classifier.bias → randomly initialized
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
## Limitations
|
| 160 |
+
|
| 161 |
+
- **Rare markers**: Types with <10 training examples (e.g., `oral_paradox`, `oral_dramatic_pause`) have poor recall
|
| 162 |
+
- **Context window**: 128 tokens max; longer spans may be truncated
|
| 163 |
+
- **Domain**: Trained primarily on historical/literary texts; may underperform on modern social media
|
| 164 |
+
- **Subjectivity**: Some marker boundaries are inherently ambiguous
|
| 165 |
+
|
| 166 |
+
## Citation
|
| 167 |
+
```bibtex
|
| 168 |
+
@misc{havelock2026token,
|
| 169 |
+
title={Havelock Orality Token Classifier},
|
| 170 |
+
author={Havelock AI},
|
| 171 |
+
year={2026},
|
| 172 |
+
url={https://huggingface.co/HavelockAI/bert-token-classifier}
|
| 173 |
+
}
|
| 174 |
+
```
|
| 175 |
+
|
| 176 |
+
## References
|
| 177 |
+
|
| 178 |
+
- Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
|
| 179 |
+
- Lin, T.-Y. et al. "Focal Loss for Dense Object Detection." ICCV 2017.
|
| 180 |
+
|
| 181 |
+
---
|
| 182 |
+
|
| 183 |
+
*Model version: 668564aa • Trained: February 2026*
|
config.json
CHANGED
|
@@ -10,7 +10,7 @@
|
|
| 10 |
"dtype": "float32",
|
| 11 |
"eos_token_id": null,
|
| 12 |
"focal_alpha": 1.0,
|
| 13 |
-
"focal_gamma":
|
| 14 |
"gradient_checkpointing": false,
|
| 15 |
"hidden_act": "gelu",
|
| 16 |
"hidden_dropout_prob": 0.1,
|
|
|
|
| 10 |
"dtype": "float32",
|
| 11 |
"eos_token_id": null,
|
| 12 |
"focal_alpha": 1.0,
|
| 13 |
+
"focal_gamma": 1.0,
|
| 14 |
"gradient_checkpointing": false,
|
| 15 |
"hidden_act": "gelu",
|
| 16 |
"hidden_dropout_prob": 0.1,
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 436035932
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d310f9767c901ae616ffd9d2fa59addc5e10a450b3b25d44c12bdedaeab3fbeb
|
| 3 |
size 436035932
|