permutans commited on
Commit
86275ec
·
verified ·
1 Parent(s): 1374d89

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +183 -0
  2. config.json +1 -1
  3. model.safetensors +1 -1
README.md ADDED
@@ -0,0 +1,183 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - token-classification
5
+ - bert
6
+ - orality
7
+ - linguistics
8
+ - ner
9
+ language:
10
+ - en
11
+ metrics:
12
+ - f1
13
+ base_model:
14
+ - google-bert/bert-base-uncased
15
+ pipeline_tag: token-classification
16
+ library_name: transformers
17
+ datasets:
18
+ - custom
19
+ ---
20
+
21
+ # Havelock Orality Token Classifier
22
+
23
+ BERT-based token classifier for detecting **oral and literate markers** in text, based on Walter Ong's "Orality and Literacy" (1982).
24
+
25
+ This model performs span-level detection of 72 rhetorical marker types using BIO tagging (145 labels total).
26
+
27
+ ## Model Details
28
+
29
+ | Property | Value |
30
+ |----------|-------|
31
+ | Base model | `bert-base-uncased` |
32
+ | Task | Token classification (BIO tagging) |
33
+ | Labels | 145 (72 marker types × B/I + O) |
34
+ | Best F1 | **0.459** (macro, markers only) |
35
+ | Training | 15 epochs, batch 8, lr 2e-5 |
36
+ | Loss | Focal loss (γ=1.0) for class imbalance |
37
+
38
+ ## Usage
39
+ ```python
40
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
41
+ import torch
42
+
43
+ model_name = "HavelockAI/bert-token-classifier"
44
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
45
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
46
+
47
+ text = "Tell me, O Muse, of that ingenious hero who travelled far and wide"
48
+ inputs = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
49
+ offset_mapping = inputs.pop("offset_mapping")
50
+
51
+ with torch.no_grad():
52
+ outputs = model(**inputs)
53
+ predictions = torch.argmax(outputs.logits, dim=-1)
54
+
55
+ # Decode predictions
56
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
57
+ labels = [model.config.id2label[p.item()] for p in predictions[0]]
58
+
59
+ for token, label in zip(tokens, labels):
60
+ if label != "O":
61
+ print(f"{token:15} {label}")
62
+ ```
63
+
64
+ **Output:**
65
+ ```
66
+ tell B-oral_imperative
67
+ me I-oral_imperative
68
+ , I-oral_imperative
69
+ o B-oral_vocative
70
+ muse I-oral_vocative
71
+ ```
72
+
73
+ ## Training Data
74
+
75
+ - **3,119 examples** with BIO-tagged spans
76
+ - **4,474 marker annotations** across 72 types
77
+ - Sources: Project Gutenberg, textfiles.com, Reddit, Wikipedia talk pages
78
+ - Synthetic examples for rare marker types (30 examples minimum per type)
79
+
80
+ ### Class Distribution
81
+
82
+ The dataset exhibits extreme class imbalance (72 marker types, long-tail distribution). We use focal loss to down-weight easy examples and focus learning on rare markers.
83
+
84
+ | Frequency | Marker types |
85
+ |-----------|--------------|
86
+ | >100 examples | 15 types (21%) |
87
+ | 30-100 examples | 37 types (51%) |
88
+ | <30 examples | 20 types (28%) |
89
+
90
+ ## Marker Types (72)
91
+
92
+ ### Oral Markers (36 types)
93
+
94
+ Characteristics of oral tradition and spoken discourse:
95
+
96
+ | Category | Markers |
97
+ |----------|---------|
98
+ | **Repetition & Pattern** | anaphora, epistrophe, parallelism, tricolon, lexical_repetition, refrain |
99
+ | **Sound & Rhythm** | alliteration, rhythm, assonance, rhyme |
100
+ | **Address & Interaction** | vocative, imperative, second_person, inclusive_we, rhetorical_question, audience_response, phatic_check, phatic_filler |
101
+ | **Conjunction** | polysyndeton, asyndeton, simple_conjunction, binomial_expression |
102
+ | **Formulas** | discourse_formula, proverb, religious_formula, epithet |
103
+ | **Narrative** | named_individual, specific_place, temporal_anchor, sensory_detail, embodied_action, everyday_example |
104
+ | **Performance** | dramatic_pause, self_correction, conflict_frame, us_them, first_person, paradox |
105
+
106
+ ### Literate Markers (36 types)
107
+
108
+ Characteristics of written, analytical discourse:
109
+
110
+ | Category | Markers |
111
+ |----------|---------|
112
+ | **Abstraction** | nominalization, abstract_noun, conceptual_metaphor, categorical_statement |
113
+ | **Syntax** | nested_clauses, relative_chain, conditional, concessive, temporal_embedding, causal_chain |
114
+ | **Hedging** | epistemic_hedge, probability, evidential, qualified_assertion, concessive_connector |
115
+ | **Impersonality** | agentless_passive, agent_demoted, institutional_subject, objectifying_stance, third_person_reference |
116
+ | **Scholarly apparatus** | citation, footnote_reference, cross_reference, metadiscourse, methodological_framing |
117
+ | **Technical** | technical_term, technical_abbreviation, enumeration, list_structure, definitional_move |
118
+ | **Connectives** | contrastive, causal_explicit, additive_formal, paradox |
119
+
120
+ ## Evaluation
121
+
122
+ Per-class F1 on test set (selected markers):
123
+
124
+ | Marker | Precision | Recall | F1 | Support |
125
+ |--------|-----------|--------|-----|---------|
126
+ | oral_vocative | 0.889 | 0.593 | 0.711 | 27 |
127
+ | oral_inclusive_we | 0.500 | 0.586 | 0.540 | 29 |
128
+ | oral_second_person | 0.556 | 0.600 | 0.577 | 25 |
129
+ | literate_conditional | 0.769 | 0.714 | 0.741 | 14 |
130
+ | oral_self_correction | 1.000 | 1.000 | 1.000 | 3 |
131
+ | oral_audience_response | 1.000 | 1.000 | 1.000 | 4 |
132
+ | literate_citation | 0.000 | 0.000 | 0.000 | 10 |
133
+
134
+ **Macro F1 (all 145 labels):** 0.487
135
+ **Weighted F1:** 0.645
136
+ **Accuracy:** 66.5%
137
+
138
+ ## Architecture
139
+
140
+ Custom `BertTokenClassifier` with focal loss:
141
+ ```
142
+ BertModel (bert-base-uncased)
143
+ └── Dropout (p=0.1)
144
+ ���── Linear (768 → 145)
145
+ └── FocalLoss (α=1.0, γ=1.0)
146
+ ```
147
+
148
+ Focal loss addresses class imbalance by down-weighting well-classified tokens (mostly "O") and focusing on hard examples (rare markers).
149
+
150
+ ### Initialization
151
+
152
+ Fine-tuned from `bert-base-uncased`. The classification head (`classifier.weight`, `classifier.bias`) is randomly initialized:
153
+ ```
154
+ bert.* layers → loaded from checkpoint
155
+ classifier.weight → randomly initialized
156
+ classifier.bias → randomly initialized
157
+ ```
158
+
159
+ ## Limitations
160
+
161
+ - **Rare markers**: Types with <10 training examples (e.g., `oral_paradox`, `oral_dramatic_pause`) have poor recall
162
+ - **Context window**: 128 tokens max; longer spans may be truncated
163
+ - **Domain**: Trained primarily on historical/literary texts; may underperform on modern social media
164
+ - **Subjectivity**: Some marker boundaries are inherently ambiguous
165
+
166
+ ## Citation
167
+ ```bibtex
168
+ @misc{havelock2026token,
169
+ title={Havelock Orality Token Classifier},
170
+ author={Havelock AI},
171
+ year={2026},
172
+ url={https://huggingface.co/HavelockAI/bert-token-classifier}
173
+ }
174
+ ```
175
+
176
+ ## References
177
+
178
+ - Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
179
+ - Lin, T.-Y. et al. "Focal Loss for Dense Object Detection." ICCV 2017.
180
+
181
+ ---
182
+
183
+ *Model version: 668564aa • Trained: February 2026*
config.json CHANGED
@@ -10,7 +10,7 @@
10
  "dtype": "float32",
11
  "eos_token_id": null,
12
  "focal_alpha": 1.0,
13
- "focal_gamma": 2.0,
14
  "gradient_checkpointing": false,
15
  "hidden_act": "gelu",
16
  "hidden_dropout_prob": 0.1,
 
10
  "dtype": "float32",
11
  "eos_token_id": null,
12
  "focal_alpha": 1.0,
13
+ "focal_gamma": 1.0,
14
  "gradient_checkpointing": false,
15
  "hidden_act": "gelu",
16
  "hidden_dropout_prob": 0.1,
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8300660cc92a031d0037034b99c54dc6f5b66534b8b88c8d87a1ca82eff280ed
3
  size 436035932
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d310f9767c901ae616ffd9d2fa59addc5e10a450b3b25d44c12bdedaeab3fbeb
3
  size 436035932