permutans commited on
Commit
479bf8f
·
verified ·
1 Parent(s): 1ec5223

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +269 -0
  2. model.safetensors +1 -1
README.md ADDED
@@ -0,0 +1,269 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - text-classification
5
+ - bert
6
+ - orality
7
+ - linguistics
8
+ - rhetorical-analysis
9
+ language:
10
+ - en
11
+ metrics:
12
+ - f1
13
+ - accuracy
14
+ base_model:
15
+ - google-bert/bert-base-uncased
16
+ pipeline_tag: text-classification
17
+ library_name: transformers
18
+ datasets:
19
+ - custom
20
+ model-index:
21
+ - name: bert-marker-subtype
22
+ results:
23
+ - task:
24
+ type: text-classification
25
+ name: Marker Subtype Classification
26
+ metrics:
27
+ - type: f1
28
+ value: 0.4704
29
+ name: F1 (macro)
30
+ - type: accuracy
31
+ value: 0.515
32
+ name: Accuracy
33
+ ---
34
+
35
+ # Havelock Marker Subtype Classifier
36
+
37
+ BERT-based classifier for **71 fine-grained rhetorical marker subtypes** on the oral–literate spectrum, grounded in Walter Ong's *Orality and Literacy* (1982).
38
+
39
+ This is the finest level of the Havelock span classification hierarchy. Given a text span identified as a rhetorical marker, the model classifies it into one of 71 specific rhetorical devices (e.g., `anaphora`, `epistemic_hedge`, `vocative`, `nested_clauses`).
40
+
41
+ ## Model Details
42
+
43
+ | Property | Value |
44
+ |----------|-------|
45
+ | Base model | `bert-base-uncased` |
46
+ | Architecture | `BertForSequenceClassification` |
47
+ | Task | Multi-class classification (71 classes) |
48
+ | Max sequence length | 128 tokens |
49
+ | Best F1 (macro) | **0.4704** |
50
+ | Best Accuracy | **0.515** |
51
+ | Parameters | ~109M |
52
+
53
+ ## Usage
54
+ ```python
55
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
56
+ import torch
57
+
58
+ model_name = "HavelockAI/bert-marker-subtype"
59
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
60
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
61
+
62
+ span = "it seems likely that this would, in principle, be feasible"
63
+ inputs = tokenizer(span, return_tensors="pt", truncation=True, max_length=128)
64
+
65
+ with torch.no_grad():
66
+ logits = model(**inputs).logits
67
+ pred = torch.argmax(logits, dim=1).item()
68
+
69
+ print(f"Marker subtype: {model.config.id2label[pred]}")
70
+ ```
71
+
72
+ ## Label Taxonomy (71 subtypes)
73
+
74
+ ### Oral Subtypes (36)
75
+
76
+ | Category | Subtypes |
77
+ |----------|----------|
78
+ | **Repetition & Pattern** | `anaphora`, `epistrophe`, `parallelism`, `tricolon`, `lexical_repetition`, `refrain` |
79
+ | **Sound & Rhythm** | `alliteration`, `assonance`, `rhyme`, `rhythm` |
80
+ | **Address & Interaction** | `vocative`, `imperative`, `second_person`, `inclusive_we`, `rhetorical_question`, `audience_response`, `phatic_check`, `phatic_filler` |
81
+ | **Conjunction** | `polysyndeton`, `asyndeton`, `simple_conjunction`, `binomial_expression` |
82
+ | **Formulas** | `discourse_formula`, `proverb`, `religious_formula`, `epithet` |
83
+ | **Narrative** | `named_individual`, `specific_place`, `temporal_anchor`, `sensory_detail`, `embodied_action`, `everyday_example` |
84
+ | **Performance** | `dramatic_pause`, `self_correction`, `conflict_frame`, `us_them`, `intensifier_doubling`, `antithesis` |
85
+
86
+ ### Literate Subtypes (36)
87
+
88
+ | Category | Subtypes |
89
+ |----------|----------|
90
+ | **Abstraction** | `nominalization`, `abstract_noun`, `conceptual_metaphor`, `categorical_statement` |
91
+ | **Syntax** | `nested_clauses`, `relative_chain`, `conditional`, `concessive`, `temporal_embedding`, `causal_chain` |
92
+ | **Hedging** | `epistemic_hedge`, `probability`, `evidential`, `qualified_assertion`, `concessive_connector` |
93
+ | **Impersonality** | `agentless_passive`, `agent_demoted`, `institutional_subject`, `objectifying_stance`, `third_person_reference` |
94
+ | **Scholarly Apparatus** | `citation`, `footnote_reference`, `cross_reference`, `metadiscourse`, `methodological_framing` |
95
+ | **Technical** | `technical_term`, `technical_abbreviation`, `enumeration`, `list_structure`, `definitional_move` |
96
+ | **Connectives** | `contrastive`, `causal_explicit`, `additive_formal`, `aside` |
97
+
98
+ ## Training
99
+
100
+ ### Data
101
+
102
+ Span-level annotations from the Havelock corpus. Each span carries a `marker_subtype` field. Only subtypes with ≥15 examples in the full dataset are included. The corpus draws from Project Gutenberg, textfiles.com, Reddit, and Wikipedia talk pages.
103
+
104
+ A stratified 80/20 train/test split was used (random seed 42). The test set contains 4,608 spans.
105
+
106
+ ### Hyperparameters
107
+
108
+ | Parameter | Value |
109
+ |-----------|-------|
110
+ | Epochs | 3 |
111
+ | Batch size | 8 |
112
+ | Learning rate | 2e-5 |
113
+ | Optimizer | AdamW |
114
+ | LR schedule | Linear warmup (10% of total steps) |
115
+ | Gradient clipping | 1.0 |
116
+ | Loss | Cross-entropy |
117
+ | Min examples per class | 15 |
118
+
119
+ ### Training Metrics
120
+
121
+ | Epoch | Loss | Accuracy | F1 (macro) |
122
+ |-------|------|----------|------------|
123
+ | 1 | 3.2554 | 0.4210 | 0.3060 |
124
+ | 2 | 2.0844 | 0.5033 | 0.4345 |
125
+ | 3 | 1.5922 | 0.5154 | 0.4704 |
126
+
127
+ Best checkpoint selected by F1 at epoch 3. Loss still declining steeply.
128
+
129
+ ### Test Set Classification Report
130
+
131
+ <details><summary>Click to expand per-class precision/recall/F1/support</summary>
132
+ ```
133
+ precision recall f1-score support
134
+
135
+ abstract_noun 0.262 0.333 0.294 144
136
+ additive_formal 0.250 0.038 0.067 26
137
+ agent_demoted 0.944 0.548 0.694 31
138
+ agentless_passive 0.458 0.619 0.526 105
139
+ alliteration 0.400 0.133 0.200 30
140
+ anaphora 0.468 0.659 0.547 88
141
+ antithesis 0.575 0.742 0.648 31
142
+ aside 0.467 0.127 0.200 55
143
+ assonance 0.744 0.970 0.842 33
144
+ asyndeton 0.867 0.433 0.578 30
145
+ audience_response 0.800 0.533 0.640 30
146
+ categorical_statement 0.362 0.388 0.374 98
147
+ causal_chain 0.472 0.625 0.538 80
148
+ causal_explicit 0.400 0.406 0.403 69
149
+ citation 0.494 0.612 0.547 67
150
+ conceptual_metaphor 0.235 0.055 0.089 73
151
+ concessive 0.677 0.739 0.707 88
152
+ concessive_connector 0.920 0.742 0.821 31
153
+ conditional 0.627 0.671 0.648 155
154
+ conflict_frame 0.800 0.774 0.787 31
155
+ contrastive 0.390 0.595 0.471 116
156
+ cross_reference 0.429 0.353 0.387 34
157
+ definitional_move 0.429 0.077 0.130 39
158
+ discourse_formula 0.499 0.703 0.583 276
159
+ dramatic_pause 0.833 0.806 0.820 31
160
+ embodied_action 0.286 0.377 0.325 69
161
+ enumeration 0.504 0.694 0.584 85
162
+ epistemic_hedge 0.429 0.624 0.508 101
163
+ epistrophe 0.763 0.906 0.829 32
164
+ epithet 0.429 0.444 0.436 27
165
+ everyday_example 0.432 0.390 0.410 41
166
+ evidential 0.608 0.574 0.590 54
167
+ footnote_reference 1.000 0.133 0.235 15
168
+ imperative 0.617 0.760 0.681 146
169
+ inclusive_we 0.579 0.700 0.634 120
170
+ institutional_subject 0.586 0.548 0.567 31
171
+ intensifier_doubling 0.792 0.633 0.704 30
172
+ lexical_repetition 0.535 0.649 0.587 94
173
+ list_structure 0.300 0.167 0.214 36
174
+ metadiscourse 0.310 0.310 0.310 87
175
+ methodological_framing 0.000 0.000 0.000 32
176
+ named_individual 0.446 0.527 0.483 55
177
+ nested_clauses 0.375 0.172 0.236 87
178
+ nominalization 0.336 0.333 0.335 120
179
+ objectifying_stance 0.250 0.023 0.043 43
180
+ parallelism 0.250 0.052 0.086 58
181
+ phatic_check 1.000 0.286 0.444 21
182
+ phatic_filler 0.529 0.300 0.383 30
183
+ polysyndeton 0.675 0.844 0.750 32
184
+ probability 0.571 0.327 0.416 49
185
+ proverb 0.222 0.065 0.100 31
186
+ qualified_assertion 0.286 0.100 0.148 60
187
+ refrain 0.895 0.567 0.694 30
188
+ relative_chain 0.504 0.600 0.548 115
189
+ religious_formula 0.917 0.688 0.786 32
190
+ rhetorical_question 0.614 0.820 0.702 161
191
+ rhyme 0.545 0.562 0.554 32
192
+ rhythm 0.839 0.812 0.825 32
193
+ second_person 0.557 0.600 0.578 235
194
+ self_correction 0.895 0.567 0.694 30
195
+ sensory_detail 0.000 0.000 0.000 37
196
+ simple_conjunction 0.667 0.049 0.091 41
197
+ specific_place 1.000 0.038 0.074 26
198
+ technical_abbreviation 1.000 0.053 0.100 19
199
+ technical_term 0.489 0.571 0.527 161
200
+ temporal_anchor 0.471 0.490 0.480 49
201
+ temporal_embedding 0.448 0.481 0.464 81
202
+ third_person_reference 0.917 0.710 0.800 31
203
+ tricolon 0.656 0.700 0.677 30
204
+ us_them 0.882 0.484 0.625 31
205
+ vocative 0.593 0.603 0.598 58
206
+
207
+ accuracy 0.515 4608
208
+ macro avg 0.561 0.465 0.470 4608
209
+ weighted avg 0.512 0.515 0.490 4608
210
+ ```
211
+
212
+ </details>
213
+
214
+ **Top performing subtypes (F1 > 0.75):** `assonance` (0.842), `epistrophe` (0.829), `rhythm` (0.825), `concessive_connector` (0.821), `dramatic_pause` (0.820), `third_person_reference` (0.800), `conflict_frame` (0.787), `religious_formula` (0.786), `polysyndeton` (0.750).
215
+
216
+ **Near-zero F1 subtypes:** `methodological_framing` (0.000), `sensory_detail` (0.000), `specific_place` (0.074), `parallelism` (0.086), `conceptual_metaphor` (0.089), `objectifying_stance` (0.043), `simple_conjunction` (0.091), `technical_abbreviation` (0.100), `proverb` (0.100). These tend to be either semantically diffuse classes or classes with very low support.
217
+
218
+ ## Class Distribution
219
+
220
+ The test set exhibits significant imbalance across 71 classes:
221
+
222
+ | Support Range | Classes | % of Total |
223
+ |---------------|---------|------------|
224
+ | >200 | 2 (`discourse_formula`, `second_person`) | 3% |
225
+ | 100–200 | 11 | 15% |
226
+ | 50–100 | 18 | 25% |
227
+ | 25–50 | 41 | 57% |
228
+
229
+ ## Limitations
230
+
231
+ - **Severely undertrained**: 3 epochs with loss at 1.59 and still falling steeply. This model has the most headroom for improvement of the three span classifiers.
232
+ - **71-way classification on ~23k spans**: The data budget per class is thin, particularly for classes near the 15-example minimum. More data or class consolidation would help.
233
+ - **Semantic overlap**: Some subtypes are difficult to distinguish from surface text alone (e.g., `parallelism` vs `anaphora` vs `tricolon`; `epistemic_hedge` vs `qualified_assertion` vs `probability`). The model may benefit from hierarchical classification that conditions on type-level predictions.
234
+ - **Recall-precision tradeoff**: Many rare classes show high precision but very low recall (e.g., `footnote_reference`: P=1.000, R=0.133), suggesting the model learns narrow prototypes but misses variation.
235
+ - **Span-level only**: Requires pre-extracted spans. Does not detect boundaries.
236
+ - **128-token context window**: Longer spans are truncated.
237
+
238
+ ## Theoretical Background
239
+
240
+ The 71 subtypes represent the full granularity of the Havelock taxonomy, operationalizing Ong's oral–literate framework into specific, annotatable rhetorical devices. Oral subtypes capture the textural signatures of spoken and performative discourse: repetitive structures (`anaphora`, `epistrophe`, `tricolon`), sound patterning (`alliteration`, `assonance`, `rhythm`), direct audience engagement (`vocative`, `imperative`, `rhetorical_question`), and formulas (`proverb`, `epithet`, `discourse_formula`). Literate subtypes capture the apparatus of analytic prose: complex syntax (`nested_clauses`, `relative_chain`, `conditional`), epistemic positioning (`epistemic_hedge`, `evidential`, `probability`), impersonal voice (`agentless_passive`, `institutional_subject`), and scholarly machinery (`citation`, `footnote_reference`, `metadiscourse`).
241
+
242
+ ## Related Models
243
+
244
+ | Model | Task | Classes | F1 |
245
+ |-------|------|---------|-----|
246
+ | [`HavelockAI/bert-marker-category`](https://huggingface.co/HavelockAI/bert-marker-category) | Binary (oral/literate) | 2 | 0.875 |
247
+ | [`HavelockAI/bert-marker-type`](https://huggingface.co/HavelockAI/bert-marker-type) | Functional type | 25 | 0.449 |
248
+ | **This model** | Fine-grained subtype | 71 | 0.470 |
249
+ | [`HavelockAI/bert-orality-regressor`](https://huggingface.co/HavelockAI/bert-orality-regressor) | Document-level score | Regression | MAE 0.079 |
250
+ | [`HavelockAI/bert-token-classifier`](https://huggingface.co/HavelockAI/bert-token-classifier) | Span detection (BIO) | 145 | 0.461 |
251
+
252
+ ## Citation
253
+ ```bibtex
254
+ @misc{havelock2026subtype,
255
+ title={Havelock Marker Subtype Classifier},
256
+ author={Havelock AI},
257
+ year={2026},
258
+ url={https://huggingface.co/HavelockAI/bert-marker-subtype}
259
+ }
260
+ ```
261
+
262
+ ## References
263
+
264
+ - Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
265
+
266
+ ---
267
+
268
+ *Trained: February 2026*
269
+ *Model version: da931b4a · Trained: February 2026*
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:57612c2d570b6ad4b50fa5e3983044ff32f89670b3a07fa1c01c9d802ed18fb6
3
  size 438170868
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a5a1f8420254999b58763469bd26ef2ba803a70ee980986aca9881f290dd9bb4
3
  size 438170868