permutans commited on
Commit
b5503c0
·
verified ·
1 Parent(s): 2b6e4da

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +194 -0
  2. model.safetensors +1 -1
README.md ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - text-classification
5
+ - bert
6
+ - orality
7
+ - linguistics
8
+ - rhetorical-analysis
9
+ language:
10
+ - en
11
+ metrics:
12
+ - f1
13
+ - accuracy
14
+ base_model:
15
+ - google-bert/bert-base-uncased
16
+ pipeline_tag: text-classification
17
+ library_name: transformers
18
+ datasets:
19
+ - custom
20
+ model-index:
21
+ - name: bert-marker-type
22
+ results:
23
+ - task:
24
+ type: text-classification
25
+ name: Marker Type Classification
26
+ metrics:
27
+ - type: f1
28
+ value: 0.4486
29
+ name: F1 (macro)
30
+ - type: accuracy
31
+ value: 0.630
32
+ name: Accuracy
33
+ ---
34
+
35
+ # Havelock Marker Type Classifier
36
+
37
+ BERT-based classifier for **25 rhetorical marker types** on the oral–literate spectrum, grounded in Walter Ong's *Orality and Literacy* (1982).
38
+
39
+ This is the mid-level of the Havelock span classification hierarchy. Given a text span identified as a rhetorical marker, the model classifies it into one of 25 functional types (e.g., `repetition`, `subordination`, `direct_address`, `hedging_qualification`).
40
+
41
+ ## Model Details
42
+
43
+ | Property | Value |
44
+ |----------|-------|
45
+ | Base model | `bert-base-uncased` |
46
+ | Architecture | `BertForSequenceClassification` |
47
+ | Task | Multi-class classification (25 classes) |
48
+ | Max sequence length | 128 tokens |
49
+ | Best F1 (macro) | **0.4486** |
50
+ | Best Accuracy | **0.630** |
51
+ | Parameters | ~109M |
52
+
53
+ ## Usage
54
+ ```python
55
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
56
+ import torch
57
+
58
+ model_name = "HavelockAI/bert-marker-type"
59
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
60
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
61
+
62
+ span = "whether or not the underlying assumptions hold true"
63
+ inputs = tokenizer(span, return_tensors="pt", truncation=True, max_length=128)
64
+
65
+ with torch.no_grad():
66
+ logits = model(**inputs).logits
67
+ pred = torch.argmax(logits, dim=1).item()
68
+
69
+ print(f"Marker type: {model.config.id2label[pred]}")
70
+ ```
71
+
72
+ ## Label Taxonomy (25 types)
73
+
74
+ The 25 types group the 72 fine-grained subtypes into functional families:
75
+
76
+ | Oral Types | Literate Types |
77
+ |------------|----------------|
78
+ | `direct_address` | `subordination` |
79
+ | `repetition` | `abstraction` |
80
+ | `formulaic_phrases` | `hedging_qualification` |
81
+ | `parallelism` | `analytical_distance` |
82
+ | `parataxis` | `logical_connectives` |
83
+ | `sound_patterns` | `textual_apparatus` |
84
+ | `performance_markers` | `literate_feature` |
85
+ | `concrete_situational` | `passive_agentless` |
86
+ | `agonistic_framing` | |
87
+ | `oral_feature` | |
88
+
89
+ Legacy/low-support types also present in the label space: `agonistic`, `concrete`, `formulaic`, `hedging`, `logical_connective`, `passive`, `passive_constructions`. These have fewer than 10 test examples and the model does not reliably predict them.
90
+
91
+ ## Training
92
+
93
+ ### Data
94
+
95
+ Span-level annotations from the same corpus as the category classifier. Each span carries a `marker_type` field. Only types with ≥15 examples in the full dataset are included; the rest are filtered out during label map construction.
96
+
97
+ A stratified 80/20 train/test split was used (random seed 42). The test set contains 4,602 spans.
98
+
99
+ ### Hyperparameters
100
+
101
+ | Parameter | Value |
102
+ |-----------|-------|
103
+ | Epochs | 3 |
104
+ | Batch size | 8 |
105
+ | Learning rate | 2e-5 |
106
+ | Optimizer | AdamW |
107
+ | LR schedule | Linear warmup (10% of total steps) |
108
+ | Gradient clipping | 1.0 |
109
+ | Loss | Cross-entropy |
110
+ | Min examples per class | 15 |
111
+
112
+ ### Training Metrics
113
+
114
+ | Epoch | Loss | Accuracy | F1 (macro) |
115
+ |-------|------|----------|------------|
116
+ | 1 | 1.9282 | 0.6043 | 0.4142 |
117
+ | 2 | 1.1097 | 0.6215 | 0.4414 |
118
+ | 3 | 0.7712 | 0.6297 | 0.4486 |
119
+
120
+ Best checkpoint selected by F1 at epoch 3. Loss still declining steeply.
121
+
122
+ ### Test Set Classification Report
123
+
124
+ <details><summary>Click to expand per-class precision/recall/F1/support</summary>
125
+ ```
126
+ precision recall f1-score support
127
+
128
+ abstraction 0.637 0.712 0.673 570
129
+ agonistic 0.000 0.000 0.000 7
130
+ agonistic_framing 0.902 0.698 0.787 53
131
+ analytical_distance 0.516 0.465 0.489 245
132
+ concrete 0.000 0.000 0.000 7
133
+ concrete_situational 0.471 0.467 0.469 225
134
+ direct_address 0.691 0.752 0.720 722
135
+ formulaic 0.000 0.000 0.000 10
136
+ formulaic_phrases 0.598 0.600 0.599 380
137
+ hedging 0.000 0.000 0.000 49
138
+ hedging_qualification 0.477 0.588 0.527 194
139
+ literate_feature 0.690 0.703 0.696 111
140
+ logical_connective 0.000 0.000 0.000 5
141
+ logical_connectives 0.537 0.600 0.567 220
142
+ oral_feature 0.521 0.388 0.444 98
143
+ parallelism 0.786 0.805 0.795 41
144
+ parataxis 0.642 0.538 0.586 130
145
+ passive 0.000 0.000 0.000 4
146
+ passive_agentless 0.651 0.597 0.623 119
147
+ passive_constructions 0.000 0.000 0.000 9
148
+ performance_markers 0.607 0.496 0.546 137
149
+ repetition 0.681 0.726 0.703 318
150
+ sound_patterns 0.693 0.591 0.638 149
151
+ subordination 0.663 0.689 0.676 586
152
+ textual_apparatus 0.708 0.648 0.676 213
153
+
154
+ accuracy 0.630 4602
155
+ macro avg 0.459 0.443 0.449 4602
156
+ weighted avg 0.618 0.630 0.622 4602
157
+ ```
158
+
159
+ </details>
160
+
161
+ **Top performing types (F1 > 0.65):** `parallelism` (0.795), `agonistic_framing` (0.787), `direct_address` (0.720), `repetition` (0.703), `literate_feature` (0.696), `subordination` (0.676), `textual_apparatus` (0.676), `abstraction` (0.673).
162
+
163
+ **Zero F1 types:** `agonistic`, `concrete`, `formulaic`, `hedging`, `logical_connective`, `passive`, `passive_constructions` — all have ≤10 test examples and appear to be legacy label variants superseded by more specific types.
164
+
165
+ ## Limitations
166
+
167
+ - **Severely undertrained**: 3 epochs with loss at 0.77 and still falling sharply. This model would benefit substantially from more training.
168
+ - **Label noise from legacy types**: 7 of 25 classes appear to be legacy/coarse variants that coexist with their refined replacements (e.g., `hedging` alongside `hedging_qualification`). This inflates the label space and depresses macro F1.
169
+ - **Class imbalance**: `direct_address` has 722 test examples while `passive` has 4. Weighted F1 (0.622) is substantially higher than macro F1 (0.449), indicating the model performs better on common types.
170
+ - **Span-level only**: Requires pre-extracted spans. Does not detect boundaries.
171
+ - **128-token context window**: Longer spans are truncated.
172
+
173
+ ## Theoretical Background
174
+
175
+ The type level captures functional groupings within the oral–literate framework. Oral types reflect Ong's characterization of oral discourse as additive (`parataxis`), aggregative (`formulaic_phrases`), redundant (`repetition`), agonistically toned (`agonistic_framing`), empathetic and participatory (`direct_address`), and close to the human lifeworld (`concrete_situational`). Literate types capture the analytic (`abstraction`, `subordination`), distanced (`analytical_distance`, `passive_agentless`), and self-referential (`textual_apparatus`) qualities of written discourse.
176
+
177
+ ## Citation
178
+ ```bibtex
179
+ @misc{havelock2026type,
180
+ title={Havelock Marker Type Classifier},
181
+ author={Havelock AI},
182
+ year={2026},
183
+ url={https://huggingface.co/HavelockAI/bert-marker-type}
184
+ }
185
+ ```
186
+
187
+ ## References
188
+
189
+ - Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
190
+
191
+ ---
192
+
193
+ *Trained: February 2026*
194
+ *Model version: da931b4a · Trained: February 2026*
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d9f0f0f5c783d07236ac2b0fc8982b1daa31efb9abfd9db72d011921d6b6c1f8
3
  size 438029372
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:18737f307d25ae24953445bd387589b30067adc7557d0506b3311953b9a8bd6f
3
  size 438029372