Upload folder using huggingface_hub
Browse files- README.md +269 -0
- model.safetensors +1 -1
README.md
ADDED
|
@@ -0,0 +1,269 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- text-classification
|
| 5 |
+
- bert
|
| 6 |
+
- orality
|
| 7 |
+
- linguistics
|
| 8 |
+
- rhetorical-analysis
|
| 9 |
+
language:
|
| 10 |
+
- en
|
| 11 |
+
metrics:
|
| 12 |
+
- f1
|
| 13 |
+
- accuracy
|
| 14 |
+
base_model:
|
| 15 |
+
- google-bert/bert-base-uncased
|
| 16 |
+
pipeline_tag: text-classification
|
| 17 |
+
library_name: transformers
|
| 18 |
+
datasets:
|
| 19 |
+
- custom
|
| 20 |
+
model-index:
|
| 21 |
+
- name: bert-marker-subtype
|
| 22 |
+
results:
|
| 23 |
+
- task:
|
| 24 |
+
type: text-classification
|
| 25 |
+
name: Marker Subtype Classification
|
| 26 |
+
metrics:
|
| 27 |
+
- type: f1
|
| 28 |
+
value: 0.4704
|
| 29 |
+
name: F1 (macro)
|
| 30 |
+
- type: accuracy
|
| 31 |
+
value: 0.515
|
| 32 |
+
name: Accuracy
|
| 33 |
+
---
|
| 34 |
+
|
| 35 |
+
# Havelock Marker Subtype Classifier
|
| 36 |
+
|
| 37 |
+
BERT-based classifier for **71 fine-grained rhetorical marker subtypes** on the oral–literate spectrum, grounded in Walter Ong's *Orality and Literacy* (1982).
|
| 38 |
+
|
| 39 |
+
This is the finest level of the Havelock span classification hierarchy. Given a text span identified as a rhetorical marker, the model classifies it into one of 71 specific rhetorical devices (e.g., `anaphora`, `epistemic_hedge`, `vocative`, `nested_clauses`).
|
| 40 |
+
|
| 41 |
+
## Model Details
|
| 42 |
+
|
| 43 |
+
| Property | Value |
|
| 44 |
+
|----------|-------|
|
| 45 |
+
| Base model | `bert-base-uncased` |
|
| 46 |
+
| Architecture | `BertForSequenceClassification` |
|
| 47 |
+
| Task | Multi-class classification (71 classes) |
|
| 48 |
+
| Max sequence length | 128 tokens |
|
| 49 |
+
| Best F1 (macro) | **0.4704** |
|
| 50 |
+
| Best Accuracy | **0.515** |
|
| 51 |
+
| Parameters | ~109M |
|
| 52 |
+
|
| 53 |
+
## Usage
|
| 54 |
+
```python
|
| 55 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 56 |
+
import torch
|
| 57 |
+
|
| 58 |
+
model_name = "HavelockAI/bert-marker-subtype"
|
| 59 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 60 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
| 61 |
+
|
| 62 |
+
span = "it seems likely that this would, in principle, be feasible"
|
| 63 |
+
inputs = tokenizer(span, return_tensors="pt", truncation=True, max_length=128)
|
| 64 |
+
|
| 65 |
+
with torch.no_grad():
|
| 66 |
+
logits = model(**inputs).logits
|
| 67 |
+
pred = torch.argmax(logits, dim=1).item()
|
| 68 |
+
|
| 69 |
+
print(f"Marker subtype: {model.config.id2label[pred]}")
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
## Label Taxonomy (71 subtypes)
|
| 73 |
+
|
| 74 |
+
### Oral Subtypes (36)
|
| 75 |
+
|
| 76 |
+
| Category | Subtypes |
|
| 77 |
+
|----------|----------|
|
| 78 |
+
| **Repetition & Pattern** | `anaphora`, `epistrophe`, `parallelism`, `tricolon`, `lexical_repetition`, `refrain` |
|
| 79 |
+
| **Sound & Rhythm** | `alliteration`, `assonance`, `rhyme`, `rhythm` |
|
| 80 |
+
| **Address & Interaction** | `vocative`, `imperative`, `second_person`, `inclusive_we`, `rhetorical_question`, `audience_response`, `phatic_check`, `phatic_filler` |
|
| 81 |
+
| **Conjunction** | `polysyndeton`, `asyndeton`, `simple_conjunction`, `binomial_expression` |
|
| 82 |
+
| **Formulas** | `discourse_formula`, `proverb`, `religious_formula`, `epithet` |
|
| 83 |
+
| **Narrative** | `named_individual`, `specific_place`, `temporal_anchor`, `sensory_detail`, `embodied_action`, `everyday_example` |
|
| 84 |
+
| **Performance** | `dramatic_pause`, `self_correction`, `conflict_frame`, `us_them`, `intensifier_doubling`, `antithesis` |
|
| 85 |
+
|
| 86 |
+
### Literate Subtypes (36)
|
| 87 |
+
|
| 88 |
+
| Category | Subtypes |
|
| 89 |
+
|----------|----------|
|
| 90 |
+
| **Abstraction** | `nominalization`, `abstract_noun`, `conceptual_metaphor`, `categorical_statement` |
|
| 91 |
+
| **Syntax** | `nested_clauses`, `relative_chain`, `conditional`, `concessive`, `temporal_embedding`, `causal_chain` |
|
| 92 |
+
| **Hedging** | `epistemic_hedge`, `probability`, `evidential`, `qualified_assertion`, `concessive_connector` |
|
| 93 |
+
| **Impersonality** | `agentless_passive`, `agent_demoted`, `institutional_subject`, `objectifying_stance`, `third_person_reference` |
|
| 94 |
+
| **Scholarly Apparatus** | `citation`, `footnote_reference`, `cross_reference`, `metadiscourse`, `methodological_framing` |
|
| 95 |
+
| **Technical** | `technical_term`, `technical_abbreviation`, `enumeration`, `list_structure`, `definitional_move` |
|
| 96 |
+
| **Connectives** | `contrastive`, `causal_explicit`, `additive_formal`, `aside` |
|
| 97 |
+
|
| 98 |
+
## Training
|
| 99 |
+
|
| 100 |
+
### Data
|
| 101 |
+
|
| 102 |
+
Span-level annotations from the Havelock corpus. Each span carries a `marker_subtype` field. Only subtypes with ≥15 examples in the full dataset are included. The corpus draws from Project Gutenberg, textfiles.com, Reddit, and Wikipedia talk pages.
|
| 103 |
+
|
| 104 |
+
A stratified 80/20 train/test split was used (random seed 42). The test set contains 4,608 spans.
|
| 105 |
+
|
| 106 |
+
### Hyperparameters
|
| 107 |
+
|
| 108 |
+
| Parameter | Value |
|
| 109 |
+
|-----------|-------|
|
| 110 |
+
| Epochs | 3 |
|
| 111 |
+
| Batch size | 8 |
|
| 112 |
+
| Learning rate | 2e-5 |
|
| 113 |
+
| Optimizer | AdamW |
|
| 114 |
+
| LR schedule | Linear warmup (10% of total steps) |
|
| 115 |
+
| Gradient clipping | 1.0 |
|
| 116 |
+
| Loss | Cross-entropy |
|
| 117 |
+
| Min examples per class | 15 |
|
| 118 |
+
|
| 119 |
+
### Training Metrics
|
| 120 |
+
|
| 121 |
+
| Epoch | Loss | Accuracy | F1 (macro) |
|
| 122 |
+
|-------|------|----------|------------|
|
| 123 |
+
| 1 | 3.2554 | 0.4210 | 0.3060 |
|
| 124 |
+
| 2 | 2.0844 | 0.5033 | 0.4345 |
|
| 125 |
+
| 3 | 1.5922 | 0.5154 | 0.4704 |
|
| 126 |
+
|
| 127 |
+
Best checkpoint selected by F1 at epoch 3. Loss still declining steeply.
|
| 128 |
+
|
| 129 |
+
### Test Set Classification Report
|
| 130 |
+
|
| 131 |
+
<details><summary>Click to expand per-class precision/recall/F1/support</summary>
|
| 132 |
+
```
|
| 133 |
+
precision recall f1-score support
|
| 134 |
+
|
| 135 |
+
abstract_noun 0.262 0.333 0.294 144
|
| 136 |
+
additive_formal 0.250 0.038 0.067 26
|
| 137 |
+
agent_demoted 0.944 0.548 0.694 31
|
| 138 |
+
agentless_passive 0.458 0.619 0.526 105
|
| 139 |
+
alliteration 0.400 0.133 0.200 30
|
| 140 |
+
anaphora 0.468 0.659 0.547 88
|
| 141 |
+
antithesis 0.575 0.742 0.648 31
|
| 142 |
+
aside 0.467 0.127 0.200 55
|
| 143 |
+
assonance 0.744 0.970 0.842 33
|
| 144 |
+
asyndeton 0.867 0.433 0.578 30
|
| 145 |
+
audience_response 0.800 0.533 0.640 30
|
| 146 |
+
categorical_statement 0.362 0.388 0.374 98
|
| 147 |
+
causal_chain 0.472 0.625 0.538 80
|
| 148 |
+
causal_explicit 0.400 0.406 0.403 69
|
| 149 |
+
citation 0.494 0.612 0.547 67
|
| 150 |
+
conceptual_metaphor 0.235 0.055 0.089 73
|
| 151 |
+
concessive 0.677 0.739 0.707 88
|
| 152 |
+
concessive_connector 0.920 0.742 0.821 31
|
| 153 |
+
conditional 0.627 0.671 0.648 155
|
| 154 |
+
conflict_frame 0.800 0.774 0.787 31
|
| 155 |
+
contrastive 0.390 0.595 0.471 116
|
| 156 |
+
cross_reference 0.429 0.353 0.387 34
|
| 157 |
+
definitional_move 0.429 0.077 0.130 39
|
| 158 |
+
discourse_formula 0.499 0.703 0.583 276
|
| 159 |
+
dramatic_pause 0.833 0.806 0.820 31
|
| 160 |
+
embodied_action 0.286 0.377 0.325 69
|
| 161 |
+
enumeration 0.504 0.694 0.584 85
|
| 162 |
+
epistemic_hedge 0.429 0.624 0.508 101
|
| 163 |
+
epistrophe 0.763 0.906 0.829 32
|
| 164 |
+
epithet 0.429 0.444 0.436 27
|
| 165 |
+
everyday_example 0.432 0.390 0.410 41
|
| 166 |
+
evidential 0.608 0.574 0.590 54
|
| 167 |
+
footnote_reference 1.000 0.133 0.235 15
|
| 168 |
+
imperative 0.617 0.760 0.681 146
|
| 169 |
+
inclusive_we 0.579 0.700 0.634 120
|
| 170 |
+
institutional_subject 0.586 0.548 0.567 31
|
| 171 |
+
intensifier_doubling 0.792 0.633 0.704 30
|
| 172 |
+
lexical_repetition 0.535 0.649 0.587 94
|
| 173 |
+
list_structure 0.300 0.167 0.214 36
|
| 174 |
+
metadiscourse 0.310 0.310 0.310 87
|
| 175 |
+
methodological_framing 0.000 0.000 0.000 32
|
| 176 |
+
named_individual 0.446 0.527 0.483 55
|
| 177 |
+
nested_clauses 0.375 0.172 0.236 87
|
| 178 |
+
nominalization 0.336 0.333 0.335 120
|
| 179 |
+
objectifying_stance 0.250 0.023 0.043 43
|
| 180 |
+
parallelism 0.250 0.052 0.086 58
|
| 181 |
+
phatic_check 1.000 0.286 0.444 21
|
| 182 |
+
phatic_filler 0.529 0.300 0.383 30
|
| 183 |
+
polysyndeton 0.675 0.844 0.750 32
|
| 184 |
+
probability 0.571 0.327 0.416 49
|
| 185 |
+
proverb 0.222 0.065 0.100 31
|
| 186 |
+
qualified_assertion 0.286 0.100 0.148 60
|
| 187 |
+
refrain 0.895 0.567 0.694 30
|
| 188 |
+
relative_chain 0.504 0.600 0.548 115
|
| 189 |
+
religious_formula 0.917 0.688 0.786 32
|
| 190 |
+
rhetorical_question 0.614 0.820 0.702 161
|
| 191 |
+
rhyme 0.545 0.562 0.554 32
|
| 192 |
+
rhythm 0.839 0.812 0.825 32
|
| 193 |
+
second_person 0.557 0.600 0.578 235
|
| 194 |
+
self_correction 0.895 0.567 0.694 30
|
| 195 |
+
sensory_detail 0.000 0.000 0.000 37
|
| 196 |
+
simple_conjunction 0.667 0.049 0.091 41
|
| 197 |
+
specific_place 1.000 0.038 0.074 26
|
| 198 |
+
technical_abbreviation 1.000 0.053 0.100 19
|
| 199 |
+
technical_term 0.489 0.571 0.527 161
|
| 200 |
+
temporal_anchor 0.471 0.490 0.480 49
|
| 201 |
+
temporal_embedding 0.448 0.481 0.464 81
|
| 202 |
+
third_person_reference 0.917 0.710 0.800 31
|
| 203 |
+
tricolon 0.656 0.700 0.677 30
|
| 204 |
+
us_them 0.882 0.484 0.625 31
|
| 205 |
+
vocative 0.593 0.603 0.598 58
|
| 206 |
+
|
| 207 |
+
accuracy 0.515 4608
|
| 208 |
+
macro avg 0.561 0.465 0.470 4608
|
| 209 |
+
weighted avg 0.512 0.515 0.490 4608
|
| 210 |
+
```
|
| 211 |
+
|
| 212 |
+
</details>
|
| 213 |
+
|
| 214 |
+
**Top performing subtypes (F1 > 0.75):** `assonance` (0.842), `epistrophe` (0.829), `rhythm` (0.825), `concessive_connector` (0.821), `dramatic_pause` (0.820), `third_person_reference` (0.800), `conflict_frame` (0.787), `religious_formula` (0.786), `polysyndeton` (0.750).
|
| 215 |
+
|
| 216 |
+
**Near-zero F1 subtypes:** `methodological_framing` (0.000), `sensory_detail` (0.000), `specific_place` (0.074), `parallelism` (0.086), `conceptual_metaphor` (0.089), `objectifying_stance` (0.043), `simple_conjunction` (0.091), `technical_abbreviation` (0.100), `proverb` (0.100). These tend to be either semantically diffuse classes or classes with very low support.
|
| 217 |
+
|
| 218 |
+
## Class Distribution
|
| 219 |
+
|
| 220 |
+
The test set exhibits significant imbalance across 71 classes:
|
| 221 |
+
|
| 222 |
+
| Support Range | Classes | % of Total |
|
| 223 |
+
|---------------|---------|------------|
|
| 224 |
+
| >200 | 2 (`discourse_formula`, `second_person`) | 3% |
|
| 225 |
+
| 100–200 | 11 | 15% |
|
| 226 |
+
| 50–100 | 18 | 25% |
|
| 227 |
+
| 25–50 | 41 | 57% |
|
| 228 |
+
|
| 229 |
+
## Limitations
|
| 230 |
+
|
| 231 |
+
- **Severely undertrained**: 3 epochs with loss at 1.59 and still falling steeply. This model has the most headroom for improvement of the three span classifiers.
|
| 232 |
+
- **71-way classification on ~23k spans**: The data budget per class is thin, particularly for classes near the 15-example minimum. More data or class consolidation would help.
|
| 233 |
+
- **Semantic overlap**: Some subtypes are difficult to distinguish from surface text alone (e.g., `parallelism` vs `anaphora` vs `tricolon`; `epistemic_hedge` vs `qualified_assertion` vs `probability`). The model may benefit from hierarchical classification that conditions on type-level predictions.
|
| 234 |
+
- **Recall-precision tradeoff**: Many rare classes show high precision but very low recall (e.g., `footnote_reference`: P=1.000, R=0.133), suggesting the model learns narrow prototypes but misses variation.
|
| 235 |
+
- **Span-level only**: Requires pre-extracted spans. Does not detect boundaries.
|
| 236 |
+
- **128-token context window**: Longer spans are truncated.
|
| 237 |
+
|
| 238 |
+
## Theoretical Background
|
| 239 |
+
|
| 240 |
+
The 71 subtypes represent the full granularity of the Havelock taxonomy, operationalizing Ong's oral–literate framework into specific, annotatable rhetorical devices. Oral subtypes capture the textural signatures of spoken and performative discourse: repetitive structures (`anaphora`, `epistrophe`, `tricolon`), sound patterning (`alliteration`, `assonance`, `rhythm`), direct audience engagement (`vocative`, `imperative`, `rhetorical_question`), and formulas (`proverb`, `epithet`, `discourse_formula`). Literate subtypes capture the apparatus of analytic prose: complex syntax (`nested_clauses`, `relative_chain`, `conditional`), epistemic positioning (`epistemic_hedge`, `evidential`, `probability`), impersonal voice (`agentless_passive`, `institutional_subject`), and scholarly machinery (`citation`, `footnote_reference`, `metadiscourse`).
|
| 241 |
+
|
| 242 |
+
## Related Models
|
| 243 |
+
|
| 244 |
+
| Model | Task | Classes | F1 |
|
| 245 |
+
|-------|------|---------|-----|
|
| 246 |
+
| [`HavelockAI/bert-marker-category`](https://huggingface.co/HavelockAI/bert-marker-category) | Binary (oral/literate) | 2 | 0.875 |
|
| 247 |
+
| [`HavelockAI/bert-marker-type`](https://huggingface.co/HavelockAI/bert-marker-type) | Functional type | 25 | 0.449 |
|
| 248 |
+
| **This model** | Fine-grained subtype | 71 | 0.470 |
|
| 249 |
+
| [`HavelockAI/bert-orality-regressor`](https://huggingface.co/HavelockAI/bert-orality-regressor) | Document-level score | Regression | MAE 0.079 |
|
| 250 |
+
| [`HavelockAI/bert-token-classifier`](https://huggingface.co/HavelockAI/bert-token-classifier) | Span detection (BIO) | 145 | 0.461 |
|
| 251 |
+
|
| 252 |
+
## Citation
|
| 253 |
+
```bibtex
|
| 254 |
+
@misc{havelock2026subtype,
|
| 255 |
+
title={Havelock Marker Subtype Classifier},
|
| 256 |
+
author={Havelock AI},
|
| 257 |
+
year={2026},
|
| 258 |
+
url={https://huggingface.co/HavelockAI/bert-marker-subtype}
|
| 259 |
+
}
|
| 260 |
+
```
|
| 261 |
+
|
| 262 |
+
## References
|
| 263 |
+
|
| 264 |
+
- Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
|
| 265 |
+
|
| 266 |
+
---
|
| 267 |
+
|
| 268 |
+
*Trained: February 2026*
|
| 269 |
+
*Model version: da931b4a · Trained: February 2026*
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 438170868
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a5a1f8420254999b58763469bd26ef2ba803a70ee980986aca9881f290dd9bb4
|
| 3 |
size 438170868
|