Upload folder using huggingface_hub
Browse files
README.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
| 2 |
license: mit
|
| 3 |
tags:
|
| 4 |
- text-classification
|
| 5 |
-
-
|
| 6 |
- orality
|
| 7 |
- linguistics
|
| 8 |
- rhetorical-analysis
|
|
@@ -12,7 +12,7 @@ metrics:
|
|
| 12 |
- f1
|
| 13 |
- accuracy
|
| 14 |
base_model:
|
| 15 |
-
-
|
| 16 |
pipeline_tag: text-classification
|
| 17 |
library_name: transformers
|
| 18 |
datasets:
|
|
@@ -25,7 +25,7 @@ model-index:
|
|
| 25 |
name: Marker Type Classification
|
| 26 |
metrics:
|
| 27 |
- type: f1
|
| 28 |
-
value: 0.
|
| 29 |
name: F1 (macro)
|
| 30 |
- type: accuracy
|
| 31 |
value: 0.584
|
|
@@ -34,7 +34,7 @@ model-index:
|
|
| 34 |
|
| 35 |
# Havelock Marker Type Classifier
|
| 36 |
|
| 37 |
-
|
| 38 |
|
| 39 |
This is the mid-level of the Havelock span classification hierarchy. Given a text span identified as a rhetorical marker, the model classifies it into one of 18 functional types (e.g., `repetition`, `subordination`, `direct_address`, `hedging_qualification`).
|
| 40 |
|
|
@@ -42,14 +42,14 @@ This is the mid-level of the Havelock span classification hierarchy. Given a tex
|
|
| 42 |
|
| 43 |
| Property | Value |
|
| 44 |
|----------|-------|
|
| 45 |
-
| Base model | `
|
| 46 |
-
| Architecture | `
|
| 47 |
| Task | Multi-class classification (18 classes) |
|
| 48 |
| Max sequence length | 128 tokens |
|
| 49 |
-
| Test F1 (macro) | **0.
|
| 50 |
| Test Accuracy | **0.584** |
|
| 51 |
| Missing labels | **0/18** |
|
| 52 |
-
| Parameters | ~
|
| 53 |
|
| 54 |
## Usage
|
| 55 |
```python
|
|
@@ -91,7 +91,7 @@ The 18 types group fine-grained subtypes into functional families. Prior version
|
|
| 91 |
|
| 92 |
### Data
|
| 93 |
|
| 94 |
-
|
| 95 |
|
| 96 |
### Hyperparameters
|
| 97 |
|
|
@@ -109,53 +109,77 @@ Span-level annotations from the Havelock corpus. Each span carries a `marker_typ
|
|
| 109 |
| Mixed precision | FP16 |
|
| 110 |
| Min examples per class | 50 |
|
| 111 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
### Test Set Classification Report
|
| 113 |
|
| 114 |
<details><summary>Click to expand per-class precision/recall/F1/support</summary>
|
| 115 |
```
|
| 116 |
precision recall f1-score support
|
| 117 |
|
| 118 |
-
abstraction 0.
|
| 119 |
-
agonistic_framing 0.
|
| 120 |
-
analytical_distance 0.
|
| 121 |
-
concrete_situational 0.
|
| 122 |
-
direct_address 0.
|
| 123 |
-
formulaic_phrases 0.
|
| 124 |
-
hedging_qualification 0.
|
| 125 |
-
literate_feature 0.
|
| 126 |
-
logical_connectives 0.
|
| 127 |
-
oral_feature 0.
|
| 128 |
-
parallelism 0.
|
| 129 |
-
parataxis 0.
|
| 130 |
-
passive_agentless 0.
|
| 131 |
-
performance_markers 0.
|
| 132 |
-
repetition 0.
|
| 133 |
-
sound_patterns 0.
|
| 134 |
-
subordination 0.
|
| 135 |
-
textual_apparatus 0.
|
| 136 |
|
| 137 |
accuracy 0.584 2178
|
| 138 |
-
macro avg 0.
|
| 139 |
-
weighted avg 0.
|
| 140 |
```
|
| 141 |
|
| 142 |
</details>
|
| 143 |
|
| 144 |
-
**Top performing types (F1 ≥ 0.65):** `agonistic_framing` (0.
|
|
|
|
|
|
|
| 145 |
|
| 146 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 147 |
|
| 148 |
## Limitations
|
| 149 |
|
| 150 |
-
- **Class imbalance**: `direct_address` has 367 test examples while `parallelism` has 19. Weighted F1 (0.
|
| 151 |
- **Span-level only**: Requires pre-extracted spans. Does not detect boundaries.
|
| 152 |
- **128-token context window**: Longer spans are truncated.
|
| 153 |
-
- **Abstraction underperforms**: At 0.
|
|
|
|
| 154 |
|
| 155 |
## Theoretical Background
|
| 156 |
|
| 157 |
The type level captures functional groupings within the oral–literate framework. Oral types reflect Ong's characterization of oral discourse as additive (`parataxis`), aggregative (`formulaic_phrases`), redundant (`repetition`), agonistically toned (`agonistic_framing`), empathetic and participatory (`direct_address`), and close to the human lifeworld (`concrete_situational`). Literate types capture the analytic (`abstraction`, `subordination`), distanced (`analytical_distance`, `passive_agentless`), and self-referential (`textual_apparatus`) qualities of written discourse.
|
| 158 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 159 |
## Citation
|
| 160 |
```bibtex
|
| 161 |
@misc{havelock2026type,
|
|
@@ -169,7 +193,9 @@ The type level captures functional groupings within the oral–literate framewor
|
|
| 169 |
## References
|
| 170 |
|
| 171 |
- Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
|
|
|
|
|
|
|
| 172 |
|
| 173 |
---
|
| 174 |
|
| 175 |
-
*
|
|
|
|
| 2 |
license: mit
|
| 3 |
tags:
|
| 4 |
- text-classification
|
| 5 |
+
- modernbert
|
| 6 |
- orality
|
| 7 |
- linguistics
|
| 8 |
- rhetorical-analysis
|
|
|
|
| 12 |
- f1
|
| 13 |
- accuracy
|
| 14 |
base_model:
|
| 15 |
+
- answerdotai/ModernBERT-base
|
| 16 |
pipeline_tag: text-classification
|
| 17 |
library_name: transformers
|
| 18 |
datasets:
|
|
|
|
| 25 |
name: Marker Type Classification
|
| 26 |
metrics:
|
| 27 |
- type: f1
|
| 28 |
+
value: 0.573
|
| 29 |
name: F1 (macro)
|
| 30 |
- type: accuracy
|
| 31 |
value: 0.584
|
|
|
|
| 34 |
|
| 35 |
# Havelock Marker Type Classifier
|
| 36 |
|
| 37 |
+
ModernBERT-based classifier for **18 rhetorical marker types** on the oral–literate spectrum, grounded in Walter Ong's *Orality and Literacy* (1982).
|
| 38 |
|
| 39 |
This is the mid-level of the Havelock span classification hierarchy. Given a text span identified as a rhetorical marker, the model classifies it into one of 18 functional types (e.g., `repetition`, `subordination`, `direct_address`, `hedging_qualification`).
|
| 40 |
|
|
|
|
| 42 |
|
| 43 |
| Property | Value |
|
| 44 |
|----------|-------|
|
| 45 |
+
| Base model | `answerdotai/ModernBERT-base` |
|
| 46 |
+
| Architecture | `ModernBertForSequenceClassification` |
|
| 47 |
| Task | Multi-class classification (18 classes) |
|
| 48 |
| Max sequence length | 128 tokens |
|
| 49 |
+
| Test F1 (macro) | **0.573** |
|
| 50 |
| Test Accuracy | **0.584** |
|
| 51 |
| Missing labels | **0/18** |
|
| 52 |
+
| Parameters | ~149M |
|
| 53 |
|
| 54 |
## Usage
|
| 55 |
```python
|
|
|
|
| 91 |
|
| 92 |
### Data
|
| 93 |
|
| 94 |
+
22,367 span-level annotations from the Havelock corpus. Each span carries a `marker_type` field normalized against a canonical taxonomy at build time. A stratified 80/10/10 train/val/test split was used with swap-based optimization to balance label distributions across splits. The test set contains 2,178 spans.
|
| 95 |
|
| 96 |
### Hyperparameters
|
| 97 |
|
|
|
|
| 109 |
| Mixed precision | FP16 |
|
| 110 |
| Min examples per class | 50 |
|
| 111 |
|
| 112 |
+
### Training Metrics
|
| 113 |
+
|
| 114 |
+
Best checkpoint selected at epoch 15 by missing-label-primary, F1-tiebreaker (0 missing, F1 0.590).
|
| 115 |
+
|
| 116 |
### Test Set Classification Report
|
| 117 |
|
| 118 |
<details><summary>Click to expand per-class precision/recall/F1/support</summary>
|
| 119 |
```
|
| 120 |
precision recall f1-score support
|
| 121 |
|
| 122 |
+
abstraction 0.368 0.658 0.472 117
|
| 123 |
+
agonistic_framing 0.857 0.750 0.800 32
|
| 124 |
+
analytical_distance 0.504 0.475 0.489 120
|
| 125 |
+
concrete_situational 0.509 0.385 0.438 143
|
| 126 |
+
direct_address 0.671 0.689 0.680 367
|
| 127 |
+
formulaic_phrases 0.205 0.608 0.307 51
|
| 128 |
+
hedging_qualification 0.600 0.500 0.545 114
|
| 129 |
+
literate_feature 0.478 0.833 0.608 66
|
| 130 |
+
logical_connectives 0.621 0.516 0.564 124
|
| 131 |
+
oral_feature 0.784 0.365 0.498 159
|
| 132 |
+
parallelism 0.688 0.579 0.629 19
|
| 133 |
+
parataxis 0.655 0.387 0.486 93
|
| 134 |
+
passive_agentless 0.721 0.500 0.590 62
|
| 135 |
+
performance_markers 0.660 0.403 0.500 77
|
| 136 |
+
repetition 0.738 0.705 0.721 156
|
| 137 |
+
sound_patterns 0.672 0.623 0.647 69
|
| 138 |
+
subordination 0.622 0.689 0.654 296
|
| 139 |
+
textual_apparatus 0.718 0.655 0.685 113
|
| 140 |
|
| 141 |
accuracy 0.584 2178
|
| 142 |
+
macro avg 0.615 0.573 0.573 2178
|
| 143 |
+
weighted avg 0.624 0.584 0.587 2178
|
| 144 |
```
|
| 145 |
|
| 146 |
</details>
|
| 147 |
|
| 148 |
+
**Top performing types (F1 ≥ 0.65):** `agonistic_framing` (0.800), `repetition` (0.721), `textual_apparatus` (0.685), `direct_address` (0.680), `subordination` (0.654), `sound_patterns` (0.647), `parallelism` (0.629), `literate_feature` (0.608).
|
| 149 |
+
|
| 150 |
+
**Weakest types (F1 < 0.50):** `formulaic_phrases` (0.307), `concrete_situational` (0.438), `abstraction` (0.472), `parataxis` (0.486), `oral_feature` (0.498). `formulaic_phrases` suffers from severe precision collapse (P=0.205) despite reasonable recall, suggesting heavy confusion with other oral types. `oral_feature` shows the inverse pattern (P=0.784, R=0.365) — the model is confident but conservative.
|
| 151 |
|
| 152 |
+
## Class Distribution
|
| 153 |
+
|
| 154 |
+
| Support Range | Classes | Examples |
|
| 155 |
+
|---------------|---------|----------|
|
| 156 |
+
| >2500 | `direct_address`, `subordination`, `abstraction` | 3 |
|
| 157 |
+
| 1000–2500 | `repetition`, `formulaic_phrases`, `hedging_qualification`, `analytical_distance`, `concrete_situational`, `logical_connectives`, `textual_apparatus` | 7 |
|
| 158 |
+
| 500–1000 | `sound_patterns`, `passive_agentless`, `performance_markers`, `parataxis`, `literate_feature`, `oral_feature` | 6 |
|
| 159 |
+
| <500 | `agonistic_framing`, `parallelism` | 2 |
|
| 160 |
|
| 161 |
## Limitations
|
| 162 |
|
| 163 |
+
- **Class imbalance**: `direct_address` has 367 test examples while `parallelism` has 19. Weighted F1 (0.587) is close to macro F1 (0.573), indicating reasonably balanced performance, but rare types remain harder.
|
| 164 |
- **Span-level only**: Requires pre-extracted spans. Does not detect boundaries.
|
| 165 |
- **128-token context window**: Longer spans are truncated.
|
| 166 |
+
- **Abstraction underperforms**: At 0.472 F1 despite being a large class (117 test spans), suggesting the type may be too broad or overlapping with `analytical_distance` and `literate_feature`.
|
| 167 |
+
- **Precision-recall asymmetry**: Several types show strong precision–recall imbalance (`oral_feature` P=0.784/R=0.365; `formulaic_phrases` P=0.205/R=0.608), indicating the focal loss weighting could be further tuned.
|
| 168 |
|
| 169 |
## Theoretical Background
|
| 170 |
|
| 171 |
The type level captures functional groupings within the oral–literate framework. Oral types reflect Ong's characterization of oral discourse as additive (`parataxis`), aggregative (`formulaic_phrases`), redundant (`repetition`), agonistically toned (`agonistic_framing`), empathetic and participatory (`direct_address`), and close to the human lifeworld (`concrete_situational`). Literate types capture the analytic (`abstraction`, `subordination`), distanced (`analytical_distance`, `passive_agentless`), and self-referential (`textual_apparatus`) qualities of written discourse.
|
| 172 |
|
| 173 |
+
## Related Models
|
| 174 |
+
|
| 175 |
+
| Model | Task | Classes | F1 |
|
| 176 |
+
|-------|------|---------|-----|
|
| 177 |
+
| [`HavelockAI/bert-marker-category`](https://huggingface.co/HavelockAI/bert-marker-category) | Binary (oral/literate) | 2 | 0.875 |
|
| 178 |
+
| **This model** | Functional type | 18 | 0.573 |
|
| 179 |
+
| [`HavelockAI/bert-marker-subtype`](https://huggingface.co/HavelockAI/bert-marker-subtype) | Fine-grained subtype | 71 | 0.493 |
|
| 180 |
+
| [`HavelockAI/bert-orality-regressor`](https://huggingface.co/HavelockAI/bert-orality-regressor) | Document-level score | Regression | MAE 0.079 |
|
| 181 |
+
| [`HavelockAI/bert-token-classifier`](https://huggingface.co/HavelockAI/bert-token-classifier) | Span detection (BIO) | 145 | 0.500 |
|
| 182 |
+
|
| 183 |
## Citation
|
| 184 |
```bibtex
|
| 185 |
@misc{havelock2026type,
|
|
|
|
| 193 |
## References
|
| 194 |
|
| 195 |
- Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
|
| 196 |
+
- Lee, C. et al. "Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models." ICLR 2020.
|
| 197 |
+
- Warner, A. et al. "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference." 2024.
|
| 198 |
|
| 199 |
---
|
| 200 |
|
| 201 |
+
*Trained: February 2026*
|