Upload folder using huggingface_hub
Browse files- README.md +145 -0
- model.safetensors +1 -1
README.md
ADDED
|
@@ -0,0 +1,145 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- text-classification
|
| 5 |
+
- bert
|
| 6 |
+
- orality
|
| 7 |
+
- linguistics
|
| 8 |
+
- rhetorical-analysis
|
| 9 |
+
language:
|
| 10 |
+
- en
|
| 11 |
+
metrics:
|
| 12 |
+
- f1
|
| 13 |
+
- accuracy
|
| 14 |
+
base_model:
|
| 15 |
+
- google-bert/bert-base-uncased
|
| 16 |
+
pipeline_tag: text-classification
|
| 17 |
+
library_name: transformers
|
| 18 |
+
datasets:
|
| 19 |
+
- custom
|
| 20 |
+
model-index:
|
| 21 |
+
- name: bert-marker-category
|
| 22 |
+
results:
|
| 23 |
+
- task:
|
| 24 |
+
type: text-classification
|
| 25 |
+
name: Oral/Literate Span Classification
|
| 26 |
+
metrics:
|
| 27 |
+
- type: f1
|
| 28 |
+
value: 0.8748
|
| 29 |
+
name: F1 (macro)
|
| 30 |
+
- type: accuracy
|
| 31 |
+
value: 0.875
|
| 32 |
+
name: Accuracy
|
| 33 |
+
---
|
| 34 |
+
|
| 35 |
+
# Havelock Marker Category Classifier
|
| 36 |
+
|
| 37 |
+
BERT-based binary classifier that determines whether a rhetorical span is **oral** or **literate**, grounded in Walter Ong's *Orality and Literacy* (1982).
|
| 38 |
+
|
| 39 |
+
This is the coarsest level of the Havelock span classification hierarchy. Given a text span that has been identified as a rhetorical marker, the model classifies it into one of two categories: oral (characteristic of spoken, performative discourse) or literate (characteristic of written, analytic discourse).
|
| 40 |
+
|
| 41 |
+
## Model Details
|
| 42 |
+
|
| 43 |
+
| Property | Value |
|
| 44 |
+
|----------|-------|
|
| 45 |
+
| Base model | `bert-base-uncased` |
|
| 46 |
+
| Architecture | `BertForSequenceClassification` |
|
| 47 |
+
| Task | Binary classification |
|
| 48 |
+
| Labels | 2 (`oral`, `literate`) |
|
| 49 |
+
| Max sequence length | 128 tokens |
|
| 50 |
+
| Best F1 (macro) | **0.8748** |
|
| 51 |
+
| Best Accuracy | **0.875** |
|
| 52 |
+
| Parameters | ~109M |
|
| 53 |
+
|
| 54 |
+
## Usage
|
| 55 |
+
```python
|
| 56 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 57 |
+
import torch
|
| 58 |
+
|
| 59 |
+
model_name = "HavelockAI/bert-marker-category"
|
| 60 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 61 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
| 62 |
+
|
| 63 |
+
span = "Tell me, O Muse, of that ingenious hero"
|
| 64 |
+
inputs = tokenizer(span, return_tensors="pt", truncation=True, max_length=128)
|
| 65 |
+
|
| 66 |
+
with torch.no_grad():
|
| 67 |
+
logits = model(**inputs).logits
|
| 68 |
+
pred = torch.argmax(logits, dim=1).item()
|
| 69 |
+
|
| 70 |
+
label_map = {0: "oral", 1: "literate"}
|
| 71 |
+
print(f"Category: {label_map[pred]}")
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
## Training
|
| 75 |
+
|
| 76 |
+
### Data
|
| 77 |
+
|
| 78 |
+
The model was trained on span-level annotations exported as JSONL, where each span is a contiguous text region identified as a rhetorical marker. Spans are drawn from documents sourced from Project Gutenberg, textfiles.com, Reddit, and Wikipedia talk pages.
|
| 79 |
+
|
| 80 |
+
A stratified 80/20 train/test split was used (random seed 42). The test set contains 4,608 spans (2,281 oral, 2,327 literate) — near-perfect class balance.
|
| 81 |
+
|
| 82 |
+
### Hyperparameters
|
| 83 |
+
|
| 84 |
+
| Parameter | Value |
|
| 85 |
+
|-----------|-------|
|
| 86 |
+
| Epochs | 3 |
|
| 87 |
+
| Batch size | 8 |
|
| 88 |
+
| Learning rate | 2e-5 |
|
| 89 |
+
| Optimizer | AdamW |
|
| 90 |
+
| LR schedule | Linear warmup (10% of total steps) |
|
| 91 |
+
| Gradient clipping | 1.0 |
|
| 92 |
+
| Loss | Cross-entropy |
|
| 93 |
+
| Min examples per class | 15 |
|
| 94 |
+
|
| 95 |
+
### Training Metrics
|
| 96 |
+
|
| 97 |
+
| Epoch | Loss | Accuracy | F1 (macro) |
|
| 98 |
+
|-------|------|----------|------------|
|
| 99 |
+
| 1 | 0.4095 | 0.8730 | 0.8730 |
|
| 100 |
+
| 2 | 0.2967 | 0.8748 | 0.8748 |
|
| 101 |
+
| 3 | 0.2126 | 0.8694 | 0.8693 |
|
| 102 |
+
|
| 103 |
+
Best checkpoint selected by F1 at epoch 2.
|
| 104 |
+
|
| 105 |
+
### Test Set Classification Report
|
| 106 |
+
```
|
| 107 |
+
precision recall f1-score support
|
| 108 |
+
|
| 109 |
+
oral 0.868 0.868 0.868 2281
|
| 110 |
+
literate 0.871 0.871 0.871 2327
|
| 111 |
+
|
| 112 |
+
accuracy 0.869 4608
|
| 113 |
+
macro avg 0.869 0.869 0.869 4608
|
| 114 |
+
weighted avg 0.869 0.869 0.869 4608
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
## Limitations
|
| 118 |
+
|
| 119 |
+
- **Short training**: 3 epochs with loss still declining. Additional epochs would likely improve performance.
|
| 120 |
+
- **Span-level only**: This model classifies pre-extracted spans. It does not detect span boundaries — pair it with a span detection model (e.g., [`HavelockAI/bert-token-classifier`](https://huggingface.co/HavelockAI/bert-token-classifier)) for end-to-end use.
|
| 121 |
+
- **128-token context window**: Longer spans are truncated.
|
| 122 |
+
- **Domain**: Trained on historical/literary and web text. Performance on other domains is untested.
|
| 123 |
+
|
| 124 |
+
## Theoretical Background
|
| 125 |
+
|
| 126 |
+
The oral–literate distinction follows Ong's framework. Oral markers include features like direct address, formulaic phrasing, parataxis, repetition, and sound patterning. Literate markers include features like subordination, abstraction, hedging, passive constructions, and textual apparatus (citations, cross-references). This binary classifier serves as the top level of a three-tier taxonomy: category → type → subtype.
|
| 127 |
+
|
| 128 |
+
## Citation
|
| 129 |
+
```bibtex
|
| 130 |
+
@misc{havelock2026category,
|
| 131 |
+
title={Havelock Marker Category Classifier},
|
| 132 |
+
author={Havelock AI},
|
| 133 |
+
year={2026},
|
| 134 |
+
url={https://huggingface.co/HavelockAI/bert-marker-category}
|
| 135 |
+
}
|
| 136 |
+
```
|
| 137 |
+
|
| 138 |
+
## References
|
| 139 |
+
|
| 140 |
+
- Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
|
| 141 |
+
|
| 142 |
+
---
|
| 143 |
+
|
| 144 |
+
*Trained: February 2026*
|
| 145 |
+
*Model version: da931b4a · Trained: February 2026*
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 437958624
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:94b5513f20b2547b72739c977cb4cade6e81c234f8b1f93470b17483784ee99f
|
| 3 |
size 437958624
|