permutans commited on
Commit
76a7e72
·
verified ·
1 Parent(s): 55b4b72

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +145 -0
  2. model.safetensors +1 -1
README.md ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - text-classification
5
+ - bert
6
+ - orality
7
+ - linguistics
8
+ - rhetorical-analysis
9
+ language:
10
+ - en
11
+ metrics:
12
+ - f1
13
+ - accuracy
14
+ base_model:
15
+ - google-bert/bert-base-uncased
16
+ pipeline_tag: text-classification
17
+ library_name: transformers
18
+ datasets:
19
+ - custom
20
+ model-index:
21
+ - name: bert-marker-category
22
+ results:
23
+ - task:
24
+ type: text-classification
25
+ name: Oral/Literate Span Classification
26
+ metrics:
27
+ - type: f1
28
+ value: 0.8748
29
+ name: F1 (macro)
30
+ - type: accuracy
31
+ value: 0.875
32
+ name: Accuracy
33
+ ---
34
+
35
+ # Havelock Marker Category Classifier
36
+
37
+ BERT-based binary classifier that determines whether a rhetorical span is **oral** or **literate**, grounded in Walter Ong's *Orality and Literacy* (1982).
38
+
39
+ This is the coarsest level of the Havelock span classification hierarchy. Given a text span that has been identified as a rhetorical marker, the model classifies it into one of two categories: oral (characteristic of spoken, performative discourse) or literate (characteristic of written, analytic discourse).
40
+
41
+ ## Model Details
42
+
43
+ | Property | Value |
44
+ |----------|-------|
45
+ | Base model | `bert-base-uncased` |
46
+ | Architecture | `BertForSequenceClassification` |
47
+ | Task | Binary classification |
48
+ | Labels | 2 (`oral`, `literate`) |
49
+ | Max sequence length | 128 tokens |
50
+ | Best F1 (macro) | **0.8748** |
51
+ | Best Accuracy | **0.875** |
52
+ | Parameters | ~109M |
53
+
54
+ ## Usage
55
+ ```python
56
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
57
+ import torch
58
+
59
+ model_name = "HavelockAI/bert-marker-category"
60
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
61
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
62
+
63
+ span = "Tell me, O Muse, of that ingenious hero"
64
+ inputs = tokenizer(span, return_tensors="pt", truncation=True, max_length=128)
65
+
66
+ with torch.no_grad():
67
+ logits = model(**inputs).logits
68
+ pred = torch.argmax(logits, dim=1).item()
69
+
70
+ label_map = {0: "oral", 1: "literate"}
71
+ print(f"Category: {label_map[pred]}")
72
+ ```
73
+
74
+ ## Training
75
+
76
+ ### Data
77
+
78
+ The model was trained on span-level annotations exported as JSONL, where each span is a contiguous text region identified as a rhetorical marker. Spans are drawn from documents sourced from Project Gutenberg, textfiles.com, Reddit, and Wikipedia talk pages.
79
+
80
+ A stratified 80/20 train/test split was used (random seed 42). The test set contains 4,608 spans (2,281 oral, 2,327 literate) — near-perfect class balance.
81
+
82
+ ### Hyperparameters
83
+
84
+ | Parameter | Value |
85
+ |-----------|-------|
86
+ | Epochs | 3 |
87
+ | Batch size | 8 |
88
+ | Learning rate | 2e-5 |
89
+ | Optimizer | AdamW |
90
+ | LR schedule | Linear warmup (10% of total steps) |
91
+ | Gradient clipping | 1.0 |
92
+ | Loss | Cross-entropy |
93
+ | Min examples per class | 15 |
94
+
95
+ ### Training Metrics
96
+
97
+ | Epoch | Loss | Accuracy | F1 (macro) |
98
+ |-------|------|----------|------------|
99
+ | 1 | 0.4095 | 0.8730 | 0.8730 |
100
+ | 2 | 0.2967 | 0.8748 | 0.8748 |
101
+ | 3 | 0.2126 | 0.8694 | 0.8693 |
102
+
103
+ Best checkpoint selected by F1 at epoch 2.
104
+
105
+ ### Test Set Classification Report
106
+ ```
107
+ precision recall f1-score support
108
+
109
+ oral 0.868 0.868 0.868 2281
110
+ literate 0.871 0.871 0.871 2327
111
+
112
+ accuracy 0.869 4608
113
+ macro avg 0.869 0.869 0.869 4608
114
+ weighted avg 0.869 0.869 0.869 4608
115
+ ```
116
+
117
+ ## Limitations
118
+
119
+ - **Short training**: 3 epochs with loss still declining. Additional epochs would likely improve performance.
120
+ - **Span-level only**: This model classifies pre-extracted spans. It does not detect span boundaries — pair it with a span detection model (e.g., [`HavelockAI/bert-token-classifier`](https://huggingface.co/HavelockAI/bert-token-classifier)) for end-to-end use.
121
+ - **128-token context window**: Longer spans are truncated.
122
+ - **Domain**: Trained on historical/literary and web text. Performance on other domains is untested.
123
+
124
+ ## Theoretical Background
125
+
126
+ The oral–literate distinction follows Ong's framework. Oral markers include features like direct address, formulaic phrasing, parataxis, repetition, and sound patterning. Literate markers include features like subordination, abstraction, hedging, passive constructions, and textual apparatus (citations, cross-references). This binary classifier serves as the top level of a three-tier taxonomy: category → type → subtype.
127
+
128
+ ## Citation
129
+ ```bibtex
130
+ @misc{havelock2026category,
131
+ title={Havelock Marker Category Classifier},
132
+ author={Havelock AI},
133
+ year={2026},
134
+ url={https://huggingface.co/HavelockAI/bert-marker-category}
135
+ }
136
+ ```
137
+
138
+ ## References
139
+
140
+ - Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
141
+
142
+ ---
143
+
144
+ *Trained: February 2026*
145
+ *Model version: da931b4a · Trained: February 2026*
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ef5cbe44a07bc9ac8660f71a6457a14bfd52313837ef3095d5f2a1fcaab628a5
3
  size 437958624
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:94b5513f20b2547b72739c977cb4cade6e81c234f8b1f93470b17483784ee99f
3
  size 437958624