HavelockAI
/

bert-marker-category

+---
+license: mit
+tags:
+- text-classification
+- bert
+- orality
+- linguistics
+- rhetorical-analysis
+language:
+- en
+metrics:
+- f1
+- accuracy
+base_model:
+- google-bert/bert-base-uncased
+pipeline_tag: text-classification
+library_name: transformers
+datasets:
+- custom
+model-index:
+- name: bert-marker-category
+  results:
+  - task:
+      type: text-classification
+      name: Oral/Literate Span Classification
+    metrics:
+    - type: f1
+      value: 0.8748
+      name: F1 (macro)
+    - type: accuracy
+      value: 0.875
+      name: Accuracy
+---
+# Havelock Marker Category Classifier
+BERT-based binary classifier that determines whether a rhetorical span is **oral** or **literate**, grounded in Walter Ong's *Orality and Literacy* (1982).
+This is the coarsest level of the Havelock span classification hierarchy. Given a text span that has been identified as a rhetorical marker, the model classifies it into one of two categories: oral (characteristic of spoken, performative discourse) or literate (characteristic of written, analytic discourse).
+## Model Details
+| Property | Value |
+|----------|-------|
+| Base model | `bert-base-uncased` |
+| Architecture | `BertForSequenceClassification` |
+| Task | Binary classification |
+| Labels | 2 (`oral`, `literate`) |
+| Max sequence length | 128 tokens |
+| Best F1 (macro) | **0.8748** |
+| Best Accuracy | **0.875** |
+| Parameters | ~109M |
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+model_name = "HavelockAI/bert-marker-category"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+span = "Tell me, O Muse, of that ingenious hero"
+inputs = tokenizer(span, return_tensors="pt", truncation=True, max_length=128)
+with torch.no_grad():
+    logits = model(**inputs).logits
+    pred = torch.argmax(logits, dim=1).item()
+label_map = {0: "oral", 1: "literate"}
+print(f"Category: {label_map[pred]}")
+```
+## Training
+### Data
+The model was trained on span-level annotations exported as JSONL, where each span is a contiguous text region identified as a rhetorical marker. Spans are drawn from documents sourced from Project Gutenberg, textfiles.com, Reddit, and Wikipedia talk pages.
+A stratified 80/20 train/test split was used (random seed 42). The test set contains 4,608 spans (2,281 oral, 2,327 literate) — near-perfect class balance.
+### Hyperparameters
+| Parameter | Value |
+|-----------|-------|
+| Epochs | 3 |
+| Batch size | 8 |
+| Learning rate | 2e-5 |
+| Optimizer | AdamW |
+| LR schedule | Linear warmup (10% of total steps) |
+| Gradient clipping | 1.0 |
+| Loss | Cross-entropy |
+| Min examples per class | 15 |
+### Training Metrics
+| Epoch | Loss | Accuracy | F1 (macro) |
+|-------|------|----------|------------|
+| 1 | 0.4095 | 0.8730 | 0.8730 |
+| 2 | 0.2967 | 0.8748 | 0.8748 |
+| 3 | 0.2126 | 0.8694 | 0.8693 |
+Best checkpoint selected by F1 at epoch 2.
+### Test Set Classification Report
+```
+              precision    recall  f1-score   support
+        oral      0.868     0.868     0.868      2281
+    literate      0.871     0.871     0.871      2327
+    accuracy                          0.869      4608
+   macro avg      0.869     0.869     0.869      4608
+weighted avg      0.869     0.869     0.869      4608
+```
+## Limitations
+- **Short training**: 3 epochs with loss still declining. Additional epochs would likely improve performance.
+- **Span-level only**: This model classifies pre-extracted spans. It does not detect span boundaries — pair it with a span detection model (e.g., [`HavelockAI/bert-token-classifier`](https://huggingface.co/HavelockAI/bert-token-classifier)) for end-to-end use.
+- **128-token context window**: Longer spans are truncated.
+- **Domain**: Trained on historical/literary and web text. Performance on other domains is untested.
+## Theoretical Background
+The oral–literate distinction follows Ong's framework. Oral markers include features like direct address, formulaic phrasing, parataxis, repetition, and sound patterning. Literate markers include features like subordination, abstraction, hedging, passive constructions, and textual apparatus (citations, cross-references). This binary classifier serves as the top level of a three-tier taxonomy: category → type → subtype.
+## Citation
+```bibtex
+@misc{havelock2026category,
+  title={Havelock Marker Category Classifier},
+  author={Havelock AI},
+  year={2026},
+  url={https://huggingface.co/HavelockAI/bert-marker-category}
+}
+```
+## References
+- Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
+---
+*Trained: February 2026*
+*Model version: da931b4a · Trained: February 2026*

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ef5cbe44a07bc9ac8660f71a6457a14bfd52313837ef3095d5f2a1fcaab628a5
 size 437958624

 version https://git-lfs.github.com/spec/v1
+oid sha256:94b5513f20b2547b72739c977cb4cade6e81c234f8b1f93470b17483784ee99f
 size 437958624