| --- |
| license: mit |
| tags: |
| - text-classification |
| - regression |
| - modernbert |
| - orality |
| - linguistics |
| - rhetorical-analysis |
| language: |
| - en |
| metrics: |
| - mae |
| - r2 |
| base_model: |
| - answerdotai/ModernBERT-base |
| pipeline_tag: text-classification |
| library_name: transformers |
| datasets: |
| - custom |
| model-index: |
| - name: bert-orality-regressor |
| results: |
| - task: |
| type: text-classification |
| name: Orality Regression |
| metrics: |
| - type: mae |
| value: 0.0791 |
| name: Mean Absolute Error |
| - type: r2 |
| value: 0.748 |
| name: R² Score |
| --- |
| |
| # Havelock Orality Regressor |
|
|
| ModernBERT-based regression model that scores text on the **oral–literate spectrum** (0–1), grounded in Walter Ong's *Orality and Literacy* (1982). |
|
|
| Given a passage of text, the model outputs a continuous score where higher values indicate greater orality (spoken, performative, additive discourse) and lower values indicate greater literacy (analytic, subordinative, abstract discourse). |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |----------|-------| |
| | Base model | `answerdotai/ModernBERT-base` | |
| | Architecture | `HavelockOralityRegressor` (custom, mean pooling → linear) | |
| | Task | Single-value regression (MSE loss) | |
| | Output range | Continuous (not clamped) | |
| | Max sequence length | 512 tokens | |
| | Best MAE | **0.0791** | |
| | R² (at best MAE) | **0.748** | |
| | Parameters | ~149M | |
|
|
| ## Usage |
| ```python |
| import os |
| os.environ["TORCH_COMPILE_DISABLE"] = "1" |
| |
| import warnings |
| warnings.filterwarnings("ignore", message="Flash Attention 2 only supports") |
| |
| import torch |
| from transformers import AutoModel, AutoTokenizer |
| |
| model_name = "HavelockAI/bert-orality-regressor" |
| tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
| model = AutoModel.from_pretrained(model_name, trust_remote_code=True) |
| model.eval() |
| |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| model = model.to(device) |
| |
| text = "Tell me, O Muse, of that ingenious hero who travelled far and wide" |
| inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) |
| inputs = {k: v.to(device) for k, v in inputs.items()} |
| |
| with torch.no_grad(), torch.autocast(device_type=device.type, enabled=device.type == "cuda"): |
| score = model(**inputs).logits.squeeze().item() |
| |
| print(f"Orality score: {max(0.0, min(1.0, score)):.3f}") |
| ``` |
|
|
| ### Score Interpretation |
|
|
| | Score | Register | |
| |-------|----------| |
| | 0.8–1.0 | Highly oral — epic poetry, sermons, rap, oral storytelling | |
| | 0.6–0.8 | Oral-dominant — speeches, podcasts, conversational prose | |
| | 0.4–0.6 | Mixed — journalism, blog posts, dialogue-heavy fiction | |
| | 0.2–0.4 | Literate-dominant — essays, expository prose | |
| | 0.0–0.2 | Highly literate — academic papers, legal texts, philosophy | |
|
|
| ## Training |
|
|
| ### Data |
|
|
| The model was trained on a curated corpus of documents annotated with orality scores using a multi-pass scoring system. Scores were originally on a 0–100 scale and normalized to 0–1 for training. The corpus draws from Project Gutenberg, textfiles.com, Reddit, and Wikipedia talk pages, representing a range of registers from highly oral to highly literate. |
|
|
| An 80/20 train/test split was used (random seed 42). |
|
|
| ### Hyperparameters |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Epochs | 20 | |
| | Learning rate | 2e-5 | |
| | Optimizer | AdamW (weight decay 0.01) | |
| | LR schedule | Cosine with warmup (10% of total steps) | |
| | Gradient clipping | 1.0 | |
| | Loss | MSE | |
| | Mixed precision | FP16 | |
| | Regularization | Mixout (p=0.1) | |
|
|
| ### Training Metrics |
|
|
| <details><summary>Click to show per-epoch metrics</summary> |
|
|
| | Epoch | Loss | MAE | R² | |
| |-------|------|-----|-----| |
| | 1 | 0.3496 | 0.1173 | 0.476 | |
| | 2 | 0.0286 | 0.0992 | 0.593 | |
| | 3 | 0.0215 | 0.0872 | 0.704 | |
| | 4 | 0.0144 | 0.0879 | 0.714 | |
| | 5 | 0.0169 | 0.0865 | 0.712 | |
| | 6 | 0.0117 | 0.0853 | 0.700 | |
| | 7 | 0.0096 | 0.0922 | 0.691 | |
| | 8 | 0.0094 | 0.0850 | 0.722 | |
| | 9 | 0.0086 | 0.0822 | 0.745 | |
| | 10 | 0.0064 | 0.0841 | 0.723 | |
| | 11 | 0.0054 | 0.0921 | 0.682 | |
| | 12 | 0.0050 | 0.0840 | 0.720 | |
| | 13 | 0.0044 | 0.0806 | 0.744 | |
| | 14 | 0.0037 | 0.0805 | 0.740 | |
| | **15** | **0.0034** | **0.0791** | **0.748** | |
| | 16 | 0.0033 | 0.0807 | 0.738 | |
| | 17 | 0.0031 | 0.0803 | 0.742 | |
| | 18 | 0.0026 | 0.0797 | 0.745 | |
| | 19 | 0.0027 | 0.0803 | 0.742 | |
| | 20 | 0.0029 | 0.0805 | 0.741 | |
|
|
| </details> |
|
|
| Best checkpoint selected at epoch 15 by lowest MAE. |
|
|
| ## Architecture |
|
|
| Custom `HavelockOralityRegressor` with mean pooling (ModernBERT has no pooler output): |
| ``` |
| ModernBERT (answerdotai/ModernBERT-base) |
| └── Mean pooling over non-padded tokens |
| └── Dropout (p=0.1) |
| └── Linear (hidden_size → 1) |
| ``` |
|
|
| ### Regularization |
|
|
| - **Mixout** (p=0.1): During training, each backbone weight element has a 10% chance of being replaced by its pretrained value per forward pass, acting as a stochastic L2 anchor that prevents representation drift (Lee et al., 2019) |
| - **Weight decay** (0.01) via AdamW |
| - **Gradient clipping** (max norm 1.0) |
|
|
| ## Limitations |
|
|
| - **No sigmoid clamping**: The model can output values outside [0, 1]. Consumers should clamp if needed. |
| - **Domain coverage**: Training corpus skews historical/literary. Performance on modern social media, code-switched text, or non-English text is untested. |
| - **Document length**: Texts longer than 512 tokens are truncated. The model sees only the first ~400 words, which may not be representative of longer documents. |
| - **Regression target subjectivity**: Orality scores involve human judgment; inter-annotator agreement bounds the ceiling for model performance. |
|
|
| ## Theoretical Background |
|
|
| The oral–literate spectrum follows Ong's framework, which characterizes oral discourse as additive, aggregative, redundant, agonistic, empathetic, and situational, while literate discourse is subordinative, analytic, abstract, distanced, and context-free. The model learns to place text along this continuum from document-level annotations informed by 72 specific rhetorical markers (36 oral, 36 literate). |
|
|
| ## Citation |
| ```bibtex |
| @misc{havelock2026regressor, |
| title={Havelock Orality Regressor}, |
| author={Havelock AI}, |
| year={2026}, |
| url={https://huggingface.co/HavelockAI/bert-orality-regressor} |
| } |
| ``` |
|
|
| ## References |
|
|
| - Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982. |
| - Lee, C. et al. "Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models." ICLR 2020. |
| - Warner, A. et al. "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference." 2024. |
|
|
| --- |
|
|
| *Trained: February 2026* |