HavelockAI
/

bert-orality-regressor

@@ -3,7 +3,7 @@ license: mit
 tags:
 - text-classification
 - regression
-- bert
 - orality
 - linguistics
 - rhetorical-analysis
@@ -13,7 +13,7 @@ metrics:
 - mae
 - r2
 base_model:
-- google-bert/bert-base-uncased
 pipeline_tag: text-classification
 library_name: transformers
 datasets:
@@ -26,16 +26,16 @@ model-index:
       name: Orality Regression
     metrics:
     - type: mae
-      value: 0.0786
       name: Mean Absolute Error
     - type: r2
-      value: 0.756
       name: R² Score
 ---
 # Havelock Orality Regressor
-BERT-based regression model that scores text on the **oral–literate spectrum** (0–1), grounded in Walter Ong's *Orality and Literacy* (1982).
 Given a passage of text, the model outputs a continuous score where higher values indicate greater orality (spoken, performative, additive discourse) and lower values indicate greater literacy (analytic, subordinative, abstract discourse).
@@ -43,14 +43,14 @@ Given a passage of text, the model outputs a continuous score where higher value
 | Property | Value |
 |----------|-------|
-| Base model | `bert-base-uncased` |
-| Architecture | `BertForSequenceClassification` (num_labels=1) |
 | Task | Single-value regression (MSE loss) |
 | Output range | Continuous (not clamped) |
 | Max sequence length | 512 tokens |
-| Best MAE | **0.0786** |
-| R² | **0.756** |
-| Parameters | ~109M |
 ## Usage
 ```python
@@ -92,25 +92,64 @@ An 80/20 train/test split was used (random seed 42).
 | Parameter | Value |
 |-----------|-------|
-| Epochs | 3 |
-| Batch size | 8 |
 | Learning rate | 2e-5 |
-| Optimizer | AdamW |
-| LR schedule | Linear warmup (10% of total steps) |
 | Gradient clipping | 1.0 |
-| Loss | MSE (via HF `num_labels=1`) |
 ### Training Metrics
 | Epoch | Loss | MAE | R² |
 |-------|------|-----|-----|
-| 1 | 0.0382 | 0.1443 | 0.317 |
-| 2 | 0.0187 | 0.0852 | 0.722 |
-| 3 | 0.0128 | 0.0786 | 0.756 |
 ## Limitations
-- **Short training**: Only 3 epochs — likely undertrained. Further epochs or hyperparameter search would probably improve R².
 - **No sigmoid clamping**: The model can output values outside [0, 1]. Consumers should clamp if needed.
 - **Domain coverage**: Training corpus skews historical/literary. Performance on modern social media, code-switched text, or non-English text is untested.
 - **Document length**: Texts longer than 512 tokens are truncated. The model sees only the first ~400 words, which may not be representative of longer documents.
@@ -133,7 +172,9 @@ The oral–literate spectrum follows Ong's framework, which characterizes oral d
 ## References
 - Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
 ---
-*Model version: 33b6eccc · Trained: February 2026*

 tags:
 - text-classification
 - regression
+- modernbert
 - orality
 - linguistics
 - rhetorical-analysis
 - mae
 - r2
 base_model:
+- answerdotai/ModernBERT-base
 pipeline_tag: text-classification
 library_name: transformers
 datasets:
       name: Orality Regression
     metrics:
     - type: mae
+      value: 0.0819
       name: Mean Absolute Error
     - type: r2
+      value: 0.734
       name: R² Score
 ---
 # Havelock Orality Regressor
+ModernBERT-based regression model that scores text on the **oral–literate spectrum** (0–1), grounded in Walter Ong's *Orality and Literacy* (1982).
 Given a passage of text, the model outputs a continuous score where higher values indicate greater orality (spoken, performative, additive discourse) and lower values indicate greater literacy (analytic, subordinative, abstract discourse).
 | Property | Value |
 |----------|-------|
+| Base model | `answerdotai/ModernBERT-base` |
+| Architecture | `HavelockOralityRegressor` (custom, mean pooling → linear) |
 | Task | Single-value regression (MSE loss) |
 | Output range | Continuous (not clamped) |
 | Max sequence length | 512 tokens |
+| Best MAE | **0.0819** |
+| R² (at best MAE) | **0.734** |
+| Parameters | ~149M |
 ## Usage
 ```python
 | Parameter | Value |
 |-----------|-------|
+| Epochs | 20 |
 | Learning rate | 2e-5 |
+| Optimizer | AdamW (weight decay 0.01) |
+| LR schedule | Cosine with warmup (10% of total steps) |
 | Gradient clipping | 1.0 |
+| Loss | MSE |
+| Mixed precision | FP16 |
+| Regularization | Mixout (p=0.1) |
 ### Training Metrics
+<details><summary>Click to show per-epoch metrics</summary>
 | Epoch | Loss | MAE | R² |
 |-------|------|-----|-----|
+| 1 | 0.3485 | 0.1151 | 0.485 |
+| 2 | 0.0269 | 0.1145 | 0.446 |
+| 3 | 0.0235 | 0.0962 | 0.636 |
+| 4 | 0.0162 | 0.0937 | 0.648 |
+| 5 | 0.0228 | 0.1099 | 0.566 |
+| 6 | 0.0153 | 0.0971 | 0.605 |
+| 7 | 0.0115 | 0.0883 | 0.707 |
+| 8 | 0.0112 | 0.0906 | 0.681 |
+| 9 | 0.0095 | 0.0872 | 0.713 |
+| 10 | 0.0076 | 0.0898 | 0.691 |
+| 11 | 0.0060 | 0.0840 | 0.727 |
+| 12 | 0.0054 | 0.0850 | 0.715 |
+| 13 | 0.0050 | 0.0821 | 0.738 |
+| 14 | 0.0043 | 0.0820 | 0.737 |
+| **15** | **0.0040** | **0.0819** | **0.734** |
+| 16 | 0.0041 | 0.0891 | 0.689 |
+| 17 | 0.0035 | 0.0829 | 0.727 |
+| 18 | 0.0031 | 0.0825 | 0.729 |
+| 19 | 0.0032 | 0.0831 | 0.725 |
+| 20 | 0.0033 | 0.0833 | 0.724 |
+</details>
+Best checkpoint selected at epoch 15 by lowest MAE.
+## Architecture
+Custom `HavelockOralityRegressor` with mean pooling (ModernBERT has no pooler output):
+```
+ModernBERT (answerdotai/ModernBERT-base)
+    └── Mean pooling over non-padded tokens
+        └── Dropout (p=0.1)
+            └── Linear (hidden_size → 1)
+```
+### Regularization
+- **Mixout** (p=0.1): During training, each backbone weight element has a 10% chance of being replaced by its pretrained value per forward pass, acting as a stochastic L2 anchor that prevents representation drift (Lee et al., 2019)
+- **Weight decay** (0.01) via AdamW
+- **Gradient clipping** (max norm 1.0)
 ## Limitations
 - **No sigmoid clamping**: The model can output values outside [0, 1]. Consumers should clamp if needed.
 - **Domain coverage**: Training corpus skews historical/literary. Performance on modern social media, code-switched text, or non-English text is untested.
 - **Document length**: Texts longer than 512 tokens are truncated. The model sees only the first ~400 words, which may not be representative of longer documents.
 ## References
 - Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
+- Lee, C. et al. "Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models." ICLR 2020.
+- Warner, A. et al. "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference." 2024.
 ---
+*Trained: February 2026*