Upload folder using huggingface_hub

891acbc verified about 2 months ago

6.63 kB

	---
	license: mit
	tags:
	- text-classification
	- regression
	- modernbert
	- orality
	- linguistics
	- rhetorical-analysis
	language:
	- en
	metrics:
	- mae
	- r2
	base_model:
	- answerdotai/ModernBERT-base
	pipeline_tag: text-classification
	library_name: transformers
	datasets:
	- custom
	model-index:
	- name: bert-orality-regressor
	results:
	- task:
	type: text-classification
	name: Orality Regression
	metrics:
	- type: mae
	value: 0.0791
	name: Mean Absolute Error
	- type: r2
	value: 0.748
	name: R² Score
	---

	# Havelock Orality Regressor

	ModernBERT-based regression model that scores text on the oral–literate spectrum (0–1), grounded in Walter Ong's Orality and Literacy (1982).

	Given a passage of text, the model outputs a continuous score where higher values indicate greater orality (spoken, performative, additive discourse) and lower values indicate greater literacy (analytic, subordinative, abstract discourse).

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base model \| `answerdotai/ModernBERT-base` \|
	\| Architecture \| `HavelockOralityRegressor` (custom, mean pooling → linear) \|
	\| Task \| Single-value regression (MSE loss) \|
	\| Output range \| Continuous (not clamped) \|
	\| Max sequence length \| 512 tokens \|
	\| Best MAE \| 0.0791 \|
	\| R² (at best MAE) \| 0.748 \|
	\| Parameters \| ~149M \|

	## Usage
	```python
	import os
	os.environ["TORCH_COMPILE_DISABLE"] = "1"

	import warnings
	warnings.filterwarnings("ignore", message="Flash Attention 2 only supports")

	import torch
	from transformers import AutoModel, AutoTokenizer

	model_name = "HavelockAI/bert-orality-regressor"
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
	model.eval()

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = model.to(device)

	text = "Tell me, O Muse, of that ingenious hero who travelled far and wide"
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
	inputs = {k: v.to(device) for k, v in inputs.items()}

	with torch.no_grad(), torch.autocast(device_type=device.type, enabled=device.type == "cuda"):
	score = model(**inputs).logits.squeeze().item()

	print(f"Orality score: {max(0.0, min(1.0, score)):.3f}")
	```

	### Score Interpretation

	\| Score \| Register \|
	\|-------\|----------\|
	\| 0.8–1.0 \| Highly oral — epic poetry, sermons, rap, oral storytelling \|
	\| 0.6–0.8 \| Oral-dominant — speeches, podcasts, conversational prose \|
	\| 0.4–0.6 \| Mixed — journalism, blog posts, dialogue-heavy fiction \|
	\| 0.2–0.4 \| Literate-dominant — essays, expository prose \|
	\| 0.0–0.2 \| Highly literate — academic papers, legal texts, philosophy \|

	## Training

	### Data

	The model was trained on a curated corpus of documents annotated with orality scores using a multi-pass scoring system. Scores were originally on a 0–100 scale and normalized to 0–1 for training. The corpus draws from Project Gutenberg, textfiles.com, Reddit, and Wikipedia talk pages, representing a range of registers from highly oral to highly literate.

	An 80/20 train/test split was used (random seed 42).

	### Hyperparameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Epochs \| 20 \|
	\| Learning rate \| 2e-5 \|
	\| Optimizer \| AdamW (weight decay 0.01) \|
	\| LR schedule \| Cosine with warmup (10% of total steps) \|
	\| Gradient clipping \| 1.0 \|
	\| Loss \| MSE \|
	\| Mixed precision \| FP16 \|
	\| Regularization \| Mixout (p=0.1) \|

	### Training Metrics

	<details><summary>Click to show per-epoch metrics</summary>

	\| Epoch \| Loss \| MAE \| R² \|
	\|-------\|------\|-----\|-----\|
	\| 1 \| 0.3496 \| 0.1173 \| 0.476 \|
	\| 2 \| 0.0286 \| 0.0992 \| 0.593 \|
	\| 3 \| 0.0215 \| 0.0872 \| 0.704 \|
	\| 4 \| 0.0144 \| 0.0879 \| 0.714 \|
	\| 5 \| 0.0169 \| 0.0865 \| 0.712 \|
	\| 6 \| 0.0117 \| 0.0853 \| 0.700 \|
	\| 7 \| 0.0096 \| 0.0922 \| 0.691 \|
	\| 8 \| 0.0094 \| 0.0850 \| 0.722 \|
	\| 9 \| 0.0086 \| 0.0822 \| 0.745 \|
	\| 10 \| 0.0064 \| 0.0841 \| 0.723 \|
	\| 11 \| 0.0054 \| 0.0921 \| 0.682 \|
	\| 12 \| 0.0050 \| 0.0840 \| 0.720 \|
	\| 13 \| 0.0044 \| 0.0806 \| 0.744 \|
	\| 14 \| 0.0037 \| 0.0805 \| 0.740 \|
	\| 15 \| 0.0034 \| 0.0791 \| 0.748 \|
	\| 16 \| 0.0033 \| 0.0807 \| 0.738 \|
	\| 17 \| 0.0031 \| 0.0803 \| 0.742 \|
	\| 18 \| 0.0026 \| 0.0797 \| 0.745 \|
	\| 19 \| 0.0027 \| 0.0803 \| 0.742 \|
	\| 20 \| 0.0029 \| 0.0805 \| 0.741 \|

	</details>

	Best checkpoint selected at epoch 15 by lowest MAE.

	## Architecture

	Custom `HavelockOralityRegressor` with mean pooling (ModernBERT has no pooler output):
	```
	ModernBERT (answerdotai/ModernBERT-base)
	└── Mean pooling over non-padded tokens
	└── Dropout (p=0.1)
	└── Linear (hidden_size → 1)
	```

	### Regularization

	- Mixout (p=0.1): During training, each backbone weight element has a 10% chance of being replaced by its pretrained value per forward pass, acting as a stochastic L2 anchor that prevents representation drift (Lee et al., 2019)
	- Weight decay (0.01) via AdamW
	- Gradient clipping (max norm 1.0)

	## Limitations

	- No sigmoid clamping: The model can output values outside [0, 1]. Consumers should clamp if needed.
	- Domain coverage: Training corpus skews historical/literary. Performance on modern social media, code-switched text, or non-English text is untested.
	- Document length: Texts longer than 512 tokens are truncated. The model sees only the first ~400 words, which may not be representative of longer documents.
	- Regression target subjectivity: Orality scores involve human judgment; inter-annotator agreement bounds the ceiling for model performance.

	## Theoretical Background

	The oral–literate spectrum follows Ong's framework, which characterizes oral discourse as additive, aggregative, redundant, agonistic, empathetic, and situational, while literate discourse is subordinative, analytic, abstract, distanced, and context-free. The model learns to place text along this continuum from document-level annotations informed by 72 specific rhetorical markers (36 oral, 36 literate).

	## Citation
	```bibtex
	@misc{havelock2026regressor,
	title={Havelock Orality Regressor},
	author={Havelock AI},
	year={2026},
	url={https://huggingface.co/HavelockAI/bert-orality-regressor}
	}
	```

	## References

	- Ong, Walter J. Orality and Literacy: The Technologizing of the Word. Routledge, 1982.
	- Lee, C. et al. "Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models." ICLR 2020.
	- Warner, A. et al. "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference." 2024.

	---

	Trained: February 2026