Havelock Orality Analyzer
BERT-based text classifier for the oral vs literate spectrum based on Walter Ong's "Orality and Literacy" (1982).
Deployed at: huggingface.co/thestalwart/havelock-orality
Models Included
| Model | File | Performance |
|---|---|---|
| Document Regressor | bert_orality_regressor.pt |
MAE: 0.109, R²: 0.60 |
| Category Classifier | bert_marker_category.pt |
86% accuracy, F1: 0.86 |
| Subtype Classifier | bert_marker_subtype.pt |
68-class, 49% accuracy |
Usage
import torch
from transformers import BertTokenizer, BertModel
import torch.nn as nn
class BertOralityRegressor(nn.Module):
def __init__(self, bert_model_name='bert-base-uncased', dropout=0.1):
super().__init__()
self.bert = BertModel.from_pretrained(bert_model_name)
self.dropout = nn.Dropout(dropout)
self.regressor = nn.Linear(self.bert.config.hidden_size, 1)
self.sigmoid = nn.Sigmoid()
def forward(self, input_ids, attention_mask):
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
pooled_output = outputs.pooler_output
pooled_output = self.dropout(pooled_output)
logits = self.regressor(pooled_output)
return self.sigmoid(logits).squeeze(-1)
# Load model
model = BertOralityRegressor()
model.load_state_dict(torch.load('bert_orality_regressor.pt', map_location='cpu'))
model.eval()
# Predict
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "Tell me, O Muse, of that ingenious hero who travelled far and wide"
inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512, padding='max_length')
with torch.no_grad():
score = model(inputs['input_ids'], inputs['attention_mask'])
print(f"Orality score: {score.item():.2f}")
Orality Score Interpretation
| Score | Interpretation |
|---|---|
| 0.9+ | Highly oral (epic poetry, hip-hop, sermons) |
| 0.7-0.9 | Oral dominant (speeches, podcasts) |
| 0.4-0.7 | Mixed oral/literate |
| 0.1-0.4 | Literate dominant (essays, journalism) |
| <0.1 | Highly literate (academic, legal, philosophy) |
Training Data
- 418 documents annotated with V4/V5 three-pass system
- 17,952 labeled spans with oral/literate categories and 68 subtypes
- Sources: textfiles.com, Project Gutenberg, Reddit, Wikipedia talk pages
- Based on Walter Ong's oral/literate marker taxonomy
Last updated: January 2025
Marker Types (68 subtypes)
Oral markers: anaphora, parallelism, tricolon, rhetorical_question, vocative, second_person, imperative, inclusive_we, asyndeton, polysyndeton, simple_conjunction, list_structure, epithet, discourse_formula, proverb, religious_formula, alliteration, rhythm, refrain, aside, dramatic_pause, self_correction, phatic_check, phatic_filler, audience_response, conflict_frame, us_them, named_individual, specific_place, sensory_detail, embodied_action, temporal_anchor, everyday_example
Literate markers: nominalization, abstract_noun, categorical_statement, conceptual_metaphor, nested_clauses, relative_chain, conditional, concessive, temporal_embedding, causal_chain, epistemic_hedge, probability, evidential, qualified_assertion, concessive_connector, agentless_passive, agent_demoted, institutional_subject, objectifying_stance, citation, footnote_reference, cross_reference, enumeration, metadiscourse, technical_term, technical_abbreviation, methodological_framing, contrastive, causal_explicit, additive_formal, definitional_move