Havelock Orality Analyzer

BERT-based text classifier for the oral vs literate spectrum based on Walter Ong's "Orality and Literacy" (1982).

Deployed at: huggingface.co/thestalwart/havelock-orality

Models Included

Model File Performance
Document Regressor bert_orality_regressor.pt MAE: 0.109, R²: 0.60
Category Classifier bert_marker_category.pt 86% accuracy, F1: 0.86
Subtype Classifier bert_marker_subtype.pt 68-class, 49% accuracy

Usage

import torch
from transformers import BertTokenizer, BertModel
import torch.nn as nn

class BertOralityRegressor(nn.Module):
    def __init__(self, bert_model_name='bert-base-uncased', dropout=0.1):
        super().__init__()
        self.bert = BertModel.from_pretrained(bert_model_name)
        self.dropout = nn.Dropout(dropout)
        self.regressor = nn.Linear(self.bert.config.hidden_size, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        pooled_output = self.dropout(pooled_output)
        logits = self.regressor(pooled_output)
        return self.sigmoid(logits).squeeze(-1)

# Load model
model = BertOralityRegressor()
model.load_state_dict(torch.load('bert_orality_regressor.pt', map_location='cpu'))
model.eval()

# Predict
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "Tell me, O Muse, of that ingenious hero who travelled far and wide"
inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512, padding='max_length')

with torch.no_grad():
    score = model(inputs['input_ids'], inputs['attention_mask'])
print(f"Orality score: {score.item():.2f}")

Orality Score Interpretation

Score Interpretation
0.9+ Highly oral (epic poetry, hip-hop, sermons)
0.7-0.9 Oral dominant (speeches, podcasts)
0.4-0.7 Mixed oral/literate
0.1-0.4 Literate dominant (essays, journalism)
<0.1 Highly literate (academic, legal, philosophy)

Training Data

  • 418 documents annotated with V4/V5 three-pass system
  • 17,952 labeled spans with oral/literate categories and 68 subtypes
  • Sources: textfiles.com, Project Gutenberg, Reddit, Wikipedia talk pages
  • Based on Walter Ong's oral/literate marker taxonomy

Last updated: January 2025

Marker Types (68 subtypes)

Oral markers: anaphora, parallelism, tricolon, rhetorical_question, vocative, second_person, imperative, inclusive_we, asyndeton, polysyndeton, simple_conjunction, list_structure, epithet, discourse_formula, proverb, religious_formula, alliteration, rhythm, refrain, aside, dramatic_pause, self_correction, phatic_check, phatic_filler, audience_response, conflict_frame, us_them, named_individual, specific_place, sensory_detail, embodied_action, temporal_anchor, everyday_example

Literate markers: nominalization, abstract_noun, categorical_statement, conceptual_metaphor, nested_clauses, relative_chain, conditional, concessive, temporal_embedding, causal_chain, epistemic_hedge, probability, evidential, qualified_assertion, concessive_connector, agentless_passive, agent_demoted, institutional_subject, objectifying_stance, citation, footnote_reference, cross_reference, enumeration, metadiscourse, technical_term, technical_abbreviation, methodological_framing, contrastive, causal_explicit, additive_formal, definitional_move

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using thestalwart/havelock-orality 1