Model Card for Model ID

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

language: - en tags: - text-classification - shakespeare - nlp - bert - transformers - literary-analysis pipeline_tag: text-classification widget: - text: "To be or not to be, that is the question" example_title: "Hamlet" - text: "Friends, Romans, countrymen, lend me your ears" example_title: "Julius Caesar" - text: "The meeting is scheduled for 2 PM tomorrow" example_title: "Modern Text"

Shakespeare Authenticator

Model Description

A BERT-based model fine-tuned to distinguish authentic Shakespearean text from modern imitations and synthetic Shakespearean-style writing.

Developed by: Lanre Moluga
Model type: BERT for Sequence Classification
Language(s): English (Early Modern English & Contemporary English)
License: MIT
Finetuned from model: bert-base-uncased
Repository: [GitHub Repository Link - Optional]

Model Sources

Repository: [Your GitHub repo if available]
Demo: [https://huggingface.co/spaces/lanretto/shakespeare-authenticator]

Uses

Direct Use

This model is designed for binary text classification to determine whether a given text sample is authentic Shakespearean writing or a modern creation/imitation.

from transformers import pipeline

classifier = pipeline("text-classification", model="lanretto/shakespeare-authenticator")
result = classifier("To be or not to be, that is the question")
print(result)

Downstream Use [optional]
Literary analysis and research tools

Educational applications for Shakespeare studies

Content moderation for Shakespearean text databases

Style transfer evaluation

Digital humanities research

### Out-of-Scope Use

Classification of non-English text

Professional literary authentication without human verification

Legal or academic authentication purposes

Texts from other historical periods or authors

## Bias, Risks, and Limitations

Temporal Bias: Model is trained specifically on Shakespearean vs modern text, not other historical periods

Style Limitations: May misclassify high-quality modern Shakespearean imitations

Length Sensitivity: Performance may vary with very short text fragments

Genre Limitations: Primarily trained on dramatic dialogue, may perform differently on poetry or prose

Cultural Context: Limited to English language and Western literary traditions

### Recommendations

Users should:

Verify critical classifications with human experts

Use longer text samples for more reliable predictions

Consider the model as a supplementary tool rather than definitive authentication

Be aware of potential false positives with sophisticated modern imitations

## How to Get Started with the Model

Use the code below to get started with the model.

# Install required packages
# pip install transformers torch

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "lanretto/shakespeare-authenticator"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example prediction
text = "Shall I compare thee to a summer's day?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=-1).item()

labels = {0: "Modern Creation", 1: "Authentic Shakespeare"}
print(f"Prediction: {labels[predicted_class]}")
print(f"Confidence: {predictions[0][predicted_class]:.2%}")

## Training Details

### Training Data

Total Samples: ~400,000 text samples

Authentic Shakespeare: ~108,000 lines from Shakespearean plays

Modern Dialogue: ~300,000 lines from modern movie scripts

Train/Validation/Test Split: 80%/10%/10%

Class Distribution: ~26% Shakespeare, ~74% Modern
### Training Procedure

Preprocessing
Text normalization and cleaning

Tokenization using BERT tokenizer (bert-base-uncased)

Maximum sequence length: 512 tokens

Dynamic padding during training

Training Hyperparameters
Training regime: Mixed precision training

Optimizer: AdamW

Learning Rate: 2e-5

Batch Size: 128 (with gradient accumulation)

Epochs: 3

Weight Decay: 0.01

Warmup Ratio: 0.1

Speeds, Sizes, Times
Model Size: 438 MB

Training Time: ~2 hours on 1x Tesla T4 GPU

Inference Speed: ~100 samples/second on CPU



#### Training Hyperparameters

- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

#### Speeds, Sizes, Times [optional]

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

[More Information Needed]

## Evaluation

Testing Data & Metrics
Testing Data
Test Set Size: ~40,000 samples

Class Distribution: Representative of training distribution

Data Source: Held-out from original dataset

Metrics
Accuracy: 84.7%

F1 Score: 0.8928

Precision (Shakespeare): 0.8619

Recall (Shakespeare): 0.8300

Precision (Modern): 0.8321

Recall (Modern): 0.8642
### Testing Data, Factors & Metrics

#### Testing Data

<!-- This should link to a Dataset Card if possible. -->

[More Information Needed]

#### Factors

<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

[More Information Needed]

#### Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

[More Information Needed]

### Results

[More Information Needed]

#### Summary



## Model Examination [optional]

<!-- Relevant interpretability work for the model goes here -->

[More Information Needed]

## Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** [More Information Needed]
- **Hours used:** [More Information Needed]
- **Cloud Provider:** [More Information Needed]
- **Compute Region:** [More Information Needed]
- **Carbon Emitted:** [More Information Needed]

## Technical Specifications [optional]

### Model Architecture and Objective

[More Information Needed]

### Compute Infrastructure

[More Information Needed]

#### Hardware

[More Information Needed]

#### Software

[More Information Needed]

## Citation [optional]

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

[More Information Needed]

**APA:**

[More Information Needed]

## Glossary [optional]

<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

[More Information Needed]

## More Information [optional]

[More Information Needed]

## Model Card Authors [optional]

[More Information Needed]

## Model Card Contact

[More Information Needed]

Downloads last month: 4

Safetensors

Model size

0.1B params

Tensor type

F32

Space using lanretto/shakespeare-authenticator 1

Paper for lanretto/shakespeare-authenticator

Quantifying the Carbon Emissions of Machine Learning

Paper • 1910.09700 • Published Oct 21, 2019 • 45