Model Card for Model ID
Model Details
Model Description
This is the model card of a π€ transformers model that has been pushed on the Hub. This model card has been automatically generated.
language: - en tags: - text-classification - shakespeare - nlp - bert - transformers - literary-analysis pipeline_tag: text-classification widget: - text: "To be or not to be, that is the question" example_title: "Hamlet" - text: "Friends, Romans, countrymen, lend me your ears" example_title: "Julius Caesar" - text: "The meeting is scheduled for 2 PM tomorrow" example_title: "Modern Text"
Shakespeare Authenticator
Model Description
A BERT-based model fine-tuned to distinguish authentic Shakespearean text from modern imitations and synthetic Shakespearean-style writing.
- Developed by: Lanre Moluga
- Model type: BERT for Sequence Classification
- Language(s): English (Early Modern English & Contemporary English)
- License: MIT
- Finetuned from model:
bert-base-uncased - Repository: [GitHub Repository Link - Optional]
Model Sources
- Repository: [Your GitHub repo if available]
- Demo: [https://huggingface.co/spaces/lanretto/shakespeare-authenticator]
Uses
Direct Use
This model is designed for binary text classification to determine whether a given text sample is authentic Shakespearean writing or a modern creation/imitation.
from transformers import pipeline
classifier = pipeline("text-classification", model="lanretto/shakespeare-authenticator")
result = classifier("To be or not to be, that is the question")
print(result)
Downstream Use [optional]
Literary analysis and research tools
Educational applications for Shakespeare studies
Content moderation for Shakespearean text databases
Style transfer evaluation
Digital humanities research
### Out-of-Scope Use
Classification of non-English text
Professional literary authentication without human verification
Legal or academic authentication purposes
Texts from other historical periods or authors
## Bias, Risks, and Limitations
Temporal Bias: Model is trained specifically on Shakespearean vs modern text, not other historical periods
Style Limitations: May misclassify high-quality modern Shakespearean imitations
Length Sensitivity: Performance may vary with very short text fragments
Genre Limitations: Primarily trained on dramatic dialogue, may perform differently on poetry or prose
Cultural Context: Limited to English language and Western literary traditions
### Recommendations
Users should:
Verify critical classifications with human experts
Use longer text samples for more reliable predictions
Consider the model as a supplementary tool rather than definitive authentication
Be aware of potential false positives with sophisticated modern imitations
## How to Get Started with the Model
Use the code below to get started with the model.
# Install required packages
# pip install transformers torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "lanretto/shakespeare-authenticator"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example prediction
text = "Shall I compare thee to a summer's day?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions, dim=-1).item()
labels = {0: "Modern Creation", 1: "Authentic Shakespeare"}
print(f"Prediction: {labels[predicted_class]}")
print(f"Confidence: {predictions[0][predicted_class]:.2%}")
## Training Details
### Training Data
Total Samples: ~400,000 text samples
Authentic Shakespeare: ~108,000 lines from Shakespearean plays
Modern Dialogue: ~300,000 lines from modern movie scripts
Train/Validation/Test Split: 80%/10%/10%
Class Distribution: ~26% Shakespeare, ~74% Modern
### Training Procedure
Preprocessing
Text normalization and cleaning
Tokenization using BERT tokenizer (bert-base-uncased)
Maximum sequence length: 512 tokens
Dynamic padding during training
Training Hyperparameters
Training regime: Mixed precision training
Optimizer: AdamW
Learning Rate: 2e-5
Batch Size: 128 (with gradient accumulation)
Epochs: 3
Weight Decay: 0.01
Warmup Ratio: 0.1
Speeds, Sizes, Times
Model Size: 438 MB
Training Time: ~2 hours on 1x Tesla T4 GPU
Inference Speed: ~100 samples/second on CPU
#### Training Hyperparameters
- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
#### Speeds, Sizes, Times [optional]
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
[More Information Needed]
## Evaluation
Testing Data & Metrics
Testing Data
Test Set Size: ~40,000 samples
Class Distribution: Representative of training distribution
Data Source: Held-out from original dataset
Metrics
Accuracy: 84.7%
F1 Score: 0.8928
Precision (Shakespeare): 0.8619
Recall (Shakespeare): 0.8300
Precision (Modern): 0.8321
Recall (Modern): 0.8642
### Testing Data, Factors & Metrics
#### Testing Data
<!-- This should link to a Dataset Card if possible. -->
[More Information Needed]
#### Factors
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
[More Information Needed]
#### Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
[More Information Needed]
### Results
[More Information Needed]
#### Summary
## Model Examination [optional]
<!-- Relevant interpretability work for the model goes here -->
[More Information Needed]
## Environmental Impact
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware Type:** [More Information Needed]
- **Hours used:** [More Information Needed]
- **Cloud Provider:** [More Information Needed]
- **Compute Region:** [More Information Needed]
- **Carbon Emitted:** [More Information Needed]
## Technical Specifications [optional]
### Model Architecture and Objective
[More Information Needed]
### Compute Infrastructure
[More Information Needed]
#### Hardware
[More Information Needed]
#### Software
[More Information Needed]
## Citation [optional]
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
[More Information Needed]
**APA:**
[More Information Needed]
## Glossary [optional]
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
[More Information Needed]
## More Information [optional]
[More Information Needed]
## Model Card Authors [optional]
[More Information Needed]
## Model Card Contact
[More Information Needed]
- Downloads last month
- 13