Update README.md

d3ddf6d verified about 2 months ago

7.55 kB

	---
	library_name: transformers
	tags: []
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->



	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
	---
	language:
	- en
	tags:
	- text-classification
	- shakespeare
	- nlp
	- bert
	- transformers
	- literary-analysis
	pipeline_tag: text-classification
	widget:
	- text: "To be or not to be, that is the question"
	example_title: "Hamlet"
	- text: "Friends, Romans, countrymen, lend me your ears"
	example_title: "Julius Caesar"
	- text: "The meeting is scheduled for 2 PM tomorrow"
	example_title: "Modern Text"
	---

	# Shakespeare Authenticator

	## Model Description

	A BERT-based model fine-tuned to distinguish authentic Shakespearean text from modern imitations and synthetic Shakespearean-style writing.

	- Developed by: Lanre Moluga
	- Model type: BERT for Sequence Classification
	- Language(s): English (Early Modern English & Contemporary English)
	- License: MIT
	- Finetuned from model: `bert-base-uncased`
	- Repository: [GitHub Repository Link - Optional]

	## Model Sources

	- Repository: [Your GitHub repo if available]
	- Demo: [https://huggingface.co/spaces/lanretto/shakespeare-authenticator]

	## Uses

	### Direct Use

	This model is designed for binary text classification to determine whether a given text sample is authentic Shakespearean writing or a modern creation/imitation.

	```python
	from transformers import pipeline

	classifier = pipeline("text-classification", model="lanretto/shakespeare-authenticator")
	result = classifier("To be or not to be, that is the question")
	print(result)

	Downstream Use [optional]
	Literary analysis and research tools

	Educational applications for Shakespeare studies

	Content moderation for Shakespearean text databases

	Style transfer evaluation

	Digital humanities research

	### Out-of-Scope Use

	Classification of non-English text

	Professional literary authentication without human verification

	Legal or academic authentication purposes

	Texts from other historical periods or authors

	## Bias, Risks, and Limitations

	Temporal Bias: Model is trained specifically on Shakespearean vs modern text, not other historical periods

	Style Limitations: May misclassify high-quality modern Shakespearean imitations

	Length Sensitivity: Performance may vary with very short text fragments

	Genre Limitations: Primarily trained on dramatic dialogue, may perform differently on poetry or prose

	Cultural Context: Limited to English language and Western literary traditions

	### Recommendations

	Users should:

	Verify critical classifications with human experts

	Use longer text samples for more reliable predictions

	Consider the model as a supplementary tool rather than definitive authentication

	Be aware of potential false positives with sophisticated modern imitations

	## How to Get Started with the Model

	Use the code below to get started with the model.

	# Install required packages
	# pip install transformers torch

	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# Load model and tokenizer
	model_name = "lanretto/shakespeare-authenticator"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# Example prediction
	text = "Shall I compare thee to a summer's day?"
	inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_class = torch.argmax(predictions, dim=-1).item()

	labels = {0: "Modern Creation", 1: "Authentic Shakespeare"}
	print(f"Prediction: {labels[predicted_class]}")
	print(f"Confidence: {predictions[0][predicted_class]:.2%}")

	## Training Details

	### Training Data

	Total Samples: ~400,000 text samples

	Authentic Shakespeare: ~108,000 lines from Shakespearean plays

	Modern Dialogue: ~300,000 lines from modern movie scripts

	Train/Validation/Test Split: 80%/10%/10%

	Class Distribution: ~26% Shakespeare, ~74% Modern
	### Training Procedure

	Preprocessing
	Text normalization and cleaning

	Tokenization using BERT tokenizer (bert-base-uncased)

	Maximum sequence length: 512 tokens

	Dynamic padding during training

	Training Hyperparameters
	Training regime: Mixed precision training

	Optimizer: AdamW

	Learning Rate: 2e-5

	Batch Size: 128 (with gradient accumulation)

	Epochs: 3

	Weight Decay: 0.01

	Warmup Ratio: 0.1

	Speeds, Sizes, Times
	Model Size: 438 MB

	Training Time: ~2 hours on 1x Tesla T4 GPU

	Inference Speed: ~100 samples/second on CPU



	#### Training Hyperparameters

	- Training regime: [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

	#### Speeds, Sizes, Times [optional]

	<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

	[More Information Needed]

	## Evaluation

	Testing Data & Metrics
	Testing Data
	Test Set Size: ~40,000 samples

	Class Distribution: Representative of training distribution

	Data Source: Held-out from original dataset

	Metrics
	Accuracy: 84.7%

	F1 Score: 0.8928

	Precision (Shakespeare): 0.8619

	Recall (Shakespeare): 0.8300

	Precision (Modern): 0.8321

	Recall (Modern): 0.8642
	### Testing Data, Factors & Metrics

	#### Testing Data

	<!-- This should link to a Dataset Card if possible. -->

	[More Information Needed]

	#### Factors

	<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

	[More Information Needed]

	#### Metrics

	<!-- These are the evaluation metrics being used, ideally with a description of why. -->

	[More Information Needed]

	### Results

	[More Information Needed]

	#### Summary



	## Model Examination [optional]

	<!-- Relevant interpretability work for the model goes here -->

	[More Information Needed]

	## Environmental Impact

	<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

	- Hardware Type: [More Information Needed]
	- Hours used: [More Information Needed]
	- Cloud Provider: [More Information Needed]
	- Compute Region: [More Information Needed]
	- Carbon Emitted: [More Information Needed]

	## Technical Specifications [optional]

	### Model Architecture and Objective

	[More Information Needed]

	### Compute Infrastructure

	[More Information Needed]

	#### Hardware

	[More Information Needed]

	#### Software

	[More Information Needed]

	## Citation [optional]

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

	BibTeX:

	[More Information Needed]

	APA:

	[More Information Needed]

	## Glossary [optional]

	<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

	[More Information Needed]

	## More Information [optional]

	[More Information Needed]

	## Model Card Authors [optional]

	[More Information Needed]

	## Model Card Contact

	[More Information Needed]