|
|
--- |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- code_eval |
|
|
- accuracy |
|
|
library_name: adapter-transformers |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
List of texts included in the corpus: |
|
|
|
|
|
Beautiful Stories of Shakespeare, |
|
|
As you Like It, |
|
|
Hamlet, |
|
|
Julius Ceaser, |
|
|
King Lear II, |
|
|
King Richard II, |
|
|
Macbeth, |
|
|
Midnight Summer Dream, |
|
|
Othello, |
|
|
Shakespeare Roman Play, |
|
|
Shakespearean Text, |
|
|
Sonnets, |
|
|
Taming of the Shrew, |
|
|
The Tempest, |
|
|
Tragedy of Romeo Juliet |
|
|
|
|
|
How many tokens are in each text and the total number of tokens in the corpus: |
|
|
|
|
|
Beautiful Stories of Shakespeare 62,537; |
|
|
As you Like It 38,772; |
|
|
Hamlet 30,512; |
|
|
Julius Caesar 30,915; |
|
|
King Lear II 25,031; |
|
|
King Richard II 26,029; |
|
|
Macbeth 197,328; |
|
|
Midnight Summer Dream 248,172; |
|
|
Othello 197,328; |
|
|
Shakespeare Roman Play 24,582; |
|
|
Shakespearean Text 32,453; |
|
|
Sonnets 35,675; |
|
|
Taming of the Shrew 36,794; |
|
|
The Tempest 30,700; |
|
|
Tragedy of Romeo Juliet 37,623; |
|
|
Total: 1,054,451 |
|
|
|
|
|
How, when, and why the corpus was collected: The corpus was collected from Project Gutenberg https://www.gutenberg.org/ which is a library of over 60,000 free ebooks. I collected 15 books of Shakespeare and combined them to one text file. The corpus was collected to create a dataset of over a million tokens so a model could be fine tuned as per Shakespeare's work and generate text according to it. This corpus was created on 18 Feb 2023. |
|
|
How the text was pre-processed or tokenized: text was preprocessed by removing all empty spaces. the text was combined into a single line and broken down into paragraphs a combined into one text file that was used for training and valiating the model. |
|
|
|
|
|
Values of hyperparameters used during fine tuning: max_length=768 tokenizer=GPT2 Batch Size=2 top_p=0.95 output max_length=200 |
|
|
|
|
|
Model Description: |
|
|
this model is finetuned on a corpus of Shakespeare's work to generate text in Shakespearean language. |
|
|
|
|
|
Intended uses & limitations: |
|
|
This model can be used to generate text in the Shakespearean language. |
|
|
|
|
|
How to use: |
|
|
This model can be downloaded from the hugging face library and can be run on Google Colab. |
|
|
|
|
|
Training Data: |
|
|
This model is trained on a corpus of over a million tokens of Shakespeare's work, that was collected from 15 novel of Shakespeare from Gutenberg.org. |
|
|
|
|
|
Training Procedure: |
|
|
This model was run on Google Colab using a GPU. Processing time took about 15 - 20 minutes. |
|
|
To select a GPU, click on Runtime and Change Runtime Type. Select GPU and Save. Then run the codes in Colab. |
|
|
|
|
|
Variable and metrics: |
|
|
Prompt given to the model to start a sentence was "The" and max_length was set to 300. |
|
|
|
|
|
Evaluation results: |
|
|
Results of text generation of this model is above satisfactory. The model was able to generate reasonable text and in Shakespearean language. |