Israhassan
/

Shakespeare

Text Generation

Model card Files Files and versions

Israhassan commited on Feb 23, 2023

Commit

5a85b5c

·

1 Parent(s): aebd536

Update README.md

Files changed (1) hide show

README.md +1 -2

README.md CHANGED Viewed

@@ -44,8 +44,7 @@ The Tempest            	           30,700;
 Tragedy of Romeo Juliet	           37,623;
 Total:                             1,054,451
-How, when, and why the corpus was collected: The corpus was collected from Project Gutenberg https://www.gutenberg.org/ which is a library of over 60,000 free ebooks. I collected 15 books of Shakespeare and combined them to one text file. The corpus was collected to create a dataset of over a million characters/ tokens so a model could be fine tuned as per Shakespeare's work and generate text according to it.
 How the text was pre-processed or tokenized: text was preprocessed by removing all empty spaces. the text was combined into a single line and broken down into paragraphs a combined into one text file that was used for training and valiating the model.
 Values of hyperparameters used during fine tuning: max_length=768 tokenizer=GPT2 Batch Size=2 top_p=0.95 output max_length=200

 Tragedy of Romeo Juliet	           37,623;
 Total:                             1,054,451
+How, when, and why the corpus was collected: The corpus was collected from Project Gutenberg https://www.gutenberg.org/ which is a library of over 60,000 free ebooks. I collected 15 books of Shakespeare and combined them to one text file. The corpus was collected to create a dataset of over a million tokens so a model could be fine tuned as per Shakespeare's work and generate text according to it. This corpus was created on 18 Feb 2023.
 How the text was pre-processed or tokenized: text was preprocessed by removing all empty spaces. the text was combined into a single line and broken down into paragraphs a combined into one text file that was used for training and valiating the model.
 Values of hyperparameters used during fine tuning: max_length=768 tokenizer=GPT2 Batch Size=2 top_p=0.95 output max_length=200