Commit
·
5a85b5c
1
Parent(s):
aebd536
Update README.md
Browse files
README.md
CHANGED
|
@@ -44,8 +44,7 @@ The Tempest 30,700;
|
|
| 44 |
Tragedy of Romeo Juliet 37,623;
|
| 45 |
Total: 1,054,451
|
| 46 |
|
| 47 |
-
How, when, and why the corpus was collected: The corpus was collected from Project Gutenberg https://www.gutenberg.org/ which is a library of over 60,000 free ebooks. I collected 15 books of Shakespeare and combined them to one text file. The corpus was collected to create a dataset of over a million
|
| 48 |
-
|
| 49 |
How the text was pre-processed or tokenized: text was preprocessed by removing all empty spaces. the text was combined into a single line and broken down into paragraphs a combined into one text file that was used for training and valiating the model.
|
| 50 |
|
| 51 |
Values of hyperparameters used during fine tuning: max_length=768 tokenizer=GPT2 Batch Size=2 top_p=0.95 output max_length=200
|
|
|
|
| 44 |
Tragedy of Romeo Juliet 37,623;
|
| 45 |
Total: 1,054,451
|
| 46 |
|
| 47 |
+
How, when, and why the corpus was collected: The corpus was collected from Project Gutenberg https://www.gutenberg.org/ which is a library of over 60,000 free ebooks. I collected 15 books of Shakespeare and combined them to one text file. The corpus was collected to create a dataset of over a million tokens so a model could be fine tuned as per Shakespeare's work and generate text according to it. This corpus was created on 18 Feb 2023.
|
|
|
|
| 48 |
How the text was pre-processed or tokenized: text was preprocessed by removing all empty spaces. the text was combined into a single line and broken down into paragraphs a combined into one text file that was used for training and valiating the model.
|
| 49 |
|
| 50 |
Values of hyperparameters used during fine tuning: max_length=768 tokenizer=GPT2 Batch Size=2 top_p=0.95 output max_length=200
|