Israhassan commited on
Commit
5a85b5c
·
1 Parent(s): aebd536

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -2
README.md CHANGED
@@ -44,8 +44,7 @@ The Tempest 30,700;
44
  Tragedy of Romeo Juliet 37,623;
45
  Total: 1,054,451
46
 
47
- How, when, and why the corpus was collected: The corpus was collected from Project Gutenberg https://www.gutenberg.org/ which is a library of over 60,000 free ebooks. I collected 15 books of Shakespeare and combined them to one text file. The corpus was collected to create a dataset of over a million characters/ tokens so a model could be fine tuned as per Shakespeare's work and generate text according to it.
48
-
49
  How the text was pre-processed or tokenized: text was preprocessed by removing all empty spaces. the text was combined into a single line and broken down into paragraphs a combined into one text file that was used for training and valiating the model.
50
 
51
  Values of hyperparameters used during fine tuning: max_length=768 tokenizer=GPT2 Batch Size=2 top_p=0.95 output max_length=200
 
44
  Tragedy of Romeo Juliet 37,623;
45
  Total: 1,054,451
46
 
47
+ How, when, and why the corpus was collected: The corpus was collected from Project Gutenberg https://www.gutenberg.org/ which is a library of over 60,000 free ebooks. I collected 15 books of Shakespeare and combined them to one text file. The corpus was collected to create a dataset of over a million tokens so a model could be fine tuned as per Shakespeare's work and generate text according to it. This corpus was created on 18 Feb 2023.
 
48
  How the text was pre-processed or tokenized: text was preprocessed by removing all empty spaces. the text was combined into a single line and broken down into paragraphs a combined into one text file that was used for training and valiating the model.
49
 
50
  Values of hyperparameters used during fine tuning: max_length=768 tokenizer=GPT2 Batch Size=2 top_p=0.95 output max_length=200