File size: 2,989 Bytes
1fb94fb
 
 
 
 
 
 
 
 
d8d5217
 
b5f728a
9f5ea86
 
 
 
 
 
 
 
 
 
 
 
 
aebd536
d8d5217
 
 
aebd536
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d8d5217
 
5a85b5c
9f5ea86
d8d5217
 
 
36ce6e2
bb1f9bc
 
7743a4f
bb1f9bc
 
7743a4f
bb1f9bc
 
36ce6e2
bb1f9bc
 
36ce6e2
 
 
 
7743a4f
bb1f9bc
 
7743a4f
0f5b3b5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---
language:
- en
metrics:
- code_eval
- accuracy
library_name: adapter-transformers
pipeline_tag: text-generation
---
List of texts included in the corpus:

Beautiful Stories of Shakespeare,
As you Like It,
Hamlet,
Julius Ceaser,
King Lear II,
King Richard II,
Macbeth,
Midnight Summer Dream,
Othello,
Shakespeare Roman Play,
Shakespearean Text,
Sonnets,
Taming of the Shrew,
The Tempest,
Tragedy of Romeo Juliet

How many tokens are in each text and the total number of tokens in the corpus: 

Beautiful Stories of Shakespeare   62,537;
As you Like It      	           38,772;
Hamlet                	           30,512;
Julius Caesar                      30,915;
King Lear II           	           25,031;
King Richard II        	           26,029;
Macbeth             	           197,328;
Midnight Summer Dream  	           248,172;
Othello             	           197,328;
Shakespeare Roman Play 	           24,582;
Shakespearean Text     	           32,453;
Sonnets               	           35,675;
Taming of the Shrew    	           36,794;
The Tempest            	           30,700;
Tragedy of Romeo Juliet	           37,623;
Total:                             1,054,451

How, when, and why the corpus was collected: The corpus was collected from Project Gutenberg https://www.gutenberg.org/ which is a library of over 60,000 free ebooks. I collected 15 books of Shakespeare and combined them to one text file. The corpus was collected to create a dataset of over a million tokens so a model could be fine tuned as per Shakespeare's work and generate text according to it. This corpus was created on 18 Feb 2023.
How the text was pre-processed or tokenized: text was preprocessed by removing all empty spaces. the text was combined into a single line and broken down into paragraphs a combined into one text file that was used for training and valiating the model.

Values of hyperparameters used during fine tuning: max_length=768 tokenizer=GPT2 Batch Size=2 top_p=0.95 output max_length=200

Model Description:
this model is finetuned on a corpus of Shakespeare's work to generate text in Shakespearean language.

Intended uses & limitations:
This model can be used to generate text in the Shakespearean language.

How to use:
This model can be downloaded from the hugging face library and can be run on Google Colab.

Training Data:
This model is trained on a corpus of over a million tokens of Shakespeare's work, that was collected from 15 novel of Shakespeare from Gutenberg.org.

Training Procedure: 
This model was run on Google Colab using a GPU. Processing time took about 15 - 20 minutes.
To select a GPU, click on Runtime and Change Runtime Type. Select GPU and Save. Then run the codes in Colab.

Variable and metrics:
Prompt given to the model to start a sentence was "The" and max_length was set to 300.

Evaluation results:
Results of text generation of this model is above satisfactory. The model was able to generate reasonable text and in Shakespearean language.