vicclab commited on
Commit
1844e03
·
1 Parent(s): 331627d

Initial card updates.

Browse files
Files changed (1) hide show
  1. README.md +34 -8
README.md CHANGED
@@ -5,28 +5,54 @@ tags:
5
  model-index:
6
  - name: FolkGPT
7
  results: []
 
 
 
 
 
8
  ---
9
 
10
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
11
- should probably proofread and complete it, then remove this comment. -->
12
-
13
  # FolkGPT
14
 
15
- This model is a fine-tuned version of [gpt2](https://huggingface.co/gpt2) on an unknown dataset.
16
 
17
  ## Model description
18
 
19
- More information needed
20
 
21
  ## Intended uses & limitations
22
 
23
- More information needed
 
 
 
24
 
25
  ## Training and evaluation data
26
 
27
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
  ## Training procedure
 
 
 
30
 
31
  ### Training hyperparameters
32
 
@@ -52,4 +78,4 @@ The following hyperparameters were used during training:
52
  - Transformers 4.26.1
53
  - Pytorch 1.13.1+cu116
54
  - Datasets 2.10.0
55
- - Tokenizers 0.13.2
 
5
  model-index:
6
  - name: FolkGPT
7
  results: []
8
+ datasets:
9
+ - vicclab/fairy_tales
10
+ language:
11
+ - en
12
+ pipeline_tag: text-generation
13
  ---
14
 
 
 
 
15
  # FolkGPT
16
 
17
+ This model is a fine-tuned version of [gpt2](https://huggingface.co/gpt2) on vicclab/fairy_tales dataset.
18
 
19
  ## Model description
20
 
21
+ This model is the result of fine-tuning gpt2 on a dataset of fairy tales from various cultures.
22
 
23
  ## Intended uses & limitations
24
 
25
+ The idea behind this is to generate text in the fashion of fairy tales written in the 18th and 19th centuries.
26
+
27
+ Why? Fairy tales seemed an appropriate application for text generation, as stories are usually short(ish),
28
+ self-contained, and easy to read.
29
 
30
  ## Training and evaluation data
31
 
32
+ Trained on the vicclab/fairy_tales dataset. The dataset consists of a number of texts which
33
+ were downloaded from Project Gutenberg, and then edited to remove all text except for the
34
+ stories themselves. These were then all concatenated into a text file and pushed to HF at
35
+ https://huggingface.co/datasets/vicclab/fairy_tales. The latest update to the dataset, which
36
+ was used in the training of this model, was created and uploaded on February 26th, 2023.
37
+ Texts used [and token count after removing boilerplate text]:
38
+ https://www.gutenberg.org/files/2591/2591-0.txt [102927 tokens]
39
+ https://www.gutenberg.org/files/503/503-0.txt [138353 tokens]
40
+ https://www.gutenberg.org/cache/epub/69739/pg69739.txt [51035 tokens]
41
+ https://www.gutenberg.org/files/2435/2435-0.txt [98791 tokens]
42
+ https://www.gutenberg.org/cache/epub/7871/pg7871.txt [49410 tokens]
43
+ https://www.gutenberg.org/files/8933/8933-0.txt [178622 tokens]
44
+ gutenberg.org/cache/epub/30834/pg30834.txt [58359 tokens]
45
+ https://www.gutenberg.org/cache/epub/68589/pg68589.txt [39815 tokens]
46
+ https://www.gutenberg.org/cache/epub/34453/pg34453.txt [69365 tokens]
47
+ gutenberg.org/cache/epub/8653/pg8653.txt [35351]
48
+
49
+ [Total tokens in actual dataset: 1002654 tokens]
50
+
51
 
52
  ## Training procedure
53
+ The dataset was loaded, sampling by paragraph. From here, the dataset was split into a training dataset
54
+ and a validation dataset in an 80-20 split. These were then tokenized. The model was set up, and the trainer
55
+ was instantiated with the training_arguments listed below. Then, the training took place.
56
 
57
  ### Training hyperparameters
58
 
 
78
  - Transformers 4.26.1
79
  - Pytorch 1.13.1+cu116
80
  - Datasets 2.10.0
81
+ - Tokenizers 0.13.2