vicclab
/

FolkGPT

@@ -5,28 +5,54 @@ tags:
 model-index:
 - name: FolkGPT
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
 # FolkGPT
-This model is a fine-tuned version of [gpt2](https://huggingface.co/gpt2) on an unknown dataset.
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
 ## Training procedure
 ### Training hyperparameters
@@ -52,4 +78,4 @@ The following hyperparameters were used during training:
 - Transformers 4.26.1
 - Pytorch 1.13.1+cu116
 - Datasets 2.10.0
-- Tokenizers 0.13.2

 model-index:
 - name: FolkGPT
   results: []
+datasets:
+- vicclab/fairy_tales
+language:
+- en
+pipeline_tag: text-generation
 ---
 # FolkGPT
+This model is a fine-tuned version of [gpt2](https://huggingface.co/gpt2) on vicclab/fairy_tales dataset.
 ## Model description
+This model is the result of fine-tuning gpt2 on a dataset of fairy tales from various cultures.
 ## Intended uses & limitations
+The idea behind this is to generate text in the fashion of fairy tales written in the 18th and 19th centuries.
+Why? Fairy tales seemed an appropriate application for text generation, as stories are usually short(ish),
+self-contained, and easy to read.
 ## Training and evaluation data
+Trained on the vicclab/fairy_tales dataset. The dataset consists of a number of texts which
+were downloaded from Project Gutenberg, and then edited to remove all text except for the
+stories themselves. These were then all concatenated into a text file and pushed to HF at
+https://huggingface.co/datasets/vicclab/fairy_tales. The latest update to the dataset, which
+was used in the training of this model, was created and uploaded on February 26th, 2023.
+Texts used [and token count after removing boilerplate text]:
+https://www.gutenberg.org/files/2591/2591-0.txt [102927 tokens]
+https://www.gutenberg.org/files/503/503-0.txt [138353 tokens]
+https://www.gutenberg.org/cache/epub/69739/pg69739.txt [51035 tokens]
+https://www.gutenberg.org/files/2435/2435-0.txt [98791 tokens]
+https://www.gutenberg.org/cache/epub/7871/pg7871.txt [49410 tokens]
+https://www.gutenberg.org/files/8933/8933-0.txt [178622 tokens]
+gutenberg.org/cache/epub/30834/pg30834.txt [58359 tokens]
+https://www.gutenberg.org/cache/epub/68589/pg68589.txt [39815 tokens]
+https://www.gutenberg.org/cache/epub/34453/pg34453.txt [69365 tokens]
+gutenberg.org/cache/epub/8653/pg8653.txt [35351]
+[Total tokens in actual dataset: 1002654 tokens]
 ## Training procedure
+The dataset was loaded, sampling by paragraph. From here, the dataset was split into a training dataset
+and a validation dataset in an 80-20 split. These were then tokenized. The model was set up, and the trainer
+was instantiated with the training_arguments listed below. Then, the training took place.
 ### Training hyperparameters
 - Transformers 4.26.1
 - Pytorch 1.13.1+cu116
 - Datasets 2.10.0
+- Tokenizers 0.13.2