Update README.md
Browse files
README.md
CHANGED
|
@@ -1,8 +1,8 @@
|
|
| 1 |
# GPT2-Spanish
|
| 2 |
-
GPT2-Spanish is a language generation model trained from scratch with
|
| 3 |
|
| 4 |
## Corpus
|
| 5 |
-
This model was trained with a corpus of
|
| 6 |
|
| 7 |
## Tokenizer
|
| 8 |
The texts are tokenized using a byte-level version of Byte Pair Encoding (BPE) (for Unicode characters) and a vocabulary size of 50257. The inputs are sequences of 1024 consecutive tokens.
|
|
@@ -15,7 +15,7 @@ Apart from the special token "<|endoftext|>" for text ending in the OpenAI GPT-2
|
|
| 15 |
The model and tokenizer were trained using the Hugging Face libraries with an Nvidia Tesla V100 GPU with 16GB memory on Google Colab servers.
|
| 16 |
|
| 17 |
## Authors
|
| 18 |
-
The
|
| 19 |
|
| 20 |
Thanks to the members of the community who collaborated with funding for the initial tests.
|
| 21 |
|
|
|
|
| 1 |
# GPT2-Spanish
|
| 2 |
+
GPT2-Spanish is a language generation model trained from scratch with 11.5GB of Spanish texts and with a Byte Pair Encoding (BPE) tokenizer that was trained for this purpose. The parameters used are the same as the medium version of the original OpenAI GPT2 model.
|
| 3 |
|
| 4 |
## Corpus
|
| 5 |
+
This model was trained with a corpus of 11.5GB of texts corresponding to 3.5GB of Wikipedia articles and 8GB of books (narrative, short stories, theater, poetry, essays, and popularization).
|
| 6 |
|
| 7 |
## Tokenizer
|
| 8 |
The texts are tokenized using a byte-level version of Byte Pair Encoding (BPE) (for Unicode characters) and a vocabulary size of 50257. The inputs are sequences of 1024 consecutive tokens.
|
|
|
|
| 15 |
The model and tokenizer were trained using the Hugging Face libraries with an Nvidia Tesla V100 GPU with 16GB memory on Google Colab servers.
|
| 16 |
|
| 17 |
## Authors
|
| 18 |
+
The authors of this model have been anonymized because they are currently being evaluated for publication in INLG2021
|
| 19 |
|
| 20 |
Thanks to the members of the community who collaborated with funding for the initial tests.
|
| 21 |
|