Commit
·
9f6d74c
1
Parent(s):
87071b6
added citation
Browse files
README.md
CHANGED
|
@@ -63,7 +63,7 @@ Follow the guide linked [here](https://towardsdatascience.com/fine-tuning-gpt2-o
|
|
| 63 |
|
| 64 |
## Finetuning using our code with TF 1.15.4:
|
| 65 |
|
| 66 |
-
|
| 67 |
```bash
|
| 68 |
python create_pretraining_data.py
|
| 69 |
--input_file=<RAW TEXT FILE with documents/article sperated by an empty line>
|
|
@@ -71,7 +71,7 @@ python create_pretraining_data.py
|
|
| 71 |
--tokenizer_dir=<Directory with the GPT2 Tokenizer files>
|
| 72 |
```
|
| 73 |
|
| 74 |
-
|
| 75 |
```bash
|
| 76 |
python3 run_pretraining.py \
|
| 77 |
--input_file="gs://<GS_BUCKET>/pretraining_data/*" \
|
|
@@ -119,7 +119,7 @@ The pretraining data used for the new AraGPT2 model is also used for **AraBERTv2
|
|
| 119 |
|
| 120 |
The dataset consists of 77GB or 200,095,961 lines or 8,655,948,860 words or 82,232,988,358 chars (before applying Farasa Segmentation)
|
| 121 |
|
| 122 |
-
For the new dataset we added the unshuffled OSCAR corpus
|
| 123 |
- OSCAR unshuffled and filtered.
|
| 124 |
- [Arabic Wikipedia dump](https://archive.org/details/arwiki-20190201) from 2020/09/01
|
| 125 |
- [The 1.5B words Arabic Corpus](https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4)
|
|
@@ -133,13 +133,18 @@ The text generated by AraGPT2 is automatically generated by a neural network mod
|
|
| 133 |
# If you used this model please cite us as :
|
| 134 |
|
| 135 |
```
|
| 136 |
-
@
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 143 |
}
|
| 144 |
```
|
| 145 |
|
|
|
|
| 63 |
|
| 64 |
## Finetuning using our code with TF 1.15.4:
|
| 65 |
|
| 66 |
+
Create the Training TFRecords:
|
| 67 |
```bash
|
| 68 |
python create_pretraining_data.py
|
| 69 |
--input_file=<RAW TEXT FILE with documents/article sperated by an empty line>
|
|
|
|
| 71 |
--tokenizer_dir=<Directory with the GPT2 Tokenizer files>
|
| 72 |
```
|
| 73 |
|
| 74 |
+
Finetuning:
|
| 75 |
```bash
|
| 76 |
python3 run_pretraining.py \
|
| 77 |
--input_file="gs://<GS_BUCKET>/pretraining_data/*" \
|
|
|
|
| 119 |
|
| 120 |
The dataset consists of 77GB or 200,095,961 lines or 8,655,948,860 words or 82,232,988,358 chars (before applying Farasa Segmentation)
|
| 121 |
|
| 122 |
+
For the new dataset we added the unshuffled OSCAR corpus after we thoroughly filter it, to the dataset used in AraBERTv1 but without the websites that we previously crawled:
|
| 123 |
- OSCAR unshuffled and filtered.
|
| 124 |
- [Arabic Wikipedia dump](https://archive.org/details/arwiki-20190201) from 2020/09/01
|
| 125 |
- [The 1.5B words Arabic Corpus](https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4)
|
|
|
|
| 133 |
# If you used this model please cite us as :
|
| 134 |
|
| 135 |
```
|
| 136 |
+
@inproceedings{antoun-etal-2021-aragpt2,
|
| 137 |
+
title = "{A}ra{GPT}2: Pre-Trained Transformer for {A}rabic Language Generation",
|
| 138 |
+
author = "Antoun, Wissam and
|
| 139 |
+
Baly, Fady and
|
| 140 |
+
Hajj, Hazem",
|
| 141 |
+
booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
|
| 142 |
+
month = apr,
|
| 143 |
+
year = "2021",
|
| 144 |
+
address = "Kyiv, Ukraine (Virtual)",
|
| 145 |
+
publisher = "Association for Computational Linguistics",
|
| 146 |
+
url = "https://www.aclweb.org/anthology/2021.wanlp-1.21",
|
| 147 |
+
pages = "196--207",
|
| 148 |
}
|
| 149 |
```
|
| 150 |
|