Commit ·
ce67844
1
Parent(s): ff757c0
added citation
Browse files
README.md
CHANGED
|
@@ -65,7 +65,7 @@ Follow the guide linked [here](https://towardsdatascience.com/fine-tuning-gpt2-o
|
|
| 65 |
|
| 66 |
## Finetuning using our code with TF 1.15.4:
|
| 67 |
|
| 68 |
-
|
| 69 |
```bash
|
| 70 |
python create_pretraining_data.py
|
| 71 |
--input_file=<RAW TEXT FILE with documents/article sperated by an empty line>
|
|
@@ -73,26 +73,26 @@ python create_pretraining_data.py
|
|
| 73 |
--tokenizer_dir=<Directory with the GPT2 Tokenizer files>
|
| 74 |
```
|
| 75 |
|
| 76 |
-
|
| 77 |
```bash
|
| 78 |
-
python3 run_pretraining.py \
|
| 79 |
-
--input_file="gs://<GS_BUCKET>/pretraining_data/*" \
|
| 80 |
-
--output_dir="gs://<GS_BUCKET>/pretraining_model/" \
|
| 81 |
-
--config_file="config/small_hparams.json" \
|
| 82 |
-
--batch_size=128 \
|
| 83 |
-
--eval_batch_size=8 \
|
| 84 |
-
--num_train_steps= \
|
| 85 |
-
--num_warmup_steps= \
|
| 86 |
-
--learning_rate= \
|
| 87 |
-
--save_checkpoints_steps= \
|
| 88 |
-
--max_seq_length=1024 \
|
| 89 |
-
--max_eval_steps= \
|
| 90 |
-
--optimizer="lamb" \
|
| 91 |
-
--iterations_per_loop=5000 \
|
| 92 |
-
--keep_checkpoint_max=10 \
|
| 93 |
-
--use_tpu=True \
|
| 94 |
-
--tpu_name=<TPU NAME> \
|
| 95 |
-
--do_train=True \
|
| 96 |
--do_eval=False
|
| 97 |
```
|
| 98 |
# Model Sizes
|
|
@@ -137,18 +137,23 @@ For the new dataset we added the unshuffled OSCAR corpus, after we thoroughly fi
|
|
| 137 |
# If you used this model please cite us as :
|
| 138 |
|
| 139 |
```
|
| 140 |
-
@
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 147 |
}
|
| 148 |
```
|
| 149 |
|
| 150 |
# Acknowledgments
|
| 151 |
-
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the
|
| 152 |
|
| 153 |
# Contacts
|
| 154 |
**Wissam Antoun**: [Linkedin](https://www.linkedin.com/in/wissam-antoun-622142b4/) | [Twitter](https://twitter.com/wissam_antoun) | [Github](https://github.com/WissamAntoun) | <wfa07@mail.aub.edu> | <wissam.antoun@gmail.com>
|
|
|
|
| 65 |
|
| 66 |
## Finetuning using our code with TF 1.15.4:
|
| 67 |
|
| 68 |
+
Create the Training TFRecords:
|
| 69 |
```bash
|
| 70 |
python create_pretraining_data.py
|
| 71 |
--input_file=<RAW TEXT FILE with documents/article sperated by an empty line>
|
|
|
|
| 73 |
--tokenizer_dir=<Directory with the GPT2 Tokenizer files>
|
| 74 |
```
|
| 75 |
|
| 76 |
+
Finetuning:
|
| 77 |
```bash
|
| 78 |
+
python3 run_pretraining.py \\
|
| 79 |
+
--input_file="gs://<GS_BUCKET>/pretraining_data/*" \\
|
| 80 |
+
--output_dir="gs://<GS_BUCKET>/pretraining_model/" \\
|
| 81 |
+
--config_file="config/small_hparams.json" \\
|
| 82 |
+
--batch_size=128 \\
|
| 83 |
+
--eval_batch_size=8 \\
|
| 84 |
+
--num_train_steps= \\
|
| 85 |
+
--num_warmup_steps= \\
|
| 86 |
+
--learning_rate= \\
|
| 87 |
+
--save_checkpoints_steps= \\
|
| 88 |
+
--max_seq_length=1024 \\
|
| 89 |
+
--max_eval_steps= \\
|
| 90 |
+
--optimizer="lamb" \\
|
| 91 |
+
--iterations_per_loop=5000 \\
|
| 92 |
+
--keep_checkpoint_max=10 \\
|
| 93 |
+
--use_tpu=True \\
|
| 94 |
+
--tpu_name=<TPU NAME> \\
|
| 95 |
+
--do_train=True \\
|
| 96 |
--do_eval=False
|
| 97 |
```
|
| 98 |
# Model Sizes
|
|
|
|
| 137 |
# If you used this model please cite us as :
|
| 138 |
|
| 139 |
```
|
| 140 |
+
@inproceedings{antoun-etal-2021-aragpt2,
|
| 141 |
+
title = "{A}ra{GPT}2: Pre-Trained Transformer for {A}rabic Language Generation",
|
| 142 |
+
author = "Antoun, Wissam and
|
| 143 |
+
Baly, Fady and
|
| 144 |
+
Hajj, Hazem",
|
| 145 |
+
booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
|
| 146 |
+
month = apr,
|
| 147 |
+
year = "2021",
|
| 148 |
+
address = "Kyiv, Ukraine (Virtual)",
|
| 149 |
+
publisher = "Association for Computational Linguistics",
|
| 150 |
+
url = "https://www.aclweb.org/anthology/2021.wanlp-1.21",
|
| 151 |
+
pages = "196--207",
|
| 152 |
}
|
| 153 |
```
|
| 154 |
|
| 155 |
# Acknowledgments
|
| 156 |
+
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continuous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.
|
| 157 |
|
| 158 |
# Contacts
|
| 159 |
**Wissam Antoun**: [Linkedin](https://www.linkedin.com/in/wissam-antoun-622142b4/) | [Twitter](https://twitter.com/wissam_antoun) | [Github](https://github.com/WissamAntoun) | <wfa07@mail.aub.edu> | <wissam.antoun@gmail.com>
|