Update README.md
Browse files
README.md
CHANGED
|
@@ -2,4 +2,116 @@
|
|
| 2 |
license: afl-3.0
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
license: afl-3.0
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
+
---
|
| 6 |
+
## T5-like span-masked language modeling
|
| 7 |
+
|
| 8 |
+
In the following, we demonstrate how to train a T5 model using the span-masked language model
|
| 9 |
+
objective as proposed in the [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683).
|
| 10 |
+
More specifically, we demonstrate how JAX/Flax can be leveraged
|
| 11 |
+
to pre-train [**`google/t5-v1_1-base`**](https://huggingface.co/google/t5-v1_1-base)
|
| 12 |
+
in Norwegian on a single TPUv3-8 pod.
|
| 13 |
+
|
| 14 |
+
The example script uses the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.
|
| 15 |
+
|
| 16 |
+
Let's start by creating a model repository to save the trained model and logs.
|
| 17 |
+
Here we call the model `"norwegian-t5-base"`, but you can change the model name as you like.
|
| 18 |
+
|
| 19 |
+
To setup all relevant files for training, let's create a directory.
|
| 20 |
+
|
| 21 |
+
```bash
|
| 22 |
+
cd ./norwegian-t5-base
|
| 23 |
+
```
|
| 24 |
+
|
| 25 |
+
### Train tokenizer
|
| 26 |
+
|
| 27 |
+
In the first step, we train a tokenizer to efficiently process the text input for the model.
|
| 28 |
+
We make use of the [tokenizers](https://github.com/huggingface/tokenizers) library to train
|
| 29 |
+
a sentencepiece unigram tokenizer as shown in [t5_tokenizer_model.py](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling/t5_tokenizer_model.py)
|
| 30 |
+
which is heavily inspired from [yandex-research/DeDLOC's tokenizer model](https://github.com/yandex-research/DeDLOC/blob/5c994bc64e573702a9a79add3ecd68b38f14b548/sahajbert/tokenizer/tokenizer_model.py) .
|
| 31 |
+
|
| 32 |
+
The tokenizer is trained on the complete Norwegian dataset of OSCAR
|
| 33 |
+
and consequently saved in the cloned model directory.
|
| 34 |
+
This can take up to 120 minutes depending on your hardware ☕☕☕ .
|
| 35 |
+
|
| 36 |
+
```python
|
| 37 |
+
import datasets
|
| 38 |
+
|
| 39 |
+
from t5_tokenizer_model import SentencePieceUnigramTokenizer
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
vocab_size = 32_000
|
| 43 |
+
input_sentence_size = None
|
| 44 |
+
|
| 45 |
+
# Initialize a dataset
|
| 46 |
+
dataset = datasets.load_dataset("oscar", name="unshuffled_deduplicated_no", split="train")
|
| 47 |
+
|
| 48 |
+
tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>")
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
# Build an iterator over this dataset
|
| 52 |
+
def batch_iterator(input_sentence_size=None):
|
| 53 |
+
if input_sentence_size is None:
|
| 54 |
+
input_sentence_size = len(dataset)
|
| 55 |
+
batch_length = 100
|
| 56 |
+
for i in range(0, input_sentence_size, batch_length):
|
| 57 |
+
yield dataset[i: i + batch_length]["text"]
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
# Train tokenizer
|
| 61 |
+
tokenizer.train_from_iterator(
|
| 62 |
+
iterator=batch_iterator(input_sentence_size=input_sentence_size),
|
| 63 |
+
vocab_size=vocab_size,
|
| 64 |
+
show_progress=True,
|
| 65 |
+
)
|
| 66 |
+
|
| 67 |
+
# Save files to disk
|
| 68 |
+
tokenizer.save("./norwegian-t5-base/tokenizer.json")
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
### Create configuration
|
| 72 |
+
|
| 73 |
+
Next, we create the model's configuration file. This is as simple
|
| 74 |
+
as loading and storing [`**google/t5-v1_1-base**`](https://huggingface.co/google/t5-v1_1-base)
|
| 75 |
+
in the local model folder:
|
| 76 |
+
|
| 77 |
+
```python
|
| 78 |
+
from transformers import T5Config
|
| 79 |
+
|
| 80 |
+
config = T5Config.from_pretrained("google/t5-v1_1-base", vocab_size=tokenizer.get_vocab_size())
|
| 81 |
+
config.save_pretrained("./norwegian-t5-base")
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
Great, we have set up our model repository. During training, we will automatically
|
| 85 |
+
push the training logs and model weights to the repo.
|
| 86 |
+
|
| 87 |
+
### Train model
|
| 88 |
+
|
| 89 |
+
Next we can run the example script to pretrain the model:
|
| 90 |
+
|
| 91 |
+
```bash
|
| 92 |
+
python run_t5_mlm_flax.py \
|
| 93 |
+
--output_dir="./norwegian-t5-base" \
|
| 94 |
+
--model_type="t5" \
|
| 95 |
+
--config_name="./norwegian-t5-base" \
|
| 96 |
+
--tokenizer_name="./norwegian-t5-base" \
|
| 97 |
+
--dataset_name="oscar" \
|
| 98 |
+
--dataset_config_name="unshuffled_deduplicated_no" \
|
| 99 |
+
--max_seq_length="512" \
|
| 100 |
+
--per_device_train_batch_size="32" \
|
| 101 |
+
--per_device_eval_batch_size="32" \
|
| 102 |
+
--adafactor \
|
| 103 |
+
--learning_rate="0.005" \
|
| 104 |
+
--weight_decay="0.001" \
|
| 105 |
+
--warmup_steps="2000" \
|
| 106 |
+
--overwrite_output_dir \
|
| 107 |
+
--logging_steps="500" \
|
| 108 |
+
--save_steps="10000" \
|
| 109 |
+
--eval_steps="2500" \
|
| 110 |
+
--push_to_hub
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
Training should converge at a loss and accuracy
|
| 114 |
+
of 2.36 and 57.0 respectively after 3 epochs on a single TPUv3-8.
|
| 115 |
+
This should take around 4.5 hours.
|
| 116 |
+
Training statistics can be accessed on directly on the 🤗 [hub](https://huggingface.co/patrickvonplaten/t5-base-norwegian/tensorboard)
|
| 117 |
+
|