ugiugi commited on
Commit
7b44c52
·
1 Parent(s): fcd27e8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +113 -1
README.md CHANGED
@@ -2,4 +2,116 @@
2
  license: afl-3.0
3
  language:
4
  - en
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: afl-3.0
3
  language:
4
  - en
5
+ ---
6
+ ## T5-like span-masked language modeling
7
+
8
+ In the following, we demonstrate how to train a T5 model using the span-masked language model
9
+ objective as proposed in the [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683).
10
+ More specifically, we demonstrate how JAX/Flax can be leveraged
11
+ to pre-train [**`google/t5-v1_1-base`**](https://huggingface.co/google/t5-v1_1-base)
12
+ in Norwegian on a single TPUv3-8 pod.
13
+
14
+ The example script uses the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.
15
+
16
+ Let's start by creating a model repository to save the trained model and logs.
17
+ Here we call the model `"norwegian-t5-base"`, but you can change the model name as you like.
18
+
19
+ To setup all relevant files for training, let's create a directory.
20
+
21
+ ```bash
22
+ cd ./norwegian-t5-base
23
+ ```
24
+
25
+ ### Train tokenizer
26
+
27
+ In the first step, we train a tokenizer to efficiently process the text input for the model.
28
+ We make use of the [tokenizers](https://github.com/huggingface/tokenizers) library to train
29
+ a sentencepiece unigram tokenizer as shown in [t5_tokenizer_model.py](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling/t5_tokenizer_model.py)
30
+ which is heavily inspired from [yandex-research/DeDLOC's tokenizer model](https://github.com/yandex-research/DeDLOC/blob/5c994bc64e573702a9a79add3ecd68b38f14b548/sahajbert/tokenizer/tokenizer_model.py) .
31
+
32
+ The tokenizer is trained on the complete Norwegian dataset of OSCAR
33
+ and consequently saved in the cloned model directory.
34
+ This can take up to 120 minutes depending on your hardware ☕☕☕ .
35
+
36
+ ```python
37
+ import datasets
38
+
39
+ from t5_tokenizer_model import SentencePieceUnigramTokenizer
40
+
41
+
42
+ vocab_size = 32_000
43
+ input_sentence_size = None
44
+
45
+ # Initialize a dataset
46
+ dataset = datasets.load_dataset("oscar", name="unshuffled_deduplicated_no", split="train")
47
+
48
+ tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>")
49
+
50
+
51
+ # Build an iterator over this dataset
52
+ def batch_iterator(input_sentence_size=None):
53
+ if input_sentence_size is None:
54
+ input_sentence_size = len(dataset)
55
+ batch_length = 100
56
+ for i in range(0, input_sentence_size, batch_length):
57
+ yield dataset[i: i + batch_length]["text"]
58
+
59
+
60
+ # Train tokenizer
61
+ tokenizer.train_from_iterator(
62
+ iterator=batch_iterator(input_sentence_size=input_sentence_size),
63
+ vocab_size=vocab_size,
64
+ show_progress=True,
65
+ )
66
+
67
+ # Save files to disk
68
+ tokenizer.save("./norwegian-t5-base/tokenizer.json")
69
+ ```
70
+
71
+ ### Create configuration
72
+
73
+ Next, we create the model's configuration file. This is as simple
74
+ as loading and storing [`**google/t5-v1_1-base**`](https://huggingface.co/google/t5-v1_1-base)
75
+ in the local model folder:
76
+
77
+ ```python
78
+ from transformers import T5Config
79
+
80
+ config = T5Config.from_pretrained("google/t5-v1_1-base", vocab_size=tokenizer.get_vocab_size())
81
+ config.save_pretrained("./norwegian-t5-base")
82
+ ```
83
+
84
+ Great, we have set up our model repository. During training, we will automatically
85
+ push the training logs and model weights to the repo.
86
+
87
+ ### Train model
88
+
89
+ Next we can run the example script to pretrain the model:
90
+
91
+ ```bash
92
+ python run_t5_mlm_flax.py \
93
+ --output_dir="./norwegian-t5-base" \
94
+ --model_type="t5" \
95
+ --config_name="./norwegian-t5-base" \
96
+ --tokenizer_name="./norwegian-t5-base" \
97
+ --dataset_name="oscar" \
98
+ --dataset_config_name="unshuffled_deduplicated_no" \
99
+ --max_seq_length="512" \
100
+ --per_device_train_batch_size="32" \
101
+ --per_device_eval_batch_size="32" \
102
+ --adafactor \
103
+ --learning_rate="0.005" \
104
+ --weight_decay="0.001" \
105
+ --warmup_steps="2000" \
106
+ --overwrite_output_dir \
107
+ --logging_steps="500" \
108
+ --save_steps="10000" \
109
+ --eval_steps="2500" \
110
+ --push_to_hub
111
+ ```
112
+
113
+ Training should converge at a loss and accuracy
114
+ of 2.36 and 57.0 respectively after 3 epochs on a single TPUv3-8.
115
+ This should take around 4.5 hours.
116
+ Training statistics can be accessed on directly on the 🤗 [hub](https://huggingface.co/patrickvonplaten/t5-base-norwegian/tensorboard)
117
+