Nuri-Tas commited on
Commit
8d3c2bf
·
1 Parent(s): 66fb711

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -1,5 +1,5 @@
1
- RoBERTurk is pretrained on Oscar Turkish Split (27GB) and a small chunk of C4 Turkish Split (1GB) with sentencepiece BPE tokenizer that is trained on randomly selected 30M sentences from the training data which is composed of 90M sentences. The training data in total contains 5.3B tokens and the vocabulary size is 50K.
2
-
3
 
4
  ## Tokenizer
5
 
@@ -26,5 +26,6 @@ Additional TODOs are (although some of them can take some time and I may include
26
 
27
  - Using Zemberek as an alternative tokenizer
28
  - Adjusting masking algorithm to be able to mask morphologies besides only complete words
29
- - Preferably pretraining BPE on the complete training data
30
  - Pretraining with 512 max sequence length + more data
 
 
1
+ RoBERTurk is pretrained on Oscar Turkish Split (27GB) and a small chunk of C4 Turkish Split (1GB) with sentencepiece BPE tokenizer that is trained on randomly selected 30M sentences from the training data, which is composed of 90M sentences. The training data in total contains 5.3B tokens and the vocabulary size is 50K.
2
+ The learning rate is warmed up to the peak value of 1e-5 for the first 10K updates and linearly decayed at $0.01$ rate. The model is pretrained for maximum 600K updates only with sequences of at most T=256 length.
3
 
4
  ## Tokenizer
5
 
 
26
 
27
  - Using Zemberek as an alternative tokenizer
28
  - Adjusting masking algorithm to be able to mask morphologies besides only complete words
29
+ - Preferably pretraining BPE on the whole training data
30
  - Pretraining with 512 max sequence length + more data
31
+