SlEng-bert

SlEng-bert is a bilingual, Slovene-English masked language model.

SlEng-bert was trained from scratch on Slovene and English, conversational, non-standard, and slang language. The model has 12 transformer layers, and is roughly equal in size to BERT and RoBERTa base models. The pre-training task used was masked language modeling, with no other tasks (like NSP).

The tokenizer and corpora used to train SlEng-bert were also used for training the SloBERTa-SlEng model. The difference between the two is: SlEng-bert was trained from scratch for 40 epochs; SloBERTa-SlEng is SloBERTa further pre-trained for 2 epochs on new corpora.

Training corpora

The model was trained on English and Slovene tweets, Slovene corpora MaCoCu and Frenk, and a small subset of English Oscar corpus. We tried to keep the sizes of English and Slovene corpora as equal as possible. Training corpora had in total about 2.7 billion words.

Downloads last month: 4