# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("cjvt/sleng-bert")
model = AutoModelForMaskedLM.from_pretrained("cjvt/sleng-bert")SlEng-bert
SlEng-bert is a bilingual, Slovene-English masked language model.
SlEng-bert was trained from scratch on Slovene and English, conversational, non-standard, and slang language. The model has 12 transformer layers, and is roughly equal in size to BERT and RoBERTa base models. The pre-training task used was masked language modeling, with no other tasks (like NSP).
The tokenizer and corpora used to train SlEng-bert were also used for training the SloBERTa-SlEng model. The difference between the two is: SlEng-bert was trained from scratch for 40 epochs; SloBERTa-SlEng is SloBERTa further pre-trained for 2 epochs on new corpora.
Training corpora
The model was trained on English and Slovene tweets, Slovene corpora MaCoCu and Frenk, and a small subset of English Oscar corpus. We tried to keep the sizes of English and Slovene corpora as equal as possible. Training corpora had in total about 2.7 billion words.
- Downloads last month
- 10
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="cjvt/sleng-bert")