NbAiLab/NCC
Viewer • Updated • 5.58M • 1.08k • 3
How to use vesteinn/ScandiBERT with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("fill-mask", model="vesteinn/ScandiBERT") # Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("vesteinn/ScandiBERT")
model = AutoModelForMaskedLM.from_pretrained("vesteinn/ScandiBERT")# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("vesteinn/ScandiBERT")
model = AutoModelForMaskedLM.from_pretrained("vesteinn/ScandiBERT")Note note: The model has been updated on 2022-09-27
The model was trained on the data shown in the table below. Batch size was 8.8k, the model was trained for 72 epochs on 24 V100 cards for about 2 weeks.
| Language | Data | Size |
|---|---|---|
| Icelandic | See IceBERT paper | 16 GB |
| Danish | Danish Gigaword Corpus (incl Twitter) | 4,7 GB |
| Norwegian | NCC corpus | 42 GB |
| Swedish | Swedish Gigaword Corpus | 3,4 GB |
| Faroese | FC3 + Sosialurinn + Bible | 69 MB |
Note: At an earlier date a half trained model went up here, it has since been removed. The model has since been updated.
This is a Scandinavian BERT model trained on a large collection of Danish, Faroese, Icelandic, Norwegian and Swedish text. It is currently the highest ranking model on the ScandEval leaderbord https://scandeval.github.io/pretrained/
If you find this model useful, please cite
@inproceedings{snaebjarnarson-etal-2023-transfer,
title = "{T}ransfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese",
author = "Snæbjarnarson, Vésteinn and
Simonsen, Annika and
Glavaš, Goran and
Vulić, Ivan",
booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
month = "may 22--24",
year = "2023",
address = "Tórshavn, Faroe Islands",
publisher = {Link{\"o}ping University Electronic Press, Sweden},
}
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="vesteinn/ScandiBERT")