Model Card for Model ID

BERT model trained on Bulgarian literature, Web, and other datasets - cased.

Model Details

124M parameter BERT model trained on 29B (35B depending on tokenization) tokens for 3 epochs with Masked Language Modelling objective.

Tokenizer vocabulary size is 50176.
Model hidden dimension is 768.
Feed-Forward dimension is 3072.
Hidden layer count is 12.
Developed by: Artificial Inteligence and Language Technologies Department at Institute of Information and Communication Technologies - Bulgarian Academy of Sciences.
Funded by: The model is pretrained within the CLaDA-BG: National Interdisciplinary Research E-Infrastructure for Bulgarian Language and Cultural heritage - member of the pan-European research consortia CLARIN-ERIC & DARIAH-ERIC, funded by the Ministry of Education and Science of Bulgaria (support for the Bulgarian National Roadmap for Research Infrastructure). The training was performed at the supercomputer HEMUS at IICT-BAS, part of the RIs of the CoE on Informatics and ICT, financed by the OP SESG (2014–2020), and co-financed by the European Union through the ESIF.
Model type: BERT
Language(s) (NLP): Bulgarian.
License: MIT

Uses

The model is intended to be used as a base model for fine-tuning tasks in NLP.

Direct Use

>>> from transformers import (
>>>     PreTrainedTokenizerFast,
>>>     BertForMaskedLM,
>>>     pipeline 
>>> )

>>> model = BertForMaskedLM.from_pretrained('AIaLT-IICT/bert_bg_lit_web_base_cased')
>>> tokenizer = PreTrainedTokenizerFast.from_pretrained('AIaLT-IICT/bert_bg_lit_web_base_cased')

>>> fill_mask = pipeline(
>>>     "fill-mask",
>>>     model=model,
>>>     tokenizer=tokenizer
>>> )


>>> fill_mask("Заради 3 завода няма да [MASK] нито есенниците неподхранени, нито зърното да поскъпне заради тях.")

[{'score': 0.32779741287231445,
  'token': 17316,
  'token_str': 'останат',
  'sequence': 'Заради 3 завода няма да останат нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
 {'score': 0.17678387463092804,
  'token': 9978,
  'token_str': 'има',
  'sequence': 'Заради 3 завода няма да има нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
 {'score': 0.16635966300964355,
  'token': 10396,
  'token_str': 'бъдат',
  'sequence': 'Заради 3 завода няма да бъдат нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
 {'score': 0.053633157163858414,
  'token': 29795,
  'token_str': 'оставим',
  'sequence': 'Заради 3 завода няма да оставим нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
 {'score': 0.031064892187714577,
  'token': 9858,
  'token_str': 'са',
  'sequence': 'Заради 3 завода няма да са нито есенниците неподхранени, нито зърното да поскъпне заради тях.'}]

Out-of-Scope Use

The model is not trained on Next Sentence prediction so the [CLS] token embedding will not be useful out of the box. If you want to use the model for Sequence classification it is recommended to fine-tune it.

Recommendations

It is recommended to use the model for Token Classification and Sequence classification fine-tuning tasks. The model can be used within SentenceTransformers framework for producing embeddings.

Training Details

Training Data

Trained on 29B tokens consisting of deduplicated union of:

uonlp/CulturaX
MaCoCu-bg 2.0
HPLT 2.0 Bulgarian (Cyrillic) cleaned
Literature
Wikipedia
others

Training Procedure

Trained with Masked Language Modelling with 20% masks for 3 epochs with bf16 mixed precision, 512 tokens context and batch size of 256*512 tokens.

Evaluation

The model is evaluated on the Masked Language Modelling objective on test split with 20% random masked tokens. It achieves test loss of 1.33 and test accuracy of 71.73%

Model Card Authors

Nikolay Paev, Kiril Simov

Model Card Contact

nikolay.paev@iict.bas.bg

Downloads last month: 1

Safetensors

Model size

0.1B params

Tensor type

F32

AIaLT-IICT
/

bert_bg_lit_web_base_cased