| license: apache-2.0 | |
| language: | |
| - lv | |
| # Latvian BERT base model (cased) | |
| A BERT model pretrained on Latvian language data using the masked language modeling and next sentence prediction objectives. | |
| It was introduced in [this paper](http://ebooks.iospress.nl/volumearticle/55531) and first released via a [GitHub repository](https://github.com/LUMII-AILab/LVBERT). | |
| The current HF repository contains an improved version of LVBERT. | |
| This model is case-sensitive. It is primarily intended to be fine-tuned on downstream natural language understanding tasks like text classification, named entity recognition, question answering. | |
| However, the model can be used as is to compute contextual embeddings for tasks like text similarity and clustering, semantic search. | |
| ## Training data | |
| LVBERT was pretrained on texts from the [Balanced Corpus of Modern Latvian](https://korpuss.lv/en/id/LVK2018), [Latvian Wikipedia](https://korpuss.lv/en/id/Vikipēdija), [Corpus of News Portal Articles](https://korpuss.lv/en/id/Ziņas), as well as [Corpus of News Portal Comments](https://korpuss.lv/en/id/Barometrs); around 500M tokens in total. | |
| ## Tokenization | |
| A SentencePiece model was trained on the training dataset, producing a vocabulary of 32,000 tokens. | |
| It was then converted to the WordPiece format used by BERT. | |
| ## Pretraining | |
| We used the BERT-base configuration with 12 layers, 768 hidden units, 12 heads, 512 sequence length, 128 mini-batch size and 32k token vocabulary. | |
| ## Citation | |
| Please cite this paper if you use LVBERT: | |
| ```bibtex | |
| @inproceedings{Znotins-Barzdins:2020:BalticHLT, | |
| author = {Arturs Znotins and Guntis Barzdins}, | |
| title = {{LVBERT: Transformer-Based Model for Latvian Language Understanding}}, | |
| booktitle = {Human Language Technologies - The Baltic Perspective}, | |
| series = {Frontiers in Artificial Intelligence and Applications}, | |
| volume = {328}, | |
| publisher = {IOS Press}, | |
| year = {2020}, | |
| pages = {111-115}, | |
| doi = {10.3233/FAIA200610}, | |
| url = {http://ebooks.iospress.nl/volumearticle/55531} | |
| } | |
| ``` |