AiLab-IMCS-UL
/

lvbert

Feature Extraction

Model card Files Files and versions

normundsg commited on May 1, 2024

Commit

4a8c2a8

·

verified ·

1 Parent(s): 4dfa048

Updated README

Files changed (1) hide show

README.md +6 -6

README.md CHANGED Viewed

@@ -6,16 +6,16 @@ language:
 # Latvian BERT base model (cased)
-A BERT model pretrained on the Latvian language using the masked language modeling and next sentence prediction objectives.
-It was introduced in [this paper](http://ebooks.iospress.nl/volumearticle/55531) and first released via [this repository](https://github.com/LUMII-AILab/LVBERT).
-This model is case-sensitive. It is primarily intended to be fine-tuned on downstream natural language understanding (NLU) tasks.
-Developed at [AiLab.lv](https://ailab.lv)
 ## Training data
-LVBERT was pretrained on texts from the [Balanced Corpus of Modern Latvian](https://korpuss.lv/en/id/LVK2018), [Latvian Wikipedia](https://korpuss.lv/en/id/Vikipēdija), [Corpus of News Portal Articles](https://korpuss.lv/en/id/Ziņas), as well as [Corpus of News Portal Comments](https://korpuss.lv/en/id/Barometrs); 500M tokens in total.
 ## Tokenization

 # Latvian BERT base model (cased)
+A BERT model pretrained on Latvian language data using the masked language modeling and next sentence prediction objectives.
+It was introduced in [this paper](http://ebooks.iospress.nl/volumearticle/55531) and first released via a [GitHub repository](https://github.com/LUMII-AILab/LVBERT).
+The current HF repository contains an improved version of LVBERT.
+This model is case-sensitive. It is primarily intended to be fine-tuned on downstream natural language understanding tasks like text classification, named entity recognition, question answering.
+However, the model can be used as is to compute contextual embeddings for tasks like text similarity and clustering, semantic search.
 ## Training data
+LVBERT was pretrained on texts from the [Balanced Corpus of Modern Latvian](https://korpuss.lv/en/id/LVK2018), [Latvian Wikipedia](https://korpuss.lv/en/id/Vikipēdija), [Corpus of News Portal Articles](https://korpuss.lv/en/id/Ziņas), as well as [Corpus of News Portal Comments](https://korpuss.lv/en/id/Barometrs); around 500M tokens in total.
 ## Tokenization