Updated README
Browse files
README.md
CHANGED
|
@@ -6,16 +6,16 @@ language:
|
|
| 6 |
|
| 7 |
# Latvian BERT base model (cased)
|
| 8 |
|
| 9 |
-
A BERT model pretrained on
|
| 10 |
-
It was introduced in [this paper](http://ebooks.iospress.nl/volumearticle/55531) and first released via [
|
|
|
|
| 11 |
|
| 12 |
-
This model is case-sensitive. It is primarily intended to be fine-tuned on downstream natural language understanding
|
| 13 |
-
|
| 14 |
-
Developed at [AiLab.lv](https://ailab.lv)
|
| 15 |
|
| 16 |
## Training data
|
| 17 |
|
| 18 |
-
LVBERT was pretrained on texts from the [Balanced Corpus of Modern Latvian](https://korpuss.lv/en/id/LVK2018), [Latvian Wikipedia](https://korpuss.lv/en/id/Vikipēdija), [Corpus of News Portal Articles](https://korpuss.lv/en/id/Ziņas), as well as [Corpus of News Portal Comments](https://korpuss.lv/en/id/Barometrs); 500M tokens in total.
|
| 19 |
|
| 20 |
## Tokenization
|
| 21 |
|
|
|
|
| 6 |
|
| 7 |
# Latvian BERT base model (cased)
|
| 8 |
|
| 9 |
+
A BERT model pretrained on Latvian language data using the masked language modeling and next sentence prediction objectives.
|
| 10 |
+
It was introduced in [this paper](http://ebooks.iospress.nl/volumearticle/55531) and first released via a [GitHub repository](https://github.com/LUMII-AILab/LVBERT).
|
| 11 |
+
The current HF repository contains an improved version of LVBERT.
|
| 12 |
|
| 13 |
+
This model is case-sensitive. It is primarily intended to be fine-tuned on downstream natural language understanding tasks like text classification, named entity recognition, question answering.
|
| 14 |
+
However, the model can be used as is to compute contextual embeddings for tasks like text similarity and clustering, semantic search.
|
|
|
|
| 15 |
|
| 16 |
## Training data
|
| 17 |
|
| 18 |
+
LVBERT was pretrained on texts from the [Balanced Corpus of Modern Latvian](https://korpuss.lv/en/id/LVK2018), [Latvian Wikipedia](https://korpuss.lv/en/id/Vikipēdija), [Corpus of News Portal Articles](https://korpuss.lv/en/id/Ziņas), as well as [Corpus of News Portal Comments](https://korpuss.lv/en/id/Barometrs); around 500M tokens in total.
|
| 19 |
|
| 20 |
## Tokenization
|
| 21 |
|