Minor updates in README.md
Browse files
README.md
CHANGED
|
@@ -32,7 +32,7 @@ pipeline_tag: fill-mask
|
|
| 32 |
|
| 33 |
**Model description:** LT-MLKM-modernBERT is a Lithuanian masked language model developed as part of the national project “Development of a general Lithuanian language corpus and vectorized models.” The model builds on the ModernBERT-base architecture and was pre-trained on the BLKT Lithuanian Text Corpus Stage 3, which includes over 1.87 billion words and 49 billion training tokens from diverse Lithuanian sources such as news, legal, academic, and public sector texts. With a context length of 8,192 tokens, it efficiently processes long documents while maintaining linguistic precision and coherence. The model advances the project’s goal of providing high-quality Lithuanian language resources and pre-trained neural models to support AI, research, and digital innovation.
|
| 34 |
|
| 35 |
-
It serves as a base model for fine-tuning and domain adaptation in both public and private sector projects that require robust Lithuanian text processing. However, the model is not
|
| 36 |
It uses the `ModernBertForMaskedLM` implementation from the Hugging Face Transformers library (v4.54.1) with bfloat16 precision for efficient training and inference. The model employs a custom Lithuanian tokenizer specially built for this project, featuring a vocabulary size of 64,000 tokens optimized for Lithuanian morphology and subword segmentation. It supports a maximum context length of 8,192 tokens, allowing effective modelling of long documents and complex sentence structures.
|
| 37 |
|
| 38 |
## How to Get Started with the Model
|
|
@@ -134,7 +134,7 @@ The dataset was pre-processed to normalize text, remove duplicates, and analyze
|
|
| 134 |
|
| 135 |
### Evaluation
|
| 136 |
|
| 137 |
-
**Evaluation metrics:**
|
| 138 |
|
| 139 |
**Evaluation details:**
|
| 140 |
|
|
|
|
| 32 |
|
| 33 |
**Model description:** LT-MLKM-modernBERT is a Lithuanian masked language model developed as part of the national project “Development of a general Lithuanian language corpus and vectorized models.” The model builds on the ModernBERT-base architecture and was pre-trained on the BLKT Lithuanian Text Corpus Stage 3, which includes over 1.87 billion words and 49 billion training tokens from diverse Lithuanian sources such as news, legal, academic, and public sector texts. With a context length of 8,192 tokens, it efficiently processes long documents while maintaining linguistic precision and coherence. The model advances the project’s goal of providing high-quality Lithuanian language resources and pre-trained neural models to support AI, research, and digital innovation.
|
| 34 |
|
| 35 |
+
It serves as a base model for fine-tuning and domain adaptation in both public and private sector projects that require robust Lithuanian text processing. However, the model is not task-specialized; to address specific downstream tasks, downstream tasks require fine-tuning.
|
| 36 |
It uses the `ModernBertForMaskedLM` implementation from the Hugging Face Transformers library (v4.54.1) with bfloat16 precision for efficient training and inference. The model employs a custom Lithuanian tokenizer specially built for this project, featuring a vocabulary size of 64,000 tokens optimized for Lithuanian morphology and subword segmentation. It supports a maximum context length of 8,192 tokens, allowing effective modelling of long documents and complex sentence structures.
|
| 37 |
|
| 38 |
## How to Get Started with the Model
|
|
|
|
| 134 |
|
| 135 |
### Evaluation
|
| 136 |
|
| 137 |
+
**Evaluation metrics:** cross-entropy, perplexity, and GLUE (NER task).
|
| 138 |
|
| 139 |
**Evaluation details:**
|
| 140 |
|