lv_roberta_base / README.md
JesseHuang922's picture
Update README.md
16b0a46 verified
---
language:
- lv
base_model:
- FacebookAI/xlm-roberta-base
license: cc-by-sa-4.0
datasets:
- universal_dependencies
metrics:
- accuracy
- uas
- las
---
# Latvian SpaCy Model: lv_roberta_base
## Github Repo:
https://github.com/LazyBomb-SIA/LV_RoBERTa_Base
---
## Overview
This is a **spaCy transformer-based pipeline for Latvian**, built with the **XLM-RoBERTa-base backbone**.
**Performance Comparison**
| Model | POS | Tag | Morph | UAS | LAS | Lemma Acc | Summary (equal weights) |
| ------------ | ------ | ------ | ------ | ------ | ------ | --------- | ------ |
| spaCy (this model) | 0.9748 | 0.9215 | 0.9550 | 0.9104 | 0.8753 | 0.8203 | 90.96% |
| Stanza | 0.9688 | 0.8987 | 0.9449 | 0.8791 | 0.8354 | 0.9539 | 91.35% |
| UDPipe | 0.9207 | 0.7960 | 0.3403 | 0.0791 | 0.0660 | 0.8911 | 51.55% |
Details please check cell 12 and 13 here:
https://github.com/LazyBomb-SIA/LV_RoBERTa_Base/blob/main/lv_roberta_base.ipynb
It includes the following components:
- **Transformer** (XLM-RoBERTa-base)
- **Tagger**
- **Morphologizer**
- **Parser**
- **Sentence Segmenter (senter)**
- **Lemmatizer**
- (Note: Transformer component internally uses a `tok2vec` listener)
**Model type:** Transformer pipeline (XLM-RoBERTa-base backbone)
**Language:** Latvian (lv)
**Recommended hardware:** CPU for small-scale use, GPU recommended for faster inference.
---
## Training Data
The model was trained on the **Latvian UD Treebank v2.16**, which is derived from the **Latvian Treebank (LVTB)** created at the University of Latvia, Institute of Mathematics and Computer Science, Artificial Intelligence Laboratory (AI Lab).
- **Dataset source:** [UD Latvian LVTB](https://github.com/UniversalDependencies/UD_Latvian-LVTB)
- **License:** [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)
- **Data splits:**
- Train: 15,055 sentences
- Dev: 2,080 sentences
- Test: 2,396 sentences
---
## Acknowledgements
- Thanks to the **University of Latvia, AI Lab**, and all contributors of the **Latvian UD Treebank**.
- Model development supported by [LazyBomb.SIA].
- Inspired by the **spaCy ecosystem** and training framework.
- The Latvian UD Treebank was developed with support from multiple grants, including:
- European Regional Development Fund (Grant No. 1.1.1.1/16/A/219, 1.1.1.2/VIAA/1/16/188)
- State Research Programme "National Identity"
- State Research Programme "Digital Resources for the Humanities" (Grant No. VPP-IZM-DH-2020/1-0001)
- State Research Programme "Research on Modern Latvian Language and Development of Language Technology" (Grant No. VPP-LETONIKA-2021/1-0006)
---
## Special Thanks
Special Thanks to all contributors who participated in the Beta test and espically those who provided valuable feedback
**The list is waiting**
---
## License
This model is released under the [Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/).
You are free to:
- **Share** — copy and redistribute the material in any medium or format, for any purpose, even commercially.
- **Adapt** — remix, transform, and build upon the material for any purpose, even commercially.
Under the following terms:
- **Attribution** — You must give appropriate credit, provide a link to the license, and indicate if changes were made.
- **ShareAlike** — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
---
## References
- Pretkalniņa, L., Rituma, L., Saulīte, B., et al. (2016–2025). Universal Dependencies Latvian Treebank (LVTB).
- Grūzītis, N., Znotiņš, A., Nešpore-Bērzkalne, G., Paikens, P., et al. (2018). Creation of a Balanced State-of-the-Art Multilayer Corpus for NLU. *LREC 2018*.
- Pretkalniņa, L., Rituma, L., Saulīte, B. (2016). Universal Dependency Treebank for Latvian: A Pilot. *Baltic Perspective Workshop*.
---
---
## Usage
You can either:
1. **Download the model directly from the Hugging Face Hub**
Using `huggingface_hub.snapshot_download`, the model files will be automatically fetched and cached locally.
```python
import spacy
from huggingface_hub import snapshot_download
# Load the pipeline
model_dir = snapshot_download(repo_id="JesseHuang922/lv_roberta_base", repo_type="model")
nlp = spacy.load(model_dir)
```
2. **Install from the pre-built wheel package**
Download the wheel file (**lv_roberta_base-1.0.0-py3-none-any.whl**) and install it into your virtual environment with:
```bash
pip install lv_roberta_base-1.0.0-py3-none-any.whl
---
## Dependencies
The following Python packages are required to run the Latvian XLM-RoBERTa spaCy pipeline:
| Package | Minimum Version | Notes |
| ---------------------- | --------------- | -------------------------------------------------------------------------------------- |
| **spaCy** | 3.8.7 | Main NLP framework |
| **spacy-transformers** | 1.3.9 | Integrates spaCy with Hugging Face Transformers |
| **transformers** | 4.49.0 | Hugging Face Transformers library |
| **torch** | 2.8.0 | PyTorch backend for transformers |
| **tokenizers** | 0.21.4 | Fast tokenizer support |
| **safetensors** | 0.6.2 | Secure tensor storage for transformer weights |
| **huggingface-hub** | 0.34.4 | Download and manage the model files from the Hugging Face Hub |
## Optional but recommended
| Package | Minimum Version | Notes |
| ---------------------- | --------------- | -------------------------------------------------------------------------------------- |
| **hf-xet** | 1.1.10 | if you need to download or upload large files from the Hugging Face Hub and use the Xet storage backend |
## Download all dependencies with just one command line:
```bash
pip install \
spacy>=3.8.7 \
spacy-transformers>=1.3.9 \
transformers>=4.49.0 \
torch>=2.8.0 \
tokenizers>=0.21.4 \
safetensors>=0.6.2 \
huggingface-hub>=0.34.4 \
hf-xet>=1.1.10
```
## Example Code
```python
import spacy
import numpy as np
from huggingface_hub import snapshot_download
# Load the pipeline
model_dir = snapshot_download(repo_id="JesseHuang922/lv_roberta_base", repo_type="model")
nlp = spacy.load(model_dir)
# Example text
text = """Baltijas jūras nosaukums ir devis nosaukumu baltu valodām un Baltijas valstīm.
Terminu "Baltijas jūra" (Mare Balticum) pirmoreiz lietoja vācu hronists Brēmenes Ādams 11. gadsimtā."""
# Process text
doc = nlp(text)
# ------------------------
# Tokenization
# ------------------------
print("Tokens:")
print([token.text for token in doc])
# ------------------------
# Lemmatization
# ------------------------
print("Lemmas:")
print([token.lemma_ for token in doc])
# ------------------------
# Part-of-Speech Tagging
# ------------------------
print("POS tags:")
for token in doc:
print(f"{token.text}: {token.pos_} ({token.tag_})")
# ------------------------
# Morphological Features
# ------------------------
print("Morphological features:")
for token in doc:
print(f"{token.text}: {token.morph}")
# ------------------------
# Dependency Parsing
# ------------------------
print("Dependency parsing:")
for token in doc:
print(f"{token.text} <--{token.dep_}-- {token.head.text}")
# ------------------------
# Sentence Segmentation
# ------------------------
print("Sentences:")
for sent in doc.sents:
print(sent.text)
# ------------------------
# Check Pipeline Components
# ------------------------
print("Pipeline components:")
print(nlp.pipe_names)
# Transformer vectors
vectors = np.vstack([token.vector for token in doc])
print("Token vectors shape:", vectors.shape)