Latvian SpaCy Model: lv_roberta_base
Github Repo:
https://github.com/LazyBomb-SIA/LV_RoBERTa_Base
Overview
This is a spaCy transformer-based pipeline for Latvian, built with the XLM-RoBERTa-base backbone.
Performance Comparison
| Model | POS | Tag | Morph | UAS | LAS | Lemma Acc | Summary (equal weights) |
|---|---|---|---|---|---|---|---|
| spaCy (this model) | 0.9748 | 0.9215 | 0.9550 | 0.9104 | 0.8753 | 0.8203 | 90.96% |
| Stanza | 0.9688 | 0.8987 | 0.9449 | 0.8791 | 0.8354 | 0.9539 | 91.35% |
| UDPipe | 0.9207 | 0.7960 | 0.3403 | 0.0791 | 0.0660 | 0.8911 | 51.55% |
Details please check cell 12 and 13 here:
https://github.com/LazyBomb-SIA/LV_RoBERTa_Base/blob/main/lv_roberta_base.ipynb
It includes the following components:
- Transformer (XLM-RoBERTa-base)
- Tagger
- Morphologizer
- Parser
- Sentence Segmenter (senter)
- Lemmatizer
- (Note: Transformer component internally uses a
tok2veclistener)
Model type: Transformer pipeline (XLM-RoBERTa-base backbone)
Language: Latvian (lv)
Recommended hardware: CPU for small-scale use, GPU recommended for faster inference.
Training Data
The model was trained on the Latvian UD Treebank v2.16, which is derived from the Latvian Treebank (LVTB) created at the University of Latvia, Institute of Mathematics and Computer Science, Artificial Intelligence Laboratory (AI Lab).
- Dataset source: UD Latvian LVTB
- License: CC BY-SA 4.0
- Data splits:
- Train: 15,055 sentences
- Dev: 2,080 sentences
- Test: 2,396 sentences
Acknowledgements
- Thanks to the University of Latvia, AI Lab, and all contributors of the Latvian UD Treebank.
- Model development supported by [LazyBomb.SIA].
- Inspired by the spaCy ecosystem and training framework.
- The Latvian UD Treebank was developed with support from multiple grants, including:
- European Regional Development Fund (Grant No. 1.1.1.1/16/A/219, 1.1.1.2/VIAA/1/16/188)
- State Research Programme "National Identity"
- State Research Programme "Digital Resources for the Humanities" (Grant No. VPP-IZM-DH-2020/1-0001)
- State Research Programme "Research on Modern Latvian Language and Development of Language Technology" (Grant No. VPP-LETONIKA-2021/1-0006)
Special Thanks
Special Thanks to all contributors who participated in the Beta test and espically those who provided valuable feedback
The list is waiting
License
This model is released under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0).
You are free to:
- Share — copy and redistribute the material in any medium or format, for any purpose, even commercially.
- Adapt — remix, transform, and build upon the material for any purpose, even commercially.
Under the following terms:
- Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made.
- ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
References
- Pretkalniņa, L., Rituma, L., Saulīte, B., et al. (2016–2025). Universal Dependencies Latvian Treebank (LVTB).
- Grūzītis, N., Znotiņš, A., Nešpore-Bērzkalne, G., Paikens, P., et al. (2018). Creation of a Balanced State-of-the-Art Multilayer Corpus for NLU. LREC 2018.
- Pretkalniņa, L., Rituma, L., Saulīte, B. (2016). Universal Dependency Treebank for Latvian: A Pilot. Baltic Perspective Workshop.
Usage
You can either:
Download the model directly from the Hugging Face Hub
Usinghuggingface_hub.snapshot_download, the model files will be automatically fetched and cached locally.import spacy from huggingface_hub import snapshot_download # Load the pipeline model_dir = snapshot_download(repo_id="JesseHuang922/lv_roberta_base", repo_type="model") nlp = spacy.load(model_dir)Install from the pre-built wheel package
Download the wheel file (lv_roberta_base-1.0.0-py3-none-any.whl) and install it into your virtual environment with:```bash pip install lv_roberta_base-1.0.0-py3-none-any.whl
Dependencies
The following Python packages are required to run the Latvian XLM-RoBERTa spaCy pipeline:
| Package | Minimum Version | Notes |
|---|---|---|
| spaCy | 3.8.7 | Main NLP framework |
| spacy-transformers | 1.3.9 | Integrates spaCy with Hugging Face Transformers |
| transformers | 4.49.0 | Hugging Face Transformers library |
| torch | 2.8.0 | PyTorch backend for transformers |
| tokenizers | 0.21.4 | Fast tokenizer support |
| safetensors | 0.6.2 | Secure tensor storage for transformer weights |
| huggingface-hub | 0.34.4 | Download and manage the model files from the Hugging Face Hub |
Optional but recommended
| Package | Minimum Version | Notes |
|---|---|---|
| hf-xet | 1.1.10 | if you need to download or upload large files from the Hugging Face Hub and use the Xet storage backend |
Download all dependencies with just one command line:
pip install \
spacy>=3.8.7 \
spacy-transformers>=1.3.9 \
transformers>=4.49.0 \
torch>=2.8.0 \
tokenizers>=0.21.4 \
safetensors>=0.6.2 \
huggingface-hub>=0.34.4 \
hf-xet>=1.1.10
Example Code
import spacy
import numpy as np
from huggingface_hub import snapshot_download
# Load the pipeline
model_dir = snapshot_download(repo_id="JesseHuang922/lv_roberta_base", repo_type="model")
nlp = spacy.load(model_dir)
# Example text
text = """Baltijas jūras nosaukums ir devis nosaukumu baltu valodām un Baltijas valstīm.
Terminu "Baltijas jūra" (Mare Balticum) pirmoreiz lietoja vācu hronists Brēmenes Ādams 11. gadsimtā."""
# Process text
doc = nlp(text)
# ------------------------
# Tokenization
# ------------------------
print("Tokens:")
print([token.text for token in doc])
# ------------------------
# Lemmatization
# ------------------------
print("Lemmas:")
print([token.lemma_ for token in doc])
# ------------------------
# Part-of-Speech Tagging
# ------------------------
print("POS tags:")
for token in doc:
print(f"{token.text}: {token.pos_} ({token.tag_})")
# ------------------------
# Morphological Features
# ------------------------
print("Morphological features:")
for token in doc:
print(f"{token.text}: {token.morph}")
# ------------------------
# Dependency Parsing
# ------------------------
print("Dependency parsing:")
for token in doc:
print(f"{token.text} <--{token.dep_}-- {token.head.text}")
# ------------------------
# Sentence Segmentation
# ------------------------
print("Sentences:")
for sent in doc.sents:
print(sent.text)
# ------------------------
# Check Pipeline Components
# ------------------------
print("Pipeline components:")
print(nlp.pipe_names)
# Transformer vectors
vectors = np.vstack([token.vector for token in doc])
print("Token vectors shape:", vectors.shape)
Model tree for JesseHuang922/lv_roberta_base
Base model
FacebookAI/xlm-roberta-base