Latvian SpaCy Model: lv_roberta_base

Github Repo:

https://github.com/LazyBomb-SIA/LV_RoBERTa_Base


Overview

This is a spaCy transformer-based pipeline for Latvian, built with the XLM-RoBERTa-base backbone.

Performance Comparison

Model POS Tag Morph UAS LAS Lemma Acc Summary (equal weights)
spaCy (this model) 0.9748 0.9215 0.9550 0.9104 0.8753 0.8203 90.96%
Stanza 0.9688 0.8987 0.9449 0.8791 0.8354 0.9539 91.35%
UDPipe 0.9207 0.7960 0.3403 0.0791 0.0660 0.8911 51.55%

Details please check cell 12 and 13 here:

https://github.com/LazyBomb-SIA/LV_RoBERTa_Base/blob/main/lv_roberta_base.ipynb

It includes the following components:

  • Transformer (XLM-RoBERTa-base)
  • Tagger
  • Morphologizer
  • Parser
  • Sentence Segmenter (senter)
  • Lemmatizer
  • (Note: Transformer component internally uses a tok2vec listener)

Model type: Transformer pipeline (XLM-RoBERTa-base backbone)
Language: Latvian (lv)
Recommended hardware: CPU for small-scale use, GPU recommended for faster inference.


Training Data

The model was trained on the Latvian UD Treebank v2.16, which is derived from the Latvian Treebank (LVTB) created at the University of Latvia, Institute of Mathematics and Computer Science, Artificial Intelligence Laboratory (AI Lab).

  • Dataset source: UD Latvian LVTB
  • License: CC BY-SA 4.0
  • Data splits:
    • Train: 15,055 sentences
    • Dev: 2,080 sentences
    • Test: 2,396 sentences

Acknowledgements

  • Thanks to the University of Latvia, AI Lab, and all contributors of the Latvian UD Treebank.
  • Model development supported by [LazyBomb.SIA].
  • Inspired by the spaCy ecosystem and training framework.
  • The Latvian UD Treebank was developed with support from multiple grants, including:
    • European Regional Development Fund (Grant No. 1.1.1.1/16/A/219, 1.1.1.2/VIAA/1/16/188)
    • State Research Programme "National Identity"
    • State Research Programme "Digital Resources for the Humanities" (Grant No. VPP-IZM-DH-2020/1-0001)
    • State Research Programme "Research on Modern Latvian Language and Development of Language Technology" (Grant No. VPP-LETONIKA-2021/1-0006)

Special Thanks

Special Thanks to all contributors who participated in the Beta test and espically those who provided valuable feedback

The list is waiting


License

This model is released under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0).

You are free to:

  • Share — copy and redistribute the material in any medium or format, for any purpose, even commercially.
  • Adapt — remix, transform, and build upon the material for any purpose, even commercially.

Under the following terms:

  • Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made.
  • ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

References

  • Pretkalniņa, L., Rituma, L., Saulīte, B., et al. (2016–2025). Universal Dependencies Latvian Treebank (LVTB).
  • Grūzītis, N., Znotiņš, A., Nešpore-Bērzkalne, G., Paikens, P., et al. (2018). Creation of a Balanced State-of-the-Art Multilayer Corpus for NLU. LREC 2018.
  • Pretkalniņa, L., Rituma, L., Saulīte, B. (2016). Universal Dependency Treebank for Latvian: A Pilot. Baltic Perspective Workshop.


Usage

You can either:

  1. Download the model directly from the Hugging Face Hub
    Using huggingface_hub.snapshot_download, the model files will be automatically fetched and cached locally.

    import spacy
    from huggingface_hub import snapshot_download
    
    # Load the pipeline
    model_dir = snapshot_download(repo_id="JesseHuang922/lv_roberta_base", repo_type="model")
    nlp = spacy.load(model_dir)
    
  2. Install from the pre-built wheel package
    Download the wheel file (lv_roberta_base-1.0.0-py3-none-any.whl) and install it into your virtual environment with:

    ```bash
    pip install lv_roberta_base-1.0.0-py3-none-any.whl
    

Dependencies

The following Python packages are required to run the Latvian XLM-RoBERTa spaCy pipeline:

Package Minimum Version Notes
spaCy 3.8.7 Main NLP framework
spacy-transformers 1.3.9 Integrates spaCy with Hugging Face Transformers
transformers 4.49.0 Hugging Face Transformers library
torch 2.8.0 PyTorch backend for transformers
tokenizers 0.21.4 Fast tokenizer support
safetensors 0.6.2 Secure tensor storage for transformer weights
huggingface-hub 0.34.4 Download and manage the model files from the Hugging Face Hub

Optional but recommended

Package Minimum Version Notes
hf-xet 1.1.10 if you need to download or upload large files from the Hugging Face Hub and use the Xet storage backend

Download all dependencies with just one command line:

pip install \
spacy>=3.8.7 \
spacy-transformers>=1.3.9 \
transformers>=4.49.0 \
torch>=2.8.0 \
tokenizers>=0.21.4 \
safetensors>=0.6.2 \
huggingface-hub>=0.34.4 \
hf-xet>=1.1.10

Example Code

import spacy
import numpy as np
from huggingface_hub import snapshot_download

# Load the pipeline
model_dir = snapshot_download(repo_id="JesseHuang922/lv_roberta_base", repo_type="model")
nlp = spacy.load(model_dir)

# Example text
text = """Baltijas jūras nosaukums ir devis nosaukumu baltu valodām un Baltijas valstīm.
Terminu "Baltijas jūra" (Mare Balticum) pirmoreiz lietoja vācu hronists Brēmenes Ādams 11. gadsimtā."""

# Process text
doc = nlp(text)

# ------------------------
# Tokenization 
# ------------------------
print("Tokens:")
print([token.text for token in doc])

# ------------------------
# Lemmatization
# ------------------------
print("Lemmas:")
print([token.lemma_ for token in doc])

# ------------------------
# Part-of-Speech Tagging
# ------------------------
print("POS tags:")
for token in doc:
    print(f"{token.text}: {token.pos_} ({token.tag_})")

# ------------------------
# Morphological Features
# ------------------------
print("Morphological features:")
for token in doc:
    print(f"{token.text}: {token.morph}")

# ------------------------
# Dependency Parsing
# ------------------------
print("Dependency parsing:")
for token in doc:
    print(f"{token.text} <--{token.dep_}-- {token.head.text}")

# ------------------------
# Sentence Segmentation
# ------------------------
print("Sentences:")
for sent in doc.sents:
    print(sent.text)

# ------------------------
# Check Pipeline Components
# ------------------------
print("Pipeline components:")
print(nlp.pipe_names)

# Transformer vectors
vectors = np.vstack([token.vector for token in doc])
print("Token vectors shape:", vectors.shape)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JesseHuang922/lv_roberta_base

Finetuned
(3740)
this model

Dataset used to train JesseHuang922/lv_roberta_base