--- language: - lv base_model: - FacebookAI/xlm-roberta-base license: cc-by-sa-4.0 datasets: - universal_dependencies metrics: - accuracy - uas - las --- # Latvian SpaCy Model: lv_roberta_base ## Github Repo: https://github.com/LazyBomb-SIA/LV_RoBERTa_Base --- ## Overview This is a **spaCy transformer-based pipeline for Latvian**, built with the **XLM-RoBERTa-base backbone**. **Performance Comparison** | Model | POS | Tag | Morph | UAS | LAS | Lemma Acc | Summary (equal weights) | | ------------ | ------ | ------ | ------ | ------ | ------ | --------- | ------ | | spaCy (this model) | 0.9748 | 0.9215 | 0.9550 | 0.9104 | 0.8753 | 0.8203 | 90.96% | | Stanza | 0.9688 | 0.8987 | 0.9449 | 0.8791 | 0.8354 | 0.9539 | 91.35% | | UDPipe | 0.9207 | 0.7960 | 0.3403 | 0.0791 | 0.0660 | 0.8911 | 51.55% | Details please check cell 12 and 13 here: https://github.com/LazyBomb-SIA/LV_RoBERTa_Base/blob/main/lv_roberta_base.ipynb It includes the following components: - **Transformer** (XLM-RoBERTa-base) - **Tagger** - **Morphologizer** - **Parser** - **Sentence Segmenter (senter)** - **Lemmatizer** - (Note: Transformer component internally uses a `tok2vec` listener) **Model type:** Transformer pipeline (XLM-RoBERTa-base backbone) **Language:** Latvian (lv) **Recommended hardware:** CPU for small-scale use, GPU recommended for faster inference. --- ## Training Data The model was trained on the **Latvian UD Treebank v2.16**, which is derived from the **Latvian Treebank (LVTB)** created at the University of Latvia, Institute of Mathematics and Computer Science, Artificial Intelligence Laboratory (AI Lab). - **Dataset source:** [UD Latvian LVTB](https://github.com/UniversalDependencies/UD_Latvian-LVTB) - **License:** [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) - **Data splits:** - Train: 15,055 sentences - Dev: 2,080 sentences - Test: 2,396 sentences --- ## Acknowledgements - Thanks to the **University of Latvia, AI Lab**, and all contributors of the **Latvian UD Treebank**. - Model development supported by [LazyBomb.SIA]. - Inspired by the **spaCy ecosystem** and training framework. - The Latvian UD Treebank was developed with support from multiple grants, including: - European Regional Development Fund (Grant No. 1.1.1.1/16/A/219, 1.1.1.2/VIAA/1/16/188) - State Research Programme "National Identity" - State Research Programme "Digital Resources for the Humanities" (Grant No. VPP-IZM-DH-2020/1-0001) - State Research Programme "Research on Modern Latvian Language and Development of Language Technology" (Grant No. VPP-LETONIKA-2021/1-0006) --- ## Special Thanks Special Thanks to all contributors who participated in the Beta test and espically those who provided valuable feedback **The list is waiting** --- ## License This model is released under the [Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/). You are free to: - **Share** — copy and redistribute the material in any medium or format, for any purpose, even commercially. - **Adapt** — remix, transform, and build upon the material for any purpose, even commercially. Under the following terms: - **Attribution** — You must give appropriate credit, provide a link to the license, and indicate if changes were made. - **ShareAlike** — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. --- ## References - Pretkalniņa, L., Rituma, L., Saulīte, B., et al. (2016–2025). Universal Dependencies Latvian Treebank (LVTB). - Grūzītis, N., Znotiņš, A., Nešpore-Bērzkalne, G., Paikens, P., et al. (2018). Creation of a Balanced State-of-the-Art Multilayer Corpus for NLU. *LREC 2018*. - Pretkalniņa, L., Rituma, L., Saulīte, B. (2016). Universal Dependency Treebank for Latvian: A Pilot. *Baltic Perspective Workshop*. --- --- ## Usage You can either: 1. **Download the model directly from the Hugging Face Hub** Using `huggingface_hub.snapshot_download`, the model files will be automatically fetched and cached locally. ```python import spacy from huggingface_hub import snapshot_download # Load the pipeline model_dir = snapshot_download(repo_id="JesseHuang922/lv_roberta_base", repo_type="model") nlp = spacy.load(model_dir) ``` 2. **Install from the pre-built wheel package** Download the wheel file (**lv_roberta_base-1.0.0-py3-none-any.whl**) and install it into your virtual environment with: ```bash pip install lv_roberta_base-1.0.0-py3-none-any.whl --- ## Dependencies The following Python packages are required to run the Latvian XLM-RoBERTa spaCy pipeline: | Package | Minimum Version | Notes | | ---------------------- | --------------- | -------------------------------------------------------------------------------------- | | **spaCy** | 3.8.7 | Main NLP framework | | **spacy-transformers** | 1.3.9 | Integrates spaCy with Hugging Face Transformers | | **transformers** | 4.49.0 | Hugging Face Transformers library | | **torch** | 2.8.0 | PyTorch backend for transformers | | **tokenizers** | 0.21.4 | Fast tokenizer support | | **safetensors** | 0.6.2 | Secure tensor storage for transformer weights | | **huggingface-hub** | 0.34.4 | Download and manage the model files from the Hugging Face Hub | ## Optional but recommended | Package | Minimum Version | Notes | | ---------------------- | --------------- | -------------------------------------------------------------------------------------- | | **hf-xet** | 1.1.10 | if you need to download or upload large files from the Hugging Face Hub and use the Xet storage backend | ## Download all dependencies with just one command line: ```bash pip install \ spacy>=3.8.7 \ spacy-transformers>=1.3.9 \ transformers>=4.49.0 \ torch>=2.8.0 \ tokenizers>=0.21.4 \ safetensors>=0.6.2 \ huggingface-hub>=0.34.4 \ hf-xet>=1.1.10 ``` ## Example Code ```python import spacy import numpy as np from huggingface_hub import snapshot_download # Load the pipeline model_dir = snapshot_download(repo_id="JesseHuang922/lv_roberta_base", repo_type="model") nlp = spacy.load(model_dir) # Example text text = """Baltijas jūras nosaukums ir devis nosaukumu baltu valodām un Baltijas valstīm. Terminu "Baltijas jūra" (Mare Balticum) pirmoreiz lietoja vācu hronists Brēmenes Ādams 11. gadsimtā.""" # Process text doc = nlp(text) # ------------------------ # Tokenization # ------------------------ print("Tokens:") print([token.text for token in doc]) # ------------------------ # Lemmatization # ------------------------ print("Lemmas:") print([token.lemma_ for token in doc]) # ------------------------ # Part-of-Speech Tagging # ------------------------ print("POS tags:") for token in doc: print(f"{token.text}: {token.pos_} ({token.tag_})") # ------------------------ # Morphological Features # ------------------------ print("Morphological features:") for token in doc: print(f"{token.text}: {token.morph}") # ------------------------ # Dependency Parsing # ------------------------ print("Dependency parsing:") for token in doc: print(f"{token.text} <--{token.dep_}-- {token.head.text}") # ------------------------ # Sentence Segmentation # ------------------------ print("Sentences:") for sent in doc.sents: print(sent.text) # ------------------------ # Check Pipeline Components # ------------------------ print("Pipeline components:") print(nlp.pipe_names) # Transformer vectors vectors = np.vstack([token.vector for token in doc]) print("Token vectors shape:", vectors.shape)