| | --- |
| | language: |
| | - lv |
| | base_model: |
| | - FacebookAI/xlm-roberta-base |
| | license: cc-by-sa-4.0 |
| | datasets: |
| | - universal_dependencies |
| | metrics: |
| | - accuracy |
| | - uas |
| | - las |
| | --- |
| | |
| | # Latvian SpaCy Model: lv_roberta_base |
| |
|
| | ## Github Repo: |
| | https://github.com/LazyBomb-SIA/LV_RoBERTa_Base |
| |
|
| | --- |
| |
|
| | ## Overview |
| |
|
| | This is a **spaCy transformer-based pipeline for Latvian**, built with the **XLM-RoBERTa-base backbone**. |
| |
|
| | **Performance Comparison** |
| | | Model | POS | Tag | Morph | UAS | LAS | Lemma Acc | Summary (equal weights) | |
| | | ------------ | ------ | ------ | ------ | ------ | ------ | --------- | ------ | |
| | | spaCy (this model) | 0.9748 | 0.9215 | 0.9550 | 0.9104 | 0.8753 | 0.8203 | 90.96% | |
| | | Stanza | 0.9688 | 0.8987 | 0.9449 | 0.8791 | 0.8354 | 0.9539 | 91.35% | |
| | | UDPipe | 0.9207 | 0.7960 | 0.3403 | 0.0791 | 0.0660 | 0.8911 | 51.55% | |
| |
|
| | Details please check cell 12 and 13 here: |
| |
|
| | https://github.com/LazyBomb-SIA/LV_RoBERTa_Base/blob/main/lv_roberta_base.ipynb |
| |
|
| | It includes the following components: |
| |
|
| | - **Transformer** (XLM-RoBERTa-base) |
| | - **Tagger** |
| | - **Morphologizer** |
| | - **Parser** |
| | - **Sentence Segmenter (senter)** |
| | - **Lemmatizer** |
| | - (Note: Transformer component internally uses a `tok2vec` listener) |
| |
|
| | **Model type:** Transformer pipeline (XLM-RoBERTa-base backbone) |
| | **Language:** Latvian (lv) |
| | **Recommended hardware:** CPU for small-scale use, GPU recommended for faster inference. |
| |
|
| | --- |
| |
|
| | ## Training Data |
| |
|
| | The model was trained on the **Latvian UD Treebank v2.16**, which is derived from the **Latvian Treebank (LVTB)** created at the University of Latvia, Institute of Mathematics and Computer Science, Artificial Intelligence Laboratory (AI Lab). |
| |
|
| | - **Dataset source:** [UD Latvian LVTB](https://github.com/UniversalDependencies/UD_Latvian-LVTB) |
| | - **License:** [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) |
| | - **Data splits:** |
| | - Train: 15,055 sentences |
| | - Dev: 2,080 sentences |
| | - Test: 2,396 sentences |
| |
|
| | --- |
| |
|
| | ## Acknowledgements |
| |
|
| | - Thanks to the **University of Latvia, AI Lab**, and all contributors of the **Latvian UD Treebank**. |
| | - Model development supported by [LazyBomb.SIA]. |
| | - Inspired by the **spaCy ecosystem** and training framework. |
| | - The Latvian UD Treebank was developed with support from multiple grants, including: |
| | - European Regional Development Fund (Grant No. 1.1.1.1/16/A/219, 1.1.1.2/VIAA/1/16/188) |
| | - State Research Programme "National Identity" |
| | - State Research Programme "Digital Resources for the Humanities" (Grant No. VPP-IZM-DH-2020/1-0001) |
| | - State Research Programme "Research on Modern Latvian Language and Development of Language Technology" (Grant No. VPP-LETONIKA-2021/1-0006) |
| |
|
| | --- |
| |
|
| | ## Special Thanks |
| | Special Thanks to all contributors who participated in the Beta test and espically those who provided valuable feedback |
| |
|
| | **The list is waiting** |
| |
|
| | --- |
| |
|
| | ## License |
| |
|
| | This model is released under the [Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/). |
| |
|
| | You are free to: |
| | - **Share** — copy and redistribute the material in any medium or format, for any purpose, even commercially. |
| | - **Adapt** — remix, transform, and build upon the material for any purpose, even commercially. |
| |
|
| | Under the following terms: |
| | - **Attribution** — You must give appropriate credit, provide a link to the license, and indicate if changes were made. |
| | - **ShareAlike** — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. |
| |
|
| | --- |
| |
|
| | ## References |
| |
|
| | - Pretkalniņa, L., Rituma, L., Saulīte, B., et al. (2016–2025). Universal Dependencies Latvian Treebank (LVTB). |
| | - Grūzītis, N., Znotiņš, A., Nešpore-Bērzkalne, G., Paikens, P., et al. (2018). Creation of a Balanced State-of-the-Art Multilayer Corpus for NLU. *LREC 2018*. |
| | - Pretkalniņa, L., Rituma, L., Saulīte, B. (2016). Universal Dependency Treebank for Latvian: A Pilot. *Baltic Perspective Workshop*. |
| |
|
| | --- |
| |
|
| | --- |
| |
|
| | ## Usage |
| |
|
| | You can either: |
| |
|
| | 1. **Download the model directly from the Hugging Face Hub** |
| | Using `huggingface_hub.snapshot_download`, the model files will be automatically fetched and cached locally. |
| |
|
| | ```python |
| | import spacy |
| | from huggingface_hub import snapshot_download |
| | |
| | # Load the pipeline |
| | model_dir = snapshot_download(repo_id="JesseHuang922/lv_roberta_base", repo_type="model") |
| | nlp = spacy.load(model_dir) |
| | ``` |
| | |
| | 2. **Install from the pre-built wheel package** |
| | Download the wheel file (**lv_roberta_base-1.0.0-py3-none-any.whl**) and install it into your virtual environment with: |
| |
|
| | ```bash |
| | pip install lv_roberta_base-1.0.0-py3-none-any.whl |
| | |
| | --- |
| | |
| | ## Dependencies |
| |
|
| | The following Python packages are required to run the Latvian XLM-RoBERTa spaCy pipeline: |
| |
|
| | | Package | Minimum Version | Notes | |
| | | ---------------------- | --------------- | -------------------------------------------------------------------------------------- | |
| | | **spaCy** | 3.8.7 | Main NLP framework | |
| | | **spacy-transformers** | 1.3.9 | Integrates spaCy with Hugging Face Transformers | |
| | | **transformers** | 4.49.0 | Hugging Face Transformers library | |
| | | **torch** | 2.8.0 | PyTorch backend for transformers | |
| | | **tokenizers** | 0.21.4 | Fast tokenizer support | |
| | | **safetensors** | 0.6.2 | Secure tensor storage for transformer weights | |
| | | **huggingface-hub** | 0.34.4 | Download and manage the model files from the Hugging Face Hub | |
| |
|
| | ## Optional but recommended |
| | | Package | Minimum Version | Notes | |
| | | ---------------------- | --------------- | -------------------------------------------------------------------------------------- | |
| | | **hf-xet** | 1.1.10 | if you need to download or upload large files from the Hugging Face Hub and use the Xet storage backend | |
| |
|
| | ## Download all dependencies with just one command line: |
| | ```bash |
| | pip install \ |
| | spacy>=3.8.7 \ |
| | spacy-transformers>=1.3.9 \ |
| | transformers>=4.49.0 \ |
| | torch>=2.8.0 \ |
| | tokenizers>=0.21.4 \ |
| | safetensors>=0.6.2 \ |
| | huggingface-hub>=0.34.4 \ |
| | hf-xet>=1.1.10 |
| | ``` |
| |
|
| | ## Example Code |
| |
|
| | ```python |
| | import spacy |
| | import numpy as np |
| | from huggingface_hub import snapshot_download |
| | |
| | # Load the pipeline |
| | model_dir = snapshot_download(repo_id="JesseHuang922/lv_roberta_base", repo_type="model") |
| | nlp = spacy.load(model_dir) |
| | |
| | # Example text |
| | text = """Baltijas jūras nosaukums ir devis nosaukumu baltu valodām un Baltijas valstīm. |
| | Terminu "Baltijas jūra" (Mare Balticum) pirmoreiz lietoja vācu hronists Brēmenes Ādams 11. gadsimtā.""" |
| | |
| | # Process text |
| | doc = nlp(text) |
| | |
| | # ------------------------ |
| | # Tokenization |
| | # ------------------------ |
| | print("Tokens:") |
| | print([token.text for token in doc]) |
| | |
| | # ------------------------ |
| | # Lemmatization |
| | # ------------------------ |
| | print("Lemmas:") |
| | print([token.lemma_ for token in doc]) |
| | |
| | # ------------------------ |
| | # Part-of-Speech Tagging |
| | # ------------------------ |
| | print("POS tags:") |
| | for token in doc: |
| | print(f"{token.text}: {token.pos_} ({token.tag_})") |
| | |
| | # ------------------------ |
| | # Morphological Features |
| | # ------------------------ |
| | print("Morphological features:") |
| | for token in doc: |
| | print(f"{token.text}: {token.morph}") |
| | |
| | # ------------------------ |
| | # Dependency Parsing |
| | # ------------------------ |
| | print("Dependency parsing:") |
| | for token in doc: |
| | print(f"{token.text} <--{token.dep_}-- {token.head.text}") |
| | |
| | # ------------------------ |
| | # Sentence Segmentation |
| | # ------------------------ |
| | print("Sentences:") |
| | for sent in doc.sents: |
| | print(sent.text) |
| | |
| | # ------------------------ |
| | # Check Pipeline Components |
| | # ------------------------ |
| | print("Pipeline components:") |
| | print(nlp.pipe_names) |
| | |
| | # Transformer vectors |
| | vectors = np.vstack([token.vector for token in doc]) |
| | print("Token vectors shape:", vectors.shape) |
| | |