Update README.md

16b0a46 verified 5 months ago

8.31 kB

	---
	language:
	- lv
	base_model:
	- FacebookAI/xlm-roberta-base
	license: cc-by-sa-4.0
	datasets:
	- universal_dependencies
	metrics:
	- accuracy
	- uas
	- las
	---

	# Latvian SpaCy Model: lv_roberta_base

	## Github Repo:
	https://github.com/LazyBomb-SIA/LV_RoBERTa_Base

	---

	## Overview

	This is a spaCy transformer-based pipeline for Latvian, built with the XLM-RoBERTa-base backbone.

	Performance Comparison
	\| Model \| POS \| Tag \| Morph \| UAS \| LAS \| Lemma Acc \| Summary (equal weights) \|
	\| ------------ \| ------ \| ------ \| ------ \| ------ \| ------ \| --------- \| ------ \|
	\| spaCy (this model) \| 0.9748 \| 0.9215 \| 0.9550 \| 0.9104 \| 0.8753 \| 0.8203 \| 90.96% \|
	\| Stanza \| 0.9688 \| 0.8987 \| 0.9449 \| 0.8791 \| 0.8354 \| 0.9539 \| 91.35% \|
	\| UDPipe \| 0.9207 \| 0.7960 \| 0.3403 \| 0.0791 \| 0.0660 \| 0.8911 \| 51.55% \|

	Details please check cell 12 and 13 here:

	https://github.com/LazyBomb-SIA/LV_RoBERTa_Base/blob/main/lv_roberta_base.ipynb

	It includes the following components:

	- Transformer (XLM-RoBERTa-base)
	- Tagger
	- Morphologizer
	- Parser
	- Sentence Segmenter (senter)
	- Lemmatizer
	- (Note: Transformer component internally uses a `tok2vec` listener)

	Model type: Transformer pipeline (XLM-RoBERTa-base backbone)
	Language: Latvian (lv)
	Recommended hardware: CPU for small-scale use, GPU recommended for faster inference.

	---

	## Training Data

	The model was trained on the Latvian UD Treebank v2.16, which is derived from the Latvian Treebank (LVTB) created at the University of Latvia, Institute of Mathematics and Computer Science, Artificial Intelligence Laboratory (AI Lab).

	- Dataset source: [UD Latvian LVTB](https://github.com/UniversalDependencies/UD_Latvian-LVTB)
	- License: [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)
	- Data splits:
	- Train: 15,055 sentences
	- Dev: 2,080 sentences
	- Test: 2,396 sentences

	---

	## Acknowledgements

	- Thanks to the University of Latvia, AI Lab, and all contributors of the Latvian UD Treebank.
	- Model development supported by [LazyBomb.SIA].
	- Inspired by the spaCy ecosystem and training framework.
	- The Latvian UD Treebank was developed with support from multiple grants, including:
	- European Regional Development Fund (Grant No. 1.1.1.1/16/A/219, 1.1.1.2/VIAA/1/16/188)
	- State Research Programme "National Identity"
	- State Research Programme "Digital Resources for the Humanities" (Grant No. VPP-IZM-DH-2020/1-0001)
	- State Research Programme "Research on Modern Latvian Language and Development of Language Technology" (Grant No. VPP-LETONIKA-2021/1-0006)

	---

	## Special Thanks
	Special Thanks to all contributors who participated in the Beta test and espically those who provided valuable feedback

	The list is waiting

	---

	## License

	This model is released under the [Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/).

	You are free to:
	- Share — copy and redistribute the material in any medium or format, for any purpose, even commercially.
	- Adapt — remix, transform, and build upon the material for any purpose, even commercially.

	Under the following terms:
	- Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made.
	- ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

	---

	## References

	- Pretkalniņa, L., Rituma, L., Saulīte, B., et al. (2016–2025). Universal Dependencies Latvian Treebank (LVTB).
	- Grūzītis, N., Znotiņš, A., Nešpore-Bērzkalne, G., Paikens, P., et al. (2018). Creation of a Balanced State-of-the-Art Multilayer Corpus for NLU. LREC 2018.
	- Pretkalniņa, L., Rituma, L., Saulīte, B. (2016). Universal Dependency Treebank for Latvian: A Pilot. Baltic Perspective Workshop.

	---

	---

	## Usage

	You can either:

	1. Download the model directly from the Hugging Face Hub
	Using `huggingface_hub.snapshot_download`, the model files will be automatically fetched and cached locally.

	```python
	import spacy
	from huggingface_hub import snapshot_download

	# Load the pipeline
	model_dir = snapshot_download(repo_id="JesseHuang922/lv_roberta_base", repo_type="model")
	nlp = spacy.load(model_dir)
	```

	2. Install from the pre-built wheel package
	Download the wheel file (lv_roberta_base-1.0.0-py3-none-any.whl) and install it into your virtual environment with:

	```bash
	pip install lv_roberta_base-1.0.0-py3-none-any.whl

	---

	## Dependencies

	The following Python packages are required to run the Latvian XLM-RoBERTa spaCy pipeline:

	\| Package \| Minimum Version \| Notes \|
	\| ---------------------- \| --------------- \| -------------------------------------------------------------------------------------- \|
	\| spaCy \| 3.8.7 \| Main NLP framework \|
	\| spacy-transformers \| 1.3.9 \| Integrates spaCy with Hugging Face Transformers \|
	\| transformers \| 4.49.0 \| Hugging Face Transformers library \|
	\| torch \| 2.8.0 \| PyTorch backend for transformers \|
	\| tokenizers \| 0.21.4 \| Fast tokenizer support \|
	\| safetensors \| 0.6.2 \| Secure tensor storage for transformer weights \|
	\| huggingface-hub \| 0.34.4 \| Download and manage the model files from the Hugging Face Hub \|

	## Optional but recommended
	\| Package \| Minimum Version \| Notes \|
	\| ---------------------- \| --------------- \| -------------------------------------------------------------------------------------- \|
	\| hf-xet \| 1.1.10 \| if you need to download or upload large files from the Hugging Face Hub and use the Xet storage backend \|

	## Download all dependencies with just one command line:
	```bash
	pip install \
	spacy>=3.8.7 \
	spacy-transformers>=1.3.9 \
	transformers>=4.49.0 \
	torch>=2.8.0 \
	tokenizers>=0.21.4 \
	safetensors>=0.6.2 \
	huggingface-hub>=0.34.4 \
	hf-xet>=1.1.10
	```

	## Example Code

	```python
	import spacy
	import numpy as np
	from huggingface_hub import snapshot_download

	# Load the pipeline
	model_dir = snapshot_download(repo_id="JesseHuang922/lv_roberta_base", repo_type="model")
	nlp = spacy.load(model_dir)

	# Example text
	text = """Baltijas jūras nosaukums ir devis nosaukumu baltu valodām un Baltijas valstīm.
	Terminu "Baltijas jūra" (Mare Balticum) pirmoreiz lietoja vācu hronists Brēmenes Ādams 11. gadsimtā."""

	# Process text
	doc = nlp(text)

	# ------------------------
	# Tokenization
	# ------------------------
	print("Tokens:")
	print([token.text for token in doc])

	# ------------------------
	# Lemmatization
	# ------------------------
	print("Lemmas:")
	print([token.lemma_ for token in doc])

	# ------------------------
	# Part-of-Speech Tagging
	# ------------------------
	print("POS tags:")
	for token in doc:
	print(f"{token.text}: {token.pos_} ({token.tag_})")

	# ------------------------
	# Morphological Features
	# ------------------------
	print("Morphological features:")
	for token in doc:
	print(f"{token.text}: {token.morph}")

	# ------------------------
	# Dependency Parsing
	# ------------------------
	print("Dependency parsing:")
	for token in doc:
	print(f"{token.text} <--{token.dep_}-- {token.head.text}")

	# ------------------------
	# Sentence Segmentation
	# ------------------------
	print("Sentences:")
	for sent in doc.sents:
	print(sent.text)

	# ------------------------
	# Check Pipeline Components
	# ------------------------
	print("Pipeline components:")
	print(nlp.pipe_names)

	# Transformer vectors
	vectors = np.vstack([token.vector for token in doc])
	print("Token vectors shape:", vectors.shape)