xlmr-tigrinya-mlm / README.md
Hailay's picture
Create README.md
0fc52a4 verified
metadata
language: ti
datasets:
  - NLLB
library_name: transformers
tags:
  - tigrinya
  - masked-language-modeling
  - xlmr
  - low-resource
  - multilingual
model_name: XLM-Roberta fine-tuned on Tigrinya (MLM)
license: apache-2.0

XLM-Roberta Fine-Tuned on Tigrinya (MLM)

This model is a fine-tuned version of xlm-roberta-base for the Tigrinya language (α‰΅αŒαˆ­αŠ›), trained with the Masked Language Modeling (MLM) objective. It uses a custom BPE tokenizer adapted to Tigrinya using FastText-informed embedding initialization.

πŸ”§ Details

  • Base model: xlm-roberta-base
  • Language: Tigrinya
  • Tokenizer: Custom BPE tokenizer (non-morpheme-aware)
  • Adaptation: Embedding initialization using weighted averages of pretrained XLM-R embeddings, guided by Tigrinya FastText word vectors
  • Training dataset: Tigrinya side of the NLLB (No Language Left Behind) parallel corpus
  • Objective: Masked Language Modeling (MLM)

πŸ§ͺ Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Hailay/xlmr-tigriyna-mlm")
model = AutoModelForMaskedLM.from_pretrained("Hailay/xlmr-tigriyna-mlm")

text = "α‰΅αŒαˆ«α‹­ α‰₯αˆα‰΅αˆ•α‰₯ባ αŠ•αˆ…α‹α‰’ ግα‰₯αˆͺ α‰€αŒΊαˆ‰α’"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
πŸ“Œ Intended Use
Pretraining for Tigrinya NLP tasks

Fine-tuning on classification, NER, QA, and other downstream tasks in Tigrinya

Research in low-resource Semitic and morphologically rich languages

πŸ“– Citation
@misc{hailay2025tigrinya,
  title={Tigrinya MLM with XLM-R and FastText-Informed Embedding Initialization},
  author={Hailay Kidu},
  year={2025},
  url={https://huggingface.co/Hailay/xlmr-tigriyna-mlm}
}
🏷️ License
Apache License 2.0