Yo-ByT5
This model is a fine-tuned version of google/byt5-small on a Yoruba dataset. It is designed to automatically restore diacritics (tone marks and underdots) to Yoruba text, which is crucial for lexical disambiguation and proper pronunciation in downstream tasks.
Model Description
- Model Type: Byte-level T5 (ByT5) for Sequence-to-Sequence operations.
- Language(s): Yoruba (yo)
- Task: Diacritic Restoration (Automatic Diacritization)
- Developed by: Gali Ahmad Samuel (lazymonster)
- Shared by: Gali Ahmad Samuel (lazymonster)
Yoruba is a tonal language where the meaning of a word relies heavily on tone marks (acute and grave accents) and underdots. This model takes non-diacritized (or partially diacritized) text as input and outputs the fully diacritized text.
Intended Uses & Limitations
Intended Uses
- Preprocessing: Cleaning text for Text-to-Speech (TTS) or Machine Translation (MT) systems where accurate diacritics are mandatory.
- Search Engines: Normalizing user queries in Yoruba.
- Linguistic Analysis: Assisting in the annotation of low-resource language datasets.
Limitations
- The model may struggle with proper nouns or ambiguous context where multiple valid diacritization patterns exist for the same character sequence (e.g., owo could be owó [money], ọwọ́ [hand], or ọ̀wọ̀ [honor]).
- Inference speed is slower than word-level models due to the byte-level tokenization of ByT5.
Training and Evaluation Data
More information needed
Training Procedure
The model was trained using the Hugging Face Seq2SeqTrainer on Google Cloud TPUs.
Training Hyperparameters
The following hyperparameters were used during training:
- Learning Rate: 2e-4
- Effective Train Batch Size: 32
- Eval Batch Size: 16
- Seed: 42
- Optimizer: AdamW (betas=(0.9,0.999), epsilon=1e-08)
- LR Scheduler: Linear
- Num Epochs: 20
- Hardware: Google Cloud TPU v6e-8
Framework Versions
- Transformers 4.53.3
- Pytorch 2.6.0+cu124
- Datasets 4.4.1
- Tokenizers 0.21.2
- Torch_xla (TPU Support)
Evaluation Results
The model was evaluated on a held-out test set using beam search (num_beams=5).
| Metric | Value | Description |
|---|---|---|
| Word Accuracy | 83.79% | Percentage of words perfectly reconstructed. |
| Underdot Accuracy | 92.35% | Accuracy of restoring sub-character underdots. |
| WER | 0.1628 | Word Error Rate (lower is better). |
| CER | 0.0558 | Character Error Rate (lower is better). |
| Yoruba DER | 0.0397 | Diacritic Error Rate specific to Yoruba markers. |
| BLEU | 0.6875 | Bilingual Evaluation Understudy score. |
| ChrF | 83.91 | Character n-gram F-score. |
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("lazymonster/yobyt5-restoration")
model = AutoModelForSeq2SeqLM.from_pretrained("lazymonster/yobyt5-restoration")
text = "Eko ni kokoro aseyori. A gbodo sise kara ki ojo ola wa le dara. Omo ti o ba kawe re daadaa yoo mu inu awon obi re dun. Nitori naa, ko ye ki a fi owo yepere mu eko wa rara."
# English: Education is the key to success. We must work hard so that our future can be bright. A child who studies well will make their parents happy. Therefore, we should not take our education lightly at all.
inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)
outputs = model.generate(
inputs["input_ids"],
max_length=1024,
num_beams=1
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
# Expected Output: "Ẹ̀kọ́ ni kọ́kọ́rọ́ àṣeyọrí. A gbọ́dọ̀ ṣiṣẹ́ kára kí ọjọ́ ọ̀la wa lè dára. Ọmọ tí ó bá kàwé rẹ̀ dáadáa yóò mú inú àwọn òbí rẹ̀ dùn. Nítorí náà, kò yẹ kí a fi ọwọ́ yẹpẹrẹ mú ẹ̀kọ́ wa rárá."
Acknowledgments
This model was trained as an member of the HausaNLP Research Group using compute resources generously provided by the Google TPU Research Cloud (TRC).
- Downloads last month
- 4
Evaluation results
- Word Accuracyself-reported0.838
- Underdot Accuracyself-reported0.923
- WERself-reported0.163
- CERself-reported0.056
- BLEUself-reported0.688
- ChrFself-reported83.912