Yo-ByT5

This model is a fine-tuned version of google/byt5-small on a Yoruba dataset. It is designed to automatically restore diacritics (tone marks and underdots) to Yoruba text, which is crucial for lexical disambiguation and proper pronunciation in downstream tasks.

Model Description

Model Type: Byte-level T5 (ByT5) for Sequence-to-Sequence operations.
Language(s): Yoruba (yo)
Task: Diacritic Restoration (Automatic Diacritization)
Developed by: Gali Ahmad Samuel (lazymonster)
Shared by: Gali Ahmad Samuel (lazymonster)

Yoruba is a tonal language where the meaning of a word relies heavily on tone marks (acute and grave accents) and underdots. This model takes non-diacritized (or partially diacritized) text as input and outputs the fully diacritized text.

Intended Uses & Limitations

Intended Uses

Preprocessing: Cleaning text for Text-to-Speech (TTS) or Machine Translation (MT) systems where accurate diacritics are mandatory.
Search Engines: Normalizing user queries in Yoruba.
Linguistic Analysis: Assisting in the annotation of low-resource language datasets.

Limitations

The model may struggle with proper nouns or ambiguous context where multiple valid diacritization patterns exist for the same character sequence (e.g., owo could be owó [money], ọwọ́ [hand], or ọ̀wọ̀ [honor]).
Inference speed is slower than word-level models due to the byte-level tokenization of ByT5.

Training and Evaluation Data

More information needed

Training Procedure

The model was trained using the Hugging Face Seq2SeqTrainer on Google Cloud TPUs.

Training Hyperparameters

The following hyperparameters were used during training:

Learning Rate: 2e-4
Effective Train Batch Size: 32
Eval Batch Size: 16
Seed: 42
Optimizer: AdamW (betas=(0.9,0.999), epsilon=1e-08)
LR Scheduler: Linear
Num Epochs: 20
Hardware: Google Cloud TPU v6e-8

Framework Versions

Transformers 4.53.3
Pytorch 2.6.0+cu124
Datasets 4.4.1
Tokenizers 0.21.2
Torch_xla (TPU Support)

Evaluation Results

The model was evaluated on a held-out test set using beam search (num_beams=5).

Metric	Value	Description
Word Accuracy	83.79%	Percentage of words perfectly reconstructed.
Underdot Accuracy	92.35%	Accuracy of restoring sub-character underdots.
WER	0.1628	Word Error Rate (lower is better).
CER	0.0558	Character Error Rate (lower is better).
Yoruba DER	0.0397	Diacritic Error Rate specific to Yoruba markers.
BLEU	0.6875	Bilingual Evaluation Understudy score.
ChrF	83.91	Character n-gram F-score.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("lazymonster/yobyt5-restoration")
model = AutoModelForSeq2SeqLM.from_pretrained("lazymonster/yobyt5-restoration")

text = "Eko ni kokoro aseyori. A gbodo sise kara ki ojo ola wa le dara. Omo ti o ba kawe re daadaa yoo mu inu awon obi re dun. Nitori naa, ko ye ki a fi owo yepere mu eko wa rara."
# English: Education is the key to success. We must work hard so that our future can be bright. A child who studies well will make their parents happy. Therefore, we should not take our education lightly at all.
inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)

outputs = model.generate(
    inputs["input_ids"],
    max_length=1024,
    num_beams=1
)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
# Expected Output: "Ẹ̀kọ́ ni kọ́kọ́rọ́ àṣeyọrí. A gbọ́dọ̀ ṣiṣẹ́ kára kí ọjọ́ ọ̀la wa lè dára. Ọmọ tí ó bá kàwé rẹ̀ dáadáa yóò mú inú àwọn òbí rẹ̀ dùn. Nítorí náà, kò yẹ kí a fi ọwọ́ yẹpẹrẹ mú ẹ̀kọ́ wa rárá."

Acknowledgments

This model was trained as an member of the HausaNLP Research Group using compute resources generously provided by the Google TPU Research Cloud (TRC).

Downloads last month: 4

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

Word Accuracy
self-reported

0.838
Underdot Accuracy
self-reported

0.923
WER
self-reported

0.163
CER
self-reported

0.056
BLEU
self-reported

0.688
ChrF
self-reported

83.912