DiacNetYorX

DiacNetYorX is a state-of-the-art transformer-based sequence classifier fine-tuned on top of castorini/afriberta_large for Yoruba tonal diacritization.

Instead of classifying over the global vocabulary, it classifies the candidate index (0 to 7) of each plain word, which optimizes the search space, prevents overfitting, and handles rare tokens gracefully.

Model Details

Base Model: castorini/afriberta_large (125M parameters)
Model Type: Transformer Sequence Classification (Candidate Index Ranking)
File Size: 503.56 MB (diacnet_yor_x.pt)
Metrics:
- Validation Word Accuracy: 82.46%
- Test Word Accuracy: 78.26%
Dependencies: PyTorch, Transformers

Usage

Loaded and used via the unified olaverse SDK wrapper (automatically downloads the weights and loads the Transformer model in the background):

from olaverse.nlp.diacritizer import Diacritizer

diacritizer = Diacritizer(model="diacnet-yor-x")
text = "Ojo lo si oja lana"
print(diacritizer.restore(text))
# Output: "Ọjọ́ ló sí ọjà lànà"

Files

diacnet_yor_x.pt: PyTorch model weights.
diacnet_yor_x_vocab.json: The word candidate list mapping.

Collection including olaverse/diacnet-yor-x

DiacNet

Collection

7 items • Updated 3 days ago

olaverse
/

diacnet-yor-x