DiacNetYor

DiacNetYor is a character-level Bidirectional LSTM (BiLSTM) model designed for high-accuracy full tonal and dot-below diacritization of Yoruba (yo) text.

Model Details

Model Type: Character-level BiLSTM Sequence Labeler
Parameters: ~1.2M parameters
File Size: 2.42 MB (diacnet_yor.pt)
Vocabulary Size: 119 characters
Supported Languages: Yoruba (yo)
Metrics:
- Validation Word Accuracy: 81.81%
- Test Character Accuracy: 93.35%
- Test Word Accuracy: 78.32%
Dependencies: PyTorch

Usage

Loaded and used via the unified olaverse SDK wrapper (automatically downloads the weights and loads the PyTorch models in the background):

from olaverse.nlp.diacritizer import Diacritizer

diacritizer = Diacritizer(model="diacnet-yor")
text = "Ojo lo si oja lana"
print(diacritizer.restore(text))
# Output: "Ọjọ́ ló sí ọjà lànà"

Post-Processing

During evaluation, the model integrates a candidate-constrained vocabulary post-processing step to map predicted character sequences to valid dictionary-based diacritization candidates, which significantly boosts word-level accuracy.

Usage

Loaded and used via the unified olaverse SDK wrapper (automatically downloads the weights and loads the Transformer model in the background):

from olaverse.nlp.diacritizer import Diacritizer

diacritizer = Diacritizer(model="diacnet-yor-x")
text = "Ojo lo si oja lana"
print(diacritizer.restore(text))
# Output: "Ọjọ́ ló sí ọjà lànà"

Files

diacnet_yor.pt: The PyTorch model state dict and configurations.
diacnet_yor_vocab.json: The character vocabulary maps and word candidate lists used for constrained decoding.

Collection including olaverse/diacnet-yor

DiacNet

Collection

7 items • Updated 3 days ago

olaverse
/

diacnet-yor

DiacNetYor

Model Details

Usage

Post-Processing

Usage

Files

Links

Collection including olaverse/diacnet-yor

DiacNet