DiacNetYor

GitHub Repository

DiacNetYor is a character-level Bidirectional LSTM (BiLSTM) model designed for high-accuracy full tonal and dot-below diacritization of Yoruba (yo) text.

Model Details

  • Model Type: Character-level BiLSTM Sequence Labeler
  • Parameters: ~1.2M parameters
  • File Size: 2.42 MB (diacnet_yor.pt)
  • Vocabulary Size: 119 characters
  • Supported Languages: Yoruba (yo)
  • Metrics:
    • Validation Word Accuracy: 81.81%
    • Test Character Accuracy: 93.35%
    • Test Word Accuracy: 78.32%
  • Dependencies: PyTorch

Usage

Loaded and used via the unified olaverse SDK wrapper (automatically downloads the weights and loads the PyTorch models in the background):

from olaverse.nlp.diacritizer import Diacritizer

diacritizer = Diacritizer(model="diacnet-yor")
text = "Ojo lo si oja lana"
print(diacritizer.restore(text))
# Output: "Ọjọ́ ló sí ọjà lànà"

Post-Processing

During evaluation, the model integrates a candidate-constrained vocabulary post-processing step to map predicted character sequences to valid dictionary-based diacritization candidates, which significantly boosts word-level accuracy.

Usage

Loaded and used via the unified olaverse SDK wrapper (automatically downloads the weights and loads the Transformer model in the background):

from olaverse.nlp.diacritizer import Diacritizer

diacritizer = Diacritizer(model="diacnet-yor-x")
text = "Ojo lo si oja lana"
print(diacritizer.restore(text))
# Output: "Ọjọ́ ló sí ọjà lànà"

Files

  • diacnet_yor.pt: The PyTorch model state dict and configurations.
  • diacnet_yor_vocab.json: The character vocabulary maps and word candidate lists used for constrained decoding.

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including olaverse/diacnet-yor