Nairabert-tokenizer
Standard BERT tokenizers (like bert-base-uncased) often struggle with Nigerian linguistic nuances. They tend to break down local words into meaningless sub-tokens (e.g., "Owanbe" might become "Ow", "##an", "##be").
NairaBERT Tokenizer was trained to recognize these as high-frequency units, ensuring that the model preserves the semantic meaning of Nigerian-centric text.
Technical Details
Model Type: WordPiece
Vocab Size: 32,000
Base: Fine-tuned from bert-base-uncased logic but retrained on local corpora.
Language: English, Nigerian English, Pidgin English
How to use
You can load this tokenizer directly using the transformers library:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("sijirama/nairabert-tokenizer")
text = "the lagos traffic today was very bad, no cap."
tokens = tokenizer.tokenize(text)
print(tokens)
Expected output should show better preservation of local slang/context
Observations
This tokenizer was trained on a scraped corpus of approximately 15M tokens. While it is a "beta" and proof-of-concept, it significantly reduces the sequence length for Nigerian text compared to standard Western-centric tokenizers (i think) .