Nairabert-tokenizer

Standard BERT tokenizers (like bert-base-uncased) often struggle with Nigerian linguistic nuances. They tend to break down local words into meaningless sub-tokens (e.g., "Owanbe" might become "Ow", "##an", "##be").

NairaBERT Tokenizer was trained to recognize these as high-frequency units, ensuring that the model preserves the semantic meaning of Nigerian-centric text.

Technical Details

Model Type: WordPiece

Vocab Size: 32,000

Base: Fine-tuned from bert-base-uncased logic but retrained on local corpora.

Language: English, Nigerian English, Pidgin English

How to use

You can load this tokenizer directly using the transformers library:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sijirama/nairabert-tokenizer")

text = "the lagos traffic today was very bad, no cap."
tokens = tokenizer.tokenize(text)
print(tokens)

Expected output should show better preservation of local slang/context

Observations

This tokenizer was trained on a scraped corpus of approximately 15M tokens. While it is a "beta" and proof-of-concept, it significantly reduces the sequence length for Nigerian text compared to standard Western-centric tokenizers (i think) .

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support