Nija Pidgin Tokenizer

A custom tokenizer trained for Nigerian Pidgin English (Naija).
Designed for NLP pipelines, chatbots, language modelling, and text processing where Pidgin-specific vocabulary and slang need to be accurately captured.

Model Overview

Type: Byte-Pair Encoding (BPE) tokenizer
Special tokens: <|startoftext|>, <|endoftext|>, <|pad|>, <|user|>, <|assistant|>, <|system|>, <|unk|>, <|endofprompt|>
Vocabulary size: 55,000 tokens
Language: Nija (Naija) Pidgin English
Trained on: BBC Pidgin dataset

Intended Use

Conversational AI in Nija Pidgin
Chatbots and virtual assistants
Fine-tuning LLMs for Nigerian Pidgin text
NLP pipelines requiring tokenisation of colloquial and code-mixed text

Usage Example

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("Arkintea/Nija_Pidgin_Tokenizer")

text = "How you dey my guy?"
tokens = tokenizer.encode(text)
decoded = tokenizer.decode(tokens)

print("Tokens:", tokens)
print("Decoded:", decoded)