# Nija Pidgin Tokenizer A custom tokenizer trained for **Nigerian Pidgin English** (Naija). Designed for NLP pipelines, chatbots, language modelling, and text processing where Pidgin-specific vocabulary and slang need to be accurately captured. --- ## Model Overview - **Type:** Byte-Pair Encoding (BPE) tokenizer - **Special tokens:** `<|startoftext|>`, `<|endoftext|>`, `<|pad|>`, `<|user|>`, `<|assistant|>`, `<|system|>`, `<|unk|>`, `<|endofprompt|>` - **Vocabulary size:** 55,000 tokens - **Language:** Nija (Naija) Pidgin English - **Trained on:** BBC Pidgin dataset --- ## Intended Use - Conversational AI in Nija Pidgin - Chatbots and virtual assistants - Fine-tuning LLMs for Nigerian Pidgin text - NLP pipelines requiring tokenisation of colloquial and code-mixed text --- ## Usage Example ```python from transformers import PreTrainedTokenizerFast tokenizer = PreTrainedTokenizerFast.from_pretrained("Arkintea/Nija_Pidgin_Tokenizer") text = "How you dey my guy?" tokens = tokenizer.encode(text) decoded = tokenizer.decode(tokens) print("Tokens:", tokens) print("Decoded:", decoded)