# Nija Pidgin Tokenizer

A custom tokenizer trained for **Nigerian Pidgin English** (Naija).  
Designed for NLP pipelines, chatbots, language modelling, and text processing where Pidgin-specific vocabulary and slang need to be accurately captured.

---

## Model Overview

- **Type:** Byte-Pair Encoding (BPE) tokenizer  
- **Special tokens:** `<|startoftext|>`, `<|endoftext|>`, `<|pad|>`, `<|user|>`, `<|assistant|>`, `<|system|>`, `<|unk|>`, `<|endofprompt|>`  
- **Vocabulary size:** 55,000 tokens  
- **Language:** Nija (Naija) Pidgin English  
- **Trained on:** BBC Pidgin dataset  

---

## Intended Use

- Conversational AI in Nija Pidgin  
- Chatbots and virtual assistants  
- Fine-tuning LLMs for Nigerian Pidgin text  
- NLP pipelines requiring tokenisation of colloquial and code-mixed text  

---

## Usage Example

```python
from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("Arkintea/Nija_Pidgin_Tokenizer")

text = "How you dey my guy?"
tokens = tokenizer.encode(text)
decoded = tokenizer.decode(tokens)

print("Tokens:", tokens)
print("Decoded:", decoded)