| # Nija Pidgin Tokenizer |
|
|
| A custom tokenizer trained for **Nigerian Pidgin English** (Naija). |
| Designed for NLP pipelines, chatbots, language modelling, and text processing where Pidgin-specific vocabulary and slang need to be accurately captured. |
|
|
| --- |
|
|
| ## Model Overview |
|
|
| - **Type:** Byte-Pair Encoding (BPE) tokenizer |
| - **Special tokens:** `<|startoftext|>`, `<|endoftext|>`, `<|pad|>`, `<|user|>`, `<|assistant|>`, `<|system|>`, `<|unk|>`, `<|endofprompt|>` |
| - **Vocabulary size:** 55,000 tokens |
| - **Language:** Nija (Naija) Pidgin English |
| - **Trained on:** BBC Pidgin dataset |
|
|
| --- |
|
|
| ## Intended Use |
|
|
| - Conversational AI in Nija Pidgin |
| - Chatbots and virtual assistants |
| - Fine-tuning LLMs for Nigerian Pidgin text |
| - NLP pipelines requiring tokenisation of colloquial and code-mixed text |
|
|
| --- |
|
|
| ## Usage Example |
|
|
| ```python |
| from transformers import PreTrainedTokenizerFast |
| |
| tokenizer = PreTrainedTokenizerFast.from_pretrained("Arkintea/Nija_Pidgin_Tokenizer") |
| |
| text = "How you dey my guy?" |
| tokens = tokenizer.encode(text) |
| decoded = tokenizer.decode(tokens) |
| |
| print("Tokens:", tokens) |
| print("Decoded:", decoded) |
| |