Arkintea's picture
Update README.md
efaa598 verified
|
Raw
History Blame Contribute Delete
1.13 kB
# Nija Pidgin Tokenizer
A custom tokenizer trained for **Nigerian Pidgin English** (Naija).
Designed for NLP pipelines, chatbots, language modelling, and text processing where Pidgin-specific vocabulary and slang need to be accurately captured.
---
## Model Overview
- **Type:** Byte-Pair Encoding (BPE) tokenizer
- **Special tokens:** `<|startoftext|>`, `<|endoftext|>`, `<|pad|>`, `<|user|>`, `<|assistant|>`, `<|system|>`, `<|unk|>`, `<|endofprompt|>`
- **Vocabulary size:** 55,000 tokens
- **Language:** Nija (Naija) Pidgin English
- **Trained on:** BBC Pidgin dataset
---
## Intended Use
- Conversational AI in Nija Pidgin
- Chatbots and virtual assistants
- Fine-tuning LLMs for Nigerian Pidgin text
- NLP pipelines requiring tokenisation of colloquial and code-mixed text
---
## Usage Example
```python
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("Arkintea/Nija_Pidgin_Tokenizer")
text = "How you dey my guy?"
tokens = tokenizer.encode(text)
decoded = tokenizer.decode(tokens)
print("Tokens:", tokens)
print("Decoded:", decoded)