Arkintea's picture
Update README.md
efaa598 verified
|
Raw
History Blame Contribute Delete
1.13 kB

Nija Pidgin Tokenizer

A custom tokenizer trained for Nigerian Pidgin English (Naija).
Designed for NLP pipelines, chatbots, language modelling, and text processing where Pidgin-specific vocabulary and slang need to be accurately captured.


Model Overview

  • Type: Byte-Pair Encoding (BPE) tokenizer
  • Special tokens: <|startoftext|>, <|endoftext|>, <|pad|>, <|user|>, <|assistant|>, <|system|>, <|unk|>, <|endofprompt|>
  • Vocabulary size: 55,000 tokens
  • Language: Nija (Naija) Pidgin English
  • Trained on: BBC Pidgin dataset

Intended Use

  • Conversational AI in Nija Pidgin
  • Chatbots and virtual assistants
  • Fine-tuning LLMs for Nigerian Pidgin text
  • NLP pipelines requiring tokenisation of colloquial and code-mixed text

Usage Example

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("Arkintea/Nija_Pidgin_Tokenizer")

text = "How you dey my guy?"
tokens = tokenizer.encode(text)
decoded = tokenizer.decode(tokens)

print("Tokens:", tokens)
print("Decoded:", decoded)