Arkintea
/

Nija_Pidgin_Tokenizer

Model card Files Files and versions

Nija_Pidgin_Tokenizer / README.md

Arkintea's picture

Update README.md

efaa598 verified 7 months ago

|

History Blame Contribute Delete

1.13 kB

	# Nija Pidgin Tokenizer

	A custom tokenizer trained for Nigerian Pidgin English (Naija).
	Designed for NLP pipelines, chatbots, language modelling, and text processing where Pidgin-specific vocabulary and slang need to be accurately captured.

	---

	## Model Overview

	- Type: Byte-Pair Encoding (BPE) tokenizer
	- Special tokens: `<\|startoftext\|>`, `<\|endoftext\|>`, `<\|pad\|>`, `<\|user\|>`, `<\|assistant\|>`, `<\|system\|>`, `<\|unk\|>`, `<\|endofprompt\|>`
	- Vocabulary size: 55,000 tokens
	- Language: Nija (Naija) Pidgin English
	- Trained on: BBC Pidgin dataset

	---

	## Intended Use

	- Conversational AI in Nija Pidgin
	- Chatbots and virtual assistants
	- Fine-tuning LLMs for Nigerian Pidgin text
	- NLP pipelines requiring tokenisation of colloquial and code-mixed text

	---

	## Usage Example

	```python
	from transformers import PreTrainedTokenizerFast

	tokenizer = PreTrainedTokenizerFast.from_pretrained("Arkintea/Nija_Pidgin_Tokenizer")

	text = "How you dey my guy?"
	tokens = tokenizer.encode(text)
	decoded = tokenizer.decode(tokens)

	print("Tokens:", tokens)
	print("Decoded:", decoded)