metadata
library_name: transformers
license: cc0-1.0
datasets:
- agentlans/multilingual-text
tags:
- tokenization
new_version: qikp/pika-2
pika
pika is a simple and public domain-like tokenizer.
Special Tokens
- Unknown token:
[UNK] - End-of-Sequence token:
[EOS] - Padding token:
[PAD]
Training
pika was trained on the first 1000 rows of each language of agentlans/multilingual-text.
Limitations
Due to its small corpus, pika may split words into smaller pieces. Also, some uncommon special tokens aren't present, you'll have to add them manually if needed.