Transformers
tokenization
pika / README.md
qikp's picture
Update README.md
0c33b49 verified
metadata
library_name: transformers
license: cc0-1.0
datasets:
  - agentlans/multilingual-text
tags:
  - tokenization
new_version: qikp/pika-2

pika

pika is a simple and public domain-like tokenizer.

Special Tokens

  • Unknown token: [UNK]
  • End-of-Sequence token: [EOS]
  • Padding token: [PAD]

Training

pika was trained on the first 1000 rows of each language of agentlans/multilingual-text.

Limitations

Due to its small corpus, pika may split words into smaller pieces. Also, some uncommon special tokens aren't present, you'll have to add them manually if needed.