metadata
library_name: tokenizers
language:
- tr
tags:
- turkish
- tokenizer
- byte-level-bpe
- rslm
RSLM Tokenizer 65K
CPU-safe Byte-Level BPE tokenizer for RSLM.
Training data
Dataset: turkish-nlp-suite/BellaTurca
Subsets:
AkademikDerlemOzenliDerlemtemiz-OSCARtemiz-mC4
Column: text
Target estimated tokens: 700,000,000 total, approximately 175,000,000 per subset.
Vocab
- Requested vocab size:
65,536 - Actual vocab size:
65,536 - BPE min frequency:
3
Special tokens
<|pad|><|bos|><|eos|><|unk|><|system|><|user|><|assistant|><|answer|><|end|><think></think>