new-model / README.md
Efe2898's picture
Add CPU-safe 65K RSLM tokenizer trained on BellaTurca
ad8b8fc verified
---
library_name: tokenizers
language:
- tr
tags:
- turkish
- tokenizer
- byte-level-bpe
- rslm
---
# RSLM Tokenizer 65K
CPU-safe Byte-Level BPE tokenizer for RSLM.
## Training data
Dataset: `turkish-nlp-suite/BellaTurca`
Subsets:
- `AkademikDerlem`
- `OzenliDerlem`
- `temiz-OSCAR`
- `temiz-mC4`
Column: `text`
Target estimated tokens: `700,000,000` total, approximately `175,000,000` per subset.
## Vocab
- Requested vocab size: `65,536`
- Actual vocab size: `65,536`
- BPE min frequency: `3`
## Special tokens
- `<|pad|>`
- `<|bos|>`
- `<|eos|>`
- `<|unk|>`
- `<|system|>`
- `<|user|>`
- `<|assistant|>`
- `<|answer|>`
- `<|end|>`
- `<think>`
- `</think>`