| library_name: tokenizers | |
| language: | |
| - tr | |
| tags: | |
| - turkish | |
| - tokenizer | |
| - byte-level-bpe | |
| - rslm | |
| # RSLM Tokenizer 65K | |
| CPU-safe Byte-Level BPE tokenizer for RSLM. | |
| ## Training data | |
| Dataset: `turkish-nlp-suite/BellaTurca` | |
| Subsets: | |
| - `AkademikDerlem` | |
| - `OzenliDerlem` | |
| - `temiz-OSCAR` | |
| - `temiz-mC4` | |
| Column: `text` | |
| Target estimated tokens: `700,000,000` total, approximately `175,000,000` per subset. | |
| ## Vocab | |
| - Requested vocab size: `65,536` | |
| - Actual vocab size: `65,536` | |
| - BPE min frequency: `3` | |
| ## Special tokens | |
| - `<|pad|>` | |
| - `<|bos|>` | |
| - `<|eos|>` | |
| - `<|unk|>` | |
| - `<|system|>` | |
| - `<|user|>` | |
| - `<|assistant|>` | |
| - `<|answer|>` | |
| - `<|end|>` | |
| - `<think>` | |
| - `</think>` | |