--- library_name: tokenizers language: - tr tags: - turkish - tokenizer - byte-level-bpe - rslm --- # RSLM Tokenizer 65K CPU-safe Byte-Level BPE tokenizer for RSLM. ## Training data Dataset: `turkish-nlp-suite/BellaTurca` Subsets: - `AkademikDerlem` - `OzenliDerlem` - `temiz-OSCAR` - `temiz-mC4` Column: `text` Target estimated tokens: `700,000,000` total, approximately `175,000,000` per subset. ## Vocab - Requested vocab size: `65,536` - Actual vocab size: `65,536` - BPE min frequency: `3` ## Special tokens - `<|pad|>` - `<|bos|>` - `<|eos|>` - `<|unk|>` - `<|system|>` - `<|user|>` - `<|assistant|>` - `<|answer|>` - `<|end|>` - `` - ``