new-model / README.md
Efe2898's picture
Add CPU-safe 65K RSLM tokenizer trained on BellaTurca
ad8b8fc verified
metadata
library_name: tokenizers
language:
  - tr
tags:
  - turkish
  - tokenizer
  - byte-level-bpe
  - rslm

RSLM Tokenizer 65K

CPU-safe Byte-Level BPE tokenizer for RSLM.

Training data

Dataset: turkish-nlp-suite/BellaTurca

Subsets:

  • AkademikDerlem
  • OzenliDerlem
  • temiz-OSCAR
  • temiz-mC4

Column: text

Target estimated tokens: 700,000,000 total, approximately 175,000,000 per subset.

Vocab

  • Requested vocab size: 65,536
  • Actual vocab size: 65,536
  • BPE min frequency: 3

Special tokens

  • <|pad|>
  • <|bos|>
  • <|eos|>
  • <|unk|>
  • <|system|>
  • <|user|>
  • <|assistant|>
  • <|answer|>
  • <|end|>
  • <think>
  • </think>