metadata
license: mit
task_categories:
- text-generation
language:
- tr
tags:
- tokenizer
- bpe
- turkish
- wikipedia
size_categories:
- 10K<n<100K
Turkish BPE Tokenizer for Mamba & GPT Models
This is a custom Byte-Pair Encoding (BPE) tokenizer trained specifically for the Turkish language to support the mamba-tr-project research.
Model Details
- Vocabulary Size: 32,000 tokens
- Training Corpus: Turkish Wikipedia (2023 Dump)
- Algorithm: BPE (Byte-Level)
- Special Tokens:
<|endoftext|>,<|padding|>
Usage
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("oguzatas/mamba-tr-project-tokenizer")
text = "Yapay zeka günümüzde çok gelişti."
tokens = tokenizer.encode(text)
decoded = tokenizer.decode(tokens)
print(tokens)
# Output: [Token IDs...]
print(decoded)
# Output: "Yapay zeka günümüzde çok gelişti."