oguzatas's picture
Update README.md
91d95ac verified
---
license: mit
task_categories:
- text-generation
language:
- tr
tags:
- tokenizer
- bpe
- turkish
- wikipedia
size_categories:
- 10K<n<100K
---
# Turkish BPE Tokenizer for Mamba & GPT Models
This is a custom **Byte-Pair Encoding (BPE)** tokenizer trained specifically for the Turkish language to support the `mamba-tr-project` research.
## Model Details
- **Vocabulary Size:** 32,000 tokens
- **Training Corpus:** Turkish Wikipedia (2023 Dump)
- **Algorithm:** BPE (Byte-Level)
- **Special Tokens:** `<|endoftext|>`, `<|padding|>`
## Usage
```python
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("oguzatas/mamba-tr-project-tokenizer")
text = "Yapay zeka günümüzde çok gelişti."
tokens = tokenizer.encode(text)
decoded = tokenizer.decode(tokens)
print(tokens)
# Output: [Token IDs...]
print(decoded)
# Output: "Yapay zeka günümüzde çok gelişti."