|
|
--- |
|
|
license: mit |
|
|
task_categories: |
|
|
- text-generation |
|
|
language: |
|
|
- tr |
|
|
tags: |
|
|
- tokenizer |
|
|
- bpe |
|
|
- turkish |
|
|
- wikipedia |
|
|
size_categories: |
|
|
- 10K<n<100K |
|
|
--- |
|
|
|
|
|
# Turkish BPE Tokenizer for Mamba & GPT Models |
|
|
|
|
|
This is a custom **Byte-Pair Encoding (BPE)** tokenizer trained specifically for the Turkish language to support the `mamba-tr-project` research. |
|
|
|
|
|
## Model Details |
|
|
- **Vocabulary Size:** 32,000 tokens |
|
|
- **Training Corpus:** Turkish Wikipedia (2023 Dump) |
|
|
- **Algorithm:** BPE (Byte-Level) |
|
|
- **Special Tokens:** `<|endoftext|>`, `<|padding|>` |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import PreTrainedTokenizerFast |
|
|
|
|
|
tokenizer = PreTrainedTokenizerFast.from_pretrained("oguzatas/mamba-tr-project-tokenizer") |
|
|
|
|
|
text = "Yapay zeka günümüzde çok gelişti." |
|
|
tokens = tokenizer.encode(text) |
|
|
decoded = tokenizer.decode(tokens) |
|
|
|
|
|
print(tokens) |
|
|
# Output: [Token IDs...] |
|
|
print(decoded) |
|
|
# Output: "Yapay zeka günümüzde çok gelişti." |