|
|
--- |
|
|
language: |
|
|
- tr |
|
|
--- |
|
|
|
|
|
## Tokenizer Summary |
|
|
TR Tokenizer is an innovative FastTokenizer that splits Turkish words according to their semantic integrity, using both current natural language processing methods and Turkish grammar rules. This fast and efficient tokenizer provides accurate and detailed results by analyzing words morphologically and semantically. For example, the sentence "akademisyenler ve aileleri ile birlikte aktif çalışıyorlar" (academics and their families are actively working together) is split into the following parts: |
|
|
|
|
|
```json |
|
|
['akademisyen', 'ler', 've', 'aile', 'leri', 'ile', 'birlikte', 'aktif', 'çalış', 'ı', 'yor', 'lar'] |
|
|
``` |
|
|
|
|
|
## Languages |
|
|
This tokenizer focuses on the **Turkish** language and is designed to support Turkish's rich morphological structure. |
|
|
|
|
|
## Tokenizer Details |
|
|
TR Tokenizer is implemented as a FastTokenizer, which provides high-performance tokenization capabilities. It combines Turkish grammar rules and current NLP methods to separate words into their roots and suffixes. The tokenizer is trained with predefined word and suffix lists and analyzes words while preserving the semantic integrity of Turkish. |
|
|
|
|
|
### Example Usage |
|
|
```python |
|
|
# Load model directly |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
# Initialize the FastTokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained("umutckmk2/tr_tokenizer", use_fast=True) |
|
|
|
|
|
sentence = "akademisyenler ve aileleri ile birlikte aktif çalışıyorlar" |
|
|
tokens = tokenizer.tokenize(sentence) |
|
|
print(tokens) |
|
|
# Output: ['akademisyen', 'ler', 've', 'aile', 'leri', 'ile', 'birlikte', 'aktif', 'çalış', 'ı', 'yor', 'lar'] |
|
|
|
|
|
# Encode the text to token IDs |
|
|
encoded = tokenizer(sentence) |
|
|
print(encoded) |
|
|
``` |