Turkish MFT Tokenizer

This is a custom Turkish tokenizer based on Meaningful Formal Tokenization (MFT). It is compatible with Hugging Face Transformers.

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("alibayram/turkish-mft-tokenizer", trust_remote_code=True)

text = "Merhaba nasılsın?"
tokens = tokenizer.tokenize(text)
print(tokens)

ids = tokenizer.encode(text)
print(ids)

decoded_text = tokenizer.decode(ids)
print(decoded_text)

🚀 High Performance Version (Rust)

If you need higher performance (~100x faster) for large-scale processing, you can use the Rust-based PyPI package which implements the exact same logic.

Installation:

pip install turkish-tokenizer

Usage:

from turkish_tokenizer import TurkishTokenizer

# Initialize the Rust-backed tokenizer
tokenizer = TurkishTokenizer()

tokens = tokenizer.encode("Bugün hava çok güzel.")
print(tokens)
# Output: [2, 0, 1234, ... ]

decoded = tokenizer.decode(tokens)
print(decoded)
# Output: Bugün hava çok güzel.

See turkish-tokenizer on PyPI for more details.

Structure

  • tokenization_turkish_mft.py: The tokenizer implementation and HF wrapper.
  • vocabs/: Directory containing vocabulary and morphological rules JSON files.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support