File size: 657 Bytes
ca41c16 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | """
TurkTokenizer — Turkish morphological tokenizer.
TR-MMLU world record: 92%
Usage:
from turk_tokenizer import TurkTokenizer
tok = TurkTokenizer()
tokens = tok("İstanbul'da meeting'e katılamadım")
# Each token dict contains:
# token : str — token string (with leading space if word-initial)
# token_type : str — ROOT | SUFFIX | FOREIGN | BPE | PUNCT |
# NUM | DATE | UNIT | URL | MENTION | HASHTAG | EMOJI
# morph_pos : int — 0=root/word-initial, 1=first suffix, 2=second...
"""
from .tokenizer import TurkTokenizer
__all__ = ["TurkTokenizer"]
__version__ = "1.0.0"
|