Add HuggingFace metadata to README

74f2384 verified 5 days ago

4.02 kB

	---
	language:
	- tr
	tags:
	- tokenizer
	- morphology
	- turkish
	- nlp
	license: mit
	library_name: turk-tokenizer
	---

	# TurkTokenizer

	Turkish morphological tokenizer — TR-MMLU world record 92%

	TurkTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar.

	## Installation

	```bash
	pip install git+https://huggingface.co/Ethosoft/turk-tokenizer
	```

	Java is required (for Zemberek morphological analysis):

	\| OS \| Command \|
	\|---\|---\|
	\| Ubuntu / Debian \| `sudo apt install default-jre` \|
	\| Fedora / RHEL \| `sudo dnf install java-latest-openjdk` \|
	\| macOS \| `brew install openjdk` \|
	\| Windows \| `winget install Microsoft.OpenJDK.21` \|

	## Quick Start

	```python
	from turk_tokenizer import TurkTokenizer

	tok = TurkTokenizer()
	tokens = tok("İstanbul'da meeting'e katılamadım")

	for t in tokens:
	print(t["token"], t["token_type"], t["morph_pos"])
	```

	Output:
	```
	<uppercase_word> ROOT 0
	istanbul ROOT 0
	da SUFFIX 1
	meeting FOREIGN 0
	e SUFFIX 1
	katılama ROOT 0
	dı SUFFIX 1
	m SUFFIX 2
	```

	## Output Fields

	Each token is a dict with the following guaranteed fields:

	\| Field \| Type \| Description \|
	\|---\|---\|---\|
	\| `token` \| `str` \| Token string (leading space = word-initial) \|
	\| `token_type` \| `str` \| See types below \|
	\| `morph_pos` \| `int` \| `0` = root/word-initial, `1` = first suffix, `2` = second… \|

	### Token Types

	\| Type \| Description \|
	\|---\|---\|
	\| `ROOT` \| Turkish root word \|
	\| `SUFFIX` \| Turkish morphological suffix \|
	\| `FOREIGN` \| Foreign/loanword root (e.g. "meeting", "zoom") \|
	\| `BPE` \| Unknown subword (fallback) \|
	\| `PUNCT` \| Punctuation mark \|
	\| `NUM` \| Number \|
	\| `DATE` \| Date \|
	\| `UNIT` \| Measurement unit \|
	\| `URL` \| Web URL \|
	\| `MENTION` \| @username \|
	\| `HASHTAG` \| #topic \|
	\| `EMOJI` \| Emoji \|

	### Optional Metadata Fields

	\| Field \| Description \|
	\|---\|---\|
	\| `_canonical` \| Canonical morpheme ID (e.g. `"PL"`, `"ACC"`, `"DAT"`) \|
	\| `_suffix_label` \| Detailed morphological label (e.g. `"-PL+ACC"`) \|
	\| `_foreign` \| `True` if foreign root \|
	\| `_caps` \| `True` if originally ALL CAPS \|
	\| `_domain` \| `True` if medical/sports/tourism domain \|
	\| `_compound` \| `True` if compound word \|
	\| `_parts` \| Compound word parts \|
	\| `_expansion` \| Acronym expansion (e.g. `"CMV"` → `"Sitomegalovirüs"`) \|
	\| `_pos` \| POS tag from Zemberek (Noun, Verb, Adj…) \|
	\| `_lemma` \| Lemma from Zemberek \|
	\| `_disambiguated` \| `True` if context disambiguation was applied \|
	\| `_root_corrected` \| `True` if root was corrected by Zemberek \|

	## Batch Tokenization

	```python
	texts = ["Ankara'da kar yağıyor.", "Meeting'e katılacak mısın?"]
	results = tok.batch_tokenize(texts, workers=4)
	```

	## Statistics

	```python
	tokens = tok("Türk dili zengin bir morfolojiye sahiptir.")
	s = tok.stats(tokens)
	print(f"TR coverage: {s['tr_pct']}%")
	```

	## Morphological Fixes Applied

	1. ALL CAPS — `"İSTANBUL"` → 2 tokens instead of 16
	2. Apostrophe splitting — `"meeting'e"` → `[meeting:FOREIGN][e:SUFFIX]`
	3. BPE→SUFFIX — 260+ suffix patterns reclassified
	4. Zemberek root validation — phonetic root correction (`"gök"` → `"göğüs"`)
	5. Punctuation — classified as PUNCT (counted in TR coverage)
	6. Domain vocabulary — 500+ medical/sports/tourism roots
	7. TDK FOREIGN detection — 76K+ Turkish words used as reference
	8. Special token normalization — NUM, DATE, URL, MENTION, HASHTAG, EMOJI
	9. Allomorph canonicalization — `"lar"/"ler"` → `PL`, `"dan"/"den"` → `ABL`
	10. Compound decomposition — `"başbakan"` → `["baş", "bakan"]`
	11. Acronym expansion — `"CMV"` → `"Sitomegalovirüs"`
	12. Context disambiguation — Zemberek sentence-level POS selection

	## Benchmark

	\| Benchmark \| Score \|
	\|---\|---\|
	\| TR-MMLU \| 92% (world record) \|

	## License

	MIT

	---
	language:
	- tr
	tags:
	- tokenizer
	- morphology
	- turkish
	- nlp
	license: mit
	library_name: turk-tokenizer
	---

	# TurkTokenizer

	Turkish morphological tokenizer — TR-MMLU world record 92%

	TurkTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar.

	## Installation

	```bash
	pip install git+https://huggingface.co/Ethosoft/turk-tokenizer
	```

	Java is required (for Zemberek morphological analysis):

	\| OS \| Command \|
	\|---\|---\|
	\| Ubuntu / Debian \| `sudo apt install default-jre` \|
	\| Fedora / RHEL \| `sudo dnf install java-latest-openjdk` \|
	\| macOS \| `brew install openjdk` \|
	\| Windows \| `winget install Microsoft.OpenJDK.21` \|

	## Quick Start

	```python
	from turk_tokenizer import TurkTokenizer

	tok = TurkTokenizer()
	tokens = tok("İstanbul'da meeting'e katılamadım")

	for t in tokens:
	print(t["token"], t["token_type"], t["morph_pos"])
	```

	Output:
	```
	<uppercase_word> ROOT 0
	istanbul ROOT 0
	da SUFFIX 1
	meeting FOREIGN 0
	e SUFFIX 1
	katılama ROOT 0
	dı SUFFIX 1
	m SUFFIX 2
	```

	## Output Fields

	Each token is a dict with the following guaranteed fields:

	\| Field \| Type \| Description \|
	\|---\|---\|---\|
	\| `token` \| `str` \| Token string (leading space = word-initial) \|
	\| `token_type` \| `str` \| See types below \|
	\| `morph_pos` \| `int` \| `0` = root/word-initial, `1` = first suffix, `2` = second… \|

	### Token Types

	\| Type \| Description \|
	\|---\|---\|
	\| `ROOT` \| Turkish root word \|
	\| `SUFFIX` \| Turkish morphological suffix \|
	\| `FOREIGN` \| Foreign/loanword root (e.g. "meeting", "zoom") \|
	\| `BPE` \| Unknown subword (fallback) \|
	\| `PUNCT` \| Punctuation mark \|
	\| `NUM` \| Number \|
	\| `DATE` \| Date \|
	\| `UNIT` \| Measurement unit \|
	\| `URL` \| Web URL \|
	\| `MENTION` \| @username \|
	\| `HASHTAG` \| #topic \|
	\| `EMOJI` \| Emoji \|

	### Optional Metadata Fields

	\| Field \| Description \|
	\|---\|---\|
	\| `_canonical` \| Canonical morpheme ID (e.g. `"PL"`, `"ACC"`, `"DAT"`) \|
	\| `_suffix_label` \| Detailed morphological label (e.g. `"-PL+ACC"`) \|
	\| `_foreign` \| `True` if foreign root \|
	\| `_caps` \| `True` if originally ALL CAPS \|
	\| `_domain` \| `True` if medical/sports/tourism domain \|
	\| `_compound` \| `True` if compound word \|
	\| `_parts` \| Compound word parts \|
	\| `_expansion` \| Acronym expansion (e.g. `"CMV"` → `"Sitomegalovirüs"`) \|
	\| `_pos` \| POS tag from Zemberek (Noun, Verb, Adj…) \|
	\| `_lemma` \| Lemma from Zemberek \|
	\| `_disambiguated` \| `True` if context disambiguation was applied \|
	\| `_root_corrected` \| `True` if root was corrected by Zemberek \|

	## Batch Tokenization

	```python
	texts = ["Ankara'da kar yağıyor.", "Meeting'e katılacak mısın?"]
	results = tok.batch_tokenize(texts, workers=4)
	```

	## Statistics

	```python
	tokens = tok("Türk dili zengin bir morfolojiye sahiptir.")
	s = tok.stats(tokens)
	print(f"TR coverage: {s['tr_pct']}%")
	```

	## Morphological Fixes Applied

	1. ALL CAPS — `"İSTANBUL"` → 2 tokens instead of 16
	2. Apostrophe splitting — `"meeting'e"` → `[meeting:FOREIGN][e:SUFFIX]`
	3. BPE→SUFFIX — 260+ suffix patterns reclassified
	4. Zemberek root validation — phonetic root correction (`"gök"` → `"göğüs"`)
	5. Punctuation — classified as PUNCT (counted in TR coverage)
	6. Domain vocabulary — 500+ medical/sports/tourism roots
	7. TDK FOREIGN detection — 76K+ Turkish words used as reference
	8. Special token normalization — NUM, DATE, URL, MENTION, HASHTAG, EMOJI
	9. Allomorph canonicalization — `"lar"/"ler"` → `PL`, `"dan"/"den"` → `ABL`
	10. Compound decomposition — `"başbakan"` → `["baş", "bakan"]`
	11. Acronym expansion — `"CMV"` → `"Sitomegalovirüs"`
	12. Context disambiguation — Zemberek sentence-level POS selection

	## Benchmark

	\| Benchmark \| Score \|
	\|---\|---\|
	\| TR-MMLU \| 92% (world record) \|

	## License

	MIT