nmstech Claude Opus 4.6 commited on
Commit
cfffd93
Β·
1 Parent(s): e430fca

Rename project from TurkTokenizer to NedoTurkishTokenizer

Browse files

- Rename module directory: turk_tokenizer/ -> nedo_turkish_tokenizer/
- Rename HF wrapper: tokenization_turk.py -> tokenization_nedo_turkish.py
- Update class name: TurkTokenizer -> NedoTurkishTokenizer
- Update PyPI package name: turk-tokenizer -> nedo-turkish-tokenizer
- Update all HuggingFace URLs to Ethosoft/NedoTurkishTokenizer
- Update log messages, cache paths, and config references

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

.gitattributes CHANGED
@@ -33,4 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
- turk_tokenizer/data/zemberek-full.jar filter=lfs diff=lfs merge=lfs -text
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ nedo_turkish_tokenizer/data/zemberek-full.jar filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -8,15 +8,15 @@ tags:
8
  - nlp
9
  - transformers
10
  license: mit
11
- library_name: turk-tokenizer
12
  pipeline_tag: token-classification
13
  ---
14
 
15
- # TurkTokenizer
16
 
17
  **Turkish morphological tokenizer β€” TR-MMLU world record 95.45%**
18
 
19
- TurkTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar, powered by [Zemberek NLP](https://github.com/ahmetaa/zemberek-nlp).
20
 
21
  ## Model Details
22
 
@@ -35,7 +35,7 @@ TurkTokenizer performs linguistically-aware tokenization of Turkish text using m
35
  ### Installation
36
 
37
  ```bash
38
- pip install git+https://huggingface.co/Ethosoft/turk-tokenizer
39
  ```
40
 
41
  > **Java is required** for Zemberek morphological analysis.
@@ -50,12 +50,12 @@ pip install git+https://huggingface.co/Ethosoft/turk-tokenizer
50
 
51
  ---
52
 
53
- ### With πŸ€— Transformers (`AutoTokenizer`)
54
 
55
  ```python
56
  from transformers import AutoTokenizer
57
 
58
- tok = AutoTokenizer.from_pretrained("Ethosoft/turk-tokenizer", trust_remote_code=True)
59
 
60
  out = tok("TΓΌrk dili, morfolojik aΓ§Δ±dan zengin bir dildir.")
61
  print(out["input_ids"]) # hash-stable int IDs
@@ -69,7 +69,7 @@ for t in out["morphological_tokens"]:
69
  **Batch tokenization:**
70
  ```python
71
  out = tok(["TΓΌrkΓ§e metin.", "Another sentence with code-switching."])
72
- # out["input_ids"] β†’ list of lists
73
  ```
74
 
75
  **Direct morphological tokenization:**
@@ -79,7 +79,7 @@ for t in tokens:
79
  print(f"{t['token']:20s} {t['token_type']:8s} pos={t['morph_pos']}", end="")
80
  if t.get("_canonical"): print(f" [{t['_canonical']}]", end="")
81
  if t.get("_compound"): print(f" compound={t['_parts']}", end="")
82
- if t.get("_expansion"): print(f" β†’ {t['_expansion']}", end="")
83
  print()
84
  ```
85
 
@@ -88,9 +88,9 @@ for t in tokens:
88
  ### Standalone (without Transformers)
89
 
90
  ```python
91
- from turk_tokenizer import TurkTokenizer
92
 
93
- tok = TurkTokenizer()
94
 
95
  # Single text
96
  tokens = tok("Δ°STANBUL'da meeting'e katΔ±lamadΔ±m")
@@ -132,7 +132,7 @@ Every token dict contains:
132
  |---|---|---|
133
  | `token` | `str` | Token string β€” leading space means word-initial |
134
  | `token_type` | `str` | Morphological type (see table below) |
135
- | `morph_pos` | `int` | Position within word: `0`=root, `1`=1st suffix, `2`=2nd suffix… |
136
 
137
  ### Token Types
138
 
@@ -149,42 +149,42 @@ Every token dict contains:
149
  | `URL` | Web address | `https://...` |
150
  | `MENTION` | @username | `@ethosoft` |
151
  | `HASHTAG` | #topic | `#NLP` |
152
- | `EMOJI` | Emoji | `😊` |
153
 
154
  ### Optional Metadata Fields
155
 
156
  | Field | Description |
157
  |---|---|
158
- | `_canonical` | Canonical morpheme: `"lar"/"ler"` β†’ `"PL"`, `"dan"/"den"` β†’ `"ABL"` |
159
- | `_suffix_label` | Detailed morphological label: `-PL+ACC`, `-P3+LOC`, … |
160
  | `_foreign` | `True` β€” foreign root detected by TDK lookup |
161
  | `_caps` | `True` β€” originally ALL CAPS word |
162
  | `_domain` | `True` β€” medical / sports / tourism domain word |
163
  | `_compound` | `True` β€” compound word (e.g. `başbakan`) |
164
  | `_parts` | Compound parts: `["baş", "bakan"]` |
165
- | `_expansion` | Acronym expansion: `"CMV"` β†’ `"SitomegalovirΓΌs"` |
166
- | `_pos` | POS tag from Zemberek: `Noun`, `Verb`, `Adj`, `Num`… |
167
- | `_lemma` | Lemma from Zemberek: `"gelir"` β†’ `"gelmek"` (when verb) |
168
- | `_disambiguated` | `True` β€” context disambiguation applied (`"yΓΌz"`, `"gelir"`…) |
169
- | `_root_corrected` | `True` β€” phonetic root correction: `"gΓΆk"` β†’ `"gâğüs"` |
170
 
171
  ---
172
 
173
  ## How It Works
174
 
175
- TurkTokenizer wraps the base `turkish-tokenizer` BPE model with **12 sequential morphological fixes**:
176
 
177
  | Fix | Problem | Solution |
178
  |---|---|---|
179
- | 1 | `Δ°STANBUL` β†’ 16 BPE tokens | Lowercase before tokenization, restore `<uppercase_word>` marker |
180
- | 2 | `meeting'e` β†’ broken BPE | Detect foreign base + Turkish suffix, split at apostrophe |
181
- | 3 | Turkish suffixes classified as BPE | 260+ suffix patterns reclassified β†’ SUFFIX |
182
  | 4 | Wrong roots (`gΓΆk` for `gâğüs`) | Zemberek phonetic root validation & correction |
183
  | 5 | Punctuation counted as BPE | Classify as PUNCT |
184
  | 6 | Medical/domain terms as BPE | 500+ medical, sports, tourism root vocabulary |
185
- | 7 | Foreign words as BPE | TDK 76K+ word lookup β†’ FOREIGN ROOT |
186
  | 8 | Numbers, URLs, mentions fragmented | Pre-tokenization placeholder normalization |
187
- | 9 | `lar`/`ler` different IDs for same morpheme | Allomorph canonicalization (`PL`, `ACC`, `DAT`…) |
188
  | 10 | `başbakan` as single unknown ROOT | Compound word decomposition |
189
  | 11 | `CMV`, `NATO` without meaning | Acronym expansion dictionary (100+ entries) |
190
  | 12 | `yΓΌz` = 100 or face or swim? | Zemberek sentence-level context disambiguation |
@@ -193,4 +193,4 @@ TurkTokenizer wraps the base `turkish-tokenizer` BPE model with **12 sequential
193
 
194
  ## License
195
 
196
- MIT Β© [Ethosoft](https://huggingface.co/Ethosoft)
 
8
  - nlp
9
  - transformers
10
  license: mit
11
+ library_name: nedo-turkish-tokenizer
12
  pipeline_tag: token-classification
13
  ---
14
 
15
+ # NedoTurkishTokenizer
16
 
17
  **Turkish morphological tokenizer β€” TR-MMLU world record 95.45%**
18
 
19
+ NedoTurkishTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar, powered by [Zemberek NLP](https://github.com/ahmetaa/zemberek-nlp).
20
 
21
  ## Model Details
22
 
 
35
  ### Installation
36
 
37
  ```bash
38
+ pip install git+https://huggingface.co/Ethosoft/NedoTurkishTokenizer
39
  ```
40
 
41
  > **Java is required** for Zemberek morphological analysis.
 
50
 
51
  ---
52
 
53
+ ### With Transformers (`AutoTokenizer`)
54
 
55
  ```python
56
  from transformers import AutoTokenizer
57
 
58
+ tok = AutoTokenizer.from_pretrained("Ethosoft/NedoTurkishTokenizer", trust_remote_code=True)
59
 
60
  out = tok("TΓΌrk dili, morfolojik aΓ§Δ±dan zengin bir dildir.")
61
  print(out["input_ids"]) # hash-stable int IDs
 
69
  **Batch tokenization:**
70
  ```python
71
  out = tok(["TΓΌrkΓ§e metin.", "Another sentence with code-switching."])
72
+ # out["input_ids"] -> list of lists
73
  ```
74
 
75
  **Direct morphological tokenization:**
 
79
  print(f"{t['token']:20s} {t['token_type']:8s} pos={t['morph_pos']}", end="")
80
  if t.get("_canonical"): print(f" [{t['_canonical']}]", end="")
81
  if t.get("_compound"): print(f" compound={t['_parts']}", end="")
82
+ if t.get("_expansion"): print(f" -> {t['_expansion']}", end="")
83
  print()
84
  ```
85
 
 
88
  ### Standalone (without Transformers)
89
 
90
  ```python
91
+ from nedo_turkish_tokenizer import NedoTurkishTokenizer
92
 
93
+ tok = NedoTurkishTokenizer()
94
 
95
  # Single text
96
  tokens = tok("Δ°STANBUL'da meeting'e katΔ±lamadΔ±m")
 
132
  |---|---|---|
133
  | `token` | `str` | Token string β€” leading space means word-initial |
134
  | `token_type` | `str` | Morphological type (see table below) |
135
+ | `morph_pos` | `int` | Position within word: `0`=root, `1`=1st suffix, `2`=2nd suffix... |
136
 
137
  ### Token Types
138
 
 
149
  | `URL` | Web address | `https://...` |
150
  | `MENTION` | @username | `@ethosoft` |
151
  | `HASHTAG` | #topic | `#NLP` |
152
+ | `EMOJI` | Emoji | |
153
 
154
  ### Optional Metadata Fields
155
 
156
  | Field | Description |
157
  |---|---|
158
+ | `_canonical` | Canonical morpheme: `"lar"/"ler"` -> `"PL"`, `"dan"/"den"` -> `"ABL"` |
159
+ | `_suffix_label` | Detailed morphological label: `-PL+ACC`, `-P3+LOC`, ... |
160
  | `_foreign` | `True` β€” foreign root detected by TDK lookup |
161
  | `_caps` | `True` β€” originally ALL CAPS word |
162
  | `_domain` | `True` β€” medical / sports / tourism domain word |
163
  | `_compound` | `True` β€” compound word (e.g. `başbakan`) |
164
  | `_parts` | Compound parts: `["baş", "bakan"]` |
165
+ | `_expansion` | Acronym expansion: `"CMV"` -> `"SitomegalovirΓΌs"` |
166
+ | `_pos` | POS tag from Zemberek: `Noun`, `Verb`, `Adj`, `Num`... |
167
+ | `_lemma` | Lemma from Zemberek: `"gelir"` -> `"gelmek"` (when verb) |
168
+ | `_disambiguated` | `True` β€” context disambiguation applied (`"yΓΌz"`, `"gelir"`...) |
169
+ | `_root_corrected` | `True` β€” phonetic root correction: `"gΓΆk"` -> `"gâğüs"` |
170
 
171
  ---
172
 
173
  ## How It Works
174
 
175
+ NedoTurkishTokenizer wraps the base `turkish-tokenizer` BPE model with **12 sequential morphological fixes**:
176
 
177
  | Fix | Problem | Solution |
178
  |---|---|---|
179
+ | 1 | `Δ°STANBUL` -> 16 BPE tokens | Lowercase before tokenization, restore `<uppercase_word>` marker |
180
+ | 2 | `meeting'e` -> broken BPE | Detect foreign base + Turkish suffix, split at apostrophe |
181
+ | 3 | Turkish suffixes classified as BPE | 260+ suffix patterns reclassified -> SUFFIX |
182
  | 4 | Wrong roots (`gΓΆk` for `gâğüs`) | Zemberek phonetic root validation & correction |
183
  | 5 | Punctuation counted as BPE | Classify as PUNCT |
184
  | 6 | Medical/domain terms as BPE | 500+ medical, sports, tourism root vocabulary |
185
+ | 7 | Foreign words as BPE | TDK 76K+ word lookup -> FOREIGN ROOT |
186
  | 8 | Numbers, URLs, mentions fragmented | Pre-tokenization placeholder normalization |
187
+ | 9 | `lar`/`ler` different IDs for same morpheme | Allomorph canonicalization (`PL`, `ACC`, `DAT`...) |
188
  | 10 | `başbakan` as single unknown ROOT | Compound word decomposition |
189
  | 11 | `CMV`, `NATO` without meaning | Acronym expansion dictionary (100+ entries) |
190
  | 12 | `yΓΌz` = 100 or face or swim? | Zemberek sentence-level context disambiguation |
 
193
 
194
  ## License
195
 
196
+ MIT (c) [Ethosoft](https://huggingface.co/Ethosoft)
{turk_tokenizer β†’ nedo_turkish_tokenizer}/__init__.py RENAMED
@@ -1,11 +1,11 @@
1
  """
2
- TurkTokenizer β€” Turkish morphological tokenizer.
3
  TR-MMLU world record: 92%
4
 
5
  Usage:
6
- from turk_tokenizer import TurkTokenizer
7
 
8
- tok = TurkTokenizer()
9
  tokens = tok("Δ°stanbul'da meeting'e katΔ±lamadΔ±m")
10
 
11
  # Each token dict contains:
@@ -15,7 +15,7 @@ Usage:
15
  # morph_pos : int β€” 0=root/word-initial, 1=first suffix, 2=second...
16
  """
17
 
18
- from .tokenizer import TurkTokenizer
19
 
20
- __all__ = ["TurkTokenizer"]
21
  __version__ = "1.0.0"
 
1
  """
2
+ NedoTurkishTokenizer β€” Turkish morphological tokenizer.
3
  TR-MMLU world record: 92%
4
 
5
  Usage:
6
+ from nedo_turkish_tokenizer import NedoTurkishTokenizer
7
 
8
+ tok = NedoTurkishTokenizer()
9
  tokens = tok("Δ°stanbul'da meeting'e katΔ±lamadΔ±m")
10
 
11
  # Each token dict contains:
 
15
  # morph_pos : int β€” 0=root/word-initial, 1=first suffix, 2=second...
16
  """
17
 
18
+ from .tokenizer import NedoTurkishTokenizer
19
 
20
+ __all__ = ["NedoTurkishTokenizer"]
21
  __version__ = "1.0.0"
{turk_tokenizer β†’ nedo_turkish_tokenizer}/_acronym_dict.py RENAMED
File without changes
{turk_tokenizer β†’ nedo_turkish_tokenizer}/_allomorph.py RENAMED
File without changes
{turk_tokenizer β†’ nedo_turkish_tokenizer}/_compound.py RENAMED
File without changes
{turk_tokenizer β†’ nedo_turkish_tokenizer}/_context_aware.py RENAMED
File without changes
{turk_tokenizer β†’ nedo_turkish_tokenizer}/_java_check.py RENAMED
@@ -24,7 +24,7 @@ def ensure_java() -> None:
24
  raise RuntimeError(
25
  "\n"
26
  "╔══════════════════════════════════════════════════════════════╗\n"
27
- "β•‘ TurkTokenizer requires Java (JVM) β€” not found on this system β•‘\n"
28
  "╠══════════════════════════════════════════════════════════════╣\n"
29
  f"β•‘ Install Java with: β•‘\n"
30
  f"β•‘ {_install_cmd:<58}β•‘\n"
 
24
  raise RuntimeError(
25
  "\n"
26
  "╔══════════════════════════════════════════════════════════════╗\n"
27
+ "β•‘ NedoTurkishTokenizer requires Java (JVM) β€” not found on this system β•‘\n"
28
  "╠══════════════════════════════════════════════════════════════╣\n"
29
  f"β•‘ Install Java with: β•‘\n"
30
  f"β•‘ {_install_cmd:<58}β•‘\n"
{turk_tokenizer β†’ nedo_turkish_tokenizer}/_medical_vocab.py RENAMED
File without changes
{turk_tokenizer β†’ nedo_turkish_tokenizer}/_normalizer.py RENAMED
File without changes
{turk_tokenizer β†’ nedo_turkish_tokenizer}/_preprocessor.py RENAMED
File without changes
{turk_tokenizer β†’ nedo_turkish_tokenizer}/_root_validator.py RENAMED
@@ -19,7 +19,7 @@ def _init_zemberek() -> None:
19
 
20
  if not JAR_PATH.exists():
21
  print(
22
- f"[TurkTokenizer] zemberek-full.jar not found at {JAR_PATH}\n"
23
  " Root validation disabled β€” morphological fixes will be limited."
24
  )
25
  return
@@ -40,9 +40,9 @@ def _init_zemberek() -> None:
40
  ZEMBEREK_AVAILABLE = True
41
 
42
  except ImportError:
43
- print("[TurkTokenizer] jpype1 not installed β†’ pip install jpype1")
44
  except Exception as exc: # noqa: BLE001
45
- print(f"[TurkTokenizer] Zemberek init failed: {exc}")
46
 
47
 
48
  _init_zemberek()
 
19
 
20
  if not JAR_PATH.exists():
21
  print(
22
+ f"[NedoTurkishTokenizer] zemberek-full.jar not found at {JAR_PATH}\n"
23
  " Root validation disabled β€” morphological fixes will be limited."
24
  )
25
  return
 
40
  ZEMBEREK_AVAILABLE = True
41
 
42
  except ImportError:
43
+ print("[NedoTurkishTokenizer] jpype1 not installed β†’ pip install jpype1")
44
  except Exception as exc: # noqa: BLE001
45
+ print(f"[NedoTurkishTokenizer] Zemberek init failed: {exc}")
46
 
47
 
48
  _init_zemberek()
{turk_tokenizer β†’ nedo_turkish_tokenizer}/_suffix_expander.py RENAMED
File without changes
{turk_tokenizer β†’ nedo_turkish_tokenizer}/_tdk_vocab.py RENAMED
@@ -6,7 +6,7 @@ import json
6
  import os
7
  from pathlib import Path
8
 
9
- _CACHE_DIR = Path.home() / ".cache" / "turk_tokenizer"
10
  _CACHE_DIR.mkdir(parents=True, exist_ok=True)
11
  TDK_CACHE_FILE = str(_CACHE_DIR / "tdk_words.txt")
12
 
@@ -16,8 +16,8 @@ _TDK_WORDS: set | None = None
16
 
17
 
18
  _HF_TDK_URL = (
19
- "https://huggingface.co/Ethosoft/turk-tokenizer/resolve/main"
20
- "/turk_tokenizer/data/tdk_words.txt"
21
  )
22
 
23
 
@@ -27,7 +27,7 @@ def load_tdk_words() -> set:
27
  return _TDK_WORDS
28
 
29
  if not os.path.exists(TDK_CACHE_FILE):
30
- print("[TurkTokenizer] TDK word list not found β€” downloading...")
31
  words = _download_from_hf() or _download_from_tdk()
32
  if not words:
33
  _TDK_WORDS = set()
@@ -35,7 +35,7 @@ def load_tdk_words() -> set:
35
 
36
  with open(TDK_CACHE_FILE, encoding="utf-8") as f:
37
  _TDK_WORDS = {line.strip().lower() for line in f if line.strip()}
38
- print(f"[TurkTokenizer] TDK: {len(_TDK_WORDS):,} words loaded βœ“")
39
  return _TDK_WORDS
40
 
41
 
@@ -51,11 +51,11 @@ def _download_from_hf() -> list[str]:
51
  with open(TDK_CACHE_FILE, "w", encoding="utf-8") as f:
52
  f.write("\n".join(words))
53
 
54
- print(f"[TurkTokenizer] TDK: {len(words):,} words downloaded from HuggingFace βœ“")
55
  return words
56
 
57
  except Exception as exc: # noqa: BLE001
58
- print(f"[TurkTokenizer] HuggingFace download failed: {exc} β€” trying TDK API...")
59
  return []
60
 
61
 
@@ -72,11 +72,11 @@ def _download_from_tdk() -> list[str]:
72
  with open(TDK_CACHE_FILE, "w", encoding="utf-8") as f:
73
  f.write("\n".join(words))
74
 
75
- print(f"[TurkTokenizer] TDK: {len(words):,} words downloaded from TDK API βœ“")
76
  return words
77
 
78
  except Exception as exc: # noqa: BLE001
79
- print(f"[TurkTokenizer] TDK API also failed: {exc}")
80
  print(" FOREIGN detection will be disabled for this session.")
81
  return []
82
 
 
6
  import os
7
  from pathlib import Path
8
 
9
+ _CACHE_DIR = Path.home() / ".cache" / "nedo_turkish_tokenizer"
10
  _CACHE_DIR.mkdir(parents=True, exist_ok=True)
11
  TDK_CACHE_FILE = str(_CACHE_DIR / "tdk_words.txt")
12
 
 
16
 
17
 
18
  _HF_TDK_URL = (
19
+ "https://huggingface.co/Ethosoft/NedoTurkishTokenizer/resolve/main"
20
+ "/nedo_turkish_tokenizer/data/tdk_words.txt"
21
  )
22
 
23
 
 
27
  return _TDK_WORDS
28
 
29
  if not os.path.exists(TDK_CACHE_FILE):
30
+ print("[NedoTurkishTokenizer] TDK word list not found β€” downloading...")
31
  words = _download_from_hf() or _download_from_tdk()
32
  if not words:
33
  _TDK_WORDS = set()
 
35
 
36
  with open(TDK_CACHE_FILE, encoding="utf-8") as f:
37
  _TDK_WORDS = {line.strip().lower() for line in f if line.strip()}
38
+ print(f"[NedoTurkishTokenizer] TDK: {len(_TDK_WORDS):,} words loaded βœ“")
39
  return _TDK_WORDS
40
 
41
 
 
51
  with open(TDK_CACHE_FILE, "w", encoding="utf-8") as f:
52
  f.write("\n".join(words))
53
 
54
+ print(f"[NedoTurkishTokenizer] TDK: {len(words):,} words downloaded from HuggingFace βœ“")
55
  return words
56
 
57
  except Exception as exc: # noqa: BLE001
58
+ print(f"[NedoTurkishTokenizer] HuggingFace download failed: {exc} β€” trying TDK API...")
59
  return []
60
 
61
 
 
72
  with open(TDK_CACHE_FILE, "w", encoding="utf-8") as f:
73
  f.write("\n".join(words))
74
 
75
+ print(f"[NedoTurkishTokenizer] TDK: {len(words):,} words downloaded from TDK API βœ“")
76
  return words
77
 
78
  except Exception as exc: # noqa: BLE001
79
+ print(f"[NedoTurkishTokenizer] TDK API also failed: {exc}")
80
  print(" FOREIGN detection will be disabled for this session.")
81
  return []
82
 
{turk_tokenizer β†’ nedo_turkish_tokenizer}/data/tdk_words.txt RENAMED
File without changes
{turk_tokenizer β†’ nedo_turkish_tokenizer}/data/turkish_proper_nouns.txt RENAMED
File without changes
{turk_tokenizer β†’ nedo_turkish_tokenizer}/data/zemberek-full.jar RENAMED
File without changes
{turk_tokenizer β†’ nedo_turkish_tokenizer}/tokenizer.py RENAMED
@@ -1,5 +1,5 @@
1
  """
2
- TurkTokenizer β€” production-ready Turkish morphological tokenizer.
3
 
4
  Applies 12 sequential fixes on top of the base turkish-tokenizer:
5
  1. ALL CAPS inflation fix
@@ -68,12 +68,12 @@ _TYPE_SYM = {
68
 
69
  # ── Parallel worker helpers ───────────────────────────────────────────────────
70
 
71
- _worker_tok: "TurkTokenizer | None" = None
72
 
73
 
74
  def _init_worker() -> None:
75
  global _worker_tok
76
- _worker_tok = TurkTokenizer()
77
 
78
 
79
  def _tokenize_one(text: str) -> list[dict]:
@@ -83,15 +83,15 @@ def _tokenize_one(text: str) -> list[dict]:
83
 
84
  # ══════════════════════════════════════════════════════════════════════════════
85
 
86
- class TurkTokenizer:
87
  """
88
  Turkish morphological tokenizer with HuggingFace-compatible interface.
89
 
90
  Example::
91
 
92
- from turk_tokenizer import TurkTokenizer
93
 
94
- tok = TurkTokenizer()
95
  tokens = tok("Δ°stanbul'da meeting'e katΔ±lamadΔ±m")
96
  for t in tokens:
97
  print(t["token"], t["token_type"], t["morph_pos"])
@@ -210,14 +210,14 @@ class TurkTokenizer:
210
  results[i] = fut.result()
211
  except Exception as exc: # noqa: BLE001
212
  results[i] = self._base.tokenize_text(texts[i])
213
- print(f"[TurkTokenizer] fallback at idx={i}: {exc}")
214
 
215
  return results # type: ignore[return-value]
216
 
217
  # ── HuggingFace-style helpers ─────────────────────────────────────────────
218
 
219
  @classmethod
220
- def from_pretrained(cls, _model_id: str = "Ethosoft/turk-tokenizer") -> "TurkTokenizer":
221
  """Load tokenizer (rules-based, no weights to download)."""
222
  return cls()
223
 
@@ -227,8 +227,8 @@ class TurkTokenizer:
227
  path = Path(save_directory)
228
  path.mkdir(parents=True, exist_ok=True)
229
  config = {
230
- "tokenizer_class": "TurkTokenizer",
231
- "model_type": "turk-tokenizer",
232
  "version": "1.0.0",
233
  "zemberek_available": self.zemberek_available,
234
  }
 
1
  """
2
+ NedoTurkishTokenizer β€” production-ready Turkish morphological tokenizer.
3
 
4
  Applies 12 sequential fixes on top of the base turkish-tokenizer:
5
  1. ALL CAPS inflation fix
 
68
 
69
  # ── Parallel worker helpers ───────────────────────────────────────────────────
70
 
71
+ _worker_tok: "NedoTurkishTokenizer | None" = None
72
 
73
 
74
  def _init_worker() -> None:
75
  global _worker_tok
76
+ _worker_tok = NedoTurkishTokenizer()
77
 
78
 
79
  def _tokenize_one(text: str) -> list[dict]:
 
83
 
84
  # ══════════════════════════════════════════════════════════════════════════════
85
 
86
+ class NedoTurkishTokenizer:
87
  """
88
  Turkish morphological tokenizer with HuggingFace-compatible interface.
89
 
90
  Example::
91
 
92
+ from nedo_turkish_tokenizer import NedoTurkishTokenizer
93
 
94
+ tok = NedoTurkishTokenizer()
95
  tokens = tok("Δ°stanbul'da meeting'e katΔ±lamadΔ±m")
96
  for t in tokens:
97
  print(t["token"], t["token_type"], t["morph_pos"])
 
210
  results[i] = fut.result()
211
  except Exception as exc: # noqa: BLE001
212
  results[i] = self._base.tokenize_text(texts[i])
213
+ print(f"[NedoTurkishTokenizer] fallback at idx={i}: {exc}")
214
 
215
  return results # type: ignore[return-value]
216
 
217
  # ── HuggingFace-style helpers ─────────────────────────────────────────────
218
 
219
  @classmethod
220
+ def from_pretrained(cls, _model_id: str = "Ethosoft/NedoTurkishTokenizer") -> "NedoTurkishTokenizer":
221
  """Load tokenizer (rules-based, no weights to download)."""
222
  return cls()
223
 
 
227
  path = Path(save_directory)
228
  path.mkdir(parents=True, exist_ok=True)
229
  config = {
230
+ "tokenizer_class": "NedoTurkishTokenizer",
231
+ "model_type": "nedo-turkish-tokenizer",
232
  "version": "1.0.0",
233
  "zemberek_available": self.zemberek_available,
234
  }
pyproject.toml CHANGED
@@ -3,7 +3,7 @@ requires = ["setuptools>=61", "wheel"]
3
  build-backend = "setuptools.build_meta"
4
 
5
  [project]
6
- name = "turk-tokenizer"
7
  version = "1.0.0"
8
  description = "Turkish morphological tokenizer β€” TR-MMLU world record %92"
9
  readme = "README.md"
@@ -28,12 +28,12 @@ dependencies = [
28
  dev = ["pytest", "huggingface_hub"]
29
 
30
  [project.urls]
31
- Homepage = "https://huggingface.co/Ethosoft/turk-tokenizer"
32
- Repository = "https://huggingface.co/Ethosoft/turk-tokenizer"
33
 
34
  [tool.setuptools.packages.find]
35
  where = ["."]
36
- include = ["turk_tokenizer*"]
37
 
38
  [tool.setuptools.package-data]
39
- turk_tokenizer = ["data/*.jar"]
 
3
  build-backend = "setuptools.build_meta"
4
 
5
  [project]
6
+ name = "nedo-turkish-tokenizer"
7
  version = "1.0.0"
8
  description = "Turkish morphological tokenizer β€” TR-MMLU world record %92"
9
  readme = "README.md"
 
28
  dev = ["pytest", "huggingface_hub"]
29
 
30
  [project.urls]
31
+ Homepage = "https://huggingface.co/Ethosoft/NedoTurkishTokenizer"
32
+ Repository = "https://huggingface.co/Ethosoft/NedoTurkishTokenizer"
33
 
34
  [tool.setuptools.packages.find]
35
  where = ["."]
36
+ include = ["nedo_turkish_tokenizer*"]
37
 
38
  [tool.setuptools.package-data]
39
+ nedo_turkish_tokenizer = ["data/*.jar"]
tokenization_turk.py β†’ tokenization_nedo_turkish.py RENAMED
@@ -1,10 +1,10 @@
1
  """
2
- TurkTokenizer β€” HuggingFace AutoTokenizer compatible class.
3
 
4
  Usage:
5
  from transformers import AutoTokenizer
6
 
7
- tok = AutoTokenizer.from_pretrained("Ethosoft/turk-tokenizer", trust_remote_code=True)
8
  out = tok("Δ°stanbul'da meeting'e katΔ±lamadΔ±m")
9
 
10
  out["input_ids"] # hash-stable int IDs of morphological tokens
@@ -42,7 +42,7 @@ def _stable_hash(s: str) -> int:
42
  return int(hashlib.md5(s.encode("utf-8")).hexdigest()[:6], 16)
43
 
44
 
45
- class TurkTokenizer(PreTrainedTokenizer):
46
  """
47
  Turkish morphological tokenizer β€” HuggingFace compatible.
48
 
@@ -62,11 +62,11 @@ class TurkTokenizer(PreTrainedTokenizer):
62
 
63
  def __init__(self, **kwargs: Any) -> None:
64
  super().__init__(**kwargs)
65
- self._morph: "TurkTokenizer_core | None" = None # lazy init
66
 
67
  def _get_morph(self):
68
  if self._morph is None:
69
- from turk_tokenizer import TurkTokenizer as _Core # noqa: PLC0415
70
  self._morph = _Core()
71
  return self._morph
72
 
@@ -160,7 +160,7 @@ class TurkTokenizer(PreTrainedTokenizer):
160
  return self._tokenize(text)
161
 
162
  def morphological_tokenize(self, text: str) -> list[dict]:
163
- """Return full morphological token dicts (main TurkTokenizer output)."""
164
  return self._get_morph().tokenize(text)
165
 
166
  def batch_tokenize(self, texts: list[str], workers: int | None = None) -> list[list[dict]]:
 
1
  """
2
+ NedoTurkishTokenizer β€” HuggingFace AutoTokenizer compatible class.
3
 
4
  Usage:
5
  from transformers import AutoTokenizer
6
 
7
+ tok = AutoTokenizer.from_pretrained("Ethosoft/NedoTurkishTokenizer", trust_remote_code=True)
8
  out = tok("Δ°stanbul'da meeting'e katΔ±lamadΔ±m")
9
 
10
  out["input_ids"] # hash-stable int IDs of morphological tokens
 
42
  return int(hashlib.md5(s.encode("utf-8")).hexdigest()[:6], 16)
43
 
44
 
45
+ class NedoTurkishTokenizer(PreTrainedTokenizer):
46
  """
47
  Turkish morphological tokenizer β€” HuggingFace compatible.
48
 
 
62
 
63
  def __init__(self, **kwargs: Any) -> None:
64
  super().__init__(**kwargs)
65
+ self._morph: "NedoTurkishTokenizer_core | None" = None # lazy init
66
 
67
  def _get_morph(self):
68
  if self._morph is None:
69
+ from nedo_turkish_tokenizer import NedoTurkishTokenizer as _Core # noqa: PLC0415
70
  self._morph = _Core()
71
  return self._morph
72
 
 
160
  return self._tokenize(text)
161
 
162
  def morphological_tokenize(self, text: str) -> list[dict]:
163
+ """Return full morphological token dicts (main NedoTurkishTokenizer output)."""
164
  return self._get_morph().tokenize(text)
165
 
166
  def batch_tokenize(self, texts: list[str], workers: int | None = None) -> list[list[dict]]:
tokenizer_config.json CHANGED
@@ -1,8 +1,8 @@
1
  {
2
- "tokenizer_class": "TurkTokenizer",
3
- "model_type": "turk-tokenizer",
4
  "auto_map": {
5
- "AutoTokenizer": ["tokenization_turk.TurkTokenizer", null]
6
  },
7
  "version": "1.0.0",
8
  "language": "tr",
 
1
  {
2
+ "tokenizer_class": "NedoTurkishTokenizer",
3
+ "model_type": "nedo-turkish-tokenizer",
4
  "auto_map": {
5
+ "AutoTokenizer": ["tokenization_nedo_turkish.NedoTurkishTokenizer", null]
6
  },
7
  "version": "1.0.0",
8
  "language": "tr",