nmstech commited on
Commit
864ffd2
·
verified ·
1 Parent(s): be2f46e

Add AutoTokenizer support (trust_remote_code)

Browse files
Files changed (1) hide show
  1. README.md +17 -0
README.md CHANGED
@@ -33,6 +33,7 @@ pip install git+https://huggingface.co/Ethosoft/turk-tokenizer
33
 
34
  ## Quick Start
35
 
 
36
  ```python
37
  from turk_tokenizer import TurkTokenizer
38
 
@@ -43,6 +44,22 @@ for t in tokens:
43
  print(t["token"], t["token_type"], t["morph_pos"])
44
  ```
45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  Output:
47
  ```
48
  <uppercase_word> ROOT 0
 
33
 
34
  ## Quick Start
35
 
36
+ **Direct usage:**
37
  ```python
38
  from turk_tokenizer import TurkTokenizer
39
 
 
44
  print(t["token"], t["token_type"], t["morph_pos"])
45
  ```
46
 
47
+ **HuggingFace AutoTokenizer:**
48
+ ```python
49
+ from transformers import AutoTokenizer
50
+
51
+ tok = AutoTokenizer.from_pretrained("Ethosoft/turk-tokenizer", trust_remote_code=True)
52
+ out = tok("İstanbul'da meeting'e katılamadım")
53
+
54
+ out["input_ids"] # hash-stable int IDs
55
+ out["attention_mask"] # [1, 1, 1, ...]
56
+ out["token_type_ids"] # 0=root, 1=suffix, 2=bpe, 3=punct, 4=num, 5=url/social
57
+ out["morphological_tokens"] # full morphological dicts
58
+
59
+ # Batch:
60
+ out = tok(["Türkçe metin.", "Another sentence."])
61
+ ```
62
+
63
  Output:
64
  ```
65
  <uppercase_word> ROOT 0