Chinese
raptorkwok commited on
Commit
e5f7d58
Β·
verified Β·
1 Parent(s): 517091d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -0
README.md ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - raptorkwok/cantonese_sentences
5
+ - raptorkwok/cantonese-traditional-chinese-parallel-corpus-gen3
6
+ language:
7
+ - zh
8
+ ---
9
+
10
+ # Fast Cantonese Tokenizer
11
+
12
+ This Fast Cantonese Tokenizer, containing 161,279 tokens, is fine-tuned with Cantonese sentences. It will tokenize Cantonese sentences into multi-character vocabularies.
13
+
14
+ ## Usage
15
+
16
+ ```
17
+ from transformers import BertTokenizerFast
18
+ tokenizer_fast = BertTokenizerFast.from_pretrained("raptorkwok/cantonese_tokenizer_fast")
19
+ print(tokenizer_fast.tokenize("ζˆ‘ε“‹εŽ»ε’—ε°–ζ²™ε’€η‡ι†«η”Ÿε‘€οΌ"))
20
+ ```
21
+ which the output is:
22
+ ```
23
+ ['ζˆ‘ε“‹', '去', 'ε’—', 'ε°–ζ²™ε’€', 'η‡ι†«η”Ÿ', 'ε‘€', '!']
24
+ ```