raptorkwok
/

cantonese_tokenizer_fast

Model card Files Files and versions

raptorkwok commited on Jan 21

Commit

e5f7d58

·

verified ·

1 Parent(s): 517091d

Create README.md

Files changed (1) hide show

README.md +24 -0

README.md ADDED Viewed

	@@ -0,0 +1,24 @@

+---
+license: apache-2.0
+datasets:
+- raptorkwok/cantonese_sentences
+- raptorkwok/cantonese-traditional-chinese-parallel-corpus-gen3
+language:
+- zh
+---
+# Fast Cantonese Tokenizer
+This Fast Cantonese Tokenizer, containing 161,279 tokens, is fine-tuned with Cantonese sentences. It will tokenize Cantonese sentences into multi-character vocabularies.
+## Usage
+```
+from transformers import BertTokenizerFast
+tokenizer_fast = BertTokenizerFast.from_pretrained("raptorkwok/cantonese_tokenizer_fast")
+print(tokenizer_fast.tokenize("我哋去咗尖沙咀睇醫生呀！"))
+```
+which the output is:
+```
+['我哋', '去', '咗', '尖沙咀', '睇醫生', '呀', '！']
+```