raptorkwok/cantonese-chinese-parallel-corpus
Viewer β’ Updated β’ 185k β’ 133 β’ 3
This Fast Cantonese Tokenizer, containing 161,279 tokens, is fine-tuned with Cantonese sentences. It will tokenize Cantonese sentences into multi-character vocabularies.
from transformers import BertTokenizerFast
tokenizer_fast = BertTokenizerFast.from_pretrained("raptorkwok/cantonese_tokenizer_fast")
print(tokenizer_fast.tokenize("ζεε»εε°ζ²εηι«ηεοΌ"))
which the output is:
['ζε', 'ε»', 'ε', 'ε°ζ²ε', 'ηι«η', 'ε', 'οΌ']