Fast Cantonese Tokenizer
This Fast Cantonese Tokenizer, containing 161,279 tokens, is fine-tuned with Cantonese sentences. It will tokenize Cantonese sentences into multi-character vocabularies.
Usage
from transformers import BertTokenizerFast
tokenizer_fast = BertTokenizerFast.from_pretrained("raptorkwok/cantonese_tokenizer_fast")
print(tokenizer_fast.tokenize("ζεε»εε°ζ²εηι«ηεοΌ"))
which the output is:
['ζε', 'ε»', 'ε', 'ε°ζ²ε', 'ηι«η', 'ε', 'οΌ']
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support