Chinese

Fast Cantonese Tokenizer

This Fast Cantonese Tokenizer, containing 161,279 tokens, is fine-tuned with Cantonese sentences. It will tokenize Cantonese sentences into multi-character vocabularies.

Usage

from transformers import BertTokenizerFast
tokenizer_fast = BertTokenizerFast.from_pretrained("raptorkwok/cantonese_tokenizer_fast")
print(tokenizer_fast.tokenize("ζˆ‘ε“‹εŽ»ε’—ε°–ζ²™ε’€η‡ι†«η”Ÿε‘€οΌ"))

which the output is:

['ζˆ‘ε“‹', '去', 'ε’—', 'ε°–ζ²™ε’€', 'η‡ι†«η”Ÿ', 'ε‘€', '!']
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train raptorkwok/cantonese_tokenizer_fast