Fast Cantonese Tokenizer

This Fast Cantonese Tokenizer, containing 161,279 tokens, is fine-tuned with Cantonese sentences. It will tokenize Cantonese sentences into multi-character vocabularies.

Usage

from transformers import BertTokenizerFast
tokenizer_fast = BertTokenizerFast.from_pretrained("raptorkwok/cantonese_tokenizer_fast")
print(tokenizer_fast.tokenize("我哋去咗尖沙咀睇醫生呀！"))

which the output is:

['我哋', '去', '咗', '尖沙咀', '睇醫生', '呀', '！']

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

raptorkwok
/

cantonese_tokenizer_fast

Fast Cantonese Tokenizer

Usage

Datasets used to train raptorkwok/cantonese_tokenizer_fast