metadata
license: apache-2.0
datasets:
- raptorkwok/cantonese-traditional-chinese-parallel-corpus-gen3
language:
- zh
base_model:
- OpenMOSS-Team/bart-base-chinese
pipeline_tag: translation
Cantonese Tokenizer
This is a Cantonese sentence tokenizer based on BART Chinese. It can be used along with our CCPC Parallel Corpus dataset.
- Token Count: 161,278 (including 110,007 multi-character Chinese vocabularies)
Usage
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("raptorkwok/pyctokenizer")
print(tokenizer.tokenize("ζεε»εε°ζ²εηι«ηεοΌ"))
# Output: ['ζε', 'ε»', 'ε', 'ε°ζ²ε', 'ηι«η', 'ε', 'οΌ']