Translation
Safetensors
Chinese
bart
pyctokenizer / README.md
raptorkwok's picture
Update README.md
fe84634 verified
metadata
license: apache-2.0
datasets:
  - raptorkwok/cantonese-traditional-chinese-parallel-corpus-gen3
language:
  - zh
base_model:
  - OpenMOSS-Team/bart-base-chinese
pipeline_tag: translation

Cantonese Tokenizer

This is a Cantonese sentence tokenizer based on BART Chinese. It can be used along with our CCPC Parallel Corpus dataset.

  • Token Count: 161,278 (including 110,007 multi-character Chinese vocabularies)

Usage

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("raptorkwok/pyctokenizer")
print(tokenizer.tokenize("ζˆ‘ε“‹εŽ»ε’—ε°–ζ²™ε’€η‡ι†«η”Ÿε‘€οΌ"))
# Output: ['ζˆ‘ε“‹', '去', 'ε’—', 'ε°–ζ²™ε’€', 'η‡ι†«η”Ÿ', 'ε‘€', '!']