Translation
Safetensors
Chinese
bart
pyctokenizer / README.md
raptorkwok's picture
Update README.md
fe84634 verified
---
license: apache-2.0
datasets:
- raptorkwok/cantonese-traditional-chinese-parallel-corpus-gen3
language:
- zh
base_model:
- OpenMOSS-Team/bart-base-chinese
pipeline_tag: translation
---
# Cantonese Tokenizer
This is a Cantonese sentence tokenizer based on [BART Chinese](https://huggingface.co/OpenMOSS-Team/bart-base-chinese/).
It can be used along with our [CCPC Parallel Corpus dataset](https://huggingface.co/datasets/raptorkwok/cantonese-traditional-chinese-parallel-corpus-gen3).
- Token Count: 161,278 (including 110,007 multi-character Chinese vocabularies)
---
## Usage
```
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("raptorkwok/pyctokenizer")
print(tokenizer.tokenize("ζˆ‘ε“‹εŽ»ε’—ε°–ζ²™ε’€η‡ι†«η”Ÿε‘€οΌ"))
# Output: ['ζˆ‘ε“‹', '去', 'ε’—', 'ε°–ζ²™ε’€', 'η‡ι†«η”Ÿ', 'ε‘€', '!']
```