| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - raptorkwok/cantonese-traditional-chinese-parallel-corpus-gen3 |
| | language: |
| | - zh |
| | base_model: |
| | - OpenMOSS-Team/bart-base-chinese |
| | pipeline_tag: translation |
| | --- |
| | |
| | # Cantonese Tokenizer |
| |
|
| | This is a Cantonese sentence tokenizer based on [BART Chinese](https://huggingface.co/OpenMOSS-Team/bart-base-chinese/). |
| | It can be used along with our [CCPC Parallel Corpus dataset](https://huggingface.co/datasets/raptorkwok/cantonese-traditional-chinese-parallel-corpus-gen3). |
| |
|
| | - Token Count: 161,278 (including 110,007 multi-character Chinese vocabularies) |
| |
|
| | --- |
| |
|
| | ## Usage |
| |
|
| | ``` |
| | from transformers import BertTokenizer |
| | tokenizer = BertTokenizer.from_pretrained("raptorkwok/pyctokenizer") |
| | print(tokenizer.tokenize("ζεε»εε°ζ²εηι«ηεοΌ")) |
| | # Output: ['ζε', 'ε»', 'ε', 'ε°ζ²ε', 'ηι«η', 'ε', 'οΌ'] |
| | ``` |