Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
datasets:
|
| 4 |
+
- raptorkwok/cantonese_sentences
|
| 5 |
+
- raptorkwok/cantonese-traditional-chinese-parallel-corpus-gen3
|
| 6 |
+
language:
|
| 7 |
+
- zh
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
# Fast Cantonese Tokenizer
|
| 11 |
+
|
| 12 |
+
This Fast Cantonese Tokenizer, containing 161,279 tokens, is fine-tuned with Cantonese sentences. It will tokenize Cantonese sentences into multi-character vocabularies.
|
| 13 |
+
|
| 14 |
+
## Usage
|
| 15 |
+
|
| 16 |
+
```
|
| 17 |
+
from transformers import BertTokenizerFast
|
| 18 |
+
tokenizer_fast = BertTokenizerFast.from_pretrained("raptorkwok/cantonese_tokenizer_fast")
|
| 19 |
+
print(tokenizer_fast.tokenize("ζεε»εε°ζ²εηι«ηεοΌ"))
|
| 20 |
+
```
|
| 21 |
+
which the output is:
|
| 22 |
+
```
|
| 23 |
+
['ζε', 'ε»', 'ε', 'ε°ζ²ε', 'ηι«η', 'ε', 'οΌ']
|
| 24 |
+
```
|