| This model has not been trained on any Cantonese material. |
|
|
| It is simply a base model in which the embeddings and tokenizer were patched with Cantonese characters. One can find the original model [gpt2-tiny-chinese](https://huggingface.co/ckiplab/gpt2-tiny-chinese). |
|
|
|
|
|
|
|
|
|
|
|
|
| I used this [repo](https://github.com/ayaka14732/bert-tokenizer-cantonese) to identify missing Cantonese characters |
|
|
| [My forked and modified version](https://github.com/jedcheng/bert-tokenizer-cantonese) |
|
|
| After identifying the missing characters, the Huggingface library provides very high level API to modify the tokenizer and embeddings. |
|
|
| ``` |
| Download a tokenizer and a model from the Huggingface library. Then: |
| |
| tokenizer.add_tokens("your new tokens") |
| model.resize_token_embeddings(len(tokenizer)) |
| |
| tokenizer.push_to_hub("your model name") |
| ``` |