| Mistral擴充詞表只包含與教育部常用4808字的交集 | |
| - 移除dummy token | |
| - 增加`<|func_start|>`, `<|func_end|>` | |
| ```python | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained( | |
| 'ocisd4/mistral_tokenizer_ext', | |
| pad_token='<unk>', | |
| add_bos_token=True, | |
| add_eos_token=False | |
| ) | |
| print('vocab size:', tokenizer.vocab_size) | |
| #vocab size: 35686 | |
| print(tokenizer.tokenize('今天天氣真好!')) | |
| #['▁', '今', '天', '天', '氣', '真', '好', '!'] | |
| print(tokenizer.encode('今天天氣真好!')) | |
| #[1, 28705, 30316, 29354, 29354, 32004, 29974, 29530, 29267] | |
| print(tokenizer.decode(tokenizer.encode('今天天氣真好!'))) | |
| #<s> 今天天氣真好! | |
| ``` | |