Mistral擴充詞表只包含與教育部常用4808字的交集 - 移除dummy token - 增加`<|func_start|>`, `<|func_end|>` ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( 'ocisd4/mistral_tokenizer_ext', pad_token='', add_bos_token=True, add_eos_token=False ) print('vocab size:', tokenizer.vocab_size) #vocab size: 35686 print(tokenizer.tokenize('今天天氣真好!')) #['▁', '今', '天', '天', '氣', '真', '好', '!'] print(tokenizer.encode('今天天氣真好!')) #[1, 28705, 30316, 29354, 29354, 32004, 29974, 29530, 29267] print(tokenizer.decode(tokenizer.encode('今天天氣真好!'))) # 今天天氣真好! ```