`CpmTokenizer` is different from the original CPM-1 tokenizer in GitHub
#1
by
ShaneTian
- opened
transformers.CpmTokenizer is based on transformers.XLNetTokenizer, but the original CPM-1 tokenizer is not.
I found in fine-tuning:
- the original tokenizer always add an
eod_token = <eod>in the end of sentence , see here. - the
transformers.CpmTokenizeralways addsep_token = <sep>andcls_token = <cls>in the end of sentence, see here.
I am confused.
In LM fine-tuning, how to prepare the input data?
[token_id_1, token_id_2, ..., eod_token_id], whereeod_token_idis the id of<eod>token intransformers.CpmTokenizer[token_id_1, token_id_2, ..., eos_token_id], whereeos_token_idis the id of</s>token intransformers.CpmTokenizer[token_id_1, token_id_2, ..., eos_token_id], whereeos_token_idis the id of<|endoftext|>token intransformers.GPT2Tokenizer[token_id_1, token_id_2, ..., sep_token_id, cls_token_id], just callCpmTokenizer
Wow so sorry for the very much late reply! You are right, we should probably correct the build_inputs_with_special_tokens function, which is used when you set add_special_tokens = True (to format the inputs)
You can also change the template processor if you are using a fast tokenizer.
Thanks
ShaneTian
changed discussion status to
closed