`CpmTokenizer` is different from the original CPM-1 tokenizer in GitHub

by ShaneTian - opened Aug 10, 2022

Discussion

ShaneTian

Aug 10, 2022

transformers.CpmTokenizer is based on transformers.XLNetTokenizer, but the original CPM-1 tokenizer is not.

I found in fine-tuning:

the original tokenizer always add an eod_token = <eod> in the end of sentence , see here.
the transformers.CpmTokenizer always add sep_token = <sep> and cls_token = <cls> in the end of sentence, see here.

I am confused.
In LM fine-tuning, how to prepare the input data?

[token_id_1, token_id_2, ..., eod_token_id], where eod_token_id is the id of <eod> token in transformers.CpmTokenizer
[token_id_1, token_id_2, ..., eos_token_id], where eos_token_id is the id of </s> token in transformers.CpmTokenizer
[token_id_1, token_id_2, ..., eos_token_id], where eos_token_id is the id of <|endoftext|> token in transformers.GPT2Tokenizer
[token_id_1, token_id_2, ..., sep_token_id, cls_token_id], just call CpmTokenizer

ArthurZ

Sep 6, 2023

Wow so sorry for the very much late reply! You are right, we should probably correct the build_inputs_with_special_tokens function, which is used when you set add_special_tokens = True (to format the inputs)

ArthurZ

Sep 6, 2023

You can also change the template processor if you are using a fast tokenizer.

ShaneTian

Mar 6, 2024

Thanks

ShaneTian changed discussion status to closed Mar 6, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment