Tokenizer and Model Config Mismatch

#10

by keremturgutlu - opened May 30, 2023

May 30, 2023

Config and tokenizer has different special token ids, which can be a problem for finetuning.

pretrained_config = AutoConfig.from_pretrained("tiiuae/falcon-7b", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")

(pretrained_config.eos_token_id, tokenizer.eos_token_id, 
pretrained_config.bos_token_id, tokenizer.bos_token_id)
>>
(2, 11, 1, None)

lucasjin

May 30, 2023

Yes, this is really redicoulous.

dimaischenko

May 30, 2023

I agree too, and actually don't understand what we have to choose

lucasjin

May 30, 2023

@tiiuae Please avoid upload a wrong model (wrong tokenizer), this will missleading lots of people .

FalconLLM changed discussion status to closed May 30, 2023

lucasjin

May 31, 2023

@FalconLLM Please fix the issue, or at least post some explain on this, otherwise your behaviour might against hugginface community rules.
Users might get confused by your uploaded model. And this is not good for you as well.

dimaischenko

May 31, 2023

@lucasjin they fixed config.json

  "bos_token_id": 11,
  "eos_token_id": 11,

lucasjin

May 31, 2023

@dimaischenko OK, but this still make me confused, why bos is 11? Very strange

lucasjin

May 31, 2023

And the bos same as eos...... Very strange....

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment