added_tokens_decoder Seem to Cause Index Errors

by SeaZeeHech - opened Jan 20, 2024

Jan 20, 2024

I think the <|im_start|> and <|im_end|> were added after another user mentioned they weren't natively in the model vocab. But the current config throws indexing errors anytime it is used. The below code - changing the vocab size - will make it run without error, but the loss is very high (~10 when other models are ~3 on my data), I presume because the new tokens are just noise. But the indexing errors are gone. I'm not sure how to just remove them from the tokenizer once instantiated.

Seems like these tokens shouldn't be added now if the model wasn't trained with them and hasn't learned them? Am I missing something? Does it work out of the box for others?

config = AutoConfig.from_pretrained(self.args.generation_model_name)
config.vocab_size += 2

generator = AutoModelForCausalLM.from_pretrained(
         'leveldevai/MarcBeagle-7B', config=config, ignore_mismatched_sizes=True,
         torch_dtype=torch.bfloat16,
         attn_implementation='flash_attention_2',
         trust_remote_code=True
)

leveldevai

Owner Jan 21, 2024

Thanks for noticing.
This file seems to come from one of the models used in the merge, I updated a few things and it appears to be working well for me please let me know if you see anything

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment