V4.22.0 model update

#1
by lewtun HF Staff - opened
Hugging Face Internal Testing Organization org
edited Sep 8, 2022

This PR:

  • updates the CLIP model to be compatible with transformers v4.22. The previous version throws an error when trying to load the tokenizer (requires from_slow=True)
  • sets the vocab size to the default value of associated with the checkpoint this model was derived from (https://huggingface.co/openai/clip-vit-base-patch32/blob/main/config.json#L79). With this change, the model can actually run inference without hitting index out of range errors

cc @ydshieh

Hugging Face Internal Testing Organization org

Hi @lewtun I understand that the fix is to avoid index out of range, but this also makes the model not-that-tiny -> as it would have somehow larger embedding matrix.
The issue was coming from the tokenizer created here has 1000 tokens, but the tiny model was created using model tester config (where the vocab has size 99)

I believe you can change 49408 to 1000 and it will fix the index error. Let me know if you still encounter issue.

The tiny model creation task needs to be improved - I am working on it.

lewtun changed pull request title from V4.22.0 update to V4.22.0 model update
Hugging Face Internal Testing Organization org

As discussed offline, resizing the vocab size in the model config isn't enough - the tokenizer length must also match to ensure the correct input IDs are sent to the model.

One alternative is to:

  • Train a new tokenizer from scratch on a tiny corpus of vocab size ~100 tokens
  • Use that new vocab size in the model

In the interest of being pragmatic, we will take this resizing issue in separate PRs to focus on speeding up the ONNX test suite (which is the source of this PR)

Hugging Face Internal Testing Organization org

OK, thanks!

ydshieh changed pull request status to merged

Sign up or log in to comment