Tokenizer fix

by justinbarton - opened Apr 2, 2024

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

-847

justinbarton

Apr 2, 2024

No description provided.

Fixing tokenizer error on loadd59a7617

justinbarton

Apr 2, 2024

Presently loading the tokenizer via:

tokeniser = T5Tokenizer.from_pretrained("Exscientia/IgT5", do_lower_case=False)

Yields the following error:

ValueError: Non-consecutive added token '<extra_id_99>' found. Should have index 128 but has index 28 in saved vocabulary.

This PR should resolve the issue.

justinbarton changed pull request status to open Apr 2, 2024

exs-hkenlay

Exscientia org Apr 5, 2024

Hi @justinbarton , thank you for the interest in our work! What version of transformers are you using? I tried this line in a colab notebook with both the transformers version we developed in (4.35.2) as well as the latest version (4.39.3) and they both imported the tokeniser without any errors.

justinbarton

Apr 15, 2024

•

edited Apr 15, 2024

How odd. I was using 4.30.2.

exs-fdreyer changed pull request status to closed Apr 21, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment