Adds the tokenizer configuration file

#25

by lysandre HF Staff - opened Feb 19, 2024

base: refs/heads/main

←

from: refs/pr/25

Discussion Files changed

-0

This PR is in draft mode

lysandre

T5 community org Feb 19, 2024

The tokenizer configuration file is missing/incorrect and therefore leading to unforeseen errors after the migration of the canonical models.

Refer to the following issue for more information: transformers#29050

The current failing code is the following:

from transformers import AutoTokenizer

>>> previous_tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> current_tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-base")
>>> print(previous_tokenizer.model_max_length, current_tokenizer.model_max_length)
1000000000000000019884624838656, 512

This is the result after the fix:

from transformers import AutoTokenizer

>>> previous_tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> current_tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-base")
>>> print(previous_tokenizer.model_max_length, current_tokenizer.model_max_length)
512, 512

Adds tokenizer_config.json file4efb15d5

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Publish this branch

This branch is in draft mode, publish it to be able to merge.

· Sign up or log in to comment