Bug with Tokenizer

by dmakhervaks - opened Jan 31

Jan 31

Hi, I think there may be a bug with your tokenizer.

Some strings with repeating characters do not have the correct mapping when encoding and then decoding.

Here is an example:

processor = AutoProcessor.from_pretrained("google/medasr")
input_string = "pevesca plus is a combination medication used for treating certain types of pain and inflammation"
encoded = processor.tokenizer(input_string)
decoded_string = processor.decode(encoded['input_ids'], skip_special_tokens=True)
assert input_string == decoded_string

srikanta-221

Google org Feb 19

Hi @dmakhervaks I tested the same example using the current google/medasr checkpoint and could not reproduce the issue.
I got the correct decoded output .It may have been caused by an older tokenizer or transformers version.
Could you try updating to the latest library versions and check again? Also, please let me know which specific versions you are currently running.
Thank you!

wuuuuuuuk

Google org Mar 11

This is the same issue as https://huggingface.co/google/medasr/discussions/7. We are waiting on HF to get our fix in.

wuuuuuuuk changed discussion status to closed Mar 11

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment