Bug with Tokenizer
#9
by
dmakhervaks
- opened
Hi, I think there may be a bug with your tokenizer.
Some strings with repeating characters do not have the correct mapping when encoding and then decoding.
Here is an example:
processor = AutoProcessor.from_pretrained("google/medasr")
input_string = "pevesca plus is a combination medication used for treating certain types of pain and inflammation"
encoded = processor.tokenizer(input_string)
decoded_string = processor.decode(encoded['input_ids'], skip_special_tokens=True)
assert input_string == decoded_string