Automatic Speech Recognition
Transformers
Safetensors
English
lasr_ctc
medical-asr
radiology
medical

Bug with Tokenizer

#9
by dmakhervaks - opened

Hi, I think there may be a bug with your tokenizer.

Some strings with repeating characters do not have the correct mapping when encoding and then decoding.

Here is an example:

processor = AutoProcessor.from_pretrained("google/medasr")
input_string = "pevesca plus is a combination medication used for treating certain types of pain and inflammation"
encoded = processor.tokenizer(input_string)
decoded_string = processor.decode(encoded['input_ids'], skip_special_tokens=True)
assert input_string == decoded_string

Sign up or log in to comment