tokenizer does not lowercase input (uncased model)

#3
by nikolina-p - opened
tokenizer = AutoTokenizer.from_pretrained("microsoft/MiniLM-L12-H384-uncased")

print(tokenizer.tokenize("Hello world!"))
print(tokenizer.tokenize("Beautiful world!"))
print(tokenizer.tokenize("Terrible world!"))
# ['[UNK]', 'world', '!']
# ['[UNK]', 'world', '!']
# ['[UNK]', 'world', '!']

print(tokenizer.tokenize("hello world!"))
print(tokenizer.tokenize("terrible world!"))
print(tokenizer.tokenize("beautiful world!"))
# ['hello', 'world', '!']
# ['terrible', 'world', '!']
# ['beautiful', 'world', '!']

Observation:
Uppercase inputs are not lowercased, leading to [UNK] tokens for common words.
Lowercase inputs work as expected.

Workaround:
use do_lower_case=True

tokenizer = AutoTokenizer.from_pretrained("microsoft/MiniLM-L12-H384-uncased", do_lower_case=True)
nikolina-p changed discussion title from (beware) tokenizer does not lowercase input (uncased model) to tokenizer does not lowercase input (uncased model)
nikolina-p changed discussion status to closed
nikolina-p changed discussion status to open

Sign up or log in to comment