Instructions to use microsoft/MiniLM-L12-H384-uncased with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/MiniLM-L12-H384-uncased with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="microsoft/MiniLM-L12-H384-uncased")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("microsoft/MiniLM-L12-H384-uncased", dtype="auto") - Inference
- Notebooks
- Google Colab
- Kaggle
tokenizer does not lowercase input (uncased model)
#3
by nikolina-p - opened
tokenizer = AutoTokenizer.from_pretrained("microsoft/MiniLM-L12-H384-uncased")
print(tokenizer.tokenize("Hello world!"))
print(tokenizer.tokenize("Beautiful world!"))
print(tokenizer.tokenize("Terrible world!"))
# ['[UNK]', 'world', '!']
# ['[UNK]', 'world', '!']
# ['[UNK]', 'world', '!']
print(tokenizer.tokenize("hello world!"))
print(tokenizer.tokenize("terrible world!"))
print(tokenizer.tokenize("beautiful world!"))
# ['hello', 'world', '!']
# ['terrible', 'world', '!']
# ['beautiful', 'world', '!']
Observation:
Uppercase inputs are not lowercased, leading to [UNK] tokens for common words.
Lowercase inputs work as expected.
Workaround:
use do_lower_case=True
tokenizer = AutoTokenizer.from_pretrained("microsoft/MiniLM-L12-H384-uncased", do_lower_case=True)
nikolina-p changed discussion title from (beware) tokenizer does not lowercase input (uncased model) to tokenizer does not lowercase input (uncased model)
nikolina-p changed discussion status to closed
nikolina-p changed discussion status to open