Instructions to use dbmdz/bert-base-german-uncased with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use dbmdz/bert-base-german-uncased with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="dbmdz/bert-base-german-uncased")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-uncased") model = AutoModelForMaskedLM.from_pretrained("dbmdz/bert-base-german-uncased") - Inference
- Notebooks
- Google Colab
- Kaggle
Bug in tokenizer?
#2
by birgitbrause - opened
I might have noticed a bug in tokenizer. It seems that in its normalizer, strip_accents is automatically set to True when loaded, because it is not explicitly set to False in tokenizer_config.json.
Because of this, the tokenizer normalizes German Umlauts. In the tokenizer's vocabulary however, I can see tokens having Umlauts.
When loaded from hub the tokenizer behaves like this:
If I deactivate the normalization:
Because of the Umlauts in the vocabulary, I guess strip_accents == False is the originally intended behaviour?
Do you have an idea if the model was trained with or without seeing Umlauts?
I tested this with transformers versions 2.3.0, 4.6.1, 4.25.1.