Bug in tokenizer?

by birgitbrause - opened Dec 13, 2022

Dec 13, 2022

•

edited Dec 13, 2022

I might have noticed a bug in tokenizer. It seems that in its normalizer, strip_accents is automatically set to True when loaded, because it is not explicitly set to False in tokenizer_config.json.
Because of this, the tokenizer normalizes German Umlauts. In the tokenizer's vocabulary however, I can see tokens having Umlauts.

When loaded from hub the tokenizer behaves like this:

If I deactivate the normalization:

Because of the Umlauts in the vocabulary, I guess strip_accents == False is the originally intended behaviour?
Do you have an idea if the model was trained with or without seeing Umlauts?

I tested this with transformers versions 2.3.0, 4.6.1, 4.25.1.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment