m2mtokenizer doesn't know the word "wouldn't"

by anzorq - opened Aug 10, 2022

•

I accidentally discovered that the tokenizer tokenizes the word "wouldn't" as ['<unk>', "'", 't'].

It doesn't seem to affect model's performance, but makes me wonder what else the tokenizer doesn't have in its vocabulary.

This comment has been hidden

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment