Tokenizer doesn't distinguish dash and hyphen

#10

by nshmyrevgmail - opened May 28, 2025

он шутит - сказал человек - амфибия.

['[CLS]', '-', 'он', 'шутит', '-', 'сказал', 'человек', '-', 'амфи', '##бия', '.', '[SEP]']

он шутит - сказал человек-амфибия.

['[CLS]', '-', 'он', 'шутит', '-', 'сказал', 'человек', '-', 'амфи', '##бия', '.', '[SEP]']

While it is a common issue, it is a bigger problem for Russian where hyphen is much more actively used than in English

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment