Mongolian tokenizer mapping seems incorrect for ө and ш
I found what looks like an incorrect character mapping in the tokenizer for facebook/mms-tts-mon.
The vocab does not contain modern Mongolian Cyrillic:
ө (U+04E9, CYRILLIC SMALL LETTER BARRED O)
ш (U+0448, CYRILLIC SMALL LETTER SHA)
But it does contain visually/symbolically confusing alternatives:ѳ (U+0473, CYRILLIC SMALL LETTER FITA)
щ (U+0449, CYRILLIC SMALL LETTER SHCHA)
Two simple examples:
өргөн drops ө unless I remap ө -> ѳ
нар шиг drops ш unless I remap ш -> щ
After applying these remaps locally, the pronunciations become correct in practice (sometimes it pronounces ш as 'ye' though.)
This makes it look like the tokenizer/artifact mapping is wrong rather than the acoustic model itself.
Also, in Mongolian, щ is generally not used except in Russian-derived words, so having щ in the vocab while missing basic Mongolian ш seems especially suspicious.
Could this be fixed upstream?