Text-to-Speech
mms
vits

Mongolian tokenizer mapping seems incorrect for ө and ш

#10
by Munyeong - opened

I found what looks like an incorrect character mapping in the tokenizer for facebook/mms-tts-mon.

The vocab does not contain modern Mongolian Cyrillic:

  • ө (U+04E9, CYRILLIC SMALL LETTER BARRED O)

  • ш (U+0448, CYRILLIC SMALL LETTER SHA)
    But it does contain visually/symbolically confusing alternatives:

  • ѳ (U+0473, CYRILLIC SMALL LETTER FITA)

  • щ (U+0449, CYRILLIC SMALL LETTER SHCHA)
    Two simple examples:

өргөн drops ө unless I remap ө -> ѳ
нар шиг drops ш unless I remap ш -> щ
After applying these remaps locally, the pronunciations become correct in practice (sometimes it pronounces ш as 'ye' though.)

This makes it look like the tokenizer/artifact mapping is wrong rather than the acoustic model itself.

Also, in Mongolian, щ is generally not used except in Russian-derived words, so having щ in the vocab while missing basic Mongolian ш seems especially suspicious.

Could this be fixed upstream?

Sign up or log in to comment