More like Zero Cross-Lingual Transfer

#22

by Markobes - opened 7 days ago

Hi,
I conducted testing, the model cloned the voice in English perfectly.
Then I conducted a second test, and gave the model an English sample and a Russian text. After synthesis, I heard instead of the text: "Tai-tai-tai-tai!..."
Then I conducted another experiment and gave it a Russian sample and a Russian text. And in this case, I heard not "Tai!", but "Ton-ton-ton!..".
And although the voice itself was similar, as in all experiments, I made a conclusion.
This is a classic and very well-known artifact of the behavior of autoregressive TTS models. The neural network got into an infinite loop due to a critical mismatch of languages.

The reason lies in the architecture of the model described in their scientific article (the link to which is visible in the interface header — arXiv:2506.21619):

Lack of a tokenizer for Cyrillic: Before converting text to speech, the model breaks it down into phonemes using a special dictionary (tokenizer) [source: 2506.21619]. IndexTTS2 was trained strictly on huge arrays of Chinese and English [source: 2506.21619]. When you send it Russian text, its tokenizer simply doesn't know these letters. For it, Cyrillic looks like a set of unknown characters (placeholders) [source: 2506.21619].
Illusion of voice similarity: The fact that I heard a similar timbre at the beginning of the loop ("Ton-ton...") is due to the division of responsibilities in the neural network [source: 2506.21619]. The voice encoding block (Voice Reference) successfully read the acoustic characteristics of the Russian sample (pitch, timbre, and noise) and passed them on [source: 2506.21619]. However, the text block was unable to associate these characteristics with letters and produced an endless repetition of the first sound it encountered, colored by your timbre [source: 2506.21619].

How to make her speak Russian?
Full-fledged support for the Russian language in such models is usually implemented by developers of third-party software. This is done in two ways:

Transliteration into English phonemes (Cripple): Russian text is artificially translated into the international phonetic transcription (Arpabet / IPA), which the English model is able to read. For example, instead of "Hello", she is given a phonetic record like P r i v e t. She will speak with a strong American accent, but the looping will stop.
Fine-tuning (Fine-tuning): A small dataset with Russian speech is loaded into the model, so that its text decoder "learns" the connection between Cyrillic letters and sounds [source: 2506.21619].

Since this is not the first time I've encountered this issue, I conducted another experiment by feeding the model the transliteration of the Russian text.
Yes, I heard intelligible speech, but it was heavily distorted.

Thank you

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment