More like Zero Cross-Lingual Transfer
Hi,
I conducted testing, the model cloned the voice in English perfectly.
Then I conducted a second test, and gave the model an English sample and a Russian text. After synthesis, I heard instead of the text: "Tai-tai-tai-tai!..."
Then I conducted another experiment and gave it a Russian sample and a Russian text. And in this case, I heard not "Tai!", but "Ton-ton-ton!..".
And although the voice itself was similar, as in all experiments, I made a conclusion.
This is a classic and very well-known artifact of the behavior of autoregressive TTS models. The neural network got into an infinite loop due to a critical mismatch of languages.
The reason lies in the architecture of the model described in their scientific article (the link to which is visible in the interface header β arXiv:2506.21619):
- Lack of a tokenizer for Cyrillic: Before converting text to speech, the model breaks it down into phonemes using a special dictionary (tokenizer) [source: 2506.21619]. IndexTTS2 was trained strictly on huge arrays of Chinese and English [source: 2506.21619]. When you send it Russian text, its tokenizer simply doesn't know these letters. For it, Cyrillic looks like a set of unknown characters (placeholders) [source: 2506.21619].
- Illusion of voice similarity: The fact that I heard a similar timbre at the beginning of the loop ("Ton-ton...") is due to the division of responsibilities in the neural network [source: 2506.21619]. The voice encoding block (Voice Reference) successfully read the acoustic characteristics of the Russian sample (pitch, timbre, and noise) and passed them on [source: 2506.21619]. However, the text block was unable to associate these characteristics with letters and produced an endless repetition of the first sound it encountered, colored by your timbre [source: 2506.21619].
How to make her speak Russian?
Full-fledged support for the Russian language in such models is usually implemented by developers of third-party software. This is done in two ways:
- Transliteration into English phonemes (Cripple): Russian text is artificially translated into the international phonetic transcription (Arpabet / IPA), which the English model is able to read. For example, instead of "Hello", she is given a phonetic record like P r i v e t. She will speak with a strong American accent, but the looping will stop.
- Fine-tuning (Fine-tuning): A small dataset with Russian speech is loaded into the model, so that its text decoder "learns" the connection between Cyrillic letters and sounds [source: 2506.21619].
Since this is not the first time I've encountered this issue, I conducted another experiment by feeding the model the transliteration of the Russian text.
Yes, I heard intelligible speech, but it was heavily distorted.
Thank you