Code-switching??
I wonder if the model can switch between different languages given a text with parts in English and parts in another language for example.
Yes, it supports code-switch between any languages. You can try it on our huggingface demo: https://huggingface.co/spaces/k2-fsa/OmniVoice.
I tried and it didn't work well??? Second example actually works really good! @zhu-han
First is Voice Design, second is what happens if I feed in HighKeyHateMe and get to speak English and Romanian (Voice Cloning).
But the voice cloning was really state-of-the-art!
I wonder if you could train OmniVoice-Large (based on Qwen 3 1.7B), that would be better and maybe support more languages!
Hi, could you describe what the issue is? Is it that it cannot generate the corresponding text properly? If so, could you provide the text and the corresponding generated speech? Also, since I don’t speak Romanian, if the issue is related to Romanian pronunciation, could you describe it in detail?
Thanks for your suggestion. In our preliminary experiments, Qwen3 1.7B did not show clear improvement over Qwen3 0.6B. But we'll consider a larger version in the future.
"Qwen3 1.7B did not show clear improvement over Qwen3 0.6B."
What? You say that it didn't make a huge improvement over Qwen 3 0.6B?
"Hi, could you describe what the issue is?"
"If the issue is related to Romanian pronunciation, could you describe it in detail?"
I putted "română" in the first Voice Design sample and the model says closer to "romana", without ə and ɨ
True for Gemini 3.1 Pro:
"You explicitly typed "română" and "français". However, the transcript (and presumably the audio output) resolved these to "Romana" and "Francais"."
Ah, it DOES support code-switching! But it actually works when you put sentences, not just a English word, a Romanian word and a French word for example.
No, no, no. It works best if there's a English phrase, a Romanian phrase, and a French phrase at the end for example.
A word is too little for code-switching to work, a phrase or a sentence works way better!
So, GOOD JOB WITH CODE-SWITCHING!
In terms of the comparison with Qwen3 0.6B and Qwen3 1.7B, in our preliminary TTS experiments, using Qwen3 1.7B as the backbone did not yield clear performance improvements over Qwen3 0.6B. Considering inference speed, we retained Qwen3 0.6B as our backbone. Note that we want to emphasize that this is not a rigorous conclusion, as concluding a solid result would require far more experiments and extensive tuning of hyper-parameters. With sufficient tuning, larger models can theoretically deliver better performance. This is why I mentioned that we'll consider a larger model version in future work.
Yes, so Qwen 3 1.7B is not much better but slower and eats more disk space and RAM?
At least, Qwen 3 0.6B is very small and in this case of OmniVoice, it's actually really good.
OmniVoice v2 could get better speaker similarity (so if you take a real clip of HighKeyHateMe, and a AI-generated clip using the same voice, he can't tell the difference), better quality and possibly a smaller and cheaper version TOO!
Yeah, we'll try to develop a better version in the future.
But can I train my own version of OmniVoice by swapping Qwen 3 0.6B for another small LLM like SmolLM 2 360M?
Yeah, you can try it. However, one potential limitation is that SmolLM 2 360M only supports English, so it may not generalize well to other languages.
But can a English brain outperform a multilingual one on English?
And if so, where's the training/fine-tuning script? @zhu-han
Non-verbal tags are spoken instead of producing sound. The model often reads the tag text aloud instead of producing the sound. Is there a required inference flag to enable this feature?
Non-verbal tags are spoken instead of producing sound. The model often reads the tag text aloud instead of producing the sound. Is there a required inference flag to enable this feature?
Have fixed this issue two hours ago. You can try our latest codes or demo.
hello i got a small problem using omni voice tts . i tried the cloning feature , but the problem is that i generated many audios from multiples texts and between generations the voice is not stable . a cloned a woman voice but for some generations , i got a man voice in the audio .