Code-switching??

by MihaiPopa-1 - opened 12 days ago

I wonder if the model can switch between different languages given a text with parts in English and parts in another language for example.

zhu-han

k2-fsa org 12 days ago

Yes, it supports code-switch between any languages. You can try it on our huggingface demo: https://huggingface.co/spaces/k2-fsa/OmniVoice.

MihaiPopa-1

11 days ago

•

edited 11 days ago

~~I tried and it didn't work well???~~ Second example actually works really good! @zhu-han
First is Voice Design, second is what happens if I feed in HighKeyHateMe and get to speak English and Romanian (Voice Cloning).

But the voice cloning was really state-of-the-art!
I wonder if you could train OmniVoice-Large (based on Qwen 3 1.7B), that would be better and maybe support more languages!

zhu-han

k2-fsa org 11 days ago

Hi, could you describe what the issue is? Is it that it cannot generate the corresponding text properly? If so, could you provide the text and the corresponding generated speech? Also, since I don’t speak Romanian, if the issue is related to Romanian pronunciation, could you describe it in detail?

Thanks for your suggestion. In our preliminary experiments, Qwen3 1.7B did not show clear improvement over Qwen3 0.6B. But we'll consider a larger version in the future.

MihaiPopa-1

11 days ago

•

edited 11 days ago

"Qwen3 1.7B did not show clear improvement over Qwen3 0.6B."

What? You say that it didn't make a huge improvement over Qwen 3 0.6B?

"Hi, could you describe what the issue is?"
"If the issue is related to Romanian pronunciation, could you describe it in detail?"

I putted "română" in the first Voice Design sample and the model says closer to "romana", without ə and ɨ
True for Gemini 3.1 Pro:

"You explicitly typed "română" and "français". However, the transcript (and presumably the audio output) resolved these to "Romana" and "Francais"."

Ah, it DOES support code-switching! But it actually works when you put sentences, not just a English word, a Romanian word and a French word for example.
No, no, no. It works best if there's a English phrase, a Romanian phrase, and a French phrase at the end for example.
A word is too little for code-switching to work, a phrase or a sentence works way better!
So, GOOD JOB WITH CODE-SWITCHING!

zhu-han

k2-fsa org 11 days ago

In terms of the comparison with Qwen3 0.6B and Qwen3 1.7B, in our preliminary TTS experiments, using Qwen3 1.7B as the backbone did not yield clear performance improvements over Qwen3 0.6B. Considering inference speed, we retained Qwen3 0.6B as our backbone. Note that we want to emphasize that this is not a rigorous conclusion, as concluding a solid result would require far more experiments and extensive tuning of hyper-parameters. With sufficient tuning, larger models can theoretically deliver better performance. This is why I mentioned that we'll consider a larger model version in future work.

MihaiPopa-1

11 days ago

Yes, so Qwen 3 1.7B is not much better but slower and eats more disk space and RAM?
At least, Qwen 3 0.6B is very small and in this case of OmniVoice, it's actually really good.
OmniVoice v2 could get better speaker similarity (so if you take a real clip of HighKeyHateMe, and a AI-generated clip using the same voice, he can't tell the difference), better quality and possibly a smaller and cheaper version TOO!

zhu-han

k2-fsa org 11 days ago

Yeah, we'll try to develop a better version in the future.

zhu-han changed discussion status to closed 11 days ago

MihaiPopa-1

11 days ago

But can I train my own version of OmniVoice by swapping Qwen 3 0.6B for another small LLM like SmolLM 2 360M?

zhu-han

k2-fsa org 10 days ago

Yeah, you can try it. However, one potential limitation is that SmolLM 2 360M only supports English, so it may not generalize well to other languages.

MihaiPopa-1

10 days ago

•

edited 10 days ago

But can a English brain outperform a multilingual one on English?
And if so, where's the training/fine-tuning script? @zhu-han

saltyrain

9 days ago

Non-verbal tags are spoken instead of producing sound. The model often reads the tag text aloud instead of producing the sound. Is there a required inference flag to enable this feature?

zhu-han

k2-fsa org 9 days ago

But can a English brain outperform a multilingual one on English?
And if so, where's the training/fine-tuning script? @zhu-han

Not sure. You’ll have to try it and see. Check out our GitHub repo for the code.

zhu-han

k2-fsa org 9 days ago

Non-verbal tags are spoken instead of producing sound. The model often reads the tag text aloud instead of producing the sound. Is there a required inference flag to enable this feature?

Have fixed this issue two hours ago. You can try our latest codes or demo.

urbainze

3 days ago

hello i got a small problem using omni voice tts . i tried the cloning feature , but the problem is that i generated many audios from multiples texts and between generations the voice is not stable . a cloned a woman voice but for some generations , i got a man voice in the audio .

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment