Zero Shot > Cross-lingual
I've suprisingly found that cross-lingual generations actually seem to work better for me on Zero Shot mode in some cases.
For example, let's say that I want to take a German speaker with a German accent and a German transcript/sample. And, I want have that German speaker say something in English while keeping the German accent.
Here's what has worked best for the German speaker to speak in English with a German accent (using Zero Shot mode).
System Prompt (in German 'Please speak in a German accent. [+ common name of a German male or your speaker's German name]')
Bitte sprechen Sie mit deutschem Akzent. Friedrich.<|endofprompt|>
Prompt Transcript (German transcript of reference audio)
Hallo, mein Name ist Friedrich. Möchten Sie, dass ich Ihnen demonstriere, wie cool es klingt, mit einem deutschen Akzent zu sprechen?
Text to synthesize (what you want CosyVoice 3 to generate in the audio output in English but using the German speaker's natural accent)
Never leave until tomorrow what you can do today. He who chases two rabbits at once will catch none.
Doing this on Zero Shot mode has given me much better results than using cross-lingual mode. Just curious what everyone else's experience have been with other languages?
in inference_cross_lingual, there is prompt for CFM and no prompt for LLM. So if your target text and prompt text are from different language, you do not want prompt wav to affect the rhythm of generated wav. But if they are from same language, use inference_zero_shot