Zero Shot > Cross-lingual

by dolphinfan - opened Dec 17, 2025

Dec 17, 2025

I've suprisingly found that cross-lingual generations actually seem to work better for me on Zero Shot mode in some cases.

For example, let's say that I want to take a German speaker with a German accent and a German transcript/sample. And, I want have that German speaker say something in English while keeping the German accent.

Here's what has worked best for the German speaker to speak in English with a German accent (using Zero Shot mode).

System Prompt (in German 'Please speak in a German accent. [+ common name of a German male or your speaker's German name]')

Bitte sprechen Sie mit deutschem Akzent. Friedrich.<|endofprompt|>

Prompt Transcript (German transcript of reference audio)

Hallo, mein Name ist Friedrich. Möchten Sie, dass ich Ihnen demonstriere, wie cool es klingt, mit einem deutschen Akzent zu sprechen?

Text to synthesize (what you want CosyVoice 3 to generate in the audio output in English but using the German speaker's natural accent)

Never leave until tomorrow what you can do today. He who chases two rabbits at once will catch none.

Doing this on Zero Shot mode has given me much better results than using cross-lingual mode. Just curious what everyone else's experience have been with other languages?

aluminumbox

FunAudioLLM org Dec 23, 2025

in inference_cross_lingual, there is prompt for CFM and no prompt for LLM. So if your target text and prompt text are from different language, you do not want prompt wav to affect the rhythm of generated wav. But if they are from same language, use inference_zero_shot

juang3d

Dec 30, 2025

When doing voice cloning, can we affect the expresiveness in some way?

I mean making the text to be said in loud voice, or in soft voice, or to be angry or happy.

dolphinfan

Dec 31, 2025

When doing voice cloning, can we affect the expresiveness in some way?

I mean making the text to be said in loud voice, or in soft voice, or to be angry or happy.

I haven't tried it yet, but you can experiment with using opening & closing HTML tags around your synthesized text
<angry>SYNTHESIZED TEXT GOES HERE</angry>

https://github.com/FunAudioLLM/CosyVoice/issues/1729
https://funaudiollm.github.io/cosyvoice3/#Instructed%20Voice%20Generation

juang3d

Dec 31, 2025

Oh, interesting I’ll try it

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment