Clarification on YourTTS Training Procedure with VCTK

#2
by ZSJ123 - opened

Hello, I am currently training YourTTS using the official code (https://github.com/coqui-ai/TTS/blob/dev/recipes/vctk/yourtts/train_yourtts.py) provided for VCTK. However, I am not achieving the expected results. Could you please clarify the intended training procedure?
Specifically, should the model be first trained on LJSpeech for 1M steps and then fine-tuned on VCTK for 200K steps, as described in the paper, or should it be trained from scratch on VCTK only? Additionally, could you confirm the total number of training steps recommended?
Thank you very much for your guidance!

Hi,

You should follow the paper’s transfer-learning schedule, not VCTK-from-scratch.

Intended procedure (Experiment 1 in the paper):
Start from a model pre-trained 1M steps on LJSpeech, then continue training 200k steps on VCTK.

Extra fine-tune used in the paper:
After each experiment we also ran an additional ~50k-step fine-tune with the Speaker Consistency Loss (SCL). If you want to match the numbers exactly, include this stage.

Total steps if you replicate Exp 1 exactly:
~1.25M steps (≈1.0M LJS + 0.2M VCTK + 0.05M SCL).

Why not train only on VCTK?
VCTK alone has limited speaker/condition diversity and doesn’t generalize as well; Also, the quality is not amazing. We explicitly used transfer learning to address this. Expect weaker zero-shot results if you start from scratch on VCTK.

If you want even better results I would recommend starting from an even cleaner corpus than LJSpeech if you have one.

cshulby changed discussion status to closed

Thanks for the clarification, much appreciated!(●'◡'●)

Sign up or log in to comment