Text-to-Speech
Czech

Interested in Supporting Improvements

#3
by Krajta5 - opened

Hey Thomcles,

Great work on the Chatterbox-TTS-Czech model! I've been testing it and the voice cloning is solid. The main issues I'm running into are:

  • Pronunciation of Czech diacritics — háčky like č, ř, š, ž, ě
  • Number reading — the model reads digits as raw values instead of spoken words (e.g. "1 2 3" instead of "jedna dva tři")

Both of these would need to be resolved for my use case.

I saw you mentioned the model was limited by available training data. I found this dataset that might help: https://datacollective.mozillafoundation.org/datasets/cmj8u3oyh004xnxxbd9uih97g — it's Mozilla Common Voice Czech with approximately 270 hours of speech.

Two quick questions:

  • Are you still working on improving this model?
  • If you'd be willing to push it to a more usable state, I'd be happy to support the effort financially.

Let me know what you think! And if you're Czech too, we can switch to Czech.

Cheers,
David

Hey Thomcles,

Great work on the Chatterbox-TTS-Czech model! I've been testing it and the voice cloning is solid. The main issues I'm running into are:

  • Pronunciation of Czech diacritics — háčky like č, ř, š, ž, ě
  • Number reading — the model reads digits as raw values instead of spoken words (e.g. "1 2 3" instead of "jedna dva tři")

Both of these would need to be resolved for my use case.

I saw you mentioned the model was limited by available training data. I found this dataset that might help: https://datacollective.mozillafoundation.org/datasets/cmj8u3oyh004xnxxbd9uih97g — it's Mozilla Common Voice Czech with approximately 270 hours of speech.

Two quick questions:

  • Are you still working on improving this model?
  • If you'd be willing to push it to a more usable state, I'd be happy to support the effort financially.

Let me know what you think! And if you're Czech too, we can switch to Czech.

Cheers,
David

Thanks for your feedback. I did train the model on a very small dataset, and it is still under-trained.
I created this dataset for a Czech colleague who produces audiobooks. He said that with normalization, it sounded as good as elevenlabs.

Indeed, the basic dataset does not contain any numbers, and the model has not learned the association. The same goes for diacritics.

The model still has a lot of room for improvement, but I am no longer working on it.

If you want to discuss your specific case, contact me at my email address, and I can see if I can help you: cyprienoucortex@gmail.com

Mimochodem, moje čeština je moc špatná, pracuji na tom, abych se zlepšil ;)

Thomcles

Sign up or log in to comment