About the architecture

by PatnaikAshish - opened Jan 24

Jan 24

Hi @YatharthS the model is great but i have some questions
I'm experimenting with LLM-based TTS architecture but currently getting very poor and unstable speech, so I’m trying to understand the right architecture choices. i have some questions:-

From your experience, which neural audio codec is best suited for LLM prediction (especially lightweight setups with minimal or single codebooks that is LLM friendly).
what properties/approach make it easier for an LLM to learn and generate intelligible speech?
How much training data (in hours) is realistically required to reach intelligible quality?
what are the most effective techniques to speed up training?
if you were building an LLM-based TTS system today with limited compute, what end-to-end architecture would you recommend?
Thanks in advance🙏🙏

YatharthS

Owner Jan 24

Hey, @PatnaikAshish

I would say a codec like neucodec would be good. It has a good compression factor (50 t/s single codebook) and has a permissive license. You can also check out FocalCodec/Kanade-tokenizer/LSCodec. I am working on LinaCodec v2 which should have much higher compression factor but similar if not better quality but that hasn't released yet.
Smaller codebook size would definitely be faster to learn. Higher compression (meaning less t/s) should make it faster to learn as well. In codecs like Kanade and LinaCodecv2, they only compress audio into semantic tokens which are far easier for the model to learn as they don't need to predict acoustic info.
Depends on codec/size but roughly 100 hours should be enough for the model to learn intelligible speech. You definitely need 1k+ hours to make it have low WER, however. If you want usable voice cloning, then at least 10k+ hours of diverse audio.
Using a good library like LitGPT for training. You can also use modded-nanogpt for the fastest training possible but that will definitely require modifications to work.
I think a basic LLM tts model arch is fine since there are lots of resources and libraries for training the LLM part and inferring it as well.

PatnaikAshish

Jan 24

•

edited Jan 24

@YatharthSt Thanks for guiding me bro

PatnaikAshish changed discussion status to closed Jan 24

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment