Spaces:

psando
/

nanoTTS

Sleeping

App Files Files Community

Apply for a GPU community grant: Academic project

by psando - opened May 17

Discussion

psando

Owner May 17

•

edited May 19

I'm running a number of experiments to understand the limits of text-to-speech modeling. In particular, I'm interested in:

how simple can the model be architecturally?
how small can the model be?
is less than 500 hrs of speech data enough?

As of right now, I am using a 93M parameter nanoGPT to model LibriTTS audio tokens from WavTokenizer. The model in the demo space is trained on only 245 hrs of speech from LibriTTS. The training code and checkpoints are on GitHub and on the HuggingFace Demo Space. I am also sharing my results on Twitter. Here is a sample Twitter thread where I demo/describe a model trained on 54 hours of LibriTTS speech data. Eventually, I expect all the experiments I run to result in findings that are submitted to a conference.

I'm a PhD student at the University of Maryland and I believe the community would benefit from listening to the quality of audio from limited/smaller models. It currently takes ~9 minutes to generate a short ~5-7 second clip on CPU. The model is only 360 MiB so any GPU (even one with 8GB VRAM) would be fine.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment