Apply for a GPU community grant: Academic project

#1
by psando - opened

I'm running a number of experiments to understand the limits of text-to-speech modeling. In particular, I'm interested in:

  • how simple can the model be architecturally?
  • how small can the model be?
  • is less than 500 hrs of speech data enough?

As of right now, I am using a 93M parameter nanoGPT to model LibriTTS audio tokens from WavTokenizer. The model in the demo space is trained on only 245 hrs of speech from LibriTTS. The training code and checkpoints are on GitHub and on the HuggingFace Demo Space. I am also sharing my results on Twitter. Here is a sample Twitter thread where I demo/describe a model trained on 54 hours of LibriTTS speech data. Eventually, I expect all the experiments I run to result in findings that are submitted to a conference.

I'm a PhD student at the University of Maryland and I believe the community would benefit from listening to the quality of audio from limited/smaller models. It currently takes ~9 minutes to generate a short ~5-7 second clip on CPU. The model is only 360 MiB so any GPU (even one with 8GB VRAM) would be fine.

Sign up or log in to comment