Instructions to use maya-research/Veena with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use maya-research/Veena with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="maya-research/Veena")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("maya-research/Veena") model = AutoModelForCausalLM.from_pretrained("maya-research/Veena") - Notebooks
- Google Colab
- Kaggle
Made Gradio Demo for veena TTS.
I made gradio demo for veena tts. It is simple for testing out. But it does have some limitations. It allows you to write/paste the text in textbox, select from 4 speakers, and generate audio which can be played on gradio page itself, also saves as mp3 in specific folder. I added slider to control speed, but it basically changes playback speed, so lower speed gives male like voice, higher speed gives fast and cartoon/female like voice. If anyone is interested to try it out then let me know.
Limitations : 1.) can only generate upto 19 sec audio, mostly due to token limitation. That would not be the case in their commercial model i guess.
2.) Hallucinations, causes skipping of words or sentences sometimes.
3.) Sometimes it generates very rapid speech, as it does not identify hindi punctuation mark at times. using " . " as punctuation instead of " । " does help some times.
Besides all this, I think you guys are on right path to make something big. Because Suno started the same way(initially it was called "bark" i think). Later it improved a lot.
How to increase the length of the audio file?
Like if I give multiple sentences for an audio file, it chops it for me. I tried with English only text thought.
How to increase the length of the audio file?
Like if I give multiple sentences for an audio file, it chops it for me. I tried with English only text thought.
I played a lot with this model but there is something strange about this model, it changes voice every time no matter what speaker you have alloted. I could generate longer audio by limiting generation to one sentence and generated as batch process. But every sentence has different voice. I think it is intentional to keep people from using long generation.