Generate audio and subtitles from text
Convert source voice to target voice
Generate singing voice from lyrics, duration, and pitch