Generate captions for images
Generate images from text prompts
Generate audio from text using VITS model