It is Speech to speech model
Multilingual Model
Generate embeddings of images
Transform text into a 768-dimension vector
Generate image embeddings from images