Multimodal

Process images, audio, and video with open-source multimodal models

Recipes

Replace GPT-4V with LLaVA for image understanding tasks

Coming Soon

Generate images using Stable Diffusion instead of DALL-E

Coming Soon

Use local Whisper implementation for speech-to-text

Coming Soon

Process video content with open-source video models

Coming Soon