Multimodal

Process images, audio, and video with open-source multimodal models

Recipes

GPT-4V → LLaVA

Replace GPT-4V with LLaVA for image understanding tasks

Coming Soon

DALL-E → Stable Diffusion

Generate images using Stable Diffusion instead of DALL-E

Coming Soon

Whisper → Whisper.cpp

Use local Whisper implementation for speech-to-text

Coming Soon

Video Understanding

Process video content with open-source video models

Coming Soon