Multimodal AI Models - a adarshzolekar Collection

adarshzolekar 's Collections

Multimodal AI Models

Audio & Speech Models

Vision Models (Image & Video)

Text & Code Models (NLP)

Multimodal AI Models

updated Jan 23

Purpose: Models that understand text + image + audio together.

llava-hf/llava-1.5-7b-hf

Image-Text-to-Text • 7B • Updated Jun 6, 2025 • 3.54M • 362
Salesforce/blip-image-captioning-base

Image-to-Text • Updated Feb 3, 2025 • 2.46M • 857
google/pix2struct-base

Image-to-Text • 0.3B • Updated Dec 24, 2023 • 3.66k • 79
microsoft/kosmos-2-patch14-224

Image-to-Text • 2B • Updated Nov 28, 2023 • 168k • 184
openbmb/MiniCPM-V-4_5

Image-Text-to-Text • 9B • Updated Mar 10 • 131k • 1.09k