Purpose: Models that understand text + image + audio together.
-
llava-hf/llava-1.5-7b-hf
Image-Text-to-Text • 7B • Updated • 3.01M • 359 -
Salesforce/blip-image-captioning-base
Image-to-Text • Updated • 2.26M • 851 -
google/pix2struct-base
Image-to-Text • 0.3B • Updated • 2.7k • 79 -
microsoft/kosmos-2-patch14-224
Image-to-Text • Updated • 168k • 184