Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
adarshzolekar
's Collections
Multimodal AI Models
Audio & Speech Models
Vision Models (Image & Video)
Text & Code Models (NLP)
Multimodal AI Models
updated
2 days ago
Purpose: Models that understand text + image + audio together.
Upvote
1
llava-hf/llava-1.5-7b-hf
Image-Text-to-Text
•
7B
•
Updated
Jun 6, 2025
•
1.15M
•
336
Salesforce/blip-image-captioning-base
Image-to-Text
•
Updated
Feb 3, 2025
•
1.97M
•
839
google/pix2struct-base
Image-to-Text
•
0.3B
•
Updated
Dec 24, 2023
•
4.44k
•
76
microsoft/kosmos-2-patch14-224
Image-to-Text
•
2B
•
Updated
Nov 28, 2023
•
140k
•
182
openbmb/MiniCPM-V-4_5
Image-Text-to-Text
•
9B
•
Updated
Dec 18, 2025
•
64k
•
1.05k
Upvote
1
Share collection
View history
Collection guide
Browse collections