Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Buckets new
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
adarshzolekar 's Collections
Multimodal AI Models
Audio & Speech Models
Vision Models (Image & Video)
Text & Code Models (NLP)

Multimodal AI Models

updated Jan 23

Purpose: Models that understand text + image + audio together.

Upvote
1

  • llava-hf/llava-1.5-7b-hf

    Image-Text-to-Text • 7B • Updated Jun 6, 2025 • 4.07M • 344

  • Salesforce/blip-image-captioning-base

    Image-to-Text • Updated Feb 3, 2025 • 3.34M • 845

  • google/pix2struct-base

    Image-to-Text • 0.3B • Updated Dec 24, 2023 • 3.07k • 76

  • microsoft/kosmos-2-patch14-224

    Image-to-Text • Updated Nov 28, 2023 • 176k • 184

  • openbmb/MiniCPM-V-4_5

    Image-Text-to-Text • 9B • Updated 2 days ago • 76.6k • 1.07k
Upvote
1
  • Collection guide
  • Browse collections
Company
TOS Privacy About Careers
Website
Models Datasets Spaces Pricing Docs