Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
adarshzolekar 's Collections
Multimodal AI Models
Audio & Speech Models
Vision Models (Image & Video)
Text & Code Models (NLP)

Multimodal AI Models

updated 2 days ago

Purpose: Models that understand text + image + audio together.

Upvote
1

  • llava-hf/llava-1.5-7b-hf

    Image-Text-to-Text • 7B • Updated Jun 6, 2025 • 1.15M • 336

  • Salesforce/blip-image-captioning-base

    Image-to-Text • Updated Feb 3, 2025 • 1.97M • 839

  • google/pix2struct-base

    Image-to-Text • 0.3B • Updated Dec 24, 2023 • 4.44k • 76

  • microsoft/kosmos-2-patch14-224

    Image-to-Text • 2B • Updated Nov 28, 2023 • 140k • 182

  • openbmb/MiniCPM-V-4_5

    Image-Text-to-Text • 9B • Updated Dec 18, 2025 • 64k • 1.05k
Upvote
1
  • Collection guide
  • Browse collections
Company
TOS Privacy About Careers
Website
Models Datasets Spaces Pricing Docs