VLMs - a shail-2512 Collection

shail-2512 's Collections

MultiModal (Any-to-Any)

ALMs (Audio Language Models)

Reasoning (LRMs)

Image Generation

Video Generation

Speech Recognition

Dataset to fine-tune Embeddings

Reranking Models

Embedding Models

VLMs

updated Dec 2, 2024

HuggingFaceTB/SmolVLM-Instruct

Image-Text-to-Text • 2B • Updated Apr 8, 2025 • 27.3k • 587
microsoft/OmniParser

Image-Text-to-Text • Updated Dec 2, 2024 • 230 • 1.71k
vidore/colsmolvlm-v0.1

Visual Document Retrieval • Updated Mar 14, 2025 • 76 • 55
meta-llama/Llama-3.2-11B-Vision-Instruct

Image-Text-to-Text • 11B • Updated Dec 4, 2024 • 217k • 1.59k
Qwen/Qwen2-VL-7B-Instruct

Image-Text-to-Text • 8B • Updated Feb 6, 2025 • 3.23M • 1.28k
mistral-experimental/pixtral-12b

Image-Text-to-Text • 13B • Updated Jan 27, 2025 • 131k • 105
HuggingFaceM4/Idefics3-8B-Llama3

Image-Text-to-Text • Updated Dec 2, 2024 • 394k • 304
allenai/Molmo-7B-O-0924

Image-Text-to-Text • 8B • Updated Oct 9, 2025 • 1.17k • 163