Vision Language Models - a sbarman25 Collection

sbarman25 's Collections

Training & Architectures

Safety / Alignment / Policies / SMI

Evals & Monitoring

Vulnerabilities

CV / Text-to-Image / Image-to-Image / Diffusion

Hardware-aware Models

Tool Usage (w/VLMs)

Vision Language Models

Vision Language Models

updated Feb 17

Qwen/Qwen2-VL-7B-Instruct

Image-Text-to-Text • 8B • Updated Feb 6, 2025 • 1.82M • 1.28k
vidore/colpali_train_set

Viewer • Updated Jun 20, 2025 • 119k • 6.27k • 91

Note Dataset to train or tune VLM
vidore/colpali

Visual Document Retrieval • Updated Nov 24, 2025 • 7.53k • 482

Note Not a generative model. ColBERT+PaliGemma
JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

Paper • 2408.08459 • Published Aug 15, 2024 • 45
pixparse/idl-wds

Viewer • Updated Mar 29, 2024 • 3.41M • 7.59k • 194
pixparse/pdfa-eng-wds

Viewer • Updated Mar 29, 2024 • 7.1k • 7.06k • 159
stepfun-ai/GOT-OCR2_0

Image-Text-to-Text • 0.7B • Updated Feb 4, 2025 • 185k • 1.54k
microsoft/Florence-2-large

Image-Text-to-Text • 0.8B • Updated Aug 4, 2025 • 660k • 1.83k
microsoft/Phi-3.5-vision-instruct

Image-Text-to-Text • 4B • Updated Dec 10, 2025 • 1.44M • 736
yifeihu/TFT-ID-1.0

Image-Text-to-Text • 0.8B • Updated Sep 29, 2024 • 12 • 106
mistralai/Pixtral-12B-2409

Updated 28 days ago • 5.67k • 689
MMMU/MMMU_Pro

Benchmark • Updated 21 days ago • 5.19k • 18.3k • 60
allenai/MolmoE-1B-0924

Image-Text-to-Text • Updated Apr 24, 2025 • 1.14k • 158
MrLight/dse-qwen2-2b-mrl-v1

Visual Document Retrieval • Updated Feb 26, 2025 • 36.4k • 68
microsoft/Phi-3-vision-128k-instruct

Text Generation • 4B • Updated Dec 10, 2025 • 247k • 970
Qwen/Qwen2-VL-2B-Instruct

Image-Text-to-Text • 2B • Updated Jan 12, 2025 • 3.7M • 512
vidore/colqwen2-v0.1

Visual Document Retrieval • Updated Mar 21, 2025 • 6.8k • 195
nvidia/NVLM-D-72B

Image-Text-to-Text • 79B • Updated Jan 14, 2025 • 148k • 776
deepseek-ai/DeepSeek-OCR

Image-Text-to-Text • 3B • Updated Nov 4, 2025 • 2.2M • 3.29k
allenai/olmOCR-2-7B-1025-FP8

Image-Text-to-Text • 8B • Updated Feb 19 • 663k • 242
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Paper • 2503.11576 • Published Mar 14, 2025 • 164