Datasets MMInstruction/Clevr_CoGenT_TrainA_R1 Viewer β’ Updated Feb 13, 2025 β’ 37.8k β’ 67 β’ 48
Embedding nvidia/MM-Embed 8B β’ Updated Nov 6, 2024 β’ 327 β’ 66 jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images Paper β’ 2412.08802 β’ Published Dec 11, 2024 β’ 7 nvidia/NV-Embed-v2 Feature Extraction β’ 8B β’ Updated Jul 21, 2025 β’ 24.5k β’ 513 Qwen/Qwen3-VL-Embedding-2B Sentence Similarity β’ 2B β’ Updated Apr 16 β’ 978k β’ β’ 426
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images Paper β’ 2412.08802 β’ Published Dec 11, 2024 β’ 7
VLMs Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Paper β’ 2409.12191 β’ Published Sep 18, 2024 β’ 80 Multimodal Latent Language Modeling with Next-Token Diffusion Paper β’ 2412.08635 β’ Published Dec 11, 2024 β’ 50 ATH-MaaS/Ovis2-2B Image-Text-to-Text β’ 2B β’ Updated Aug 15, 2025 β’ 221 β’ 60 DAMO-NLP-SG/VideoLLaMA3-2B Video-Text-to-Text β’ 2B β’ Updated Sep 3, 2025 β’ 2.38k β’ 21
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Paper β’ 2409.12191 β’ Published Sep 18, 2024 β’ 80
Multimodal Latent Language Modeling with Next-Token Diffusion Paper β’ 2412.08635 β’ Published Dec 11, 2024 β’ 50
Text-to-Image black-forest-labs/FLUX.1-dev Text-to-Image β’ Updated Jun 27, 2025 β’ 959k β’ β’ 13.5k
CLIP series Wasserstein Contrastive Representation Distillation Paper β’ 2012.08674 β’ Published Dec 15, 2020 nvidia/MM-Embed 8B β’ Updated Nov 6, 2024 β’ 327 β’ 66 google/siglip2-base-patch16-224 Zero-Shot Image Classification β’ 0.4B β’ Updated Feb 21, 2025 β’ 410k β’ 111
google/siglip2-base-patch16-224 Zero-Shot Image Classification β’ 0.4B β’ Updated Feb 21, 2025 β’ 410k β’ 111
LLMs Phi-4 Technical Report Paper β’ 2412.08905 β’ Published Dec 12, 2024 β’ 124 Qwen/Qwen2.5-Omni-7B Any-to-Any β’ 11B β’ Updated Apr 30, 2025 β’ 614k β’ 1.91k
Text-to-Image black-forest-labs/FLUX.1-dev Text-to-Image β’ Updated Jun 27, 2025 β’ 959k β’ β’ 13.5k
Datasets MMInstruction/Clevr_CoGenT_TrainA_R1 Viewer β’ Updated Feb 13, 2025 β’ 37.8k β’ 67 β’ 48
Embedding nvidia/MM-Embed 8B β’ Updated Nov 6, 2024 β’ 327 β’ 66 jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images Paper β’ 2412.08802 β’ Published Dec 11, 2024 β’ 7 nvidia/NV-Embed-v2 Feature Extraction β’ 8B β’ Updated Jul 21, 2025 β’ 24.5k β’ 513 Qwen/Qwen3-VL-Embedding-2B Sentence Similarity β’ 2B β’ Updated Apr 16 β’ 978k β’ β’ 426
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images Paper β’ 2412.08802 β’ Published Dec 11, 2024 β’ 7
CLIP series Wasserstein Contrastive Representation Distillation Paper β’ 2012.08674 β’ Published Dec 15, 2020 nvidia/MM-Embed 8B β’ Updated Nov 6, 2024 β’ 327 β’ 66 google/siglip2-base-patch16-224 Zero-Shot Image Classification β’ 0.4B β’ Updated Feb 21, 2025 β’ 410k β’ 111
google/siglip2-base-patch16-224 Zero-Shot Image Classification β’ 0.4B β’ Updated Feb 21, 2025 β’ 410k β’ 111
VLMs Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Paper β’ 2409.12191 β’ Published Sep 18, 2024 β’ 80 Multimodal Latent Language Modeling with Next-Token Diffusion Paper β’ 2412.08635 β’ Published Dec 11, 2024 β’ 50 ATH-MaaS/Ovis2-2B Image-Text-to-Text β’ 2B β’ Updated Aug 15, 2025 β’ 221 β’ 60 DAMO-NLP-SG/VideoLLaMA3-2B Video-Text-to-Text β’ 2B β’ Updated Sep 3, 2025 β’ 2.38k β’ 21
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Paper β’ 2409.12191 β’ Published Sep 18, 2024 β’ 80
Multimodal Latent Language Modeling with Next-Token Diffusion Paper β’ 2412.08635 β’ Published Dec 11, 2024 β’ 50
LLMs Phi-4 Technical Report Paper β’ 2412.08905 β’ Published Dec 12, 2024 β’ 124 Qwen/Qwen2.5-Omni-7B Any-to-Any β’ 11B β’ Updated Apr 30, 2025 β’ 614k β’ 1.91k