MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset Paper • 2605.21272 • Published May 20 • 4
MONET - Massive Open Non-redundant, Enriched, Text-to-image Collection A curated, deduped & recaptioned open image–text dataset of 104.9M samples released under the Apache2.0 licence. https://huggingface.co/blog/jasperai/ • 4 items • Updated 29 days ago • 11
Jagle Collection Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision–Language Models • 5 items • Updated Apr 12 • 2
MobileCLIP2 Collection MobileCLIP2: Mobile-friendly image-text models with SOTA zero-shot capabilities trained on DFNDR-2B • 30 items • Updated Apr 23 • 64
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation Paper • 2605.08029 • Published May 8 • 12
Continuous-Time Distribution Matching for Few-Step Diffusion Distillation Paper • 2605.06376 • Published May 7 • 27
SenseNova-U1 Collection SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-Unify Architecture • 10 items • Updated 14 days ago • 74
GenLIP Collection Model weights of paper "Let ViT Speak: Generative Language-Image Pre-training" • 6 items • Updated May 5 • 8
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation Paper • 2604.24764 • Published Apr 27 • 119
AVControl: Efficient Framework for Training Audio-Visual Controls Paper • 2603.24793 • Published Mar 25 • 30