ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder Paper • 2510.18795 • Published Oct 21 • 10
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue Paper • 2510.13747 • Published Oct 15 • 29
Decoupled Global-Local Alignment for Improving Compositional Understanding Paper • 2504.16801 • Published Apr 23 • 14