Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding Paper • 2601.10611 • Published 13 days ago • 26
LTX-2: Efficient Joint Audio-Visual Foundation Model Paper • 2601.03233 • Published 22 days ago • 137
Moshi: a speech-text foundation model for real-time dialogue Paper • 2410.00037 • Published Sep 17, 2024 • 11
Vision-Speech Models: Teaching Speech Models to Converse about Images Paper • 2503.15633 • Published Mar 19, 2025 • 2
ARC-Encoder: learning compressed text representations for large language models Paper • 2510.20535 • Published Oct 23, 2025 • 8
CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion Paper • 2512.19535 • Published Dec 22, 2025 • 12