The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation Paper • 2601.17737 • Published 6 days ago • 53
Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs Paper • 2601.17058 • Published 9 days ago • 176
daVinci-Dev: Agent-native Mid-training for Software Engineering Paper • 2601.18418 • Published 5 days ago • 120
Depth Anything 3: Recovering the Visual Space from Any Views Paper • 2511.10647 • Published Nov 13, 2025 • 99
Visual Representation Alignment for Multimodal Large Language Models Paper • 2509.07979 • Published Sep 9, 2025 • 84
OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning Paper • 2509.01644 • Published Sep 1, 2025 • 34
MolmoAct: Action Reasoning Models that can Reason in Space Paper • 2508.07917 • Published Aug 11, 2025 • 44
Enhanced Arabic Text Retrieval with Attentive Relevance Scoring Paper • 2507.23404 • Published Jul 31, 2025 • 3
Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers Paper • 2507.10787 • Published Jul 14, 2025 • 13
AnimaX: Animating the Inanimate in 3D with Joint Video-Pose Diffusion Models Paper • 2506.19851 • Published Jun 24, 2025 • 60
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset Paper • 2505.09568 • Published May 14, 2025 • 99
Vision-Language-Action Models: Concepts, Progress, Applications and Challenges Paper • 2505.04769 • Published May 7, 2025 • 10