VaseVQA: Multimodal Agent and Benchmark for Ancient Greek Pottery Paper • 2509.17191 • Published Sep 21, 2025 • 1
3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence Paper • 2601.06496 • Published 27 days ago • 1
3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence Paper • 2601.06496 • Published 27 days ago • 1
CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving Paper • 2601.01874 • Published Jan 5 • 19
DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion Paper • 2510.15264 • Published Oct 17, 2025 • 4
Few-Step Distillation for Text-to-Image Generation: A Practical Guide Paper • 2512.13006 • Published Dec 15, 2025 • 8
Few-Step Distillation for Text-to-Image Generation: A Practical Guide Paper • 2512.13006 • Published Dec 15, 2025 • 8
VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer Paper • 2512.11891 • Published Dec 9, 2025 • 9
ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning Paper • 2512.09924 • Published Dec 10, 2025 • 4
ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning Paper • 2512.09924 • Published Dec 10, 2025 • 4
Unicorn: Text-Only Data Synthesis for Vision Language Model Training Paper • 2503.22655 • Published Mar 28, 2025 • 39
OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation Paper • 2505.03912 • Published May 6, 2025 • 9
SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning Paper • 2505.12448 • Published May 18, 2025 • 10