Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling Paper • 2604.28185 • Published 11 days ago • 87
Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms Paper • 2604.23775 • Published 15 days ago • 44
Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers Paper • 2603.27666 • Published Mar 29 • 18
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps Paper • 2505.18675 • Published May 24, 2025 • 27
Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models Paper • 2603.15557 • Published Mar 16 • 29
ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer Paper • 2603.15478 • Published Mar 16 • 24
WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion Paper • 2512.19678 • Published Dec 22, 2025 • 32
In-Video Instructions: Visual Signals as Generative Control Paper • 2511.19401 • Published Nov 24, 2025 • 32
Discrete Diffusion in Large Language and Multimodal Models: A Survey Paper • 2506.13759 • Published Jun 16, 2025 • 43