Reinforcing Dual-Path Reasoning in Spatial Vision Language Models Paper • 2606.17539 • Published 9 days ago • 15
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation Paper • 2507.10524 • Published Jul 14, 2025 • 74
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer Paper • 2605.15178 • Published May 14 • 91
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation Paper • 2604.24763 • Published Apr 27 • 71
VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward Paper • 2603.26599 • Published Mar 27 • 67
HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming Paper • 2512.21338 • Published Dec 24, 2025 • 23
OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory Paper • 2512.07802 • Published Dec 8, 2025 • 46
TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models Paper • 2512.02014 • Published Dec 1, 2025 • 78
From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model Paper • 2510.19871 • Published Oct 22, 2025 • 30
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats Paper • 2510.25602 • Published Oct 29, 2025 • 81
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model Paper • 2502.10248 • Published Feb 14, 2025 • 57