You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass Paper • 2604.10966 • Published 26 days ago • 11
CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation Paper • 2602.18424 • Published Feb 20
PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning Paper • 2603.26653 • Published Mar 27 • 18
VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models Paper • 2603.24575 • Published Mar 25 • 18
Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos Paper • 2602.23543 • Published Feb 26 • 9
Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL Paper • 2601.09876 • Published Jan 14 • 7
Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation Paper • 2512.11792 • Published Dec 12, 2025 • 10
LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation Paper • 2412.15188 • Published Dec 19, 2024 • 2
MADFormer: Mixed Autoregressive and Diffusion Transformers for Continuous Image Generation Paper • 2506.07999 • Published Jun 9, 2025 • 2
TV2TV: A Unified Framework for Interleaved Language and Video Generation Paper • 2512.05103 • Published Dec 4, 2025 • 20
VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL Paper • 2505.23977 • Published May 29, 2025 • 10