Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation Paper • 2512.02457 • Published 10 days ago • 13
UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist Paper • 2511.08521 • Published about 1 month ago • 37
Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding Paper • 2509.11866 • Published Sep 15 • 1
VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models Paper • 2508.12081 • Published Aug 16
DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World Paper • 2506.24102 • Published Jun 30
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence Paper • 2510.20579 • Published Oct 23 • 55
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs Paper • 2510.18876 • Published Oct 21 • 36
Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models Paper • 2505.24164 • Published May 30
UltraVideo: High-Quality UHD Video Dataset with Comprehensive Captions Paper • 2506.13691 • Published Jun 16 • 2
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology Paper • 2507.07999 • Published Jul 10 • 49
Towards Semantic Equivalence of Tokenization in Multimodal LLM Paper • 2406.05127 • Published Jun 7, 2024
So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection Paper • 2505.18660 • Published May 24 • 1
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model Paper • 2505.23606 • Published May 29 • 14
Conditional Panoramic Image Generation via Masked Autoregressive Modeling Paper • 2505.16862 • Published May 22
MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query Paper • 2506.03144 • Published Jun 3 • 7