Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding Paper • 2604.05015 • Published 3 days ago • 193
When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition Paper • 2603.16256 • Published 23 days ago • 1
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion Paper • 2603.06577 • Published Mar 6 • 48
DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model Paper • 2512.12633 • Published Dec 14, 2025
RISE-Video: Can Video Generators Decode Implicit World Rules? Paper • 2602.05986 • Published Feb 5 • 26
Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration Paper • 2309.01131 • Published Sep 3, 2023 • 1
Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models Paper • 2402.19014 • Published Feb 29, 2024
Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction Paper • 2406.12707 • Published Jun 18, 2024
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy Paper • 2502.05177 • Published Feb 7, 2025 • 2
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model Paper • 2505.03739 • Published May 6, 2025 • 9
VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation Paper • 2510.09607 • Published Oct 10, 2025 • 2
VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting Paper • 2510.21817 • Published Oct 21, 2025 • 41
Input Domain Aware MoE: Decoupling Routing Decisions from Task Optimization in Mixture of Experts Paper • 2510.16448 • Published Oct 18, 2025
TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs Paper • 2505.20777 • Published May 27, 2025
Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision Paper • 2601.19798 • Published Jan 27 • 43
BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models Paper • 2508.06895 • Published Aug 9, 2025
Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding Paper • 2601.20430 • Published Jan 28 • 16