7 8

caohaoyu

rechy

AI & ML interests

None yet

Recent Activity

authored a paper about 9 hours ago

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

authored a paper about 9 hours ago

When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition

authored a paper about 9 hours ago

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

View all activity

Organizations

None yet

authored 5 papers about 9 hours ago

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Paper • 2604.05015 • Published 3 days ago • 193

When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition

Paper • 2603.16256 • Published 23 days ago • 1

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Paper • 2603.06577 • Published Mar 6 • 48

DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model

Paper • 2512.12633 • Published Dec 14, 2025

RISE-Video: Can Video Generators Decode Implicit World Rules?

Paper • 2602.05986 • Published Feb 5 • 26

authored 12 papers 2 months ago

Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

Paper • 2309.01131 • Published Sep 3, 2023 • 1

Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models

Paper • 2402.19014 • Published Feb 29, 2024

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction

Paper • 2406.12707 • Published Jun 18, 2024

Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

Paper • 2502.05177 • Published Feb 7, 2025 • 2

VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

Paper • 2505.03739 • Published May 6, 2025 • 9

VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

Paper • 2510.09607 • Published Oct 10, 2025 • 2

VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting

Paper • 2510.21817 • Published Oct 21, 2025 • 41

Input Domain Aware MoE: Decoupling Routing Decisions from Task Optimization in Mixture of Experts

Paper • 2510.16448 • Published Oct 18, 2025

TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs

Paper • 2505.20777 • Published May 27, 2025

Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision

Paper • 2601.19798 • Published Jan 27 • 43

BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models

Paper • 2508.06895 • Published Aug 9, 2025

Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding

Paper • 2601.20430 • Published Jan 28 • 16

caohaoyu

AI & ML interests

Recent Activity

Organizations

rechy's activity