Translation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots Paper • 2606.28133 • Published 9 days ago • 39
PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation Paper • 2606.28128 • Published 9 days ago • 50
Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models Paper • 2606.25041 • Published 12 days ago • 115
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs Paper • 2510.10689 • Published Oct 12, 2025 • 46
VideoPrism: A Foundational Visual Encoder for Video Understanding Paper • 2402.13217 • Published Feb 20, 2024 • 41
Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations Paper • 2506.18898 • Published Jun 23, 2025 • 35
NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks Paper • 2504.19854 • Published Apr 28, 2025 • 7
LongVILA: Scaling Long-Context Visual Language Models for Long Videos Paper • 2408.10188 • Published Aug 19, 2024 • 52
FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance Paper • 2408.08189 • Published Aug 15, 2024 • 17
MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation Paper • 2407.15060 • Published Jul 21, 2024 • 9
YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information Paper • 2402.13616 • Published Feb 21, 2024 • 49
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper • 2402.13753 • Published Feb 21, 2024 • 116
MusicRL: Aligning Music Generation to Human Preferences Paper • 2402.04229 • Published Feb 6, 2024 • 17
EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision Paper • 2311.02077 • Published Nov 3, 2023 • 15