AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation Paper • 2410.00371 • Published Oct 1, 2024
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning Paper • 2506.13654 • Published Jun 16 • 43
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models Paper • 2407.12772 • Published Jul 17, 2024 • 35
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models Paper • 2407.12772 • Published Jul 17, 2024 • 35
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos Paper • 2501.13826 • Published Jan 23 • 23
Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models Paper • 2412.09645 • Published Dec 10, 2024 • 36
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models Paper • 2411.14432 • Published Nov 21, 2024 • 25
Octopus: Embodied Vision-Language Programmer from Environmental Feedback Paper • 2310.08588 • Published Oct 12, 2023 • 38
Otter: A Multi-Modal Model with In-Context Instruction Tuning Paper • 2305.03726 • Published May 5, 2023 • 6