view article Article VLX-Go: Vision-Language Short-Horizon Waypoint Prediction for Embodied Navigation omlab • about 5 hours ago • 6
view article Article VLX-Seek: Improving VLM Fine-Grained Perception via Region Reference Instead of Coordinate Generation omlab • 1 day ago • 8
view article Article VLX-Flow: Continuous Video Understanding for Real-Time Multimodal Interaction omlab • 1 day ago • 9
Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models Paper • 2605.28132 • Published May 27 • 25
VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs Paper • 2509.25916 • Published Sep 30, 2025 • 6
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model Paper • 2504.07615 • Published Apr 10, 2025 • 36
view article Article Improving Object Detection through Reinforcement Learning with VLM-R1 omlab • Mar 25, 2025 • 3
view article Article Trials, Errors, and Breakthroughs: Our Rocky Road to OVD SOTA with Reinforcement Learning omlab • Mar 25, 2025 • 3
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Paper • 2501.12948 • Published Jan 22, 2025 • 454
OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer Paper • 2406.16620 • Published Jun 24, 2024 • 3
RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing Paper • 2306.11300 • Published Jun 20, 2023 • 2
Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head Paper • 2403.06892 • Published Mar 11, 2024 • 2
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection Paper • 2312.15043 • Published Dec 22, 2023 • 2
VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations Paper • 2207.00221 • Published Jul 1, 2022 • 2
OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network Paper • 2209.05946 • Published Sep 10, 2022 • 2
OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding Paper • 2407.04923 • Published Jul 6, 2024 • 2
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Paper • 2412.05271 • Published Dec 6, 2024 • 162