VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents Paper • 2601.16973 • Published 3 days ago • 22
CoVT: Chain-of-Visual-Thought Collection Enrich VLMs’ vision-centric reasoning capabilities via Chain-of-Visual-Thought! • 7 items • Updated Nov 25, 2025 • 6
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens Paper • 2511.19418 • Published Nov 24, 2025 • 29
UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity Paper • 2511.13714 • Published Nov 17, 2025 • 12
Constantly Improving Image Models Need Constantly Improving Benchmarks Paper • 2510.15021 • Published Oct 16, 2025 • 7
Reconstruction Alignment Improves Unified Multimodal Models Paper • 2509.07295 • Published Sep 8, 2025 • 40
Describe Anything: Detailed Localized Image and Video Captioning Paper • 2504.16072 • Published Apr 22, 2025 • 63
Cut and Learn for Unsupervised Object Detection and Instance Segmentation Paper • 2301.11320 • Published Jan 26, 2023 • 1
See, Say, and Segment: Teaching LMMs to Overcome False Premises Paper • 2312.08366 • Published Dec 13, 2023
InstanceDiffusion: Instance-level Control for Image Generation Paper • 2402.03290 • Published Feb 5, 2024 • 1