VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents Paper • 2601.16973 • Published 5 days ago • 30
CoVT: Chain-of-Visual-Thought Collection Enrich VLMs’ vision-centric reasoning capabilities via Chain-of-Visual-Thought! • 7 items • Updated Nov 25, 2025 • 6
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens Paper • 2511.19418 • Published Nov 24, 2025 • 29
UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity Paper • 2511.13714 • Published Nov 17, 2025 • 12
Constantly Improving Image Models Need Constantly Improving Benchmarks Paper • 2510.15021 • Published Oct 16, 2025 • 7
Reconstruction Alignment Improves Unified Multimodal Models Paper • 2509.07295 • Published Sep 8, 2025 • 40
Describe Anything: Detailed Localized Image and Video Captioning Paper • 2504.16072 • Published Apr 22, 2025 • 63
InstanceDiffusion: Instance-level Control for Image Generation Paper • 2402.03290 • Published Feb 5, 2024 • 1