PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models Paper • 2606.19534 • Published 9 days ago • 61
Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning Paper • 2606.07436 • Published 21 days ago • 24
Describe Anything Collection Multimodal Large Language Models for Detailed Localized Image and Video Captioning • 7 items • Updated 14 days ago • 63
DeH4R: A Decoupled and Hybrid Method for Road Network Graph Extraction Paper • 2508.13669 • Published Aug 19, 2025 • 1
SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment Paper • 2507.02705 • Published Jul 3, 2025 • 2
ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing Paper • 2506.19848 • Published Jun 24, 2025 • 27
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding Paper • 2504.10465 • Published Apr 14, 2025 • 27
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding Paper • 2406.19389 • Published Jun 27, 2024 • 54