VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents Paper • 2601.16973 • Published 6 days ago • 34
VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents Paper • 2601.16973 • Published 6 days ago • 34
ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models Paper • 2512.07843 • Published Nov 24, 2025 • 22
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling Paper • 2504.13169 • Published Apr 17, 2025 • 39
Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint Paper • 2505.23759 • Published May 29, 2025 • 5
Learning Adaptive Parallel Reasoning with Language Models Paper • 2504.15466 • Published Apr 21, 2025 • 44
Describe Anything: Detailed Localized Image and Video Captioning Paper • 2504.16072 • Published Apr 22, 2025 • 63
Atlas: Multi-Scale Attention Improves Long Context Image Modeling Paper • 2503.12355 • Published Mar 16, 2025 • 12
Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence Paper • 2305.14334 • Published May 23, 2023 • 1
Readout Guidance: Learning Control from Diffusion Features Paper • 2312.02150 • Published Dec 4, 2023 • 3
Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition Paper • 2403.19822 • Published Mar 28, 2024
ALOHa: A New Measure for Hallucination in Captioning Models Paper • 2404.02904 • Published Apr 3, 2024
Virtual Personas for Language Models via an Anthology of Backstories Paper • 2407.06576 • Published Jul 9, 2024 • 1
Visual Haystacks: Answering Harder Questions About Sets of Images Paper • 2407.13766 • Published Jul 18, 2024 • 2