arxiv:2607.00115

PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking

Published on Jun 30

· Submitted by

GongDengxian on Jul 2

Wuhan Univeristy

Upvote

Authors:

Abstract

Multi-turn visual reasoning agents suffer from entangled reasoning and perception that cause redundant trajectories; PixelEyes addresses this by decoupling these processes through mask-guided search and semantic-region breadth-first search, demonstrated on a new benchmark with expert-resynthesized data.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

This paper explores multi-turn visual reasoning and observes that MLLMs repeatedly fail to localize the target, leading to long, redundant trajectories. We attribute this failure to the entanglement of reasoning and perception within a single model, the MLLM reasons and localizes simultaneously, and inaccurate localization triggers additional reasoning turns that bloat the trajectory. To solve this problem, we propose PixelEyes, a multi-turn visual reasoning agent that explicitly decouples reasoning from perception, i.e., the reasoner decides what to look for, while a specialized perception tool answers where it is. Specifically, PixelEyes introduces 1) Mask-guided Visual Search. A referring segmentation model is invoked to provide mask-precise localization, freeing the reasoner from the need to compensate for imprecise grounding. 2) Semantic-region Breadth-first Search (BFS). To eliminate redundant loops caused by repeatedly cropping incorrect sub-regions, we organize exploration as a breadth-first search over semantic regions. To internalize these capabilities, we construct the PixelEyes-6K dataset by resynthesizing expert trajectories from existing data. This explicitly embeds our mask-guided search and BFS logic into the model. We further introduce Pinpoint-Bench, a zero-hint visual search benchmark, i.e., no location cues are provided in the question, with instance-level masks and bounding boxes that separate localization failures from reasoning failures, enabling fine-grained analysis of failure modes such as inattentional blindness. Recent state-of-the-art MLLMs and visual reasoning agents leave large headroom on Pinpoint-Bench, demonstrating its quality and difficulty. Code and models are open-sourced.

View arXiv page View PDF Project page GitHub 2 Add to collection

Community

godx7

Paper submitter about 5 hours ago

•

edited less than a minute ago

PixelEyes enhances active visual search in MLLMs by delegating fine-grained localization to a specialized perception tool, thereby achieving efficient and accurate multi-turn visual reasoning.

🤗 Models: https://huggingface.co/collections/godx7/pixeleyes
🤗 Pinpoint-Bench: https://huggingface.co/datasets/godx7/Pinpoint-Bench

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2607.00115

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2607.00115 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2607.00115 in a Space README.md to link it from this page.