101 kB

Title: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues

URL Source: https://arxiv.org/html/2412.01250

Markdown Content: Back to arXiv

This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.

Why HTML? Report Issue Back to Abstract Download PDF Abstract 1Introduction 2Related Works 3Collaborative Instance Object Navigation 4Proposed Method 5CoIN-Bench 6Experiments 7Conclusion \AlphAlphAdditional details of CoIN-Bench \AlphAlphEvaluation Setup \AlphAlphBaselines \AlphAlphComputational analysis \AlphAlphEvaluation with real human \AlphAlphVLFM results \AlphAlphUMAP visualization \AlphAlphIDKVQA dataset \AlphAlphPrompts \AlphAlphAlgorithm References

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

failed: algpseudocodex failed: alphalph

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license arXiv:2412.01250v3 [cs.AI] 18 Mar 2025 Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues Francesco Taioli1,2, Edoardo Zorzi2, Gianni Franchi3, Alberto Castellini2, Alessandro Farinelli2, Marco Cristani2, Yiming Wang4 1Polytechnic of Turin, 2University of Verona, 3U2IS, ENSTA Paris, Institut Polytechnique de Paris, 4 Fondazione Bruno Kessler francesco.taioli@polito.it, {name.surname}@univr.it, gianni.franchi@ensta-paris.fr, ywang@fbk.eu https://intelligolabs.github.io/CoIN Abstract

Language-driven instance object navigation assumes that human users initiate the task by providing a detailed description of the target instance to the embodied agent. While this description is crucial for distinguishing the target from visually similar instances in a scene, providing it prior to navigation can be demanding for human. To bridge this gap, we introduce Collaborative Instance object Navigation (CoIN), a new task setting where the agent actively resolve uncertainties about the target instance during navigation in natural, template-free, open-ended dialogues with human. We propose a novel training-free method, Agent-user Interaction with UncerTainty Awareness (AIUTA), which operates independently from the navigation policy, and focuses on the human-agent interaction reasoning with Vision-Language Models (VLMs) and Large Language Models (LLMs). First, upon object detection, a Self-Questioner model initiates a self-dialogue within the agent to obtain a complete and accurate observation description with a novel uncertainty estimation technique. Then, an Interaction Trigger module determines whether to ask a question to the human, continue or halt navigation, minimizing user input. For evaluation, we introduce CoIN-Bench, with a curated dataset designed for challenging multi-instance scenarios. CoIN-Bench supports both online evaluation with humans and reproducible experiments with simulated user-agent interactions. On CoIN-Bench, we show that AIUTA serves as a competitive baseline, while existing language-driven instance navigation methods struggle in complex multi-instance scenes. Code and benchmark will be available upon acceptance.

Figure 1: Sketched episode of the proposed CoIN task. The human user (bottom left) provides a request (“Find the picture”) in natural language. The agent has to locate the object in a completely unknown environment without any target image as input, interacting with the user only when needed via template-free, open-ended natural-language dialogue. Our method, Agent-user Interaction with UncerTainty Awareness (AIUTA), addresses this challenging task, minimizing user interactions by equipping the agent with two modules: a Self-Questioner and an Interaction Trigger, whose output is shown in the blue boxes along the agent’s path (① to ⑤), and whose inner working is shown on the right. The Self-Questioner leverages a LLM and VLM in a self-dialogue to initially describe the agent’s observation, and then extract additional relevant details, with a novel entropy-based technique to reduce hallucinations and inaccuracies, producing a refined detection description. The Interaction Trigger uses this refined description to decide whether to pose a question to the user (①,③,④), continue the navigation (②) or halt the exploration (⑤). 1Introduction

Recent advances in Large Language Models (LLMs) and Vision-Language Models (VLMs) have significantly reinvigorated research on language-driven navigation tasks [3, 46, 1, 21, 17], where human engages with embodied agents via natural language only, the most intuitive human-agent interaction among other forms (e.g., visual reference [62]). In this paper, we focus on the language-driven Instance Object Navigation (InstanceObjectNav) task [21, 17], a practical task where the agent aims to locate a specific instance within an unknown 3D scene, based on a detailed instance description (differently from ObjectNav [6] where any object of a category can be located). The instance description typically contains nuanced details about the intrinsic (e.g., color, material) and extrinsic (e.g., context, spatial relations) attributes of the searched object instance, which are essential for uniquely identifying the target amid visual ambiguity. However, the standard language-driven InstanceObjectNav task assumes that the detailed instance description is provided upfront, before navigation begins. This assumption can be demanding and impractical in real world, as users may not be able or willing to supply all details in advance.

We introduce the Collaborative Instance object Navigation (CoIN) task, which engages a human user via natural-language dialogues to resolve instance visual ambiguity during navigation. CoIN enables human users to initiate the InstanceNav task without providing extensive instance description. For instance, the user can just specify the instance category, e.g., “Find the picture”, a challenging minimal-guidance scenario. Notably, CoIN introduces, for the first time, template-free, open-ended human-agent dialogues, a significant departure from the templated question-answer pairs used in prior work [12]. Instead, our agent engages in dialogue solely based on the understanding gained during navigation. Within CoIN, two key research questions arise: 1) When and 2) How should agent-user interaction occur? To address the “When”, the agent must develop an internal model of its perceived environment to determine the optimal moments for seeking assistance from the user, resolving ambiguities effectively. To address the “How”, the agent must formulate the most informative questions to maximize its chances of locating the target.

We introduce a novel zero-shot approach called Agent-user Interaction with UncerTainty Awareness (AIUTA). AIUTA equips the agent with two onboard modules, the Self-Questioner and the Interaction Trigger, leveraging pre-trained VLMs and LLMs without additional training. The Self-Questioner enables the agent to autonomously generate self-dialogues to inquire additional target-relevant details, and verify essential details with a novel technique for uncertainty estimation. As shown in Fig. 1, upon detection, the LLM first prompts the VLM to obtain an initial detection description which can be incomplete and inaccurate. To enrich with target-relevant details, the LLM further generates questions for the VLM, whose responses complement the initial description. However, since VLMs cannot guarantee accurate responses grounded in the visual counterpart [48, 25, 35], we further prompt the LLM to generate sanity-check questions about all relevant details (e.g., “Is the wall light-blue?”). We instruct the VLM’s response to be either Yes, No or I don’t know, proposing a novel Normalized-Entropy-based technique to quantify the VLM uncertainty. Finally, the Interaction Trigger module leverages the LLM to predict an alignment score between the refined detection description and the known target’s facts acquired from previous agent-human dialogues, if any. With the score, the module decides whether to continue navigation, terminate it, or ask human clarifying questions.

To evaluate CoIN, we propose the first benchmark, CoIN-Bench, with a curated dataset that specifically focuses on the visual ambiguity challenge in CoIN. The dataset is created on top of the recent large-scale GOAT-Bench [17], where we only consider episodes involving multiple instances in a scene, with high-quality visual observations on the target. In total, CoIN-Bench consists of 1 , 649 evaluation episodes, with on average five distractors (non-target instances of the same category) per episode. Our benchmark supports on-line evaluation with humans, as well as reproducible evaluation via simulated user-agent interactions. We empirically show that the simulated user-agent interaction yields results that are in line with human evaluation. With CoIN-Bench, AIUTA, while being training-free, outperforms state-of-the-art InstanceObjectNav methods that are trained on the dataset in the zero-shot setting, in terms of success rate and path efficiency. Finally, to evaluate VLM uncertainty estimation in the context of CoIN, we introduce the “I Don’t Know Visual Question Answering” (IDKVQA), with human annotations. On IDKVQA, our proposed Normalized-Entropy-based technique outperforms recent competitors [26], being a more reliable uncertainty measure.

Paper Contributions are summarized as follows:

•

We introduce CoIN, a practical setting for InstanceObjectNav, enabling minimal human input via agent-human dialogues during navigation.

•

We propose AIUTA, a training-free method addressing CoIN, using self-dialogues within the agent to reduce perception uncertainty and minimize agent-user interactions.

•

We introduce a novel Normalized-Entropy-based technique to quantify VLM perception uncertainty, along with a dedicated IDKVQA dataset, demonstrating its superior reliability compared to recent competitors.

•

We introduce CoIN-Bench, a new benchmark featuring the challenging multi-instance scenarios of CoIN, supporting evaluation with both human and simulated user-agent interactions for reproducibility.

Figure 2:Graphical depiction of AIUTA: left shows its interaction cycle with the user, and right provides an exploded view of our method. ① The agent receives an initial instruction 𝐼 : “Find a 𝑐

< object category

”. ② At each timestep 𝑡 , a zero-shot policy 𝜋 [53], comprising a frozen object detection module [24], selects the optimal action 𝑎 𝑡 . ③ Upon detection, the agent performs the proposed AIUTA. Specifically, ④ the agent first obtains an initial scene description of observation 𝑂 𝑡 from a VLM. Then, a Self-Questioner module leverages an LLM to automatically generate attribute-specific questions to the VLM, acquiring more information and refining the scene description with reduced attribute-level uncertainty, producing 𝑆 𝑟 ⁢ 𝑒 ⁢ 𝑓 ⁢ 𝑖 ⁢ 𝑛 ⁢ 𝑒 ⁢ 𝑑 . ⑤ The Interaction Trigger module then evaluates 𝑆 𝑟 ⁢ 𝑒 ⁢ 𝑓 ⁢ 𝑖 ⁢ 𝑛 ⁢ 𝑒 ⁢ 𝑑 against the “facts” related to the target, to determine whether to terminate the navigation (if the agent believes it has located the target object ⑥), or to pose template-free, natural-language questions to a human ⑦, updating the “facts” based on the response ⑧. 2Related Works

Instance Object Navigation. Instance Object Navigation (InstanceObjectNav) is an extension of Object-Goal navigation (ObjectNav) [2, 5]. Unlike ObjectNav, which seeks any instance of a given category, InstanceObjectNav requires locating a specific instance defined by the user, making it a more practical and user-centered task. While the instance can be specified via an image (InstanceImageNav) [18], we focus on the more intuitive setting where users describe the target instance only in natural language. Recent policies can be divided into two categories: training-based [17, 38, 28, 45, 10, 54] and zero-shot policies [11, 60, 58, 19, 53, 55]. Trained policies rely exclusively on reinforcement learning [17, 45, 28] or in conjunction with behavioral cloning [38]. Vision-language-aligned embeddings offer a promising alternative by enabling policies to incorporate detailed natural language descriptions as input. For instance, GOAT-Bench [17] employ CLIP embeddings as the goal modality, while methods like [28, 45] train on image-goal navigation [62] and evaluate on the ObjectNav task. Among zero-shot policies, several methods extend the frontier-based exploration [52], by incorporating LLM reasoning [60, 55, 58, 19], CLIP-based localization [11] or vision-language maps for frontier selections [53]. The recent [4] extends InstanceObjectNav to the context of personalization, providing the agent with multimodal instance references, i.e., a set of images and textual descriptions. Differently, we feature human-agent interactions during navigation, with no access to any target image.

Interactive Embodied AI. Common approaches for human-agent interaction involve agents asking users for assistance, with responses typically consisting of shortest-path actions for reaching target objects [44, 7] or simpler sub-goals expressed in natural language to guide navigation [31, 32, 27, 34, 39]. In [41], authors proposed a framework to measure the uncertainty of an LLM-based planner, enabling the agent to determine the next action or ask for help. Both [12, 47] include a dialog-guided task completion benchmark using human-annotated question-answer pairs collected via Amazon Mechanical Turk. FindThis [29] requires locating a specific object instance through dialogue with the user. However, the agent only responds with images of candidate objects, lacking the ability to ask questions or engage in free-form natural language interactions, limiting its interactivity. In [9], the Zero-Shot Interactive Personalized Object Navigation is proposed, where agents must navigate to personalized objects (e.g., “Find Alice’s computer”). However, personalized goals are manually annotated, and the user, simulated by an LLM, can only respond with this ground-truth data. Both [29, 9] rely on a pre-built top-down semantic/occupancy map to locate the objects of interest; in contrast, our agent identifies target instance only through open-ended, template-free, natural language dialog with the user.

Vision-Language Models Uncertainty. Hallucinations, biases, reasoning failures and the generation of unfaithful text by LLMs are well-known issues [16]. Research by [33] shows that truthful information tends to concentrate on specific tokens, which can be leveraged to enhance error detection performance. However, these error detectors fail to generalize across datasets. Similarly, recent studies highlight systematic limitations in the visual capabilities of large vision-language models [48, 25], leading them to respond to unanswerable or misleading questions with hallucinated or inaccurate content [35]. To this end, PAI [25] proposes adjusting and amplifying attention weights assigned to image tokens, encouraging the model to prioritize visual information. In [59], a linear probing on the logits distribution of the first tokens determines whether visual questions are answerable/unanswerable. A relevant work [61] introduces a VLM-LLM dialogue for image captioning. Differently, AIUTA leverages the self-dialogue for an embodied task, incorporating both observation and target facts for question generation, with a novel uncertainty estimation technique.

3Collaborative Instance Object Navigation

Collaborative Instance Object Navigation (CoIN) introduces a novel setting for the InstanceObjectNav task, where an agent navigates in an unknown environment to locate a specific target instance in collaboration with a human user via template-free, open-ended and natural-language interactions. The agent decides whether an interaction is needed to gather necessary target information from the user during the navigation. The objective of CoIN is to successfully locate the target instance with minimal user input, reducing the effort for the user in providing a detailed description.

Initially, the agent is positioned randomly in an unknown 3D environment [37]. The navigation starts upon receiving a user request 𝐼 in natural language, which can be as minimal as by only specifying an open-set category 𝑐 , e.g., “Find the < category > ”. The agent does not have access to any visual reference of the target instance. We assume that the user is: (i) aware of the full details about the target instance, and (ii) collaborative to provide the true response when being asked by the agent. At each time step 𝑡 , the agent perceives a visual observation 𝑂 𝑡 of the scene, allowing it to guide a policy 𝜋 to pick an action 𝑎 𝑡 ∈ 𝐴

{ Forward 0.25m, Turn Right 15°, Turn Left 15°, Stop, Ask } , where Ask is the novel action that comes with our CoIN task. When invoked, the agent asks the user a template-free open-ended question 𝑞 𝑎 → 𝑢 in natural language to gather more information about the target. With the user response 𝑟 𝑢 → 𝑎 , the agent updates the set of facts (set of attributes and characteristics) 𝐹 𝑡 , representing information derived exclusively from the interaction. Formally, the updated set of facts is represented as 𝐹 𝑡

𝐹 𝑡 − 1 ∪ 𝑟 𝑢 → 𝑎 . The navigation terminates when certain criteria are met, e.g., the agent selects the Stop action or exceeds the maximum number of allowed actions. Notably, the agent can move anywhere in the continuous environment [42]. CoIN is particularly relevant in challenging scenarios where many visually ambiguous instances co-exist.

4Proposed Method

Our proposed Agent-user Interaction with UncerTainty Awareness (AIUTA), a module that enriches the agent, is illustrated in Fig. 2. Upon receiving an initial user request 𝐼 with minimal guidance that only specifies the category, e.g., “Find the picture” (① in Fig. 2), AIUTA updates the known facts regarding the target instance, i.e., 𝐹 𝑡

0

{ 𝐼 } . Then, it activates a zero-shot navigation method, VLFM [53], perceiving the scene observation 𝑂 𝑡 and providing the navigation policy (② in Fig. 2). VLFM constructs an occupancy map to identify frontiers in the explored space, and a value map that quantifies the semantic relevance of these frontiers for target object localization using the pre-trained BLIP-2 [20] model. Object detection is performed by Grounding-DINO, an open-set object detector [24]. More details about VLFM [53] in the Supp. Mat. (Sec. \AlphAlph.1).

AIUTA is triggered upon the detection of an object belonging to the target class (③ in Fig. 2), executing two key components sequentially. First, the Self-Questioner (Sec. 4.1) leverages a Vision Language Model (VLM) and a Large Language Model (LLM) to obtain an accurate and detailed understanding of the observed object via self-questioning, enabling reliable verification of the detection against the target (④ in Fig. 2). Next, the Interaction Trigger (Sec. 4.2), determines whether an agent-user interaction is necessary (in such case, triggering the action Ask), based on the observed object and known target facts 𝐹 𝑡 , and whether the agent should halt (i.e., Stop) or proceed with the navigation (⑤ in Fig. 2). In the case of Ask (⑦ in Fig. 2), AIUTA updates the target facts 𝐹 𝑡 with the response from the user (⑧ in Fig. 2). The agent terminates the navigation task once the target instance is deemed to be found. The complete algorithm can be found in Supp. Mat. (Sec. \AlphAlph). In the following, Self-Questioner and Interaction Trigger are fully detailed.

4.1Self-Questioner

Upon detection, the Self-Questioner component aims to obtain a thorough and accurate description of the detected object. As suggested by previous studies [48, 25, 35], generative VLMs may produce descriptions that are not fully grounded on the visual content, leading to inaccuracy or hallucination. To mitigate this issue, we leverage an LLM to automatically generate attribute-specific questions for the VLM. In particular, we propose a novel technique for estimating uncertainty in VLM perception, enabling the refinement of detection descriptions. The technique has three steps: (i) generating an initial detection description with detailed information relevant to target identification; (ii) estimating VLM perception uncertainty to validate object detection; and (iii) refining the detection description by filtering out uncertain attributes. Each step is detailed below.

Generation of the initial detection description. The agent initially prompts the VLM for an initial description 𝑆 𝑖 ⁢ 𝑛 ⁢ 𝑖 ⁢ 𝑡 of the observation 𝑂 𝑡 by providing the prompt 𝑃 𝑖 ⁢ 𝑛 ⁢ 𝑖 ⁢ 𝑡

“Describe the in the provided image.” Formally, 𝑆 𝑖 ⁢ 𝑛 ⁢ 𝑖 ⁢ 𝑡

VLM ⁢ ( 𝑂 𝑡 , 𝑃 𝑖 ⁢ 𝑛 ⁢ 𝑖 ⁢ 𝑡 ) . The description 𝑆 𝑖 ⁢ 𝑛 ⁢ 𝑖 ⁢ 𝑡 returned by the VLM could miss essential details for locating the specific instance, e.g., when looking for a picture, the content of the picture itself may not be specified in the description. To mitigate this issue, we prompt the LLM to create a list of questions 𝑄 𝑎 → 𝑎 𝑑 ⁢ 𝑒 ⁢ 𝑡 ⁢ 𝑎 ⁢ 𝑖 ⁢ 𝑙 ⁢ 𝑠

{ 𝑞 𝑗 } given 𝑆 𝑖 ⁢ 𝑛 ⁢ 𝑖 ⁢ 𝑡 and 𝐹 𝑡 (the symbol 𝑄 𝑎 → 𝑎 is used to represent the self-dialogue performed by the agent). Formally, 𝑄 𝑎 → 𝑎 𝑑 ⁢ 𝑒 ⁢ 𝑡 ⁢ 𝑎 ⁢ 𝑖 ⁢ 𝑙 ⁢ 𝑠

LLM ⁢ ( 𝑃 𝑑 ⁢ 𝑒 ⁢ 𝑡 ⁢ 𝑎 ⁢ 𝑖 ⁢ 𝑙 ⁢ 𝑠 , 𝑆 𝑖 ⁢ 𝑛 ⁢ 𝑖 ⁢ 𝑡 , 𝐹 𝑡 ) , where 𝑃 𝑑 ⁢ 𝑒 ⁢ 𝑡 ⁢ 𝑎 ⁢ 𝑖 ⁢ 𝑙 ⁢ 𝑠 is the prompt guiding the question generation to obtain more details (Supp. Mat. Sec. \AlphAlph.2). The questions of 𝑄 𝑎 → 𝑎 𝑑 ⁢ 𝑒 ⁢ 𝑡 ⁢ 𝑎 ⁢ 𝑖 ⁢ 𝑙 ⁢ 𝑠 are subsequently answered by the VLM. Specifically, it answers each question 𝑞 𝑗 ∈ 𝑄 𝑎 → 𝑎 𝑑 ⁢ 𝑒 ⁢ 𝑡 ⁢ 𝑎 ⁢ 𝑖 ⁢ 𝑙 ⁢ 𝑠 with a response 𝑟 𝑗

VLM ⁢ ( 𝑂 𝑡 , 𝑞 𝑗 ) given the observation 𝑂 𝑡 . Finally, we concatenate all responses to the initial detection 𝑆 𝑖 ⁢ 𝑛 ⁢ 𝑖 ⁢ 𝑡 , obtaining an enriched detection description 𝑆 𝑒 ⁢ 𝑛 ⁢ 𝑟 ⁢ 𝑖 ⁢ 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ 𝑑 . Perception uncertainty estimation. VLMs can generate hallucinated or inaccurate content [48, 25, 35], impacting the performance of AIUTA. To address this, we propose a novel technique for estimating their perception uncertainty. Direct evaluation of this aspect is challenging and often requires architectural modifications. Instead, we employ a prompt-guided Shannon entropy-based method for effective assessment. Our goal is to measure the uncertainty 𝑢 ∈ [ 0 , 1 ] of the VLM in perceiving specific aspects of a given image through visual question answering: the VLM answers to a specific question 𝑞 with a response 𝑟 and an associated uncertainty estimation 𝑢 , i.e., ( 𝑟 , 𝑢 )

VLM ⁢ ( 𝑂 𝑡 , 𝑞 ) . Following the notation from [25], we consider an auto-regressive VLM, where 𝐗 𝐼 is the image representation (i.e., image tokens), 𝐗 𝑃 is the prompt representation (i.e., prompt text tokens), and 𝐗 𝐻 is the history representation (token generated at previous time-steps). During inference, the VLM generates a conditional probability distribution 𝑝 over the vocabulary 𝐲 ∈ ℝ 𝑤 at each time step, expressed as:

𝐲

∼ 𝑝 VLM ⁢ ( 𝐲 ∣ 𝐗 𝐼 , 𝐗 𝑉 , 𝐗 𝐻 ) ,

(1)

∝ softmax ⁡ ( logit VLM ⁢ ( 𝐲 ∣ 𝐗 𝐼 , 𝐗 𝑉 , 𝐗 𝐻 ) ) .

Estimating the uncertainty of the VLM response is non-trivial as the VLM has an unbounded output space and its output probability distribution is over a (large) vocabulary of size 𝑤 . To address this issue, we leverage the standard instruction-tuning [22] procedures for VLMs, utilizing a predefined set of templated answers to restrict the vocabulary size to a fixed, small 𝑤 . In particular, during inference, we use the following prompt: “? You must answer with Yes, No, or ?=I don’t know.” In this way, we: (i) bound the auto-regressive nature to be essentially a one-step prediction, thus avoiding length-normalization; (ii) bound the vocabulary size, i.e., 𝑤

3 . We then compute the Shannon entropy [43] 𝐻 of a probability distribution 𝑝 over vocabulary size 𝑤 :

𝐻 ⁢ ( 𝑝 VLM )

− ∑ 𝑖

1 𝑤 𝑝 ⁢ ( 𝑦 𝑖 ) ⁢ log ⁡ 𝑝 ⁢ ( 𝑦 𝑖 ) .

(2)

The VLM uncertainty 𝑢 is then obtained by normalizing the entropy 𝐻 within the range [ 0 , 1 ] as 𝑢

𝐻 𝐻 max , where 𝐻 𝑚 ⁢ 𝑎 ⁢ 𝑥

log ⁡ ( 𝑤 ) is the maximum entropy (i.e., maximum uncertainty) over a vocabulary of size 𝑤 .

Given a threshold 𝜏 , we can indicate if the answer is Certain or Uncertain, namely:

𝐶 ⁢ ( 𝑢 , 𝜏 )

{ Certain ,

𝑢 ≤ 𝜏

Uncertain ,

𝑢

𝜏

(3)

To reduce false positives, we use the prompt 𝑃 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ 𝑐 ⁢ 𝑘

“Does the image contain a ? Answer with Yes, No or ?=I don’t know.” (see Supp. Mat. Sec. \AlphAlph.3). This allows us to confirm the presence of the object, which we formally express as ( 𝑟 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ 𝑐 ⁢ 𝑘 , 𝑢 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ 𝑐 ⁢ 𝑘 )

VLM ⁢ ( 𝑂 𝑡 , 𝑃 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ 𝑐 ⁢ 𝑘 ) . Following Eq. 3, we continue the AIUTA pipeline if response 𝑟 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ 𝑐 ⁢ 𝑘

“Yes” and uncertainty 𝑢 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ 𝑐 ⁢ 𝑘

Certain; otherwise, we continue exploring.

To remove uncertain attributes, we prompt the LLM to extract a set of attributes and values 𝐾 𝑡

{ ( 𝑘 𝑗 , 𝑣 𝑗 ) } from the detection description 𝑆 𝑒 ⁢ 𝑛 ⁢ 𝑟 ⁢ 𝑖 ⁢ 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ 𝑑 , where each attribute 𝑘 𝑗 is associated to a value 𝑣 𝑗 , e.g., (“frame”, “black”); (“content”, “RGB image of a family”), etc. For each attribute 𝑘 𝑗 , we then prompt the LLM to generate a list of 𝐽 questions, 𝑄 𝑎 → 𝑎 𝑎 ⁢ 𝑡 ⁢ 𝑡 ⁢ 𝑟 ⁢ 𝑖 ⁢ 𝑏 ⁢ 𝑢 ⁢ 𝑡 ⁢ 𝑒

{ 𝑞 𝑗 } 𝑗

1 𝐽 to be answered by the agent itself. Formally, we extract attributes list and self-questions in one prompt, 𝑄 𝑎 → 𝑎 𝑎 ⁢ 𝑡 ⁢ 𝑡 ⁢ 𝑟 ⁢ 𝑖 ⁢ 𝑏 ⁢ 𝑢 ⁢ 𝑡 ⁢ 𝑒

LLM ⁢ ( 𝑃 𝑠 ⁢ 𝑒 ⁢ 𝑙 ⁢ 𝑓 ⁢ 𝑞 ⁢ 𝑢 ⁢ 𝑒 ⁢ 𝑠 ⁢ 𝑡 ⁢ 𝑖 ⁢ 𝑜 ⁢ 𝑛 ⁢ 𝑠 , 𝐹 , 𝑆 𝑒 ⁢ 𝑛 ⁢ 𝑟 ⁢ 𝑖 ⁢ 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ 𝑑 ) , where 𝑃 𝑠 ⁢ 𝑒 ⁢ 𝑙 ⁢ 𝑓 ⁢ 𝑞 ⁢ 𝑢 ⁢ 𝑒 ⁢ 𝑠 ⁢ 𝑡 ⁢ 𝑖 ⁢ 𝑜 ⁢ 𝑛 is the prompt for the LLM (Supp. Mat. Sec. \AlphAlph.4). For each question 𝑞 𝑗 , we access both the response 𝑟 𝑗 and the associated uncertainty 𝑢 𝑗 by evaluating ( 𝑟 𝑗 , 𝑢 𝑗 )

VLM ⁢ ( 𝑂 𝑡 , 𝑞 𝑗 ) . This process allows us to confirm or refine the attributes based on the VLM’s responses, obtaining a final detailed description 𝑆 𝑟 ⁢ 𝑒 ⁢ 𝑓 ⁢ 𝑖 ⁢ 𝑛 ⁢ 𝑒 ⁢ 𝑑 .

Detection description refinement. To obtain the final detailed description 𝑆 𝑟 ⁢ 𝑒 ⁢ 𝑓 ⁢ 𝑖 ⁢ 𝑛 ⁢ 𝑒 ⁢ 𝑑 , we let the LLM filter out uncertain attributes, given the enriched description 𝑆 𝑒 ⁢ 𝑛 ⁢ 𝑟 ⁢ 𝑖 ⁢ 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ 𝑑 and the set of questions, responses, and uncertainties { 𝑞 𝑗 , 𝑟 𝑗 , 𝑢 𝑗 } . More formally, 𝑆 𝑟 ⁢ 𝑒 ⁢ 𝑓 ⁢ 𝑖 ⁢ 𝑛 ⁢ 𝑒 ⁢ 𝑑

LLM ⁢ ( 𝑃 𝑟 ⁢ 𝑒 ⁢ 𝑓 ⁢ 𝑖 ⁢ 𝑛 ⁢ 𝑒 ⁢ 𝑑 , { 𝑞 𝑗 , 𝑟 𝑗 , 𝑢 𝑗 } , 𝑆 𝑒 ⁢ 𝑛 ⁢ 𝑟 ⁢ 𝑖 ⁢ 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ 𝑑 ) , where 𝑃 𝑟 ⁢ 𝑒 ⁢ 𝑓 ⁢ 𝑖 ⁢ 𝑛 ⁢ 𝑒 ⁢ 𝑑 is the prompt for the LLM (see Supp. Mat. Sec. \AlphAlph.5).

4.2Interaction Trigger

Using the accurate and detailed description 𝑆 𝑟 ⁢ 𝑒 ⁢ 𝑓 ⁢ 𝑖 ⁢ 𝑛 ⁢ 𝑒 ⁢ 𝑑 of the detected object, the Interaction Trigger prompts the LLM to decide whether to pose a question to the human user or continue the navigation. Specifically, we prompt the LLM to estimate a similarity score 𝑠 between scene description 𝑆 𝑟 ⁢ 𝑒 ⁢ 𝑓 ⁢ 𝑖 ⁢ 𝑛 ⁢ 𝑒 ⁢ 𝑑 and target object facts 𝐹 𝑡 . We instruct the LLM to estimate the similarity score based on the alignment between the detection description and the known facts. Formally, 𝑠

LLM ⁢ ( 𝑃 𝑠 ⁢ 𝑐 ⁢ 𝑜 ⁢ 𝑟 ⁢ 𝑒 , 𝑆 𝑟 ⁢ 𝑒 ⁢ 𝑓 ⁢ 𝑖 ⁢ 𝑛 ⁢ 𝑒 ⁢ 𝑑 , 𝐹 𝑡 ) , where 𝑃 𝑠 ⁢ 𝑐 ⁢ 𝑜 ⁢ 𝑟 ⁢ 𝑒 is prompt instructing the LLM to produce the similarity score (Supp. Mat. Sec. \AlphAlph.6). Based on the LLM-estimated similarity score, the agent takes corresponding action based on the following intuition: (i) if 𝑠 ≥ 𝜏 𝑠 ⁢ 𝑡 ⁢ 𝑜 ⁢ 𝑝 , the navigation terminates as the agent deems the instance has been found; (ii) if 𝑠 < 𝜏 𝑠 ⁢ 𝑘 ⁢ 𝑖 ⁢ 𝑝 , the agent deems the detected object is significantly different from the known target facts, thus skipping the agent-user interaction to reduce the user efforts in providing input. The agent will continue with the environment exploration; and (iii) if 𝜏 𝑠 ⁢ 𝑘 ⁢ 𝑖 ⁢ 𝑝 ≤ 𝑠 < 𝜏 𝑠 ⁢ 𝑡 ⁢ 𝑜 ⁢ 𝑝 , the description and facts are somewhat aligned, suggesting that posing a question to the user can effectively reduce uncertainty.

When taking the action Ask, we further leverage the capability of LLM to compose an effective question to the user, 𝑞 𝑎 → 𝑢 , aimed at maximizing information gain about the target instance, conditioned on the know target object facts 𝐹 and the refined observation description 𝑆 𝑟 ⁢ 𝑒 ⁢ 𝑓 ⁢ 𝑖 ⁢ 𝑛 ⁢ 𝑒 ⁢ 𝑑 . To minimize the number of LLM calls, we incorporate such question retrieval inside the 𝑃 𝑠 ⁢ 𝑐 ⁢ 𝑜 ⁢ 𝑟 ⁢ 𝑒 prompt. After receiving the corresponding response from the human, 𝑟 𝑢 → 𝑎 , we update the target object facts 𝐹 𝑡 with new information, maximizing the effectiveness of later agent-human interactions.

5CoIN-Bench

To facilitate the evaluation of CoIN, we introduce CoIN-Bench, a curated dataset that features challenging multi-instance scenarios, supports both human evaluation and simulated agent-user interactions, and includes a new performance metric that accounts for agent-user interactions.

Dataset Construction. Our dataset is built upon the large-scale GOAT-Bench [17], which spans diverse scenarios from the HM3DSem [37] using the Habitat sim [42]. GOAT-Bench provides instance references in various formats, including category names and natural-language descriptions, making it a suitable source dataset. GOAT-Bench consist of a large Train split for policy training, and three eval splits: Val Seen, Val Seen Synonyms and Val Unseen. Specifically, Val Seen includes objects seen in Train, Val Seen Synonyms introduces synonymous object names, and Val Unseen contains only novel objects absent from Train. Since GOAT-Bench’s Train split is dedicated to policy training, we design CoIN-Bench exclusively for evaluation.

We select episodes from the evaluation splits of GOAT-Bench, i.e., Val Seen, Val Seen Synonyms and Val Unseen to ensure fair comparison with methods trained on GOAT-Bench. Since CoIN focuses on scenarios with multiple instances of the same target category (i.e., distractors), we apply a filtering procedure to discard episodes with fewer than 𝑑 𝑚 ⁢ 𝑖 ⁢ 𝑛

2 distractors. After filtering the episodes, the simulator [42] sets random start positions to the agent, ensuring a geodesic distance of [ 5 ⁢ 𝑚 , 20 ⁢ 𝑚 ] between the start and target locations to vary navigation difficulty. Moreover, since visual observation are 3D renderings whose quality is dependent on the scene reconstruction, we manually filter out episodes to ensure high-quality visual observations, removing those where target instances have insufficient resolution, limited visual coverage, or are indistinguishable from distractors. Additionally, we ensure episodes are navigable without crossing floors, following [51, 17]. CoIN-Bench dataset includes 831 episodes in Val Seen, 459 in Val Unseen and 359 in Val Seen Synonyms, with a total of 1 , 649 evaluation episodes, in line with the evaluation scale of well-known datasets [18, 17, 5]. As shown in Tab. 1, CoIN-Bench features an average of ∼ 5 distractors per episode, and a mean path length

7 , forming a highly challenging multi-instance evaluation set. More details and statistics are provided in Supp. Mat. (Sec. \AlphAlph).

Statistics Val Seen Val Seen Synonyms Val Unseen Avg. (std) number of distractors 4.58 (1.93) 6.01 (1.96) 5.15 (1.51) Avg. (std) length (Geodesic) 9.32 (3.43) 9.13 (3.14) 9.86 (3.73) Avg. (std) length (Euclidean) 7.48 (2.88) 7.50 (2.75) 7.78 (3.39) Table 1:Avg. (std) number of distractors and distance to the goal. Method Model Condition Val Seen Val Seen Synonyms Val Unseen Input Training-free SR ↑ SPL ↑ NQ ↓ SR ↑ SPL ↑ NQ ↓ SR ↑ SPL ↑ NQ ↓

Monolithic† [17] (CVPR-24) d ✗ 6.62† 3.11 - 13.09† 6.45 - 0.22† 0.05 - PSL [45] (ECCV-24) d ✗ 8.78 3.30 - 8.91 2.83 - 4.58 1.39 - OVON† [54] (IROS-24) c ✗ 8.18† 5.24 - 15.88† 11.35 - 2.61† 1.29 - VFLM [53] (ICRA-24) c ✔ 0.36 0.28 - 0.00 0.00 - 0.00 0.00 - AIUTA (ours) c ✔ 7.42 2.92 1.67 14.38 7.99 1.36 6.67 2.30 1.13 Table 2:CoIN-Bench is challenging. AIUTA, while being training-free, achieves strong performance by outperforming trained policies (top rows) and significantly surpassing the zero-shot VLFM, across all splits, through effective user interaction. In contrast, policies trained on GOAT-Bench (denoted with †), the foundation of CoIN-Bench, fail to generalize to novel categories (Val Unseen). We report the SR (main metric, in bold w.r.t training free-methods), SPL, and the number of questions NQ. Input types: c for object category, d for its description.

Evaluation protocol. CoIN-Bench supports evaluation with both real humans, to assess the potential and limitation of genuine agent-human interactions, and simulated user-agent interactions, to enable extensive, reproducible and large-scale experiments. Simulating agent-human interactions is challenging due to: (i) the agent’s open-ended, template-free questions about any target attribute, making it impractical to predefine a comprehensive question-answer dataset, and (ii) the huge question space in the simulated continuous environment [42]. To address this, we propose to simulate user responses via a VLM with access to a high-resolution image of the target object ( 1024 × 1024 ) at each episode. This setup is more effective than relying solely on instance descriptions [9], as the comprehensive visual coverage allows for diverse responses to the agent’s questions.

Metrics. An episode is successful if the agent stops within 0.25 m of the target goal viewpoints. If not located, the exploration ends after 500 actions. Following [2, 51], we use: Success Rate, SR ( ↑ ), our primary metric (in gray), and Success rate weighted by Path Length, SPL ( ↑ ). Additionally, we introduce the average Number of Questions asked, NQ ( ↓ ) in successful episodes to measure the amount of user input.

6Experiments

We first benchmark AIUTA against state-of-the-art (SOTA) methods [45, 17, 53, 54] on CoIN-Bench, with simulated user-agent interactions, highlighting that CoIN-Bench present a challenging evaluation set for training-free and training-based methods. Next, we conduct an evaluation on a small validation set using both real human and simulated user-agent interactions, demonstrating that the simulation setup serves as a viable alternative to real human evaluation, enabling scalable and reproducible experiments. Finally, ablation studies validate AIUTA design choices, and showcase the effectiveness of the Normalized-Entropy-based technique for estimating VLM uncertainty, outperforming recent baselines [59, 26] on the IDKVQA dataset.

Implementation Details. We use [23] (LLaVA 1.6, Mistral 7B) as the VLM and GPT-4o [15] as the LLM. User interaction is limited to a maximum of 4 rounds. We empirically set 𝜏

0.75 (Eq. 3), 𝜏 𝑠 ⁢ 𝑡 ⁢ 𝑜 ⁢ 𝑝

7 and 𝜏 𝑠 ⁢ 𝑘 ⁢ 𝑖 ⁢ 𝑝

5 as they yield the best result. In Supp. Mat., see Sec. \AlphAlph for all prompts and Sec. \AlphAlph for AIUTA’s computational analysis.

Baselines. We compare AIUTA against SOTA Instance Navigation and ObjectNav methods: the SenseAct-NN Monolithic Policy (Monolithic) [17], PSL [45], OVON [54] and the zero-shot, training-free VLFM [53].

To demonstrate the challenging nature of our dataset, we include two baselines, Monolithic [17] and OVON [54], which are trained on GOAT-Bench. Again, note that the “Seen” splits contains categories seen during training (Sec. 5). PSL is trained on the ImageNav task and transferred on the language-driven Instance navigation task. Notably, both Monolithic and PSL take a fully detailed description d of the target instance as input, while OVON [54] takes the target category c. Finally, VLFM operates in a zero-shot, training-free manner, while taking category 𝑐 in input. All baselines are detailed in Supp. Mat. Sec. \AlphAlph. Tab. 2 summarizes the input types and training conditions on CoIN-Bench.

Results with simulated user-agent interaction. As shown in Tab. 2, training-based methods perform better on Val Seen and Val Seen Synonyms than on Val Unseen, highlighting their poor generalization to novel categories. This phenomenon is particularly pronounced on policies trained on GOAT-Bench (denoted with †), with performance dropping significantly—OVON’s SR decreases from a maximum of 15.88 to 2.61 , and Monolithic’s SR drops from 13.09 to 0.22 . In contrast, AIUTA, while being training-free, outperforms training-based methods on Val Unseen, with consistent and strong results in all the splits. Interestingly, on the Val Seen Synonyms, AIUTA is inferior to OVON, but outperforms PSL and Monolithich in SR and SPL. This is surprising, as PSL and Monolithic are training-based and operate with detailed instance descriptions. One possible explanation is that CLIP-based approaches is limited in encoding fine-grained instance description compared to category [17, 11]. Moreover, compared to the results reported on GOAT-Bench, the lower SR of the baselines, e.g. Monolithic [17] on CoIN-Bench, highlights the introduced challenge of multi-instance ambiguity.

In particular, our closest competitor VLFM [53], when using only the instance category as input, fails nearly all evaluation episodes, with almost 0 % SR across all splits. This is expected, as the large amount of distractor objects (Tab. 1) poses significant challenges for ObjectNav methods, which lack instance-level discrimination capabilities. Further analysis of VLFM 0 % SR can be found in Supp. Mat. Sec. \AlphAlph. In contrast, despite being built on top of VLFM and taking only the instance category, AIUTA effectively gather additional information from the user to identify the correct instance, requiring minimal agent-user interaction (NQ < 2 for all splits). This results in a substantial improvement in SR, achieving an outstanding ∼ 14 × increase on Val Seen Synonyms, ∼ 7 × on Val Seen, and approximately ∼ 7 × on Val Unseen. We illustrate the diversity of AIUTA-generated questions in Supp. Mat. Sec. \AlphAlph.

Validation with real human. To validate that simulated user-agent interactions yield credible results, we further conduct evaluation with real human on a small subset of CoIN-Bench. We randomly select 40 episodes with detectable target instances across all splits to minimize time and cognitive load. As a result, the SR for this set are higher compared to those reported in Tab. 2. We engage 20 participants of varying ages and backgrounds, each evaluating two episodes. Participants are provided an image depicting the target instance and interact with the agent via a chat-like interface (see aiuta_demo.mp4 in the supplementary materials). They initiates the navigation via the fixed template “Find the ”, and answer the agent’s questions in natural language. More details about human evaluation in Supp. Mat. Sec. \AlphAlph. The human results is compared against with simulated user-agent interactions in Tab. 3. We observe no significant differences in main metrics, confirming that the simulation setup is reliable for reproducible evaluation.

User type CoIN-Bench subset SR ↑ SPL ↑ NQ ↓

Simulated 42.50 15.48 1.10 Real Human 42.50 17.44 1.29 Table 3:Real human vs simulated user-agent interaction. Self-Questioner Skip-Question Ablation split SR ↑ SPL ↑ NQ ↓

✗ ✗ 9.21 5.86 3.57 ✗ ✔ 8.55 4.84 2.69 ✔ ✗ 9.87 6.5 4.6 ✔ ✔ 14.47 7.22 1.68 Table 4:Ablation of components in AIUTA on the Train split.

Ablation I: AIUTA components. We introduce the Ablation split, derived from the largest GOAT-Bench Train split, following the procedure in Sec. 5. We select GOAT-Bench Train as it covers more semantic categories. Since AIUTA is training-free, validation remains fair. Tab. 4 highlights the importance of the Self-Questioner and Skip-Question (within the Interaction Trigger). Without both (row 1), SR drops to 9.21 % , with a high number of questions NQ. Removing only the Self-Questioner (row 2) lowers the SR, reducing NQ, as expected. Enabling only the Self-Questioner (row 3) improves SR to 9.87 % , but keeps NQ high. With both components active (row 4), SR peaks at 14.47 % , and NQ drops to 1.68 , proving both effectiveness and efficiency.

Ablation II: VLM uncertainty estimation on IDKVQA. VLM uncertainty estimation is crucial for the Self-Questioner module, helping the agent to mitigate hallucinations and inaccuracies. For validating these techniques, we introduce IDKVQA, a VQA dataset with 502 questions and 102 images from GOAT-Bench [17]. Each question is answered by three annotators who choose from {Yes, No, I Don’t Know}, allowing the agent to abstain when information is insufficient. We compare our Normalized-Entropy-based technique against three recent techniques: MaxProb (selects the answer with the highest predicted probability); an energy score-based framework for out-of-distribution detection [26]; and LP [59], a recent logistic regression model trained as a linear probe on the logits distribution of the first generated token. Tab. 5 reports the performance using the Effective Reliability metric Φ 𝑐 proposed in [50]. Our proposed technique achieves the best Φ 𝑐

1 score of 21.12 , demonstrating its effectiveness. Further details in the Supp. Mat. (Sec. \AlphAlph).

VLM Model Selection Function Φ 𝑐

LLaVA llava-v1.6-mistral-7b-hf MaxProb 15.94 LP [59] 14.01 Energy Score [26] 20.45 Normalized Entropy (ours) 21.12 Table 5:Results of different selection functions and their corresponding Effective Reliability rate Φ 𝑐

1 on the IDKVQA dataset.

Ablation III: 𝜏 . We analyze the sensitivity of the threshold ( 𝜏 in Eq. 3) for our Normalized-Entropy-based technique and second-best performing Energy Score [26]. We subsample the datasets to 50 % , 70 % , and 100 % of its original size. For each subsampled dataset, we find the optimal threshold 𝜏 ∗ and evaluate its sensitivity by testing Φ 𝑐

1 on 30 alternative thresholds around 𝜏 ∗ , normalizing it between 0 and 1 . As shown in Fig. 3, our technique has a smaller interquartile range and a tighter distribution of Φ 𝑐

1 , while [26] exhibits a greater degradation from 𝜏 ∗ , which worsens as the dataset size decreases. This proves that our technique is more robust in data-scarce situations, and is less sensitive to small variations in 𝜏 . Moreover, [26] depends on logits, thus being unbounded. On the contrary, our uncertainty is normalized, i.e. 𝑢 ∈ [ 0 , 1 ] , making optimal 𝜏 selection more efficient.

Figure 3: 𝜏 sensitivity results. For each method, 30 new 𝜏 values are sampled symmetrically around the optimal threshold 𝜏 ∗ . The 𝑥 -axis shows the set size as a percentage of the original IDKVQA dataset size, while the 𝑦 -axis displays the normalized ER Φ 𝑐

1 . 7Conclusion

We introduced the CoIN task, where the agent collaborates with the user during navigation to resolve uncertainties about the target instance. Trough extensive experiments, we show that existing trained method fails to generalize to unseen categories, while our training-free AIUTA, using a novel self-dialogue mechanism and uncertainty estimation, achieves strong performance across all validation splits. Moreover, our simulated user-agent interaction is in line with human evaluation, enabling scalable and reproducible experiments.

AIUTA relies on current LLMs, where larger models improve performance but at a high inference cost, limiting real-time on-board processing. Future works will investigate model optimization for embodied deployment, and extending the interaction scope to action instructions.

References An et al. [2024] ↑ Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang.Etpnav: Evolving topological planning for vision-language navigation in continuous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. Anderson et al. [2018a] ↑ Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al.On Evaluation of Embodied Navigation Agents.arXiv preprint arXiv:1807.06757, 2018a. Anderson et al. [2018b] ↑ Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel.Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018b. Barsellotti et al. [2024] ↑ Luca Barsellotti, Roberto Bigazzi, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara.Personalized Instance-based Navigation Toward User-Specific Objects in Realistic Environments.In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. Batra et al. [2020] ↑ Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans.ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects.arXiv preprint arXiv:2006.13171, 2020. Chaplot et al. [2020] ↑ Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhinav Gupta, and Russ R Salakhutdinov.Object Goal Navigation using Goal-Oriented Semantic Exploration.In Advances in Neural Information Processing Systems, pages 4247–4258. Curran Associates, Inc., 2020. Chi et al. [2020] ↑ Ta-Chung Chi, Minmin Shen, Mihail Eric, Seokhwan Kim, and Dilek Hakkani-tur.Just Ask: An Interactive Learning Framework for Vision and Language Navigation.Proceedings of the AAAI Conference on Artificial Intelligence, 34(03):2459–2466, 2020. Chiang et al. [2024] ↑ Hao-Tien Lewis Chiang, Zhuo Xu, Zipeng Fu, Mithun George Jacob, Tingnan Zhang, Tsang-Wei Edward Lee, Wenhao Yu, Connor Schenck, David Rendleman, Dhruv Shah, et al.Mobility vla: Multimodal instruction navigation with long-context vlms and topological graphs.arXiv preprint arXiv:2407.07775, 2024. Dai et al. [2024] ↑ Yinpei Dai, Run Peng, Sikai Li, and Joyce Chai.Think, Act, and Ask: Open-World Interactive Personalized Robot Navigation.In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 3296–3303, 2024. Ehsani et al. [2024] ↑ Kiana Ehsani, Tanmay Gupta, Rose Hendrix, Jordi Salvador, Luca Weihs, Kuo-Hao Zeng, Kunal Pratap Singh, Yejin Kim, Winson Han, Alvaro Herrasti, Ranjay Krishna, Dustin Schwenk, Eli VanderBilt, and Aniruddha Kembhavi.Spoc: Imitating shortest paths in simulation enables effective navigation and manipulation in the real world.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16238–16250, 2024. Gadre et al. [2023] ↑ Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song.CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation.In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2023. Gao et al. [2023] ↑ Qiaozi Gao, Govind Thattai, Suhaila Shakiah, Xiaofeng Gao, Shreyas Pansare, Vasu Sharma, Gaurav Sukhatme, Hangjie Shi, Bofei Yang, Desheng Zhang, Lucy Hu, Karthika Arumugam, Shui Hu, Matthew Wen, Dinakar Guthy, Shunan Chung, Rohan Khanna, Osman Ipek, Leslie Ball, Kate Bland, Heather Rocker, Michael Johnston, Reza Ghanadan, Dilek Hakkani-Tur, and Prem Natarajan.Alexa Arena: A User-Centric Interactive Platform for Embodied AI.In Advances in Neural Information Processing Systems, pages 19170–19194. Curran Associates, Inc., 2023. Groq [2024] ↑ Groq.Groq - Accelerated AI Inference.https://groq.com/, 2024.Accessed: Mar. 7, 2025. Gurari et al. [2018] ↑ Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham.VizWiz Grand Challenge: Answering Visual Questions from Blind People.In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2018. Hurst et al. [2024] ↑ Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al.Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. Ji et al. [2023] ↑ Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung.Survey of Hallucination in Natural Language Generation.ACM Comput. Surv., 55(12), 2023. Khanna et al. [2024] ↑ Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi.GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation.In CVPR, page 16373–16383. IEEE, 2024. Krantz et al. [2022] ↑ Jacob Krantz, Stefan Lee, Jitendra Malik, Dhruv Batra, and Devendra Singh Chaplot.Instance-Specific Image Goal Navigation: Training Embodied Agents to Find Object Instances.arXiv preprint arXiv:2211.15876, 2022. Kuang et al. [2024] ↑ Yuxuan Kuang, Hai Lin, and Meng Jiang.OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models.In Findings of the Association for Computational Linguistics: NAACL 2024, pages 338–351, Mexico City, Mexico, 2024. Association for Computational Linguistics. Li et al. [2023] ↑ Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.In Proceedings of the 40th International Conference on Machine Learning. JMLR.org, 2023. Li et al. [2021] ↑ Weijie Li, Xinhang Song, Yubing Bai, Sixian Zhang, and Shuqiang Jiang.ION: Instance-level Object Navigation.In ACM MM, pages 4343–4352. ACM, 2021. Liu et al. [2023a] ↑ Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee.Visual Instruction Tuning.In Advances in Neural Information Processing Systems, pages 34892–34916. Curran Associates, Inc., 2023a. Liu et al. [2024a] ↑ Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee.LLaVA-NeXT: Improved reasoning, OCR, and world knowledge, 2024a. Liu et al. [2023b] ↑ Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al.Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.arXiv preprint arXiv:2303.05499, 2023b. Liu et al. [2025] ↑ Shi Liu, Kecheng Zheng, and Wei Chen.Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs.In Computer Vision - ECCV 2024. Springer Nature Switzerland, 2025. Liu et al. [2020] ↑ Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li.Energy-based Out-of-distribution Detection.In Advances in Neural Information Processing Systems, pages 21464–21475. Curran Associates, Inc., 2020. Liu et al. [2024b] ↑ Xiulong Liu, Sudipta Paul, Moitreya Chatterjee, and Anoop Cherian.CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments.Proceedings of the AAAI Conference on Artificial Intelligence, 38(4):3765–3773, 2024b. Majumdar et al. [2022] ↑ Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, and Dhruv Batra.ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings.In Advances in Neural Information Processing Systems, pages 32340–32352. Curran Associates, Inc., 2022. Majumdar et al. [2023] ↑ Arjun Majumdar, Fei Xia, Brian Ichter, Dhruv Batra, and Leonidas Guibas.FindThis: Language-Driven Object Disambiguation in Indoor Environments.In Proceedings of The 7th Conference on Robot Learning, pages 1335–1347. PMLR, 2023. McInnes et al. [2018] ↑ Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger.UMAP: Uniform Manifold Approximation and Projection.Journal of Open Source Software, 3(29):861, 2018. Nguyen and Daumé III [2019] ↑ Khanh Nguyen and Hal Daumé III.Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning.In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 2019. Nguyen et al. [2019] ↑ Khanh Nguyen, Debadeepta Dey, and Bill Brockett, Chrnd Dolan.Vision-Based Navigation With Language-Based Assistance via Imitation Learning With Indirect Intervention.In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019. Orgad et al. [2024] ↑ Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov.LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations.arXiv preprint arXiv:2410.02707, 2024. Paul et al. [2022] ↑ Sudipta Paul, Amit Roy-Chowdhury, and Anoop Cherian.AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments.In Advances in Neural Information Processing Systems, pages 6236–6249. Curran Associates, Inc., 2022. Qian et al. [2024] ↑ Yusu Qian, Haotian Zhang, Yinfei Yang, and Zhe Gan.How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts.In Neurips Safe Generative AI Workshop 2024, 2024. Radford et al. [2021] ↑ Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever.Learning Transferable Visual Models From Natural Language Supervision.In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. Ramakrishnan et al. [2021] ↑ Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel Chang, Manolis Savva, Yili Zhao, and Dhruv Batra.Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI.In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021. Ramrakhya et al. [2023] ↑ Ram Ramrakhya, Dhruv Batra, Erik Wijmans, and Abhishek Das.PIRLNav: Pretraining with Imitation and RL Finetuning for OBJECTNAV.In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2023. Rawal et al. [2024] ↑ Niyati Rawal, Roberto Bigazzi, Lorenzo Baraldi, and Rita Cucchiara.UNMuTe: Unifying Navigation and Multimodal Dialogue-like Text Generation.arXiv preprint arXiv:2408.04423, 2024. Reimers and Gurevych [2019] ↑ Nils Reimers and Iryna Gurevych.”Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”.In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China, 2019. Association for Computational Linguistics. Ren et al. [2023] ↑ Allen Z. Ren, Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, Fei Xia, Jake Varley, Zhenjia Xu, Dorsa Sadigh, Andy Zeng, and Anirudha Majumdar.Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners.In 7th Annual Conference on Robot Learning, 2023. Savva et al. [2019] ↑ Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra.Habitat: A Platform for Embodied AI Research.In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019. Shannon [1948] ↑ Claude Elwood Shannon.A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948. Singh et al. [2022] ↑ Kunal Pratap Singh, Luca Weihs, Alvaro Herrasti, Jonghyun Choi, Aniruddha Kembhavi, and Roozbeh Mottaghi.Ask4Help: Learning to Leverage an Expert for Embodied Tasks.In Advances in Neural Information Processing Systems, pages 16221–16232. Curran Associates, Inc., 2022. Sun et al. [2025] ↑ Xander Sun, Louis Lau, Hoyard Zhi, Ronghe Qiu, and Junwei Liang.Prioritized Semantic Learning for Zero-shot Instance Navigation.In Computer Vision - ECCV 2024. Springer Nature Switzerland, 2025. Taioli et al. [2024] ↑ Francesco Taioli, Stefano Rosa, Alberto Castellini, Lorenzo Natale, Alessio Del Bue, Alessandro Farinelli, Marco Cristani, and Yiming Wang.Mind the Error! Detection and Localization of Instruction Errors in Vision-and-Language Navigation.In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 12993–13000, 2024. Thomason et al. [2020] ↑ Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer.Vision-and-Dialog Navigation.In Proceedings of the Conference on Robot Learning, pages 394–406. PMLR, 2020. Tong et al. [2024] ↑ Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie.Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs.In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 9568–9578. IEEE, 2024. Vaswani et al. [2017] ↑ Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin.Attention is All you Need.In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017. Whitehead et al. [2022] ↑ Spencer Whitehead, Suzanne Petryk, Vedaad Shakib, Joseph Gonzalez, Trevor Darrell, Anna Rohrbach, and Marcus Rohrbach.Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly.In Computer Vision – ECCV 2022, page 148–166. Springer Nature Switzerland, 2022. Yadav et al. [2023] ↑ Karmesh Yadav, Jacob Krantz, Ram Ramrakhya, Santhosh Kumar Ramakrishnan, Jimmy Yang, Austin Wang, John Turner, Aaron Gokaslan, Vincent-Pierre Berges, Roozbeh Mootaghi, Oleksandr Maksymets, Angel X Chang, Manolis Savva, Alexander Clegg, Devendra Singh Chaplot, and Dhruv Batra.Habitat Challenge 2023, 2023. Yamauchi [1997] ↑ B. Yamauchi.A frontier-based approach for autonomous exploration.In Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97. ’Towards New Computational Principles for Robotics and Automation’, pages 146–151, 1997. Yokoyama et al. [2024a] ↑ Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher.VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation.In 2024 IEEE International Conference on Robotics and Automation (ICRA), page 42–48. IEEE, 2024a. Yokoyama et al. [2024b] ↑ Naoki Yokoyama, Ram Ramrakhya, Abhishek Das, Dhruv Batra, and Sehoon Ha.HM3D-OVON: A Dataset and Benchmark for Open-Vocabulary Object Goal Navigation.In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5543–5550, 2024b. Yu et al. [2023] ↑ Bangguo Yu, Hamidreza Kasaei, and Ming Cao.L3MVN: Leveraging Large Language Models for Visual Target Navigation.In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023. Zhai et al. [2023] ↑ Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer.Sigmoid Loss for Language Image Pre-Training.In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 11941–11952. IEEE, 2023. Zhang et al. [2023] ↑ Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong.Faster Segment Anything: Towards Lightweight SAM for Mobile Applications.arXiv preprint arXiv:2306.14289, 2023. Zhang et al. [2024] ↑ Lingfeng Zhang, Qiang Zhang, Hao Wang, Erjia Xiao, Zixuan Jiang, Honglei Chen, and Renjing Xu.TriHelper: Zero-Shot Object Navigation with Dynamic Assistance.arXiv preprint arXiv:2403.15223, 2024. Zhao et al. [2025] ↑ Qinyu Zhao, Ming Xu, Kartik Gupta, Akshay Asthana, Liang Zheng, and Stephen Gould.The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?In Computer Vision - ECCV 2024. Springer Nature Switzerland, 2025. Zhou et al. [2023] ↑ Kaiwen Zhou, Kaizhi Zheng, Connor Pryor, Yilin Shen, Hongxia Jin, Lise Getoor, and Xin Eric Wang.ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation.In Proceedings of the 40th International Conference on Machine Learning, pages 42829–42842. PMLR, 2023. Zhu et al. [2024] ↑ Deyao Zhu, Jun Chen, Kilichbek Haydarov, Xiaoqian Shen, Wenxuan Zhang, and Mohamed Elhoseiny.ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions.Transactions on Machine Learning Research, 2024. Zhu et al. [2017] ↑ Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi.Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning.In 2017 IEEE International Conference on Robotics and Automation (ICRA), page 3357–3364. IEEE, 2017. \thetitle

Supplementary Material In this supplementary material, we first provide additional details regarding the CoIN-Bench dataset (Sec. \AlphAlph), including an overview of the GOAT-Bench dataset on which CoIN-Bench is based, as well as statistics and examples of target instances within CoIN-Bench. Additionally, in Sec. \AlphAlph, we provide a visualization of the evaluation setup, clarifying the role of the user responses in our evaluation, whether from real human or simulation. Next, we elaborate on the implementation details for baseline comparisons in Sec. \AlphAlph, and present a computational analysis of AIUTA in Sec. \AlphAlph. Further details on the real human experiments are provided in Sec. \AlphAlph. Then, in Sec. \AlphAlph we investigate the low performance of the original VLFM, and in Sec. \AlphAlph, we provide a UMAP visualization of the questions generated by the agent. Next, in Sec. \AlphAlph, we detail the evaluation conducted on IDKVQA including the dataset creation, evaluation metric, and state-of-the-art baselines used for comparison. Finally, we include all the prompts in Sec. \AlphAlph and the full algorithm of AIUTA in Sec. \AlphAlph.

For a demonstration of AIUTA in action, engaging with a real human through natural language dialogues to collaboratively localize a target instance, please refer to the accompanying video ( aiuta_demo.mp4) provided in the supplementary material.

\AlphAlphAdditional details of CoIN-Bench \AlphAlph.1CoIN-Bench Figure 1:CoIN-Bench can be very challenging when only given the instance category to the agent. We highlight the target instance with red borders, while the distractor instances that exist in the same scene are marked with blue borders.

Instance examples. The CoIN-Bench benchmark poses a significant challenge, since multiple distractor objects are present among each target. To illustrate this, Fig. 1 provides examples where the target instance is highlighted with red borders, while distractors in the same scene are marked with blue borders. As demonstrated, agent-user collaboration is crucial to gather the necessary details for uniquely identifying the target instance among other visually similar objects of the same category, such as the armchair or the plant.

Dataset statistics. We provide additional statistics for the CoIN-Bench dataset. In Fig. 2 we show the shortest path statistics for the CoIN-Bench dataset. In particular, the euclidean and geodesic distance for all the split, as well as the number of distractors. Next, Fig. 3 illustrates the distribution of instance categories across different splits. These splits are ordered by dataset size, from the largest at the top (Val Seen) to the smallest at the bottom (Val Seen Synonyms). The number of distinct categories decreases as the dataset size reduces. The Val Seen split, being the largest, also contains the highest number of distinct categories, with “cabinet”, “bed”, and “table” being the top 3 common categories. Val Seen Synonyms, being the smallest, only contains 3 categories.

Figure 2:Distribution of the path length to the goal, both in Euclidean and Geodesic term. Figure 3:We show the distribution of categories, categorized for each evaluation split. Figure 4:CoIN-Bench evaluation setup. (Left) Real human responding to the agent’s question. (Right) Simulated user-agent interactions, where the user responses are provided by a VLM with access to a high-resolution target instance image for scalable and reproducible experimentation. \AlphAlph.2GOAT-Bench

Dataset. GOAT-Bench provides agents with a sequence of targets specified either by category name 𝑐 (using episodes from [54]), language description 𝑑 , or image in an open vocabulary fashion, using the HM3DSem [37] scene datasets and Habitat simulator [42]. Natural-language descriptions 𝑑 are created with an automatic pipeline by leveraging ground-truth semantic and spatial information from simulator [42] along with capabilities of VLMs and LLMs. Specifically, for each object-goal instance, a viewpoint image is sampled to maximize frame coverage. From this sampled image, the names and 2D bounding box coordinates of visible objects are extracted. Then, spatial information is extracted with the BLIP-2 [20] model, while ChatGPT-3.5 is prompted to output the final language description.

Splits. GOAT-Bench baselines are trained on Train split, and evaluated on validations splits. Notably, the evaluation splits are divided into Val Seen (i.e., object categories seen during training), Val Seen Synonyms (i.e., object categories that are synonyms to those seen during training) and Val Unseen (i.e., novel object categories).

\AlphAlphEvaluation Setup

In Fig. 4, we show the two evaluation setups, highlighting their differences between the human user and the simulated user-agent interactions. Fig. 4 (Left) shows how a human user answers the agent’s queries based on their knowledge of the target instance. However, relying on human responses for large-scale evaluations is impractical due to variability, scalability constraints and large cost. To address this, we introduce a simulated user-agent interaction setup as in Fig. 4 (Right). The user responses are simulated via a VLM with access to the high-resolution target instance image, which is never available to the agent. With the visual coverage of the target instance, the simulated user responses can support the diverse open-ended, template-free questions from the agent, about any attribute of the target instance. This is more desired than previous work [9] whose simulation setup leverages an LLM with access to the instance description, particularly when the instance description misses critical fine details that the agent deems important to know. For instance, in the case of the picture in Fig. 1 of the main paper, the instance description may not mention “the person is shirtless”, but this detail is critical for the agent to eventually disambiguate the target instance from distractors.

\AlphAlphBaselines

In this section, we provide a description of the different baselines for Instance Navigation and Object Navigation used throughout the paper: VLFM (Sec. \AlphAlph.1), Monolithic (Sec. \AlphAlph.2), PSL (Sec. \AlphAlph.3), and OVON (Sec. \AlphAlph.4).

\AlphAlph.1VLFM

VLFM [53] is a zero-shot state-of-the-art object-goal navigation policy that does not require model training, pre-built maps, or prior knowledge about the environment. The core of the approach involves two maps: a frontier map (see in Fig. 5 (a)) and a value map (see Fig. 5 (b)).

Frontier map. The frontier map is a top-down 2D map built from depth and odometry information. The explored area within the map is updated based on the robot’s location, heading, and obstacles by reconstructing the environment into a point cloud with the depth images, and then projecting them onto a 2D grid. The role of the frontier map is to identify each boundary separating the explored and unexplored areas, thus identifying the frontiers (see the blue dots in Fig. 5 (a)).

Value map. The value map is a 2D map similar to the frontier map. For each point within the explored area, a value is assigned by quantifying its relevance in locating the target object (see Fig. 5 (b)). At each timestep, frontiers are extracted from the frontier map, and the frontier with the highest value on the value map is selected as the next goal for exploration. To efficiently guide the navigation, VLFM projects the cosine similarity between the current visual observation and a textual prompt (e.g., “Find the picture”) onto the value map. This similarity is computed using the BLIP-2 model [20], which achieves state-of-the-art performance in image-to-text retrieval. To verify whether a target instance is present in the current observation, VLFM employs Grounding-DINO [24], an open-vocabulary object detector. Once a candidate target is detected, Mobile-SAM [57] refines the detection by segmenting the object’s contour within the bounding box. The segmented contour is paired with depth information to determine the closest point on the object relative to the agent’s position. This point serves as a waypoint for the agent to navigate toward the object.

At each timestep, the action 𝑎 𝑡 is selected using a PointGoal navigation (PointNav) policy [2], which can navigate to either a frontier or a waypoint, depending on the context.

Figure 5:(a) Frontier map and (b) value map constructed by VLFM [53]. The blue dots in (a) (as well as the red dots in (b)) are the identified frontiers. \AlphAlph.2Monolithic

The Monolithic (SenseAct-NN Monolithic Policy) is a single, end-to-end reinforcement learning (RL) policy designed for multimodal tasks, leveraging implicit memory and goal encoding proposed in [17]. RGB observations are encoded using a frozen CLIP [36] ResNet50 encoder. Additionally, the agent integrates GPS and compass inputs, representing location ( Δ ⁢ 𝑥 , Δ ⁢ 𝑦 , Δ ⁢ 𝑧 ) and orientation ( Δ ⁢ 𝜃 ). These inputs are embedded into 32-dimensional vectors using a encoder with fully connected layers. To model multimodal inputs, a 1024 -dimensional goal embedding is derived using a frozen CLIP image or CLIP text encoder, depending on the subtask modality (object, image, or language). All input features—image, location, orientation, and goal embedding—are concatenated into an observation embedding, which is processed through a two-layer, 512-dimensional GRU. At each timestep, the GRU predicts a distribution over a set of actions based on the current observation and the hidden state. The policy is trained using 4 × A40 GPUs for approximately 500 million steps.

\AlphAlph.3PSL

PSL [45] is a zero-shot policy for instance navigation, which is pre-trained on the ImageNav task and transferred to achieve object goal navigation without using object annotations for training.

Built on top of ZSON [28], observations are processed by a learned ResNet50 encoder and a frozen CLIP encoder obtaining, respectively, observation embeddings and semantic-level embeddings. To encode the goal modality, an additional frozen CLIP encoder is used, obtaining goal embedding. The goal and the semantic-level embeddings are additionally processed by a semantic perception module, which reduces dimension condensing critical information, emphasizing the reasoning of the semantics differences in the goal and observation. Based on condensed embeddings and observation embeddings, the authors trained a navigation policy using reinforcement learning. Specifically, the PSL agent is trained for 1G steps following ZSON [28], on 16 Nvidia RTX-3090 GPUs.

\AlphAlph.4OVON

OVON [54] is a transformer-based policy designed for the open-vocabulary object navigation task. At each timestep 𝑡 , it constructs a 1568 -dimensional latent observation 𝑜 𝑡 of the current navigation state by concatenating the encodings of the current image 𝐼 𝑡 , object description 𝑐 , and previous action 𝑎 𝑡 − 1 , using two SigLIP encoders [56] and an action embedding layer. This latent observation is then passed to a 4 -layer decoder-only transformer [49], which, along with the previous 100 observations, outputs a feature vector. This vector is then used to produce a categorical distribution over the action space via a simple linear layer. The policy is trained on their proposed HM3D-OVON dataset using various methods, such as RL and BC, for 150 M to 300 M steps, across 16 environments on 8 × TITAN Xp, resulting in a total of 128 environments. Note that the categories seen in training in HM3D-OVON overlap with that of GOAT-Bench.

\AlphAlphComputational analysis

In the following, we report the average inference time of AIUTA over 20 episodes, using an NVIDIA 4090 GPU, following the steps outlined in the algorithm in Sec. \AlphAlph.2: Step 1 , Detailed Detection Description: 11.3s; Step 2, Perception Uncertainty Estimation: 8.29s; Step 3 + Interaction Trigger: 6.13s. We would like to emphasize that our code was not optimized for speed, as it is out of scope of our study- we did not apply model compilation (e.g., torch.compile) and quantization, leaving room for further efficiency improvements. Moreover, in AIUTA, we identify the primary bottleneck is the LLM call. As discussed in the Conclusion (Sec. 7) and in [8], reducing model dimensionality while maintaining similar reasoning performance is a necessity and a promising direction for future work. Additionally, our inference time is in line with other works [8]. Finally, the emerging field of Language Processing Units (LPUs) offers potential solutions, promising near-instant inference, high affordability, and energy efficiency at scale [13].

\AlphAlphEvaluation with real human

To demonstrate the reliability and reproducibility of our simulated setup, we run a human study comparing the performance of AIUTA when user responses are provided by: (i) real human and (ii) simulations (Fig. 4). A total of 20 volunteers participated in the study (12 males and 8 females), with ages ranging from 20 to 40 years. All participants have backgrounds in electronic engineering, computer science, or other relevant fields, minimizing expertise barriers to conducting the experiments. At the start of each episode, participants are given an image depicting the final target instance, which remains accessible throughout the experiment. Again, note that this image is never seen by the agent. This setup simulates a real-world scenario where a human has a reference image in mind, enabling them to answer questions correctly. The human user then initiates the navigation by sending the initial instruction to the agent (using the fixed template “Find the ”) via a chat-like User Interface (UI) that we have developed for the evaluation (as demonstrated in the supplementary video, aiuta_demo.mp4). Next, the human user is encouraged to respond to the questions posed by AIUTA in natural language and to truthfully reflect the facts about the target instance. For this evaluation, we have selected 40 episodes across CoIN-Bench dataset, randomly distributed among participants, with each conducting two evaluations. When compared with the simulated setting, we found no statistical differences in terms of results, showing that our simulated-based evaluation is reliable and reproducible.

\AlphAlphVLFM results

In this section, we investigate the low SR results of VLFM in Tab. 2 of the main paper. To better understand this behavior, we introduce an additional metric, Distractor Success, which mirrors the success rate but considers an episode successful if the agent stops at a distractor object instead of the target. As we can see in Tab. 6, VLFM successfully locates the correct category instance (Distractor Success) but struggles to discern its attributes and differentiate between instances (low SR). Furthermore, this analysis highlights that the presence of sufficient distractors is well realized with our dataset construction procedure.

Statistics Val Seen Val Seen Synonyms Val Unseen SR 0.36 0.00 0.00 Distractor Success 3.37 0.84 4.58 Table 6:SR and Distractor Success comparison. \AlphAlphUMAP visualization

To illustrate the diversity of questions to the user generated by AIUTA, we collect 414 question samples made by the agent, compute embedding using Sentece-Bert [40] and visualize them using UMAP [30] for dimensionality reduction. The results, shown in Fig. 6, demonstrate that AIUTA generates questions covering a wide range of attributes, such as color, material, style, and spatial arrangement.

Figure 6:AIUTA generates questions covering a wide range of attributes, such as color, material, style, and spatial arrangement. \AlphAlphIDKVQA dataset Figure 7:Examples from IDKVQA, showing images and the questions generated by the LLM. \AlphAlph.1Dataset

An essential feature of the Self-Questioner is its ability to generate self-questions aimed at extracting additional attributes from the observation 𝑂 𝑡 and assessing the uncertainty of the VLM. However, there exists no dataset in the such context for us to understand if how reliable a technique is for the VLM uncertainty estimation.

For this purpose, we introduce IDKVQA, a dataset specifically designed and annotated for visual question answering using the agent’s observations during navigation, where the answer includes not only Yes and No, but also I don’t know. Specifically, we sample 102 images from the training split of GOAT-Bench. Then, for each image, we leverage the Self-Questioner pipeline to generate a set of questions. Each question is annotated by three annotators, that can pick one answer from the set { Yes , No , I don’t know } . Fig. 7 illustrates sample images and their questions generated by the Self-Questioner module.

\AlphAlph.2VLM uncertainty estimation on IDKVQA.

In this section, we present a detailed analysis of VLM uncertainty estimation on IDKVQA, focusing on the evaluation metric and baseline methods.

Metric. We evaluate the performance using the Effective Reliability metric Φ 𝑐 proposed in [50]. This metric captures the trade-off between risk and coverage in a VQA model by assigning a reward to questions that are answered correctly, a penalty 𝑐 to questions answered incorrectly, and a zero reward to the model abstaining. Formally:

Φ 𝑐 ⁢ ( 𝑥 )

{ Acc( 𝑥 ) ,
𝑔 ⁢ ( 𝑥 )

1 ⁢ and ⁢ Acc( 𝑥 )

− 𝑐
𝑔 ⁢ ( 𝑥 )

1 ⁢ and ⁢ Acc( 𝑥 )

0 ,
𝑔 ⁢ ( 𝑥 )

Here, 𝑥

( 𝑖 , 𝑞 ) ∈ 𝒳 is the input pair where 𝑖 is the image and 𝑞 is the question. The function 𝑔 ⁢ ( 𝑥 ) is equal to 1 if the model is answering and 0 if it abstains. The parameter 𝑐 denotes the cost for an incorrect answer, and the VQA accuracy Acc is:

Acc ⁢ ( 𝑓 ⁢ ( 𝑥 , 𝑦 ) )

min ⁢ (

annotations that match

⁢ 𝑓 ⁢ ( 𝑥 ) 3 , 1 )

where the function 𝑓 : 𝒳 → 𝑉 output a response 𝑟 ∈ ℛ for each input pair 𝑥 .

Baselines. We evaluate our proposed Normalized Entropy against three baseline methods:

(i) MaxProb, which selects the response 𝑟 with the highest predicted probability from the VLM, given image 𝑖 and question 𝑞 . Formally, 𝑟

VLM ⁢ ( 𝑖 , 𝑞 ) . It does not incorporate uncertainty estimation.

(ii) LP [59], a recently proposed Logistic Regression model trained as a linear probe on the logits distribution of the first generated token. The model is trained on the Answerable/Unanswerable classification task using the VizWiz VQA dataset [14], which includes 23 , 954 images for training. When applied to IDKVQA, the logistic regression model first predicts whether the question 𝑞 is Answerable or Unanswerable. If the question is deemed answerable, the response 𝑟 with the highest probability is selected among { Yes , No } ; otherwise, the response I don’t know is returned.

(iii) Energy score, an energy-based framework for out-of-distribution (OOD) detection [26]. Following the implementation in [26], an energy score is computed to identify whether the given question-image pair is OOD. If the pair is classified as OOD, the response I don’t know is returned; otherwise, the response with the highest probability is selected among { Yes , No } .

Finally, for our proposed Normalized Entropy estimation, we link the abstention function 𝑔 ⁢ ( 𝑥 ) (i.e., determining whether the model abstains from answering) to Eq. 3 in the main paper. Specifically, 𝑔 ⁢ ( 𝑥 )

1 if the Normalized Entropy classifies the model as certain, and 𝑔 ⁢ ( 𝑥 )

0 otherwise. Then, if the model is deemed certain, we return the most probable answer { Yes , No } ; otherwise, the response I don’t know is selected.

\AlphAlph.3Sensitivity analysis of the threshold 𝜏

This section provides additional details about how small variations of the threshold parameter 𝜏 affect both our Normalized Entropy technique (Eq. 3) and the Energy Score [26], with respect to the target metric Φ c

1 .

To conduct this analysis, we perform an ablation study on datasets of varying sizes, obtained by randomly sub-sampling CoIN-Bench. Specifically, we create five sets containing 50 % of the question-answer pairs from CoIN-Bench, five sets comprising 70 % of the question-answer pairs, and also use the full dataset ( 100 % ) for a total of 11 datasets.

For each dataset, we identify the optimal threshold 𝜏 ∗ for each method through an exhaustive search over predefined ranges, resulting in 22 optimal thresholds (11 per method)

Around each 𝜏 ∗ , we define a neighborhood 𝝉 comprising 30 new thresholds 𝜏 sampled symmetrically around it. Our goal is to analyze how Φ c

1 changes across these neighborhoods: if the values are spread out, it means that the method is very sensitive to small changes of 𝜏 near the optimal value, whereas if they are more tightly distributed it means that it is more robust.

Therefore, for each method and related neighborhood 𝝉 , we compute 30 Φ c values, one for each 𝜏 ∈ 𝝉 , and normalize them to the range [ 0 , 1 ] by dividing each value by the best Φ c

1 found in 𝝉 . We do so to measure only the distribution of the Φ c

1 values, not their absolute values, and to help the comparison across datasets of the same size (otherwise, due to chance, they could have distributions of different values). Finally, we aggregate all these normalized Φ c

1 scores across dataset size, resulting in Fig.3 (main paper).

From the figure, we can see that our technique has smaller interquartile ranges and tighter distributions of Φ c

1 , while the Energy Score [26] exhibits larger tails, indicating more variance. Moreover, our method shows distributions more biased toward higher values (which would indicate smaller degradation w.r.t. the best Φ c

1 ) than those of the Energy Score, and this gap increases as the dataset size decreases. This shows that our technique is generally more robust, especially in data-scarce situations, and less sensitive to small variations in 𝜏 .

\AlphAlphPrompts \AlphAlph.1 𝑃 𝑖 ⁢ 𝑛 ⁢ 𝑖 ⁢ 𝑡

Initial Description 1P_init = """Describe the {target_object} in the provided image.""" \AlphAlph.2 𝑃 𝑑 ⁢ 𝑒 ⁢ 𝑡 ⁢ 𝑎 ⁢ 𝑖 ⁢ 𝑙 ⁢ 𝑠
Gather Additional Information 1P_details = """You are an intelligent embodied agent equipped with an RGB sensor, an object detector, and a Visual Question Answering (VQA) model. 2Your task is to explore an indoor environment to find a specific target {target_object}. 3The detector has identified a {target_object}. The VQA model has provided the following description of the scene: 4 5 6{distractor_object_description} 7 8 9Based on your past interactions with the user, you know the following facts about the target picture: 10 11{facts_about_the_target_picture} 12 13 14Your task is to: 15- ask more question to the VQA model on the detected {target_object} to maximize information gain. 16 17Ensure your output follows the following format: 18 19YAML_START # must be present to get the information back 20attributes_of_the_image: 21 : "" # summarize all the known attributes from the description, enclosed in " " 22questions: 23 : "" 24YAML_END # must be present to get the information back 25 26Provide your reasoning step-by-step, after the YAML_END tag.""" \AlphAlph.3 𝑃 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ 𝑐 ⁢ 𝑘
Check detection with LVML 1 P_check = """Does the image contain a {target_object}? Answer with Yes, No or ?=I don’t know.""" \AlphAlph.4 𝑃 𝑠 ⁢ 𝑒 ⁢ 𝑙 ⁢ 𝑓 ⁢ 𝑞 ⁢ 𝑢 ⁢ 𝑒 ⁢ 𝑠 ⁢ 𝑡 ⁢ 𝑖 ⁢ 𝑜 ⁢ 𝑛
Extract attributes and generate Self-Questions 1P_ATTRIBUTES_AND_SELF_QUESTIONS = """ 2You are an intelligent embodied agent equipped with an RGB sensor, an object detector, and a Visual Question Answering (VQA) model. Your task is to explore an indoor environment to find a specific target {target_object}. 3The detector has identified a {target_object}. The VQA model has provided the following description of the scene: 4 5 6{distractor_object_description} 7 8 9Based on your past interactions with the user, you know the following facts about the target picture: {facts_about_the_target_picture} 10 11Assume that the detected image description contains hallucinations. Your goal is to verify every attribute of the detected {target_object} description through questions. Formally: 12- Detect possible hallucinations in the VQA model’s description 13- Get more information about the detected object. 14Every question should be in this format: "? You must answer only with Yes, No, or ?=I don’t know." This allows us to assess the likelihood of the answers. 15 16 17Ensure your output follows the following format: 18YAML_START # must be present to get the information back 19attributes_of_the_image: 20 : "" # summarize all the known attributes from the description, enclosed in " " 21 22questions_for_detected_object: # question for the detected object, if any 23 : "? You must answer only with Yes, No, or ?=I don’t know." 24reasoning_for_detected_object: 25 : 26YAML_END # must be present to get the information back 27 28Provide your reasoning step-by-step, after the YAML_END tag.""" \AlphAlph.5 𝑃 𝑟 ⁢ 𝑒 ⁢ 𝑓 ⁢ 𝑖 ⁢ 𝑛 ⁢ 𝑒 ⁢ 𝑑
Refined image description 1P_refined = """ 2You are an intelligent embodied agent equipped with an RGB sensor, an object detector, and a Visual Question Answering (VQA) model. 3Your task is to refine an image description based on certainty estimates and user interactions. 4 5Scenario: 6The detector has identified a scene with a {target_object}. The VQA model provided this initial scene description: 7 8 9{distractor_object_description} 10 11 12 13Questions asked and responses: 14 15{list_questions_answers_uncertainty_labels} 16 17 18Task: 19Using the questions/answer pairs with uncertainty labels, refine the image description. 20Since we have to find a {target_object}, put enphasis on it. Do not include in the description information that is labeled as uncertain. 21 22Ensure your response follows the format below: 23YAML_START # must be present to get the information back 24attributes_of_the_image: 25 : "" # summarize all the known attributes from the description, enclosed in " " 26image_description_refined: # Ensure that the string does not contain a newline (\n) after the tag image_description_refined: 27YAML_END # must be present to get the information back 28 29Provide your reasoning step-by-step, after the YAML_END tag.""" \AlphAlph.6 𝑃 𝑠 ⁢ 𝑐 ⁢ 𝑜 ⁢ 𝑟 ⁢ 𝑒
Alignment score 1P_score = """ 2You are an intelligent agent equipped with an RGB sensor, object detector, and Visual Question Answering (VQA) model. 3Your goal is to identify a target {target_object} based on a scene description and prior knowledge of the target. 4 5Scenario: 6The object detector has identified a scene containing a {target_object}, and the VQA model has provided the following description: 7 8 9{distractor_object_description} 10 11 12Target object information: 13Based on previous interactions, you know the target picture has the following characteristics: 14 15{facts_about_the_target_picture} 16 17 18Task:

Similarity analysis. 20Analyze how closely the detected scene description aligns with the known facts about the target {target_object}. Provide a similarity score between 0 and 10, where: 21- 0 = The detected {target_object} is not the target object. 22- 10 = The detected {target_object} is definitely the target object. 23- If no information about the target is available, the score should be -1. 24
Question Generation: 26- The question is for the target object, not the detected one. 27- Ask exactly one specific, relevant, and human-answerable question related to the target object that maximizes information gain for identifying the target {target_object}. 28- Do not ask speculative or irrelevant questions 29- The question should be grounded in observable or known details from the scene, focusing on key characteristics that can help confirm or refute the identity of the target object. 30 31Ensure your response follows the format below: 32YAML_START # must be present to get the information back 33similarity_score: 34questions: 35 : 36YAML_END # must be present to get the information back 37 38Provide your reasoning step-by-step for the similarity score and questions, after the YAML_END tag.""" \AlphAlphAlgorithm

We first present the complete AIUTA’s algorithm in Sec. \AlphAlph.1. As outlined in the main paper (Sec. 4), AIUTA enriches the zero-shot training policy VLFM [53]. Specifically, we detail the input/output structure of AIUTA regarding VLFM policy 𝜋 , as well as AIUTA’s main component, i.e., the Self Questioner (see Sec. \AlphAlph.2) and the Interaction Trigger (Sec. \AlphAlph.3).

\AlphAlph.1AIUTA Algorithm

Algorithm 1 outlines the complete AIUTA pipeline. Upon detecting a candidate object, AIUTA first invokes the Self Questioner module (shown in Sec. \AlphAlph.2) to obtain an accurate and detailed understanding of the observed object and to reduce inaccuracies and hallucinations, obtaining a refined observation description 𝑆 𝑟 ⁢ 𝑒 ⁢ 𝑓 ⁢ 𝑖 ⁢ 𝑛 ⁢ 𝑒 ⁢ 𝑑 . Then, with the known facts about the target instance and the refined description, AIUTA invokes the Interaction Trigger module (Sec. \AlphAlph.3) for up to 4 iterations rounds (i.e., Max_Iteration_Number

4 ), as specified in Sec. 6 under Implementation Details. Within each interaction round, if AIUTA returns the STOP action, then the policy 𝜋 terminates the navigation since the target instance is found; otherwise, the policy 𝜋 continues the navigation process.

Algorithm 1 AIUTA 1:Target object facts 𝐹 , Observation 𝑂 𝑡 , policy 𝜋 , Candidate Object Detection, Max Iteration number ▷ Upon candidate object detection 2: 𝑆 𝑟 ⁢ 𝑒 ⁢ 𝑓 ⁢ 𝑖 ⁢ 𝑛 ⁢ 𝑒 ⁢ 𝑑 ← Self_Questioner ⁢ ( 𝐹 , 𝑂 𝑡 ) ▷ enrich details and reduce inaccuracy, obtain a refined description 3:if 𝑆 𝑟 ⁢ 𝑒 ⁢ 𝑓 ⁢ 𝑖 ⁢ 𝑛 ⁢ 𝑒 ⁢ 𝑑

“ ” then 4: 𝜋 ⁢ ( CONTINUE_EXPLORING ) ▷ VQA detection failed, Signal to policy 𝜋 to continue exploration 5:end if 6:for each iteration in Max_Iteration_Number do 7: aiuta_action ← Interaction_Trigger ⁢ ( 𝐹 , 𝑆 𝑟 ⁢ 𝑒 ⁢ 𝑓 ⁢ 𝑖 ⁢ 𝑛 ⁢ 𝑒 ⁢ 𝑑 ) 8: if aiuta_action = STOP then 9: 𝜋 ⁢ ( STOP ) ▷ Signal to policy 𝜋 that the object is found! Terminate exploration 10: else 11: 𝜋 ⁢ ( CONTINUE_EXPLORING ) ▷ Signal to policy 𝜋 to continue exploration 12: end if 13:end for \AlphAlph.2Self Questioner Algorithm 2 Self Questioner Module 1:Target object facts 𝐹 , Uncertainty Threshold 𝜏 , Observation 𝑂 𝑡 , 𝑃 𝑖 ⁢ 𝑛 ⁢ 𝑖 ⁢ 𝑡 , 𝑃 𝑑 ⁢ 𝑒 ⁢ 𝑡 ⁢ 𝑎 ⁢ 𝑖 ⁢ 𝑙 ⁢ 𝑠 , 𝑃 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ 𝑐 ⁢ 𝑘 , 𝑃 𝑠 ⁢ 𝑒 ⁢ 𝑙 ⁢ 𝑓 ⁢ 𝑞 ⁢ 𝑢 ⁢ 𝑒 ⁢ 𝑠 ⁢ 𝑡 ⁢ 𝑖 ⁢ 𝑜 ⁢ 𝑛 ⁢ 𝑠 , 𝑃 𝑟 ⁢ 𝑒 ⁢ 𝑓 ⁢ 𝑖 ⁢ 𝑛 ⁢ 𝑒 ⁢ 𝑑 2:Step 1: Detailed Detection Description, from 𝑆 init to 𝑆 enriched 3:Initial scene description: 𝑆 init ← VLM ⁢ ( 𝑂 𝑡 , 𝑃 init ) 4:Self-generate questions to enrich description 5: 𝑄 𝑎 → 𝑎 𝑑 ⁢ 𝑒 ⁢ 𝑡 ⁢ 𝑎 ⁢ 𝑖 ⁢ 𝑙 ⁢ 𝑠 ← LLM ⁢ ( 𝑃 details , 𝑆 init , 𝐹 ) 6:for each question 𝑞 𝑗 in 𝑄 𝑎 → 𝑎 𝑑 ⁢ 𝑒 ⁢ 𝑡 ⁢ 𝑎 ⁢ 𝑖 ⁢ 𝑙 ⁢ 𝑠 do 7: 𝑟 𝑎 → 𝑎 ← VLM ⁢ ( 𝑂 𝑡 , 𝑞 𝑗 ) ▷ Get answers 8: 𝑆 init ← concatenate ⁢ ( 𝑆 init , 𝑟 𝑎 → 𝑎 ) 9:end for 10: 𝑆 enriched ← 𝑆 init ▷ Updated scene description 11:Step 2: Perception Uncertainty Estimation 12: ( 𝑟 check , 𝑢 check ) ← VLM ( 𝑂 𝑡 , 𝑃 check ) ▷ Check detection with uncertainty 13:if NOT ( 𝑟 check

“Yes” AND 𝑢 check

“Certain” ) then 14: return “ ” ▷ empty string, thus continue exploring 15:end if 16: 𝑄 𝑎 → 𝑎 𝑎 ⁢ 𝑡 ⁢ 𝑡 ⁢ 𝑟 ⁢ 𝑖 ⁢ 𝑏 ⁢ 𝑢 ⁢ 𝑡 ⁢ 𝑒 ← LLM ⁢ ( 𝑃 self questions , 𝐹 , 𝑆 enriched ) ▷ Generate self-questions to verify attributes 17:Container ← { } ▷ Store question, answer, uncertainty 18:for each question 𝑞 𝑗 in 𝑄 𝑎 → 𝑎 𝑎 ⁢ 𝑡 ⁢ 𝑡 ⁢ 𝑟 ⁢ 𝑖 ⁢ 𝑏 ⁢ 𝑢 ⁢ 𝑡 ⁢ 𝑒 do 19: ( 𝑟 𝑗 , 𝑢 𝑗 ) ← VLM ⁢ ( 𝑂 𝑡 , 𝑞 𝑗 ) ▷ Get answers and uncertainties 20: Container ← concatenate ⁢ ( Container , { 𝑞 𝑗 , 𝑟 𝑗 , 𝑢 𝑗 } ) 21:end for 22:Step 3: Detection Description Refinement 23: 𝑆 refined ← LLM ⁢ ( 𝑃 refined , Container , 𝑆 enriched ) ▷ Filter out uncertain attributes 24:return 𝑆 refined \AlphAlph.3Interaction Trigger Algorithm 3 Interaction Trigger 1:Target object facts 𝐹 , Refined observation description 𝑆 refined , 𝑃 𝑠 ⁢ 𝑐 ⁢ 𝑜 ⁢ 𝑟 ⁢ 𝑒 , 𝜏 𝑠 ⁢ 𝑡 ⁢ 𝑜 ⁢ 𝑝 and 𝜏 𝑠 ⁢ 𝑘 ⁢ 𝑖 ⁢ 𝑝 2: ( 𝑠 , 𝑞 𝑎 → 𝑢 ) ← LLM ⁢ ( 𝑃 𝑠 ⁢ 𝑐 ⁢ 𝑜 ⁢ 𝑟 ⁢ 𝑒 , 𝑆 refined , 𝐹 ) ▷ get alignment score 𝑠 , and question for the human 𝑞 𝑎 → 𝑢 3:if 𝑠 ≥ 𝜏 𝑠 ⁢ 𝑡 ⁢ 𝑜 ⁢ 𝑝 then 4: return STOP ▷ target found, stop navigation. 5:else if 𝑠 < 𝜏 𝑠 ⁢ 𝑘 ⁢ 𝑖 ⁢ 𝑝 then 6: return CONTINUE_EXPLORING ▷ skip the question and continue exploring 7:else 8: 𝑟 𝑢 → 𝑎 ← Ask_Human ⁢ ( 𝑞 𝑎 → 𝑢 ) ▷ posing clarifying question 𝑞 𝑎 → 𝑢 from the agent to the human. 9: 𝐹 ← Update_Facts ⁢ ( 𝐹 , 𝑟 𝑢 → 𝑎 ) ▷ update target object facts 𝐹 10:end if Report Issue Report Issue for Selection Generated by L A T E xml Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

Xet Storage Details

Size:: 101 kB
Xet hash:: e72e78fa3068e05672b071d80e518b82e9ef0b49a0b375e66557b5ade2e6ccbd

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

Figure 2:Graphical depiction of AIUTA: left shows its interaction cycle with the user, and right provides an exploded view of our method. ① The agent receives an initial instruction 𝐼 : “Find a 𝑐

0

Generation of the initial detection description. The agent initially prompts the VLM for an initial description 𝑆 𝑖 ⁢ 𝑛 ⁢ 𝑖 ⁢ 𝑡 of the observation 𝑂 𝑡 by providing the prompt 𝑃 𝑖 ⁢ 𝑛 ⁢ 𝑖 ⁢ 𝑡

“Describe the in the provided image.” Formally, 𝑆 𝑖 ⁢ 𝑛 ⁢ 𝑖 ⁢ 𝑡

{ 𝑞 𝑗 } given 𝑆 𝑖 ⁢ 𝑛 ⁢ 𝑖 ⁢ 𝑡 and 𝐹 𝑡 (the symbol 𝑄 𝑎 → 𝑎 is used to represent the self-dialogue performed by the agent). Formally, 𝑄 𝑎 → 𝑎 𝑑 ⁢ 𝑒 ⁢ 𝑡 ⁢ 𝑎 ⁢ 𝑖 ⁢ 𝑙 ⁢ 𝑠

𝐻 ⁢ ( 𝑝 VLM )

− ∑ 𝑖

The VLM uncertainty 𝑢 is then obtained by normalizing the entropy 𝐻 within the range [ 0 , 1 ] as 𝑢

𝐻 𝐻 max , where 𝐻 𝑚 ⁢ 𝑎 ⁢ 𝑥

𝐶 ⁢ ( 𝑢 , 𝜏 )

To reduce false positives, we use the prompt 𝑃 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ 𝑐 ⁢ 𝑘

“Does the image contain a ? Answer with Yes, No or ?=I don’t know.” (see Supp. Mat. Sec. \AlphAlph.3). This allows us to confirm the presence of the object, which we formally express as ( 𝑟 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ 𝑐 ⁢ 𝑘 , 𝑢 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ 𝑐 ⁢ 𝑘 )

VLM ⁢ ( 𝑂 𝑡 , 𝑃 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ 𝑐 ⁢ 𝑘 ) . Following Eq. 3, we continue the AIUTA pipeline if response 𝑟 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ 𝑐 ⁢ 𝑘

“Yes” and uncertainty 𝑢 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ 𝑐 ⁢ 𝑘

To remove uncertain attributes, we prompt the LLM to extract a set of attributes and values 𝐾 𝑡

{ 𝑞 𝑗 } 𝑗

1 𝐽 to be answered by the agent itself. Formally, we extract attributes list and self-questions in one prompt, 𝑄 𝑎 → 𝑎 𝑎 ⁢ 𝑡 ⁢ 𝑡 ⁢ 𝑟 ⁢ 𝑖 ⁢ 𝑏 ⁢ 𝑢 ⁢ 𝑡 ⁢ 𝑒

Implementation Details. We use [23] (LLaVA 1.6, Mistral 7B) as the VLM and GPT-4o [15] as the LLM. User interaction is limited to a maximum of 4 rounds. We empirically set 𝜏

0.75 (Eq. 3), 𝜏 𝑠 ⁢ 𝑡 ⁢ 𝑜 ⁢ 𝑝

7 and 𝜏 𝑠 ⁢ 𝑘 ⁢ 𝑖 ⁢ 𝑝

VLM Model Selection Function Φ 𝑐

LLaVA llava-v1.6-mistral-7b-hf MaxProb 15.94 LP [59] 14.01 Energy Score [26] 20.45 Normalized Entropy (ours) 21.12 Table 5:Results of different selection functions and their corresponding Effective Reliability rate Φ 𝑐

1 on 30 alternative thresholds around 𝜏 ∗ , normalizing it between 0 and 1 . As shown in Fig. 3, our technique has a smaller interquartile range and a tighter distribution of Φ 𝑐

Figure 3: 𝜏 sensitivity results. For each method, 30 new 𝜏 values are sampled symmetrically around the optimal threshold 𝜏 ∗ . The 𝑥 -axis shows the set size as a percentage of the original IDKVQA dataset size, while the 𝑦 -axis displays the normalized ER Φ 𝑐

Φ 𝑐 ⁢ ( 𝑥 )

{ Acc( 𝑥 ) , 𝑔 ⁢ ( 𝑥 )

− 𝑐 𝑔 ⁢ ( 𝑥 )

1 ⁢ and ⁢ Acc( 𝑥 )

0 , 𝑔 ⁢ ( 𝑥 )

Here, 𝑥

Acc ⁢ ( 𝑓 ⁢ ( 𝑥 , 𝑦 ) )

annotations that match

(i) MaxProb, which selects the response 𝑟 with the highest predicted probability from the VLM, given image 𝑖 and question 𝑞 . Formally, 𝑟

Finally, for our proposed Normalized Entropy estimation, we link the abstention function 𝑔 ⁢ ( 𝑥 ) (i.e., determining whether the model abstains from answering) to Eq. 3 in the main paper. Specifically, 𝑔 ⁢ ( 𝑥 )

1 if the Normalized Entropy classifies the model as certain, and 𝑔 ⁢ ( 𝑥 )

This section provides additional details about how small variations of the threshold parameter 𝜏 affect both our Normalized Entropy technique (Eq. 3) and the Energy Score [26], with respect to the target metric Φ c

Around each 𝜏 ∗ , we define a neighborhood 𝝉 comprising 30 new thresholds 𝜏 sampled symmetrically around it. Our goal is to analyze how Φ c

Therefore, for each method and related neighborhood 𝝉 , we compute 30 Φ c values, one for each 𝜏 ∈ 𝝉 , and normalize them to the range [ 0 , 1 ] by dividing each value by the best Φ c

1 found in 𝝉 . We do so to measure only the distribution of the Φ c

1 values, not their absolute values, and to help the comparison across datasets of the same size (otherwise, due to chance, they could have distributions of different values). Finally, we aggregate all these normalized Φ c

From the figure, we can see that our technique has smaller interquartile ranges and tighter distributions of Φ c

1 , while the Energy Score [26] exhibits larger tails, indicating more variance. Moreover, our method shows distributions more biased toward higher values (which would indicate smaller degradation w.r.t. the best Φ c

“Yes” AND 𝑢 check

Xet Storage Details

{ Acc( 𝑥 ) ,
𝑔 ⁢ ( 𝑥 )

− 𝑐
𝑔 ⁢ ( 𝑥 )

0 ,
𝑔 ⁢ ( 𝑥 )