arxiv:2602.18940

DREAM: Deep Research Evaluation with Agentic Metrics

Published on Feb 21

· Submitted by

Roy Ganz on Feb 25

Amazon Web Services

Upvote

Authors:

Abstract

Deep Research Agents generate analyst-grade reports, yet evaluating them remains challenging due to the absence of a single ground truth and the multidimensional nature of research quality. Recent benchmarks propose distinct methodologies, yet they suffer from the Mirage of Synthesis, where strong surface-level fluency and citation alignment can obscure underlying factual and reasoning defects. We characterize this gap by introducing a taxonomy across four verticals that exposes a critical capability mismatch: static evaluators inherently lack the tool-use capabilities required to assess temporal validity and factual correctness. To address this, we propose DREAM (Deep Research Evaluation with Agentic Metrics), a framework that instantiates the principle of capability parity by making evaluation itself agentic. DREAM structures assessment through an evaluation protocol combining query-agnostic metrics with adaptive metrics generated by a tool-calling agent, enabling temporally aware coverage, grounded verification, and systematic reasoning probes. Controlled evaluations demonstrate DREAM is significantly more sensitive to factual and temporal decay than existing benchmarks, offering a scalable, reference-free evaluation paradigm.

AI-generated summary

View arXiv page View PDF Add to collection

Community

proy

Paper submitter about 4 hours ago

Evaluating Deep Research Agents (DRAs) with static LLM judges often leads to the "Mirage of Synthesis," where fluent writing and accurate-looking citations mask underlying factual errors and flawed logic. The authors attribute this to a capability mismatch: static evaluators lack the active retrieval tools and temporal awareness needed to independently verify the claims they assess. To solve this, the paper introduces DREAM (Deep Research Evaluation with Agentic Metrics), a framework that establishes "capability parity" by making the evaluation process itself agentic. DREAM operates in two phases: first, a tool-equipped agent creates a custom evaluation protocol combining static standards with query-specific adaptive metrics (like Key-Information Coverage and Reasoning Quality), and second, it routes these metrics to the appropriate LLM, workflow, or agent evaluators for execution. Controlled experiments demonstrate that by actively cross-referencing external evidence, DREAM is significantly more sensitive to temporal degradation and well-cited falsehoods than existing benchmarks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.18940 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.18940 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.18940 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.