arxiv:2605.29861

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

Published on May 28

· Submitted by

Chenghao Zhang on May 29

Renmin University of China

Upvote

Authors:

Chenghao Zhang ,

Abstract

Multi-agent system for generating reliable, visually informative multimodal reports by interleaving textual and visual evidence through specialized agents and verification mechanisms.

AI-generated summary

Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesizes scattered evidence into long-form reports. However, verifiable multimodal deep research remains challenging due to open-ended synthesis without deterministic ground truth and the need to interleave textual arguments with visual evidence. We propose Ptah, a multi-agent harness for interleaved report generation. Ptah orchestrates the lifecycle from user query to rendered web report through planning, research, and writing stages, where specialized agents construct visual-aware plans, collect claim-grounded evidence, maintain source-aligned images in a Visual Working Memory, and compose reports through declarative multimodal tool use. A verifier agent serves as the harness's acceptance function, enforcing factual grounding, citation fidelity, and cross-modal consistency throughout the workflow. We further introduce PtahEval, an evaluation protocol that augments existing benchmarks with image-level and presentation-level assessments. Experiments on deep research benchmarks show that Ptah produces more reliable, visually informative, and usable human-facing multimodal reports than strong baselines.

View arXiv page View PDF Add to collection

Community

SnowNation

Paper author Paper submitter about 12 hours ago

Interleaved image-text reports are an important format for presenting complex multimodal information, yet generating them in a trustworthy and well-grounded way remains challenging.

In this work, we introduce Ptah, an agentic harness for producing reliable multimodal reports by coordinating textual research, claim-grounded evidence, and source-aligned visual evidence.

Evaluating multimodal reports is also difficult, as factual grounding, citation fidelity, visual relevance, cross-modal consistency, and presentation quality all matter. To address this, we propose PtahEval, an evaluation protocol for assessing multimodal report quality at both the image-content and presentation levels.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.29861

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.29861 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.29861 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.29861 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.