File size: 4,861 Bytes
afef90d
 
 
2b109ee
afef90d
 
 
 
2b109ee
 
 
 
 
 
 
 
 
 
 
afef90d
2b109ee
 
 
afef90d
2b109ee
afef90d
2b109ee
afef90d
2b109ee
 
 
 
afef90d
2b109ee
afef90d
2b109ee
afef90d
2b109ee
afef90d
2b109ee
afef90d
2b109ee
afef90d
2b109ee
 
 
afef90d
2b109ee
afef90d
2b109ee
afef90d
2b109ee
afef90d
2b109ee
afef90d
2b109ee
 
 
afef90d
2b109ee
afef90d
2b109ee
 
 
 
afef90d
2b109ee
afef90d
2b109ee
afef90d
2b109ee
 
 
afef90d
2b109ee
afef90d
2b109ee
afef90d
2b109ee
 
afef90d
2b109ee
afef90d
2b109ee
afef90d
2b109ee
afef90d
2b109ee
afef90d
2b109ee
afef90d
2b109ee
afef90d
 
2b109ee
 
 
 
 
afef90d
2b109ee
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
license: apache-2.0
base_model:
- ByteDance-Seed/BAGEL-7B-MoT
pipeline_tag: any-to-any
library_name: bagel-mot
---

---
task_categories:
- text-to-image
---
# Unify-Agent

[**Paper**](https://arxiv.org/abs/2603.29620) | [**Code**](https://github.com/shawn0728/Unify-Agent)

This repository contains the official resources for [**Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis**](https://arxiv.org/abs/2603.29620).

# πŸ‘€ Intro

<div align="center">
  <img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/showcase.png?raw=true" alt="Unify-Agent Overview" width="80%">
</div>

We introduce **Unify-Agent**, an end-to-end unified multimodal agent for **world-grounded image synthesis**. Unlike conventional text-to-image models that rely only on frozen parametric knowledge, Unify-Agent can actively **reason, search, and integrate external world knowledge at inference time**, enabling more faithful generation of real people, cultural symbols, rare IPs, historical scenes, scientific concepts, and other long-tail entities.

Unify-Agent unifies four core capabilities within a single model:

- **THINK**: understand the prompt and identify missing knowledge  
- **RESEARCH**: retrieve relevant textual and visual evidence  
- **RECAPTION**: convert retrieved evidence into grounded generation guidance  
- **GENERATE**: synthesize the final image  

To train this agent, we construct a tailored multimodal data pipeline and curate **143K high-quality agent trajectories** for world-grounded image synthesis.

We further introduce **FactIP**, a new benchmark for factual and knowledge-intensive image generation, covering **12 categories** of culturally significant and long-tail concepts that explicitly require external knowledge grounding.

As an early exploration of agent-based modeling for image generation, Unify-Agent highlights the value of tightly coupling **reasoning, searching, and generation** for reliable open-world visual synthesis.

## πŸ” FactIP Benchmark

Our **FactIP** benchmark is designed to evaluate search-grounded and knowledge-intensive image generation in real-world settings.

<div align="center">
  <img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/construction.png?raw=true" alt="FactIP Benchmark Categories" width="80%">
</div>

FactIP contains **three major groups** β€” **Character**, **Scene**, and **Object** β€” and **12 fine-grained subcategories**, covering diverse factual generation scenarios such as celebrities, animated characters, landmarks, cultural relics, food, toys, and mythology.

The full benchmark contains **2,462 prompts**, and we also provide a mini test subset with category proportions aligned to the full benchmark.

## πŸ† Performance

Unify-Agent substantially improves factual visual synthesis over its base unified model and strong open-source baselines across **FactIP**, **WiSE**, **KiTTEN**, and **T2I-FactualBench**.

<div align="center">
  <img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/comparison.png?raw=true" alt="Performance Comparison" width="85%">
</div>

Our method produces images that better preserve:

- **subject identity**
- **fine-grained visual attributes**
- **prompt-specific details**
- **real-world factual grounding**

while maintaining strong visual quality and broad stylistic versatility.

## 🧠 Pipeline

<div align="center">
  <img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/method.png?raw=true" alt="Unify-Agent Pipeline" width="85%">
</div>

Given an input prompt, Unify-Agent first performs **prompt understanding** and **cognitive gap detection** to identify missing but visually critical attributes. It then acquires complementary evidence through both **textual evidence search** and **visual evidence search**.

Based on the collected evidence, the model grounds the generation process with:

- **identity-preserving constraints** for character-specific visual traits  
- **scene-compositional constraints** for pose, environment, clothing, and mood  

These grounded constraints are then integrated into an **evidence-grounded recaptioning** module, which produces a detailed caption for the downstream image generator.

## πŸ“¦ Release Status

The repository is now available, and the **code, benchmark, and checkpoints** are being prepared for full release.

Please stay tuned for upcoming updates.

## Citation

If you find this work helpful, please consider citing:

```bibtex
@article{chen2026unify,
  title={Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis},
  author={Chen, Shuang and Shou, Quanxin and Chen, Hangting and Zhou, Yucheng and Feng, Kaituo and Hu, Wenbo and Zhang, Yi-Fan and Lin, Yunlong and Huang, Wenxuan and Song, Mingyang and others},
  journal={arXiv preprint arXiv:2603.29620},
  year={2026}
}
```