Update README.md

2b109ee verified about 1 month ago

4.86 kB

	---
	license: apache-2.0
	base_model:
	- ByteDance-Seed/BAGEL-7B-MoT
	pipeline_tag: any-to-any
	library_name: bagel-mot
	---

	---
	task_categories:
	- text-to-image
	---
	# Unify-Agent

	[Paper](https://arxiv.org/abs/2603.29620) \| [Code](https://github.com/shawn0728/Unify-Agent)

	This repository contains the official resources for [Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis](https://arxiv.org/abs/2603.29620).

	# 👀 Intro

	<div align="center">
	<img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/showcase.png?raw=true" alt="Unify-Agent Overview" width="80%">
	</div>

	We introduce Unify-Agent, an end-to-end unified multimodal agent for world-grounded image synthesis. Unlike conventional text-to-image models that rely only on frozen parametric knowledge, Unify-Agent can actively reason, search, and integrate external world knowledge at inference time, enabling more faithful generation of real people, cultural symbols, rare IPs, historical scenes, scientific concepts, and other long-tail entities.

	Unify-Agent unifies four core capabilities within a single model:

	- THINK: understand the prompt and identify missing knowledge
	- RESEARCH: retrieve relevant textual and visual evidence
	- RECAPTION: convert retrieved evidence into grounded generation guidance
	- GENERATE: synthesize the final image

	To train this agent, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis.

	We further introduce FactIP, a new benchmark for factual and knowledge-intensive image generation, covering 12 categories of culturally significant and long-tail concepts that explicitly require external knowledge grounding.

	As an early exploration of agent-based modeling for image generation, Unify-Agent highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world visual synthesis.

	## 🔍 FactIP Benchmark

	Our FactIP benchmark is designed to evaluate search-grounded and knowledge-intensive image generation in real-world settings.

	<div align="center">
	<img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/construction.png?raw=true" alt="FactIP Benchmark Categories" width="80%">
	</div>

	FactIP contains three major groups — Character, Scene, and Object — and 12 fine-grained subcategories, covering diverse factual generation scenarios such as celebrities, animated characters, landmarks, cultural relics, food, toys, and mythology.

	The full benchmark contains 2,462 prompts, and we also provide a mini test subset with category proportions aligned to the full benchmark.

	## 🏆 Performance

	Unify-Agent substantially improves factual visual synthesis over its base unified model and strong open-source baselines across FactIP, WiSE, KiTTEN, and T2I-FactualBench.

	<div align="center">
	<img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/comparison.png?raw=true" alt="Performance Comparison" width="85%">
	</div>

	Our method produces images that better preserve:

	- subject identity
	- fine-grained visual attributes
	- prompt-specific details
	- real-world factual grounding

	while maintaining strong visual quality and broad stylistic versatility.

	## 🧠 Pipeline

	<div align="center">
	<img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/method.png?raw=true" alt="Unify-Agent Pipeline" width="85%">
	</div>

	Given an input prompt, Unify-Agent first performs prompt understanding and cognitive gap detection to identify missing but visually critical attributes. It then acquires complementary evidence through both textual evidence search and visual evidence search.

	Based on the collected evidence, the model grounds the generation process with:

	- identity-preserving constraints for character-specific visual traits
	- scene-compositional constraints for pose, environment, clothing, and mood

	These grounded constraints are then integrated into an evidence-grounded recaptioning module, which produces a detailed caption for the downstream image generator.

	## 📦 Release Status

	The repository is now available, and the code, benchmark, and checkpoints are being prepared for full release.

	Please stay tuned for upcoming updates.

	## Citation

	If you find this work helpful, please consider citing:

	```bibtex
	@article{chen2026unify,
	title={Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis},
	author={Chen, Shuang and Shou, Quanxin and Chen, Hangting and Zhou, Yucheng and Feng, Kaituo and Hu, Wenbo and Zhang, Yi-Fan and Lin, Yunlong and Huang, Wenxuan and Song, Mingyang and others},
	journal={arXiv preprint arXiv:2603.29620},
	year={2026}
	}
	```