Instructions to use csfufu/Unify-Agent with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Bagel
How to use csfufu/Unify-Agent with Bagel:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| base_model: | |
| - ByteDance-Seed/BAGEL-7B-MoT | |
| pipeline_tag: any-to-any | |
| library_name: bagel-mot | |
| --- | |
| task_categories: | |
| - text-to-image | |
| --- | |
| # Unify-Agent | |
| [**Paper**](https://arxiv.org/abs/2603.29620) | [**Code**](https://github.com/shawn0728/Unify-Agent) | |
| This repository contains the official resources for [**Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis**](https://arxiv.org/abs/2603.29620). | |
| # π Intro | |
| <div align="center"> | |
| <img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/showcase.png?raw=true" alt="Unify-Agent Overview" width="80%"> | |
| </div> | |
| We introduce **Unify-Agent**, an end-to-end unified multimodal agent for **world-grounded image synthesis**. Unlike conventional text-to-image models that rely only on frozen parametric knowledge, Unify-Agent can actively **reason, search, and integrate external world knowledge at inference time**, enabling more faithful generation of real people, cultural symbols, rare IPs, historical scenes, scientific concepts, and other long-tail entities. | |
| Unify-Agent unifies four core capabilities within a single model: | |
| - **THINK**: understand the prompt and identify missing knowledge | |
| - **RESEARCH**: retrieve relevant textual and visual evidence | |
| - **RECAPTION**: convert retrieved evidence into grounded generation guidance | |
| - **GENERATE**: synthesize the final image | |
| To train this agent, we construct a tailored multimodal data pipeline and curate **143K high-quality agent trajectories** for world-grounded image synthesis. | |
| We further introduce **FactIP**, a new benchmark for factual and knowledge-intensive image generation, covering **12 categories** of culturally significant and long-tail concepts that explicitly require external knowledge grounding. | |
| As an early exploration of agent-based modeling for image generation, Unify-Agent highlights the value of tightly coupling **reasoning, searching, and generation** for reliable open-world visual synthesis. | |
| ## π FactIP Benchmark | |
| Our **FactIP** benchmark is designed to evaluate search-grounded and knowledge-intensive image generation in real-world settings. | |
| <div align="center"> | |
| <img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/construction.png?raw=true" alt="FactIP Benchmark Categories" width="80%"> | |
| </div> | |
| FactIP contains **three major groups** β **Character**, **Scene**, and **Object** β and **12 fine-grained subcategories**, covering diverse factual generation scenarios such as celebrities, animated characters, landmarks, cultural relics, food, toys, and mythology. | |
| The full benchmark contains **2,462 prompts**, and we also provide a mini test subset with category proportions aligned to the full benchmark. | |
| ## π Performance | |
| Unify-Agent substantially improves factual visual synthesis over its base unified model and strong open-source baselines across **FactIP**, **WiSE**, **KiTTEN**, and **T2I-FactualBench**. | |
| <div align="center"> | |
| <img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/comparison.png?raw=true" alt="Performance Comparison" width="85%"> | |
| </div> | |
| Our method produces images that better preserve: | |
| - **subject identity** | |
| - **fine-grained visual attributes** | |
| - **prompt-specific details** | |
| - **real-world factual grounding** | |
| while maintaining strong visual quality and broad stylistic versatility. | |
| ## π§ Pipeline | |
| <div align="center"> | |
| <img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/method.png?raw=true" alt="Unify-Agent Pipeline" width="85%"> | |
| </div> | |
| Given an input prompt, Unify-Agent first performs **prompt understanding** and **cognitive gap detection** to identify missing but visually critical attributes. It then acquires complementary evidence through both **textual evidence search** and **visual evidence search**. | |
| Based on the collected evidence, the model grounds the generation process with: | |
| - **identity-preserving constraints** for character-specific visual traits | |
| - **scene-compositional constraints** for pose, environment, clothing, and mood | |
| These grounded constraints are then integrated into an **evidence-grounded recaptioning** module, which produces a detailed caption for the downstream image generator. | |
| ## π¦ Release Status | |
| The repository is now available, and the **code, benchmark, and checkpoints** are being prepared for full release. | |
| Please stay tuned for upcoming updates. | |
| ## Citation | |
| If you find this work helpful, please consider citing: | |
| ```bibtex | |
| @article{chen2026unify, | |
| title={Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis}, | |
| author={Chen, Shuang and Shou, Quanxin and Chen, Hangting and Zhou, Yucheng and Feng, Kaituo and Hu, Wenbo and Zhang, Yi-Fan and Lin, Yunlong and Huang, Wenxuan and Song, Mingyang and others}, | |
| journal={arXiv preprint arXiv:2603.29620}, | |
| year={2026} | |
| } | |
| ``` | |