Instructions to use csfufu/Unify-Agent with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Bagel
How to use csfufu/Unify-Agent with Bagel:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,118 +1,101 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
base_model:
|
| 4 |
-
-
|
| 5 |
pipeline_tag: any-to-any
|
| 6 |
library_name: bagel-mot
|
| 7 |
---
|
| 8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
-
<
|
| 11 |
-
<img src="https://
|
| 12 |
-
</
|
| 13 |
|
|
|
|
| 14 |
|
| 15 |
-
|
| 16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
|
|
|
| 18 |
|
| 19 |
-
|
| 20 |
-
<a href="https://bagel-ai.org/">
|
| 21 |
-
<img
|
| 22 |
-
src="https://img.shields.io/badge/BAGEL-Website-0A66C2?logo=safari&logoColor=white" style="display: inline-block; vertical-align: middle;"
|
| 23 |
-
alt="BAGEL Website"
|
| 24 |
-
/>
|
| 25 |
-
</a>
|
| 26 |
-
<a href="https://arxiv.org/abs/2505.14683">
|
| 27 |
-
<img
|
| 28 |
-
src="https://img.shields.io/badge/BAGEL-Paper-red?logo=arxiv&logoColor=red" style="display: inline-block; vertical-align: middle;"
|
| 29 |
-
alt="BAGEL Paper on arXiv"
|
| 30 |
-
/>
|
| 31 |
-
</a>
|
| 32 |
-
<a href="https://github.com/bytedance-seed/BAGEL" target="_blank" style="margin: 2px;">
|
| 33 |
-
<img
|
| 34 |
-
alt="Github" src="https://img.shields.io/badge/BAGEL-Codebase-536af5?color=536af5&logo=github" style="display: inline-block; vertical-align: middle;"
|
| 35 |
-
alt="BAGEL Codebase"
|
| 36 |
-
/>
|
| 37 |
-
</a>
|
| 38 |
-
<a href="https://demo.bagel-ai.org/">
|
| 39 |
-
<img
|
| 40 |
-
src="https://img.shields.io/badge/BAGEL-Demo-blue?logo=googleplay&logoColor=white" style="display: inline-block; vertical-align: middle;"
|
| 41 |
-
alt="BAGEL Demo"
|
| 42 |
-
/>
|
| 43 |
-
</a>
|
| 44 |
-
<a href="https://discord.com/invite/Z836xxzy">
|
| 45 |
-
<img
|
| 46 |
-
src="https://img.shields.io/badge/BAGEL-Discord-green?logo=discord&logoColor=white" style="display: inline-block; vertical-align: middle;"
|
| 47 |
-
alt="BAGEL Discord"
|
| 48 |
-
/>
|
| 49 |
-
</a>
|
| 50 |
|
| 51 |
-
|
| 52 |
-
</p>
|
| 53 |
|
|
|
|
| 54 |
|
| 55 |
-
|
| 56 |
-
Moreover, BAGEL demonstrates superior qualitative results in classical imageβediting scenarios than the leading open-source models. More importantly, it extends to free-form visual manipulation, multiview synthesis, and world navigation, capabilities that constitute "world-modeling" tasks beyond the scope of previous image-editing models.
|
| 57 |
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
-
|
| 60 |
|
|
|
|
| 61 |
|
|
|
|
| 62 |
|
| 63 |
-
|
| 64 |
|
|
|
|
|
|
|
|
|
|
| 65 |
|
|
|
|
| 66 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
|
|
|
| 68 |
|
|
|
|
| 69 |
|
| 70 |
-
|
| 71 |
-
|
|
|
|
| 72 |
|
| 73 |
-
|
| 74 |
|
| 75 |
-
|
| 76 |
|
|
|
|
|
|
|
| 77 |
|
| 78 |
-
|
| 79 |
-
<p align="left"><img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/emerging_curves.png" width="50%"></p>
|
| 80 |
|
| 81 |
-
|
| 82 |
|
|
|
|
| 83 |
|
|
|
|
| 84 |
|
| 85 |
-
##
|
| 86 |
-
### 1. Visual Understanding
|
| 87 |
-
| Model | MME β | MMBench β | MMMU β | MM-Vet β | MathVista β |
|
| 88 |
-
| ------------------- | ----------: | ----------: | -------: | -------: | ----------: |
|
| 89 |
-
| Janus-Pro-7B | - | 79.2 | 41.0 | 50.0 | β |
|
| 90 |
-
| Qwen2.5-VL-7B | 2347 | 83.5 | **58.6** | 67.1 | 68.2 |
|
| 91 |
-
| **BAGEL** | **2388** | **85.0** | 55.3 | **67.2** | **73.1** |
|
| 92 |
-
### 2. Text-to-Image Generation Β· GenEval
|
| 93 |
-
| Model | Overall β |
|
| 94 |
-
| ------------ | --------- |
|
| 95 |
-
| FLUX-1-dev | 0.82 |
|
| 96 |
-
| SD3-Medium | 0.74 |
|
| 97 |
-
| Janus-Pro-7B | 0.80 |
|
| 98 |
-
| **BAGEL** | **0.88** |
|
| 99 |
-
### 3. Image Editing
|
| 100 |
-
| Model | GEdit-Bench-EN (SC) β | GEdit-Bench-EN (PQ) β | GEdit-Bench-EN (O) β | IntelligentBench β |
|
| 101 |
-
| ------------- | --------------------- | --------------------- | ------------------- | ------------------ |
|
| 102 |
-
| Step1X-Edit | 7.09 | 6.76 | **6.70** | 14.9 |
|
| 103 |
-
| Gemini-2-exp. | 6.73 | 6.61 | 6.32 | **57.6** |
|
| 104 |
-
| **BAGEL** | **7.36** | **6.83** | 6.52 | 44.0 |
|
| 105 |
-
| **BAGEL+CoT** | β | β | β | 55.3 |
|
| 106 |
|
| 107 |
-
|
| 108 |
-
BAGEL is licensed under the Apache 2.0 license. It is finetuned from [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and [siglip-so400m-14-384-flash-attn2](https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2) model, and uses the [FLUX.1-schnell VAE model](https://huggingface.co/black-forest-labs/FLUX.1-schnell), all under Apache 2.0.
|
| 109 |
|
| 110 |
-
## βοΈ Citation
|
| 111 |
```bibtex
|
| 112 |
-
@article{
|
| 113 |
-
title
|
| 114 |
-
author
|
| 115 |
-
journal
|
| 116 |
-
year
|
| 117 |
}
|
| 118 |
-
```
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
base_model:
|
| 4 |
+
- ByteDance-Seed/BAGEL-7B-MoT
|
| 5 |
pipeline_tag: any-to-any
|
| 6 |
library_name: bagel-mot
|
| 7 |
---
|
| 8 |
|
| 9 |
+
---
|
| 10 |
+
task_categories:
|
| 11 |
+
- text-to-image
|
| 12 |
+
---
|
| 13 |
+
# Unify-Agent
|
| 14 |
+
|
| 15 |
+
[**Paper**](https://arxiv.org/abs/2603.29620) | [**Code**](https://github.com/shawn0728/Unify-Agent)
|
| 16 |
+
|
| 17 |
+
This repository contains the official resources for [**Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis**](https://arxiv.org/abs/2603.29620).
|
| 18 |
+
|
| 19 |
+
# π Intro
|
| 20 |
|
| 21 |
+
<div align="center">
|
| 22 |
+
<img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/showcase.png?raw=true" alt="Unify-Agent Overview" width="80%">
|
| 23 |
+
</div>
|
| 24 |
|
| 25 |
+
We introduce **Unify-Agent**, an end-to-end unified multimodal agent for **world-grounded image synthesis**. Unlike conventional text-to-image models that rely only on frozen parametric knowledge, Unify-Agent can actively **reason, search, and integrate external world knowledge at inference time**, enabling more faithful generation of real people, cultural symbols, rare IPs, historical scenes, scientific concepts, and other long-tail entities.
|
| 26 |
|
| 27 |
+
Unify-Agent unifies four core capabilities within a single model:
|
| 28 |
|
| 29 |
+
- **THINK**: understand the prompt and identify missing knowledge
|
| 30 |
+
- **RESEARCH**: retrieve relevant textual and visual evidence
|
| 31 |
+
- **RECAPTION**: convert retrieved evidence into grounded generation guidance
|
| 32 |
+
- **GENERATE**: synthesize the final image
|
| 33 |
|
| 34 |
+
To train this agent, we construct a tailored multimodal data pipeline and curate **143K high-quality agent trajectories** for world-grounded image synthesis.
|
| 35 |
|
| 36 |
+
We further introduce **FactIP**, a new benchmark for factual and knowledge-intensive image generation, covering **12 categories** of culturally significant and long-tail concepts that explicitly require external knowledge grounding.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
+
As an early exploration of agent-based modeling for image generation, Unify-Agent highlights the value of tightly coupling **reasoning, searching, and generation** for reliable open-world visual synthesis.
|
|
|
|
| 39 |
|
| 40 |
+
## π FactIP Benchmark
|
| 41 |
|
| 42 |
+
Our **FactIP** benchmark is designed to evaluate search-grounded and knowledge-intensive image generation in real-world settings.
|
|
|
|
| 43 |
|
| 44 |
+
<div align="center">
|
| 45 |
+
<img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/construction.png?raw=true" alt="FactIP Benchmark Categories" width="80%">
|
| 46 |
+
</div>
|
| 47 |
|
| 48 |
+
FactIP contains **three major groups** β **Character**, **Scene**, and **Object** β and **12 fine-grained subcategories**, covering diverse factual generation scenarios such as celebrities, animated characters, landmarks, cultural relics, food, toys, and mythology.
|
| 49 |
|
| 50 |
+
The full benchmark contains **2,462 prompts**, and we also provide a mini test subset with category proportions aligned to the full benchmark.
|
| 51 |
|
| 52 |
+
## π Performance
|
| 53 |
|
| 54 |
+
Unify-Agent substantially improves factual visual synthesis over its base unified model and strong open-source baselines across **FactIP**, **WiSE**, **KiTTEN**, and **T2I-FactualBench**.
|
| 55 |
|
| 56 |
+
<div align="center">
|
| 57 |
+
<img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/comparison.png?raw=true" alt="Performance Comparison" width="85%">
|
| 58 |
+
</div>
|
| 59 |
|
| 60 |
+
Our method produces images that better preserve:
|
| 61 |
|
| 62 |
+
- **subject identity**
|
| 63 |
+
- **fine-grained visual attributes**
|
| 64 |
+
- **prompt-specific details**
|
| 65 |
+
- **real-world factual grounding**
|
| 66 |
|
| 67 |
+
while maintaining strong visual quality and broad stylistic versatility.
|
| 68 |
|
| 69 |
+
## π§ Pipeline
|
| 70 |
|
| 71 |
+
<div align="center">
|
| 72 |
+
<img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/method.png?raw=true" alt="Unify-Agent Pipeline" width="85%">
|
| 73 |
+
</div>
|
| 74 |
|
| 75 |
+
Given an input prompt, Unify-Agent first performs **prompt understanding** and **cognitive gap detection** to identify missing but visually critical attributes. It then acquires complementary evidence through both **textual evidence search** and **visual evidence search**.
|
| 76 |
|
| 77 |
+
Based on the collected evidence, the model grounds the generation process with:
|
| 78 |
|
| 79 |
+
- **identity-preserving constraints** for character-specific visual traits
|
| 80 |
+
- **scene-compositional constraints** for pose, environment, clothing, and mood
|
| 81 |
|
| 82 |
+
These grounded constraints are then integrated into an **evidence-grounded recaptioning** module, which produces a detailed caption for the downstream image generator.
|
|
|
|
| 83 |
|
| 84 |
+
## π¦ Release Status
|
| 85 |
|
| 86 |
+
The repository is now available, and the **code, benchmark, and checkpoints** are being prepared for full release.
|
| 87 |
|
| 88 |
+
Please stay tuned for upcoming updates.
|
| 89 |
|
| 90 |
+
## Citation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
|
| 92 |
+
If you find this work helpful, please consider citing:
|
|
|
|
| 93 |
|
|
|
|
| 94 |
```bibtex
|
| 95 |
+
@article{chen2026unify,
|
| 96 |
+
title={Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis},
|
| 97 |
+
author={Chen, Shuang and Shou, Quanxin and Chen, Hangting and Zhou, Yucheng and Feng, Kaituo and Hu, Wenbo and Zhang, Yi-Fan and Lin, Yunlong and Huang, Wenxuan and Song, Mingyang and others},
|
| 98 |
+
journal={arXiv preprint arXiv:2603.29620},
|
| 99 |
+
year={2026}
|
| 100 |
}
|
| 101 |
+
```
|