csfufu
/

Unify-Agent

@@ -1,118 +1,101 @@
 ---
 license: apache-2.0
 base_model:
-- Qwen/Qwen2.5-7B-Instruct
 pipeline_tag: any-to-any
 library_name: bagel-mot
 ---
-<p align="left">
-  <img src="https://lf3-static.bytednsdoc.com/obj/eden-cn/nuhojubrps/banner.png" alt="BAGEL" width="480"/>
-</p>
-# 🥯 BAGEL • Unified Model for Multimodal Understanding and Generation
-<p align="left">
-  <a href="https://bagel-ai.org/">
-    <img
-      src="https://img.shields.io/badge/BAGEL-Website-0A66C2?logo=safari&logoColor=white" style="display: inline-block; vertical-align: middle;"
-      alt="BAGEL Website"
-    />
-  </a>
-  <a href="https://arxiv.org/abs/2505.14683">
-    <img
-      src="https://img.shields.io/badge/BAGEL-Paper-red?logo=arxiv&logoColor=red" style="display: inline-block; vertical-align: middle;"
-      alt="BAGEL Paper on arXiv"
-    />
-  </a>
-  <a href="https://github.com/bytedance-seed/BAGEL" target="_blank" style="margin: 2px;">
-      <img
-        alt="Github" src="https://img.shields.io/badge/BAGEL-Codebase-536af5?color=536af5&logo=github" style="display: inline-block; vertical-align: middle;"
-        alt="BAGEL Codebase"
-      />
-  </a>
-  <a href="https://demo.bagel-ai.org/">
-    <img
-      src="https://img.shields.io/badge/BAGEL-Demo-blue?logo=googleplay&logoColor=white" style="display: inline-block; vertical-align: middle;"
-      alt="BAGEL Demo"
-    />
-  </a>
-  <a href="https://discord.com/invite/Z836xxzy">
-    <img
-      src="https://img.shields.io/badge/BAGEL-Discord-green?logo=discord&logoColor=white" style="display: inline-block; vertical-align: middle;"
-      alt="BAGEL Discord"
-    />
-  </a>
-</p>
-> We present **BAGEL**, an open‑source multimodal foundation model with 7B active parameters (14B total) trained on large‑scale interleaved multimodal data. BAGEL outperforms the current top‑tier open‑source VLMs like Qwen2.5-VL and InternVL-2.5 on standard multimodal understanding leaderboards, and delivers text‑to‑image quality that is competitive with strong specialist generators such as SD3.
-Moreover, BAGEL demonstrates superior qualitative results in classical image‑editing scenarios than the leading open-source models. More importantly, it extends to free-form visual manipulation, multiview synthesis, and world navigation, capabilities that constitute "world-modeling" tasks beyond the scope of previous image-editing models.
-This repository hosts the model weights for **BAGEL**. For installation, usage instructions, and further documentation, please visit our [GitHub repository](https://github.com/bytedance-seed/BAGEL).
-<p align="left"><img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/teaser.webp" width="80%"></p>
-## 🧠 Method
-BAGEL adopts a Mixture-of-Transformer-Experts (MoT) architecture to maximize the model’s capacity to learn from richly diverse multimodal information. Following the same principle of capacity maximization, it utilizes two separate encoders to capture pixel-level and semantic-level features of an image. The overall framework follows a Next Group of Token Prediction paradigm, where the model is trained to predict the next group of language or visual tokens as a compression target.
-BAGEL scales MoT’s capacity through Pre-training, Continued Training, and Supervised Finetuning on trillions of interleaved multimodal tokens spanning language, image, video, and web data. It surpasses open models on standard understanding and generation benchmarks and demonstrates advanced in-context multimodal abilities like free-form image editing, future frame prediction, 3D manipulation, world navigation, and sequential reasoning.
-<p align="left"><img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/arch.png" width="50%"></p>
-## 🌱 Emerging Properties
-<p align="left"><img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/emerging_curves.png" width="50%"></p>
-As we scale up BAGEL’s pretraining with more multimodal tokens, we observe consistent performance gains across understanding, generation, and editing tasks. Different capabilities emerge at distinct training stages—multimodal understanding and generation appear early, followed by basic editing, while complex, intelligent editing emerges later. This staged progression suggests an emergent pattern, where advanced multimodal reasoning builds on well-formed foundational skills. Ablation studies further show that combining VAE and ViT features significantly improves intelligent editing, underscoring the importance of visual-semantic context in enabling complex multimodal reasoning and further supporting its role in the emergence of advanced capabilities.
-## 📊 Benchmarks
-### 1. Visual Understanding
-| Model | MME ↑ | MMBench ↑ |   MMMU ↑ | MM-Vet ↑ | MathVista ↑ |
-| ------------------- | ----------: | ----------: | -------: | -------: | ----------: |
-| Janus-Pro-7B        | -  |     79.2 |     41.0 |     50.0 |           – |
-| Qwen2.5-VL-7B      | 2347    |   83.5 | **58.6** |     67.1 |           68.2 |
-| **BAGEL**    | **2388**  |  **85.0** |     55.3 | **67.2** |    **73.1** |
-### 2. Text-to-Image Generation · GenEval
-| Model        | Overall ↑ |
-| ------------ | --------- |
-| FLUX-1-dev   | 0.82      |
-| SD3-Medium   | 0.74      |
-| Janus-Pro-7B | 0.80      |
-| **BAGEL**    | **0.88**  |
-### 3. Image Editing
-| Model         | GEdit-Bench-EN (SC) ↑ | GEdit-Bench-EN (PQ) ↑ | GEdit-Bench-EN (O) ↑ | IntelligentBench ↑ |
-| ------------- | --------------------- | --------------------- | ------------------- | ------------------ |
-| Step1X-Edit   | 7.09                  | 6.76                  | **6.70**            | 14.9               |
-| Gemini-2-exp. | 6.73                  | 6.61                  | 6.32                | **57.6**           |
-| **BAGEL**     | **7.36**              | **6.83**              | 6.52                | 44.0               |
-| **BAGEL+CoT** | –                   | –                     | –                   | 55.3               |
-## License
-BAGEL is licensed under the Apache 2.0 license. It is finetuned from [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and [siglip-so400m-14-384-flash-attn2](https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2) model, and uses the [FLUX.1-schnell VAE model](https://huggingface.co/black-forest-labs/FLUX.1-schnell), all under Apache 2.0.
-## ✍️ Citation
 ```bibtex
-@article{deng2025bagel,
-  title   = {Emerging Properties in Unified Multimodal Pretraining},
-  author  = {Deng, Chaorui and Zhu, Deyao and Li, Kunchang and Gou, Chenhui and Li, Feng and Wang, Zeyu and Zhong, Shu and Yu, Weihao and Nie, Xiaonan and Song, Ziang and Shi, Guang and Fan, Haoqi},
-  journal = {arXiv preprint arXiv:2505.14683},
-  year    = {2025}
 }
-```

 ---
 license: apache-2.0
 base_model:
+- ByteDance-Seed/BAGEL-7B-MoT
 pipeline_tag: any-to-any
 library_name: bagel-mot
 ---
+---
+task_categories:
+- text-to-image
+---
+# Unify-Agent
+[**Paper**](https://arxiv.org/abs/2603.29620) | [**Code**](https://github.com/shawn0728/Unify-Agent)
+This repository contains the official resources for [**Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis**](https://arxiv.org/abs/2603.29620).
+# 👀 Intro
+<div align="center">
+  <img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/showcase.png?raw=true" alt="Unify-Agent Overview" width="80%">
+</div>
+We introduce **Unify-Agent**, an end-to-end unified multimodal agent for **world-grounded image synthesis**. Unlike conventional text-to-image models that rely only on frozen parametric knowledge, Unify-Agent can actively **reason, search, and integrate external world knowledge at inference time**, enabling more faithful generation of real people, cultural symbols, rare IPs, historical scenes, scientific concepts, and other long-tail entities.
+Unify-Agent unifies four core capabilities within a single model:
+- **THINK**: understand the prompt and identify missing knowledge
+- **RESEARCH**: retrieve relevant textual and visual evidence
+- **RECAPTION**: convert retrieved evidence into grounded generation guidance
+- **GENERATE**: synthesize the final image
+To train this agent, we construct a tailored multimodal data pipeline and curate **143K high-quality agent trajectories** for world-grounded image synthesis.
+We further introduce **FactIP**, a new benchmark for factual and knowledge-intensive image generation, covering **12 categories** of culturally significant and long-tail concepts that explicitly require external knowledge grounding.
+As an early exploration of agent-based modeling for image generation, Unify-Agent highlights the value of tightly coupling **reasoning, searching, and generation** for reliable open-world visual synthesis.
+## 🔍 FactIP Benchmark
+Our **FactIP** benchmark is designed to evaluate search-grounded and knowledge-intensive image generation in real-world settings.
+<div align="center">
+  <img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/construction.png?raw=true" alt="FactIP Benchmark Categories" width="80%">
+</div>
+FactIP contains **three major groups** — **Character**, **Scene**, and **Object** — and **12 fine-grained subcategories**, covering diverse factual generation scenarios such as celebrities, animated characters, landmarks, cultural relics, food, toys, and mythology.
+The full benchmark contains **2,462 prompts**, and we also provide a mini test subset with category proportions aligned to the full benchmark.
+## 🏆 Performance
+Unify-Agent substantially improves factual visual synthesis over its base unified model and strong open-source baselines across **FactIP**, **WiSE**, **KiTTEN**, and **T2I-FactualBench**.
+<div align="center">
+  <img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/comparison.png?raw=true" alt="Performance Comparison" width="85%">
+</div>
+Our method produces images that better preserve:
+- **subject identity**
+- **fine-grained visual attributes**
+- **prompt-specific details**
+- **real-world factual grounding**
+while maintaining strong visual quality and broad stylistic versatility.
+## 🧠 Pipeline
+<div align="center">
+  <img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/method.png?raw=true" alt="Unify-Agent Pipeline" width="85%">
+</div>
+Given an input prompt, Unify-Agent first performs **prompt understanding** and **cognitive gap detection** to identify missing but visually critical attributes. It then acquires complementary evidence through both **textual evidence search** and **visual evidence search**.
+Based on the collected evidence, the model grounds the generation process with:
+- **identity-preserving constraints** for character-specific visual traits
+- **scene-compositional constraints** for pose, environment, clothing, and mood
+These grounded constraints are then integrated into an **evidence-grounded recaptioning** module, which produces a detailed caption for the downstream image generator.
+## 📦 Release Status
+The repository is now available, and the **code, benchmark, and checkpoints** are being prepared for full release.
+Please stay tuned for upcoming updates.
+## Citation
+If you find this work helpful, please consider citing:
 ```bibtex
+@article{chen2026unify,
+  title={Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis},
+  author={Chen, Shuang and Shou, Quanxin and Chen, Hangting and Zhou, Yucheng and Feng, Kaituo and Hu, Wenbo and Zhang, Yi-Fan and Lin, Yunlong and Huang, Wenxuan and Song, Mingyang and others},
+  journal={arXiv preprint arXiv:2603.29620},
+  year={2026}
 }
+```