LucasFang
/

GoT-6B

PyTorch

Model card Files Files and versions

xet

Community

LucasFang commited on Mar 13, 2025

Commit

73ad092

verified ·

1 Parent(s): 28addfa

Update README.md

Browse files

Files changed (1) hide show

README.md +162 -3

README.md CHANGED Viewed

@@ -1,3 +1,162 @@
----
-license: mit
----

+# GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
+<div align="center">
+<a href="https://rongyaofang.github.io/"><img src="https://img.shields.io/badge/Project-Homepage-green" alt="Home"></a>
+<a href="https://arxiv.org/abs/xxxx"><img src="https://img.shields.io/badge/ArXiv-xxxx-red"></a>
+<img src="https://visitor-badge.laobi.icu/badge?page_id=rongyaofang/GoT" alt="visitors">
+[Rongyao Fang](https://scholar.google.com/citations?user=FtH3CW4AAAAJ&hl=en)<sup>1\*</sup>, [Chengqi Duan](https://scholar.google.com/citations?user=r9qb4ZwAAAAJ&hl=zh-CN)<sup>2\*</sup>, [Kun Wang]()<sup>3</sup>, [Linjiang Huang](https://leonhlj.github.io/)<sup>6</sup>, [Hao Li](https://scholar.google.com/citations?user=qHqQsY4AAAAJ&hl=zh-CN)<sup>1,4</sup>, [Shilin Yan](https://scholar.google.com/citations?user=2VhjOykAAAAJ&hl=zh-CN), [Hao Tian]()<sup>3</sup>, [Xingyu Zeng]()<sup>3</sup>, [Rui Zhao]()<sup>3</sup>, [Jifeng Dai](https://jifengdai.org/)<sup>4,5</sup>, [Xihui Liu](https://xh-liu.github.io/)<sup>2 :envelope:</sup>, [Hongsheng Li](https://www.ee.cuhk.edu.hk/~hsli/)<sup>1 :envelope:</sup>
+<sup>1</sup>CUHK MMLab, <sup>2</sup>HKU MMLab, <sup>3</sup>SenseTime, <sup>4</sup>Shanghai AI Laboratory, <sup>5</sup>Tsinghua University, <sup>6</sup>Beihang University
+*Equal contribution, :envelope:Corresponding authors
+</div>
+<div align="center" style="line-height: 1.2;">
+  <a href="https://arxiv.org/abs/xxx" target="_blank"><b>Paper</b></a> •
+  <a href="#introduction">Introduction</a> •
+  <a href="#released-datasets">Datasets</a> •
+  <a href="#released-model-got-framework">Model</a> •
+  <a href="#results">Results</a> •
+  <a href="https://huggingface.co/LucasFang/GoT-6B" target="_blank">🤗 Hugging Face</a> •
+  <a href="#license">License</a>
+</div>
+## Introduction
+We present **Generation Chain-of-Thought (GoT)**, a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements.
+GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent through:
+- **Semantic-Spatial Reasoning**: Integrates both semantic understanding and explicit spatial coordinates
+- **Unified Framework**: Handles both image generation and editing with the same architecture
+## Released Datasets
+| Dataset | Link | Amount |
+|---------|------|--------|
+| **Laion-Aesthetics-High-Resolution-GoT** | [🤗 HuggingFace](https://huggingface.co/datasets/LucasFang/Laion-Aesthetics-High-Resolution-GoT) | 3.77M  |
+| **JourneyDB-GoT** | [🤗 HuggingFace](https://huggingface.co/datasets/LucasFang/JourneyDB-GoT) | 4.09M  |
+| **OmniEdit-GoT** | [🤗 HuggingFace](https://huggingface.co/datasets/LucasFang/OmniEdit-GoT) | 736K   |
+## Dataset Features
+### Laion-Aesthetics-High-Resolution-GoT
+- 3.77 million High-quality images filtered for sizes larger than 512 pixels from Laion-Aesthetics
+- Prompts and GoT descriptions from Qwen2-VL
+- Prompts averaging 110.81 characters
+- GoT descriptions averaging 811.56 characters
+- 3.78 bounding boxes per image on average
+### JourneyDB-GoT
+- 4.09 million high-quality AI-generated images
+- Prompts and GoT descriptions from Qwen2-VL
+- Prompts averaging 149.78 characters
+- GoT descriptions averaging 906.01 characters
+- 4.09 bounding boxes per image on average
+- Please download the images from [JourneyDB dataset](https://opendatalab.com/OpenDataLab/JourneyDB/tree/main/raw/JourneyDB/train/imgs)
+### OmniEdit-GoT
+- 736K high-quality image editing samples from OmniEdit
+- Diverse editing operations (addition, removal, swap, attribute changes, style transfer)
+- Detailed reasoning chains with step-by-step editing processes
+- Precise spatial coordinate annotations for editing regions
+- Please download the images from [OmniEdit dataset](https://huggingface.co/datasets/TIGER-Lab/OmniEdit-Filtered-1.2M)
+## Model Features
+Our GoT framework consists of two key components:
+1. **Semantic-Spatial MLLM**: Generates detailed reasoning chains with spatial information using Qwen2.5-VL as the backbone
+2. **SSGM Diffusion Module**: Leverages the semantic guidance, spatial layouts, and reference images to create high-quality visual outputs
+The Semantic-Spatial Guidance Module (SSGM) combines three guidance pathways:
+- **Semantic Guidance**: Captures relationships and attributes
+- **Spatial Guidance**: Controls precise object placement
+- **Reference Guidance**: Provides context for editing tasks
+## Results
+### Text-to-Image Generation
+GoT achieves state-of-the-art performance on the GenEval benchmark, particularly excelling in composition tasks:
+<div align="center">
+| Method | Architecture | Overall | Single Obj. | Two Obj. | Counting | Colors | Position | Attr. Binding |
+|--------|--------------|---------|-------------|----------|----------|--------|----------|---------------|
+| SD-XL | Unet+CLIP | 0.55 | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 |
+| SD3 | MMDIT+CLIP+T5 | 0.62 | 0.98 | 0.74 | 0.63 | 0.67 | 0.34 | 0.36 |
+| Emu3-Gen | Autoregressive | 0.54 | 0.98 | 0.71 | 0.34 | 0.81 | 0.17 | 0.21 |
+| Janus | Autoregressive | 0.61 | 0.97 | 0.68 | 0.30 | 0.84 | 0.46 | 0.42 |
+| JanusFlow | Autoregressive | 0.63 | 0.97 | 0.59 | 0.45 | 0.83 | 0.53 | 0.42 |
+| **GoT Framework** | Unet+Qwen2.5-VL | **0.64** | **0.99** | 0.69 | **0.67** | **0.85** | 0.34 | 0.27 |
+</div>
+### Image Editing
+Our approach also demonstrates superior performance on image editing benchmarks:
+<div align="center">
+| Method | Emu-Edit |  | ImagenHub | Reason-Edit |
+|--------|----------|--------|-----------|------------|
+|        | CLIP-I   | CLIP-T | GPT-4o Eval. | GPT-4o Eval. |
+| IP2P | 0.834 | 0.219 | 0.308 | 0.286 |
+| MagicBrush | 0.838 | 0.222 | 0.513 | 0.334 |
+| SEED-X | 0.825 | 0.272 | 0.166 | 0.239 |
+| CosXL-Edit | 0.860 | 0.274 | 0.464 | 0.325 |
+| **GoT Framework** | **0.864** | **0.276** | **0.533** | 0.561 |
+</div>
+### Interactive Generation
+One of the unique capabilities of GoT is interactive generation, allowing users to modify the reasoning chain to customize the generated images:
+<div align="center">
+  <img src="figures/interactive.png" width="100%" alt="Interactive Generation" />
+</div>
+Users can interact with the reasoning chain to:
+1. Replace objects
+2. Adjust object positions
+3. Modify object attributes
+## Usage
+### Dependencies
+- Python >= 3.8 (Recommend to use [Anaconda](https://www.anaconda.com/download/#linux))
+- [PyTorch >=2.0.1](https://pytorch.org/)
+- NVIDIA GPU + [CUDA](https://developer.nvidia.com/cuda-downloads)
+### Installation
+Clone the repo and install dependent packages
+  ```bash
+  git clone git@github.com:rongyaofang/GoT.git
+  cd GoT
+  pip install -r requirements.txt
+  ```
+### Model Weights
+Place the required model weights in the `./pretrained` directory as follows:
+1. GoT-6B model weights
+2. [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
+3. [Stable Diffusion XL Base 1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0)
+Your directory structure should match the following:
+```
+GoT
+├── pretrained
+│   ├── GoT-6B
+│   ├── Qwen2.5-VL-3B-Instruct
+│   └── stable-diffusion-xl-base-1.0
+├── ...
+```
+## License
+This code is released under the MIT License.