GoT-6B / README.md

Update README.md

7e3768f verified 11 months ago

6.32 kB

	# GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
	<div align="center">
	Rongyao Fang<sup>1</sup>, Chengqi Duan<sup>2</sup>, Kun Wang<sup>3</sup>, Linjiang Huang<sup>6</sup>, Hao Li<sup>1,4</sup>, Shilin Yan, Hao Tian<sup>3</sup>, Xingyu Zeng<sup>3</sup>, Rui Zhao<sup>3</sup>, Jifeng Dai<sup>4,5</sup>, Xihui Liu<sup>2</sup>, Hongsheng Li<sup>1</sup>

	<sup>1</sup>CUHK MMLab, <sup>2</sup>HKU MMLab, <sup>3</sup>SenseTime, <sup>4</sup>Shanghai AI Laboratory, <sup>5</sup>Tsinghua University, <sup>6</sup>Beihang University

	*Equal contribution
	</div>

	<div align="center" style="line-height: 1.2;">
	<a href="https://arxiv.org/abs/xxx" target="_blank"><b>Paper</b></a> •
	<a href="#introduction">Introduction</a> •
	<a href="#released-datasets">Datasets</a> •
	<a href="#released-model-got-framework">Model</a> •
	<a href="#results">Results</a> •
	<a href="https://huggingface.co/LucasFang/GoT-6B" target="_blank">🤗 Hugging Face</a> •
	<a href="#license">License</a>
	</div>

	## Introduction

	We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements.

	GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent through:

	- Semantic-Spatial Reasoning: Integrates both semantic understanding and explicit spatial coordinates
	- Unified Framework: Handles both image generation and editing with the same architecture

	## Released Datasets

	\| Dataset \| Link \| Amount \|
	\|---------\|------\|--------\|
	\| Laion-Aesthetics-High-Resolution-GoT \| [🤗 HuggingFace](https://huggingface.co/datasets/LucasFang/Laion-Aesthetics-High-Resolution-GoT) \| 3.77M \|
	\| JourneyDB-GoT \| [🤗 HuggingFace](https://huggingface.co/datasets/LucasFang/JourneyDB-GoT) \| 4.09M \|
	\| OmniEdit-GoT \| [🤗 HuggingFace](https://huggingface.co/datasets/LucasFang/OmniEdit-GoT) \| 736K \|

	## Dataset Features

	### Laion-Aesthetics-High-Resolution-GoT
	- 3.77 million High-quality images filtered for sizes larger than 512 pixels from Laion-Aesthetics
	- Prompts and GoT descriptions from Qwen2-VL
	- Prompts averaging 110.81 characters
	- GoT descriptions averaging 811.56 characters
	- 3.78 bounding boxes per image on average

	### JourneyDB-GoT
	- 4.09 million high-quality AI-generated images
	- Prompts and GoT descriptions from Qwen2-VL
	- Prompts averaging 149.78 characters
	- GoT descriptions averaging 906.01 characters
	- 4.09 bounding boxes per image on average
	- Please download the images from [JourneyDB dataset](https://opendatalab.com/OpenDataLab/JourneyDB/tree/main/raw/JourneyDB/train/imgs)

	### OmniEdit-GoT
	- 736K high-quality image editing samples from OmniEdit
	- Diverse editing operations (addition, removal, swap, attribute changes, style transfer)
	- Detailed reasoning chains with step-by-step editing processes
	- Precise spatial coordinate annotations for editing regions
	- Please download the images from [OmniEdit dataset](https://huggingface.co/datasets/TIGER-Lab/OmniEdit-Filtered-1.2M)

	## Model Features

	Our GoT framework consists of two key components:

	1. Semantic-Spatial MLLM: Generates detailed reasoning chains with spatial information using Qwen2.5-VL as the backbone
	2. SSGM Diffusion Module: Leverages the semantic guidance, spatial layouts, and reference images to create high-quality visual outputs

	The Semantic-Spatial Guidance Module (SSGM) combines three guidance pathways:
	- Semantic Guidance: Captures relationships and attributes
	- Spatial Guidance: Controls precise object placement
	- Reference Guidance: Provides context for editing tasks

	## Results

	### Text-to-Image Generation

	GoT achieves state-of-the-art performance on the GenEval benchmark, particularly excelling in composition tasks:

	<div align="center">

	\| Method \| Architecture \| Overall \| Single Obj. \| Two Obj. \| Counting \| Colors \| Position \| Attr. Binding \|
	\|--------\|--------------\|---------\|-------------\|----------\|----------\|--------\|----------\|---------------\|
	\| SD-XL \| Unet+CLIP \| 0.55 \| 0.98 \| 0.74 \| 0.39 \| 0.85 \| 0.15 \| 0.23 \|
	\| SD3 \| MMDIT+CLIP+T5 \| 0.62 \| 0.98 \| 0.74 \| 0.63 \| 0.67 \| 0.34 \| 0.36 \|
	\| Emu3-Gen \| Autoregressive \| 0.54 \| 0.98 \| 0.71 \| 0.34 \| 0.81 \| 0.17 \| 0.21 \|
	\| Janus \| Autoregressive \| 0.61 \| 0.97 \| 0.68 \| 0.30 \| 0.84 \| 0.46 \| 0.42 \|
	\| JanusFlow \| Autoregressive \| 0.63 \| 0.97 \| 0.59 \| 0.45 \| 0.83 \| 0.53 \| 0.42 \|
	\| GoT Framework \| Unet+Qwen2.5-VL \| 0.64 \| 0.99 \| 0.69 \| 0.67 \| 0.85 \| 0.34 \| 0.27 \|

	</div>

	### Image Editing

	Our approach also demonstrates superior performance on image editing benchmarks:

	<div align="center">

	\| Method \| Emu-Edit \| \| ImagenHub \| Reason-Edit \|
	\|--------\|----------\|--------\|-----------\|------------\|
	\| \| CLIP-I \| CLIP-T \| GPT-4o Eval. \| GPT-4o Eval. \|
	\| IP2P \| 0.834 \| 0.219 \| 0.308 \| 0.286 \|
	\| MagicBrush \| 0.838 \| 0.222 \| 0.513 \| 0.334 \|
	\| SEED-X \| 0.825 \| 0.272 \| 0.166 \| 0.239 \|
	\| CosXL-Edit \| 0.860 \| 0.274 \| 0.464 \| 0.325 \|
	\| GoT Framework \| 0.864 \| 0.276 \| 0.533 \| 0.561 \|

	</div>

	## Usage

	### Dependencies
	- Python >= 3.8 (Recommend to use [Anaconda](https://www.anaconda.com/download/#linux))
	- [PyTorch >=2.0.1](https://pytorch.org/)
	- NVIDIA GPU + [CUDA](https://developer.nvidia.com/cuda-downloads)

	### Installation
	Clone the repo and install dependent packages

	```bash
	git clone git@github.com:rongyaofang/GoT.git
	cd GoT
	pip install -r requirements.txt
	```

	### Model Weights
	Place the required model weights in the `./pretrained` directory as follows:

	1. GoT-6B model weights
	2. [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
	3. [Stable Diffusion XL Base 1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0)

	Your directory structure should match the following:

	```
	GoT
	├── pretrained
	│ ├── GoT-6B
	│ ├── Qwen2.5-VL-3B-Instruct
	│ └── stable-diffusion-xl-base-1.0
	├── ...
	```

	## License

	This code is released under the MIT License.