File size: 3,179 Bytes

fd45889
94b8cbd
fd45889
 
 
5e154e3
94b8cbd
 
 
 
 
 
 
 
 
5e154e3
 
 
 
 
 
 
8958807
5e154e3
 
 
 
 
 
 
 
 
 
94b8cbd
5e154e3
 
94b8cbd
5e154e3
94b8cbd
5e154e3
 
 
633ace1
5e154e3
 
 
 
 
 
 
 
94b8cbd
5e154e3
94b8cbd
5e154e3
 
94b8cbd
5e154e3
94b8cbd
 
 
 
 
 
5e154e3
94b8cbd
 
5e154e3
 
 
94b8cbd
5e154e3
 
 
94b8cbd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5e154e3
 
94b8cbd

---
library_name: diffusers
license: apache-2.0
pipeline_tag: any-to-any
---

# ThinkGen: Generalized Thinking for Visual Generation

ThinkGen is the first think-driven visual generation framework that explicitly leverages Multimodal Large Language Models' (MLLMs) Chain-of-Thought (CoT) reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and the DiT produces high-quality images guided by these instructions.

- **Paper:** [ThinkGen: Generalized Thinking for Visual Generation](https://huggingface.co/papers/2512.23568)
- **Code:** [GitHub Repository](https://github.com/jiaosiyuu/ThinkGen)

**Authors**: Siyu Jiao, Yiheng Lin, Yujie Zhong, Qi She, Wei Zhou, Xiaohan Lan, Zilong Huang, Fei Yu, Yingchen Yu, Yunqing Zhao, Yao Zhao, Yunchao Wei.

## 🚀 Quick Start

### 🛠️ Environment Setup

```bash
# 1. Clone the repo
git clone https://github.com/jiaosiyuu/ThinkGen.git
cd ThinkGen

# 2. (Optional) Create a clean Python environment
conda create -n thinkgen python=3.11
conda activate thinkgen

# 3. Install dependencies
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r req.txt

# ThinkGen runs even without flash-attn, though we recommend install it for best performance.
pip install --no-cache-dir flash-attn==2.7.4.post1 --no-build-isolation
```

### 💻 Sample Usage

```python
from ThinkGen.model import ThinkGen_Chat
import os

model_path = "JSYuuu/ThinkGen"

chat_model = ThinkGen_Chat(
    model_path=model_path,
    dtype='bf16',
    height=1024,
    width=1024
)

# 1. Image Generation
messages = [
    {"type": "text", "value": "A young woman wearing a straw hat, standing in a golden wheat field."}
]
results = chat_model.generate_image(messages)
results.images[0].save("result.png")

# 2. Image Generation with Thinking (CoT)
# This enables the MLLM's CoT reasoning for generation
results_think = chat_model.generate_image(messages, think=True)
print(f"cot & rewrite prompt: 
{results_think.prompt_cot}")
results_think.images[0].save("result_think.png")

# 3. Image Understanding
messages_und = [
    {"type": "image", "value": "images/teaser.png"},
    {"type": "text", "value": "Describe this image"}
]
response = chat_model.generate_text(messages_und)
print(response)
```

## Acknowledgments
This work builds upon the following great open-source projects:
* **OmniGen2:** https://github.com/VectorSpaceLab/OmniGen2
* **Qwen3VL:** https://github.com/QwenLM/Qwen3-VL
* **EasyR1:** https://github.com/hiyouga/EasyR1
* **Flow-GRPO:** https://github.com/yifan123/flow_grpo

## Citation
```bibtex
@article{jiao2025thinkgen,
  title={ThinkGen: Generalized Thinking for Visual Generation},
  author={Jiao, Siyu and Lin, Yiheng and Zhong, Yujie and She, Qi and Zhou, Wei and Lan, Xiaohan and Huang, Zilong and Yu, Fei and Yu, Yingchen and Zhao, Yunqing and Zhao, Yao and Wei, Yunchao},
  journal={arXiv preprint arXiv:2512.23568},
  year={2025}
}
```

## License
This work is licensed under the Apache 2.0 license.