Think-Then-Generate-T2I / README.md

nielsr HF Staff

Add model card for Think-Then-Generate

a13a195 verified 1 day ago

preview code

raw

history blame

3.48 kB

metadata

pipeline_tag: text-to-image
library_name: diffusers

Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders

This repository contains the weights for Think-Then-Generate (T2G), a novel reasoning-aware text-to-image diffusion model paradigm.

Paper | GitHub Repository | Project Page | Hugging Face Space

Introduction

Most existing Text-to-Image (T2I) diffusion models, even those with LLM-based text encoders, act primarily as text-pixel mappers. They encode text without fully leveraging the LLM's inherent reasoning capabilities to infer what should be visually depicted. The Think-Then-Generate paradigm proposes to move beyond this literal generation by encouraging the LLM-based text encoder to reason about and rewrite raw user prompts. The states of these rewritten prompts then serve as diffusion conditioning, leading to more factually consistent, semantically aligned, and visually realistic generations.

How it Works

The T2G paradigm consists of two main phases:

Phase I: Reasoning Activation The reasoning potential of the LLM-based text encoder is activated through a lightweight supervised fine-tuning (SFT) process. Instead of directly passing the raw prompt to the generator, the LLM is encouraged to reason about the user's intent and rewrite the prompt into a detailed, structured description that serves as conditioning for the Diffusion Transformer (DiT) backbone.
Phase II: Co-Evolution via Dual-GRPO To ensure that reasoning effectively improves image quality, a co-optimization strategy called Dual-GRPO is employed for both the "Brain" (LLM encoder) and the "Painter" (DiT backbone):
- For the LLM Encoder: It is reinforced using image-grounded rewards, with a focus on semantic alignment. This encourages the model to activate latent world knowledge and infer precise visual details critical for accurate generation.
- For the DiT Backbone: It is simultaneously trained with visual realism and aesthetic quality rewards, conditioned on the refined prompts. This aligns the generator's capabilities with the complex and detailed instructions produced by the LLM.

Installation

To install the necessary dependencies, refer to the GitHub repository:

pip install torch transformers diffusers accelerate

Inference

You can run the model on a single GPU using the inference script provided in the GitHub repository:

python inference.py \
  --model_path "SJTU-Deng-Lab/Think-Then-Generate-T2I" \
  --prompt "A multi-panel illustration showing the story of marking the boat to find a sword, with clear steps from dropping the sword to carving a mark on the boat." \
  --output "sword_result.jpg"

Citation

If you find our work helpful, please consider citing:

@article{kou2026think,
  title={Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders},
  author={Siqi Kou and Jiachun Jin and Zetong Zhou and Ye Ma and Yugang Wang and Quan Chen and Peng Jiang and Xiao Yang and Jun Zhu and Kai Yu and Zhijie Deng},
  journal={arXiv preprint},
  year={2026}
}