license: mit
pipeline_tag: text-to-image
GoT-R1-1B
GoT-R1 is a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation, as presented in GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning.
Introduction
Visual generation models often struggle with complex prompts specifying multiple objects with precise spatial relationships. GoT-R1 addresses this by applying reinforcement learning to enhance semantic-spatial reasoning. Building upon the Generation Chain-of-Thought (GoT) approach, GoT-R1 enables models to autonomously discover effective reasoning strategies through a dual-stage multi-dimensional reward framework.
- Enhanced Semantic-Spatial Reasoning: Uses RL to improve planning of complex scenes.
- Autonomous Reasoning Chain Discovery: Moves beyond fixed templates to allow the model to explore more effective reasoning paths.
- Comprehensive MLLM-based Rewards: Evaluates both the intermediate reasoning process and the final visual output.
Resources
- GitHub Repository: https://github.com/gogoduan/GoT-R1
- Paper: GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
Usage
For installation and setup, please refer to the official GitHub repository. To run inference using the provided script from the repository:
python infer.py --ckpt_path <Your GoT-R1 checkpoint path>
Citation
If you find this work helpful, please consider citing the paper:
@article{duan2025got,
title={GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning},
author={Duan, Chengqi and Fang, General and Wang, Yuqing and Wang, Kun and Huang, Linjiang and Zeng, Xingyu and Li, Hongsheng and Liu, Xihui},
journal={arXiv preprint arXiv:2505.17022},
year={2025}
}