GoT-R1-1B / README.md
nielsr's picture
nielsr HF Staff
Add model card for GoT-R1-1B
0a9de08 verified
|
raw
history blame
2.08 kB
metadata
license: mit
pipeline_tag: text-to-image

GoT-R1-1B

GoT-R1 is a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation, as presented in GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning.

Introduction

Visual generation models often struggle with complex prompts specifying multiple objects with precise spatial relationships. GoT-R1 addresses this by applying reinforcement learning to enhance semantic-spatial reasoning. Building upon the Generation Chain-of-Thought (GoT) approach, GoT-R1 enables models to autonomously discover effective reasoning strategies through a dual-stage multi-dimensional reward framework.

  • Enhanced Semantic-Spatial Reasoning: Uses RL to improve planning of complex scenes.
  • Autonomous Reasoning Chain Discovery: Moves beyond fixed templates to allow the model to explore more effective reasoning paths.
  • Comprehensive MLLM-based Rewards: Evaluates both the intermediate reasoning process and the final visual output.

Resources

Usage

For installation and setup, please refer to the official GitHub repository. To run inference using the provided script from the repository:

python infer.py --ckpt_path <Your GoT-R1 checkpoint path>

Citation

If you find this work helpful, please consider citing the paper:

@article{duan2025got,
  title={GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning},
  author={Duan, Chengqi and Fang, General and Wang, Yuqing and Wang, Kun and Huang, Linjiang and Zeng, Xingyu and Li, Hongsheng and Liu, Xihui},
  journal={arXiv preprint arXiv:2505.17022},
  year={2025}
}