GoT-R1-7B / README.md
nielsr's picture
nielsr HF Staff
Add model card for GoT-R1-7B
6060561 verified
|
raw
history blame
1.8 kB
metadata
license: mit
pipeline_tag: text-to-image

GoT-R1-7B

GoT-R1-7B is a multimodal large language model (MLLM) designed for high-quality text-to-image generation with advanced semantic-spatial reasoning, as introduced in the paper GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning.

Overview

Visual generation models often struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. GoT-R1 addresses this by applying reinforcement learning to enhance semantic-spatial reasoning. Building upon the Generation Chain-of-Thought (GoT) approach, GoT-R1 enables models to autonomously discover effective reasoning strategies. The model uses a unified MLLM architecture (based on Janus-Pro) that autoregressively generates a textual reasoning chain followed by image tokens.

Usage

To use this model, please follow the installation instructions in the official GitHub repository. Inference can be performed using the provided script:

python infer.py --ckpt_path <path-to-GoT-R1-7B-weights>

Citation

@article{duan2025got,
  title={GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning},
  author={Duan, Chengqi and Fang, Rongyao and Wang, Yuqing and Wang, Kun and Huang, Linjiang and Zeng, Xingyu and Li, Hongsheng and Liu, Xihui},
  journal={arXiv preprint arXiv:2505.17022},
  year={2025}
}