| | --- |
| | license: apache-2.0 |
| | library_name: transformers |
| | pipeline_tag: text-to-image |
| | --- |
| | |
| | ## ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning |
| |
|
| | [](https://arxiv.org/abs/2506.03596) [](https://huggingface.co/maplebb/ControlThinker) [](https://huggingface.co/papers/2506.03596) [GitHub Repository](https://github.com/maplebb/controlthinker) |
| |
|
| | ControlThinker is a novel framework that employs a "comprehend-then-generate" paradigm for controllable image generation through visual reasoning. It addresses the semantic gap between input text prompts and target images by leveraging a Multimodal Large Language Model (MLLM) to extract latent semantics from control images. This enriches prompts, significantly enhancing visual quality and semantic consistency in generated images. |
| |
|
| | The model was presented in the paper [ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning](https://huggingface.co/papers/2506.03596). |
| |
|
| | <p align="center"><img src="https://github.com/maplebb/controlthinker/raw/main/asset/image/teaser.png" width="95%"></p> |
| |
|
| | ## Usage |
| |
|
| | You can use ControlThinker for image generation. Below is a sample usage demonstrating how to generate an image from a text prompt. |
| |
|
| | ```python |
| | from inference_solver import FlexARInferenceSolver |
| | from PIL import Image |
| | |
| | # ******************** Image Generation ******************** |
| | inference_solver = FlexARInferenceSolver( |
| | model_path="maplebb/ControlThinker", |
| | precision="bf16", |
| | target_size=768, |
| | ) |
| | |
| | q1 = f"Generate an image of 768x768 according to the following prompt: |
| | " \ |
| | f"Image of a dog playing water, and a waterfall is in the background." |
| | |
| | # generated: tuple of (generated response, list of generated images) |
| | generated = inference_solver.generate( |
| | images=[], |
| | qas=[[q1, None]], |
| | max_gen_len=8192, |
| | temperature=1.0, |
| | logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000), |
| | ) |
| | |
| | a1, new_image = generated[0], generated[1][0] |
| | |
| | # You can save and display the generated image |
| | new_image.save("generated_dog.png") |
| | new_image.show() |
| | ``` |
| |
|
| | ## License |
| |
|
| | ControlThinker is licensed under the Apache 2.0. |
| |
|
| | ## ✍️ Citation |
| |
|
| | ```bibtex |
| | @article{han2025controlthinker, |
| | title={ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning}, |
| | author={Han, Feng and Jiao, Yang and Chen, Shaoxiang and Xu, Junhao and Chen, Jingjing and Jiang, Yu-Gang}, |
| | journal={arXiv preprint arXiv:2506.03596}, |
| | year={2025} |
| | } |
| | ``` |