| --- |
| license: apache-2.0 |
| --- |
| |
| # Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis |
|
|
| We systematically explore the under-designed space of fusing LLMs with DiT-based diffusion backbones to text-to-image synthesis. |
|
|
| ## Resources |
| - [arXiv: Paper](https://arxiv.org/pdf/2505.10046) |
| - [GitHub: Code](https://github.com/tang-bd/fuse-dit) |
|
|
| ## Quick Start |
| You can download the pre-trained model and then use `FuseDiTPipeline` in our codebase to run inference: |
|
|
| ```python |
| import torch |
| from diffusion.pipelines import FuseDiTPipeline |
| pipeline = FuseDiTPipeline.from_pretrained("/path/to/pipeline/").to("cuda") |
| image = pipeline( |
| "your prompt", |
| width=512, |
| height=512, |
| num_inference_steps=25, |
| guidance_scale=6.0, |
| use_cache=True, |
| )[0][0] |
| image.save("test.png") |
| ``` |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{tang2025exploringdeepfusion, |
| title={Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis}, |
| author={Bingda Tang and Boyang Zheng and Xichen Pan and Sayak Paul and Saining Xie}, |
| year={2025}, |
| journal={arXiv preprint arXiv:2505.10046}, |
| } |
| ``` |