|
|
--- |
|
|
base_model: |
|
|
- black-forest-labs/FLUX.1-dev |
|
|
- stabilityai/stable-diffusion-3.5-medium |
|
|
library_name: diffusers |
|
|
license: mit |
|
|
pipeline_tag: text-to-image |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
<h1>TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers</h1> |
|
|
</div> |
|
|
|
|
|
<div align="center"> |
|
|
<span class="author-block"> |
|
|
<a href="https://scholar.google.com/citations?user=FkkaUgwAAAAJ&hl=en" target="_blank">Zhengyao Lv*</a><sup>1</sup>,</span> |
|
|
</span> |
|
|
<span class="author-block"> |
|
|
<a href="https://tianlinn.com/" target="_blank">Tianlin Pan*</a><sup>2,3</sup>,</span> |
|
|
</span> |
|
|
<span class="author-block"> |
|
|
<a href="https://chenyangsi.github.io/" target="_blank">Chenyang Si</a><sup>2‡†</sup>,</span> |
|
|
</span> |
|
|
<span class="author-block"> |
|
|
<a href="https://frozenburning.github.io/" target="_blank">Zhaoxi Chen</a><sup>4</sup>,</span> |
|
|
</span> |
|
|
<span class="author-block"> |
|
|
<a href="https://homepage.hit.edu.cn/wangmengmengzuo" target="_blank">Wangmeng Zuo</a><sup>5</sup>,</span> |
|
|
</span> |
|
|
<span class="author-block"> |
|
|
<a href="https://liuziwei7.github.io/" target="_blank">Ziwei Liu</a><sup>4†</sup>,</span> |
|
|
</span> |
|
|
<span class="author-block"> |
|
|
<a href="https://i.cs.hku.hk/~kykwong/" target="_blank">Kwan-Yee K. Wong</a><sup>1†</sup> |
|
|
</span> |
|
|
</div> |
|
|
|
|
|
<div align="center"> |
|
|
<sup>1</sup>The University of Hong Kong |
|
|
<sup>2</sup>Nanjing University <br> |
|
|
<sup>3</sup>University of Chinese Academy of Sciences |
|
|
<sup>4</sup>Nanyang Technological University<br> |
|
|
<sup>5</sup>Harbin Institute of Technology |
|
|
</div> |
|
|
<div align="center">(*Equal Contribution. <sup>‡</sup>Project Leader. <sup>†</sup>Corresponding Author.)</div> |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://huggingface.co/papers/2506.07986">Paper</a> | |
|
|
<a href="https://vchitect.github.io/TACA/">Project Page</a> | |
|
|
<a href="https://huggingface.co/ldiex/TACA/tree/main">LoRA Weights</a> | |
|
|
<a href="https://github.com/Vchitect/TACA">Code</a> |
|
|
</p> |
|
|
|
|
|
# About |
|
|
Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose **Temperature-Adjusted Cross-modal Attention (TACA)**, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{ this https URL } |
|
|
|
|
|
https://github.com/user-attachments/assets/ae15a853-ee99-4eee-b0fd-8f5f53c308f9 |
|
|
|
|
|
# Usage |
|
|
|
|
|
You can use `TACA` with `Stable Diffusion 3.5` or `FLUX.1` models. |
|
|
|
|
|
## With Stable Diffusion 3.5 |
|
|
|
|
|
```python |
|
|
from diffusers import StableDiffusionXLPipeline |
|
|
import torch |
|
|
|
|
|
# Load the base model and LoRA weights |
|
|
pipe = StableDiffusionXLPipeline.from_pretrained( |
|
|
"stabilityai/stable-diffusion-3.5-medium", torch_dtype=torch.float16 |
|
|
) |
|
|
pipe.load_lora_weights("ldiex/TACA", weight_name="taca_sd3_r64.safetensors") |
|
|
pipe.to("cuda") |
|
|
|
|
|
# Generate an image |
|
|
prompt = "A majestic lion standing proudly on a rocky cliff overlooking a vast savanna at sunset." |
|
|
image = pipe(prompt).images[0] |
|
|
|
|
|
image.save("lion_sunset.png") |
|
|
``` |
|
|
|
|
|
## With FLUX.1 |
|
|
|
|
|
```python |
|
|
from diffusers import FluxPipeline |
|
|
import torch |
|
|
|
|
|
# Load the base model and LoRA weights |
|
|
pipe = FluxPipeline.from_pretrained( |
|
|
"black-forest-labs/FLUX.1-dev", torch_dtype=torch.float16 |
|
|
) |
|
|
pipe.load_lora_weights("ldiex/TACA", weight_name="taca_flux_r64.safetensors") |
|
|
pipe.to("cuda") |
|
|
|
|
|
# Generate an image |
|
|
prompt = "A majestic lion standing proudly on a rocky cliff overlooking a vast savanna at sunset." |
|
|
image = pipe(prompt).images[0] |
|
|
|
|
|
image.save("lion_sunset.png") |
|
|
``` |
|
|
|
|
|
# Benchmark |
|
|
Comparison of alignment evaluation on T2I-CompBench for FLUX.1-Dev-based and SD3.5-Medium-based models. |
|
|
|
|
|
| Model | Attribute Binding | | | Object Relationship | | Complex $\uparrow$ | |
|
|
|---|---|---|---|---|---|---| |
|
|
| | Color $\uparrow$ | Shape $\uparrow$ | Texture $\uparrow$ | Spatial $\uparrow$ | Non-Spatial $\uparrow$ | | |
|
|
| FLUX.1-Dev | 0.7678 | 0.5064 | 0.6756 | 0.2066 | 0.3035 | 0.4359 | |
|
|
| FLUX.1-Dev + TACA ($r = 64$) | **0.7843** | **0.5362** | **0.6872** | **0.2405** | 0.3041 | **0.4494** | |
|
|
| FLUX.1-Dev + TACA ($r = 16$) | 0.7842 | 0.5347 | 0.6814 | 0.2321 | **0.3046** | 0.4479 | |
|
|
| SD3.5-Medium | 0.7890 | 0.5770 | 0.7328 | 0.2087 | 0.3104 | 0.4441 | |
|
|
| SD3.5-Medium + TACA ($r = 64$) | **0.8074** | **0.5938** | **0.7522** | **0.2678** | 0.3106 | 0.4470 | |
|
|
| SD3.5-Medium + TACA ($r = 16$) | 0.7984 | 0.5834 | 0.7467 | 0.2374 | **0.3111** | **0.4505** | |
|
|
|
|
|
# Showcases |
|
|
 |
|
|
 |
|
|
 |
|
|
 |
|
|
|
|
|
# Citation |
|
|
```bibtex |
|
|
@article{lv2025taca, |
|
|
title={TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers}, |
|
|
author={Lv, Zhengyao and Pan, Tianlin and Si, Chenyang and Chen, Zhaoxi and Zuo, Wangmeng and Liu, Ziwei and Wong, Kwan-Yee K}, |
|
|
journal={arXiv preprint arXiv:2506.07986}, |
|
|
year={2025} |
|
|
} |
|
|
``` |