File size: 6,153 Bytes
c83afa1 b854c46 f3e9246 c83afa1 e8f231a 23c695f e8f231a b854c46 e8f231a f3e9246 e8f231a 23c695f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
---
base_model:
- black-forest-labs/FLUX.1-dev
- stabilityai/stable-diffusion-3.5-medium
library_name: diffusers
license: mit
pipeline_tag: text-to-image
---
<div align="center">
<h1>TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers</h1>
</div>
<div align="center">
<span class="author-block">
<a href="https://scholar.google.com/citations?user=FkkaUgwAAAAJ&hl=en" target="_blank">Zhengyao Lv*</a><sup>1</sup>,</span>
</span>
<span class="author-block">
<a href="https://tianlinn.com/" target="_blank">Tianlin Pan*</a><sup>2,3</sup>,</span>
</span>
<span class="author-block">
<a href="https://chenyangsi.github.io/" target="_blank">Chenyang Si</a><sup>2‡†</sup>,</span>
</span>
<span class="author-block">
<a href="https://frozenburning.github.io/" target="_blank">Zhaoxi Chen</a><sup>4</sup>,</span>
</span>
<span class="author-block">
<a href="https://homepage.hit.edu.cn/wangmengmengzuo" target="_blank">Wangmeng Zuo</a><sup>5</sup>,</span>
</span>
<span class="author-block">
<a href="https://liuziwei7.github.io/" target="_blank">Ziwei Liu</a><sup>4†</sup>,</span>
</span>
<span class="author-block">
<a href="https://i.cs.hku.hk/~kykwong/" target="_blank">Kwan-Yee K. Wong</a><sup>1†</sup>
</span>
</div>
<div align="center">
<sup>1</sup>The University of Hong Kong
<sup>2</sup>Nanjing University <br>
<sup>3</sup>University of Chinese Academy of Sciences
<sup>4</sup>Nanyang Technological University<br>
<sup>5</sup>Harbin Institute of Technology
</div>
<div align="center">(*Equal Contribution. <sup>‡</sup>Project Leader. <sup>†</sup>Corresponding Author.)</div>
<p align="center">
<a href="https://huggingface.co/papers/2506.07986">Paper</a> |
<a href="https://vchitect.github.io/TACA/">Project Page</a> |
<a href="https://huggingface.co/ldiex/TACA/tree/main">LoRA Weights</a> |
<a href="https://github.com/Vchitect/TACA">Code</a>
</p>
# About
Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose **Temperature-Adjusted Cross-modal Attention (TACA)**, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{ this https URL }
https://github.com/user-attachments/assets/ae15a853-ee99-4eee-b0fd-8f5f53c308f9
# Usage
You can use `TACA` with `Stable Diffusion 3.5` or `FLUX.1` models.
## With Stable Diffusion 3.5
```python
from diffusers import StableDiffusionXLPipeline
import torch
# Load the base model and LoRA weights
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-3.5-medium", torch_dtype=torch.float16
)
pipe.load_lora_weights("ldiex/TACA", weight_name="taca_sd3_r64.safetensors")
pipe.to("cuda")
# Generate an image
prompt = "A majestic lion standing proudly on a rocky cliff overlooking a vast savanna at sunset."
image = pipe(prompt).images[0]
image.save("lion_sunset.png")
```
## With FLUX.1
```python
from diffusers import FluxPipeline
import torch
# Load the base model and LoRA weights
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev", torch_dtype=torch.float16
)
pipe.load_lora_weights("ldiex/TACA", weight_name="taca_flux_r64.safetensors")
pipe.to("cuda")
# Generate an image
prompt = "A majestic lion standing proudly on a rocky cliff overlooking a vast savanna at sunset."
image = pipe(prompt).images[0]
image.save("lion_sunset.png")
```
# Benchmark
Comparison of alignment evaluation on T2I-CompBench for FLUX.1-Dev-based and SD3.5-Medium-based models.
| Model | Attribute Binding | | | Object Relationship | | Complex $\uparrow$ |
|---|---|---|---|---|---|---|
| | Color $\uparrow$ | Shape $\uparrow$ | Texture $\uparrow$ | Spatial $\uparrow$ | Non-Spatial $\uparrow$ | |
| FLUX.1-Dev | 0.7678 | 0.5064 | 0.6756 | 0.2066 | 0.3035 | 0.4359 |
| FLUX.1-Dev + TACA ($r = 64$) | **0.7843** | **0.5362** | **0.6872** | **0.2405** | 0.3041 | **0.4494** |
| FLUX.1-Dev + TACA ($r = 16$) | 0.7842 | 0.5347 | 0.6814 | 0.2321 | **0.3046** | 0.4479 |
| SD3.5-Medium | 0.7890 | 0.5770 | 0.7328 | 0.2087 | 0.3104 | 0.4441 |
| SD3.5-Medium + TACA ($r = 64$) | **0.8074** | **0.5938** | **0.7522** | **0.2678** | 0.3106 | 0.4470 |
| SD3.5-Medium + TACA ($r = 16$) | 0.7984 | 0.5834 | 0.7467 | 0.2374 | **0.3111** | **0.4505** |
# Showcases




# Citation
```bibtex
@article{lv2025taca,
title={TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers},
author={Lv, Zhengyao and Pan, Tianlin and Si, Chenyang and Chen, Zhaoxi and Zuo, Wangmeng and Liu, Ziwei and Wong, Kwan-Yee K},
journal={arXiv preprint arXiv:2506.07986},
year={2025}
}
``` |