File size: 6,153 Bytes
c83afa1
 
 
 
b854c46
f3e9246
c83afa1
e8f231a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23c695f
e8f231a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b854c46
e8f231a
f3e9246
 
e8f231a
 
 
23c695f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
base_model:
- black-forest-labs/FLUX.1-dev
- stabilityai/stable-diffusion-3.5-medium
library_name: diffusers
license: mit
pipeline_tag: text-to-image
---

<div align="center">
<h1>TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers</h1>
</div>

<div align="center">
    <span class="author-block">
      <a href="https://scholar.google.com/citations?user=FkkaUgwAAAAJ&hl=en" target="_blank">Zhengyao Lv*</a><sup>1</sup>,</span>
    </span>
    <span class="author-block">
      <a href="https://tianlinn.com/" target="_blank">Tianlin Pan*</a><sup>2,3</sup>,</span>
    </span>
    <span class="author-block">
      <a href="https://chenyangsi.github.io/" target="_blank">Chenyang Si</a><sup>2‡†</sup>,</span>
    </span>
    <span class="author-block">
      <a href="https://frozenburning.github.io/" target="_blank">Zhaoxi Chen</a><sup>4</sup>,</span>
    </span>
    <span class="author-block">
      <a href="https://homepage.hit.edu.cn/wangmengmengzuo" target="_blank">Wangmeng Zuo</a><sup>5</sup>,</span>
    </span>
    <span class="author-block">
      <a href="https://liuziwei7.github.io/" target="_blank">Ziwei Liu</a><sup>4†</sup>,</span>
    </span>
    <span class="author-block">
      <a href="https://i.cs.hku.hk/~kykwong/" target="_blank">Kwan-Yee K. Wong</a><sup>1†</sup>
    </span>
</div>

<div align="center">
    <sup>1</sup>The University of Hong Kong &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
    <sup>2</sup>Nanjing University <br> 
    <sup>3</sup>University of Chinese Academy of Sciences &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
    <sup>4</sup>Nanyang Technological University<br> 
    <sup>5</sup>Harbin Institute of Technology
</div>
<div align="center">(*Equal Contribution.&nbsp;&nbsp;&nbsp;&nbsp;<sup>‡</sup>Project Leader.&nbsp;&nbsp;&nbsp;&nbsp;<sup>†</sup>Corresponding Author.)</div>

<p align="center">
    <a href="https://huggingface.co/papers/2506.07986">Paper</a> | 
    <a href="https://vchitect.github.io/TACA/">Project Page</a> |
    <a href="https://huggingface.co/ldiex/TACA/tree/main">LoRA Weights</a> |
    <a href="https://github.com/Vchitect/TACA">Code</a>
</p>

# About
Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose **Temperature-Adjusted Cross-modal Attention (TACA)**, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{ this https URL }

https://github.com/user-attachments/assets/ae15a853-ee99-4eee-b0fd-8f5f53c308f9

# Usage

You can use `TACA` with `Stable Diffusion 3.5` or `FLUX.1` models.

## With Stable Diffusion 3.5

```python
from diffusers import StableDiffusionXLPipeline
import torch

# Load the base model and LoRA weights
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-3.5-medium", torch_dtype=torch.float16
)
pipe.load_lora_weights("ldiex/TACA", weight_name="taca_sd3_r64.safetensors")
pipe.to("cuda")

# Generate an image
prompt = "A majestic lion standing proudly on a rocky cliff overlooking a vast savanna at sunset."
image = pipe(prompt).images[0]

image.save("lion_sunset.png")
```

## With FLUX.1

```python
from diffusers import FluxPipeline
import torch

# Load the base model and LoRA weights
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev", torch_dtype=torch.float16
)
pipe.load_lora_weights("ldiex/TACA", weight_name="taca_flux_r64.safetensors")
pipe.to("cuda")

# Generate an image
prompt = "A majestic lion standing proudly on a rocky cliff overlooking a vast savanna at sunset."
image = pipe(prompt).images[0]

image.save("lion_sunset.png")
```

# Benchmark
Comparison of alignment evaluation on T2I-CompBench for FLUX.1-Dev-based and SD3.5-Medium-based models.

| Model | Attribute Binding | | | Object Relationship | | Complex $\uparrow$ |
|---|---|---|---|---|---|---|
| | Color $\uparrow$ | Shape $\uparrow$ | Texture $\uparrow$ | Spatial $\uparrow$ | Non-Spatial $\uparrow$ | |
| FLUX.1-Dev | 0.7678 | 0.5064 | 0.6756 | 0.2066 | 0.3035 | 0.4359 |
| FLUX.1-Dev + TACA ($r = 64$) | **0.7843** | **0.5362** | **0.6872** | **0.2405** | 0.3041 | **0.4494** |
| FLUX.1-Dev + TACA ($r = 16$) | 0.7842 | 0.5347 | 0.6814 | 0.2321 | **0.3046** | 0.4479 |
| SD3.5-Medium | 0.7890 | 0.5770 | 0.7328 | 0.2087 | 0.3104 | 0.4441 |
| SD3.5-Medium + TACA ($r = 64$) | **0.8074** | **0.5938** | **0.7522** | **0.2678** | 0.3106 | 0.4470 |
| SD3.5-Medium + TACA ($r = 16$) | 0.7984 | 0.5834 | 0.7467 | 0.2374 | **0.3111** | **0.4505** |

# Showcases
![](https://github.com/Vchitect/TACA/raw/main/static/images/short_1.png)
![](https://github.com/Vchitect/TACA/raw/main/static/images/short_2.png)
![](https://github.com/Vchitect/TACA/raw/main/static/images/long_1.png)
![](https://github.com/Vchitect/TACA/raw/main/static/images/long_2.png)

# Citation
```bibtex
@article{lv2025taca,
  title={TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers},
  author={Lv, Zhengyao and Pan, Tianlin and Si, Chenyang and Chen, Zhaoxi and Zuo, Wangmeng and Liu, Ziwei and Wong, Kwan-Yee K},
  journal={arXiv preprint arXiv:2506.07986},
  year={2025}
}
```