File size: 2,312 Bytes
c83afa1 a8dec8a f3e9246 c83afa1 e8f231a a8dec8a e8f231a f3e9246 e8f231a f3e9246 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | ---
base_model:
- black-forest-labs/FLUX.1-dev
- stabilityai/stable-diffusion-3.5-medium
library_name: diffusers
license: mit
pipeline_tag: text-to-image
---
<div align="center">
<h1>TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers</h1>
</div>
<div align="center">
<span class="author-block">
<a href="https://scholar.google.com/citations?user=FkkaUgwAAAAJ&hl=en" target="_blank">Zhengyao Lv*</a><sup>1</sup>,</span>
</span>
<span class="author-block">
<a href="https://tianlinn.com/" target="_blank">Tianlin Pan*</a><sup>2,3</sup>,</span>
</span>
<span class="author-block">
<a href="https://chenyangsi.github.io/" target="_blank">Chenyang Si</a><sup>2‡†</sup>,</span>
</span>
<span class="author-block">
<a href="https://frozenburning.github.io/" target="_blank">Zhaoxi Chen</a><sup>4</sup>,</span>
</span>
<span class="author-block">
<a href="https://homepage.hit.edu.cn/wangmengzuo" target="_blank">Wangmeng Zuo</a><sup>5</sup>,</span>
</span>
<span class="author-block">
<a href="https://liuziwei7.github.io/" target="_blank">Ziwei Liu</a><sup>4†</sup>,</span>
</span>
<span class="author-block">
<a href="https://i.cs.hku.hk/~kykwong/" target="_blank">Kwan-Yee K. Wong</a><sup>1†</sup>
</span>
</div>
<div align="center">
<sup>1</sup>The University of Hong Kong
<sup>2</sup>Nanjing University <br>
<sup>3</sup>University of Chinese Academy of Sciences
<sup>4</sup>Nanyang Technological University<br>
<sup>5</sup>Harbin Institute of Technology
</div>
<div align="center">(*Equal Contribution. <sup>‡</sup>Project Leader. <sup>†</sup>Corresponding Author.)</div>
<p align="center">
<a href="https://huggingface.co/papers/2506.07986">Paper</a> |
<a href="https://vchitect.github.io/TACA/">Project Page</a> |
<a href="https://huggingface.co/ldiex/TACA/tree/main">LoRA Weights</a> |
<a href="https://github.com/Vchitect/TACA">Code</a>
</p>
# About
We propose **TACA**, a parameter-efficient method that dynamically rebalances cross-modal attention in multimodal diffusion transformers to improve text-image alignment. |