Update README.md
Browse files
README.md
CHANGED
|
@@ -4,4 +4,50 @@ base_model:
|
|
| 4 |
- black-forest-labs/FLUX.1-dev
|
| 5 |
- stabilityai/stable-diffusion-3.5-medium
|
| 6 |
pipeline_tag: text-to-image
|
| 7 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
- black-forest-labs/FLUX.1-dev
|
| 5 |
- stabilityai/stable-diffusion-3.5-medium
|
| 6 |
pipeline_tag: text-to-image
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
<div align="center">
|
| 10 |
+
<h1>TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers</h1>
|
| 11 |
+
</div>
|
| 12 |
+
|
| 13 |
+
<div align="center">
|
| 14 |
+
<span class="author-block">
|
| 15 |
+
<a href="https://scholar.google.com/citations?user=FkkaUgwAAAAJ&hl=en" target="_blank">Zhengyao Lv*</a><sup>1</sup>,</span>
|
| 16 |
+
</span>
|
| 17 |
+
<span class="author-block">
|
| 18 |
+
<a href="https://tianlinn.com/" target="_blank">Tianlin Pan*</a><sup>2,3</sup>,</span>
|
| 19 |
+
</span>
|
| 20 |
+
<span class="author-block">
|
| 21 |
+
<a href="https://chenyangsi.github.io/" target="_blank">Chenyang Si</a><sup>2‡†</sup>,</span>
|
| 22 |
+
</span>
|
| 23 |
+
<span class="author-block">
|
| 24 |
+
<a href="https://frozenburning.github.io/" target="_blank">Zhaoxi Chen</a><sup>4</sup>,</span>
|
| 25 |
+
</span>
|
| 26 |
+
<span class="author-block">
|
| 27 |
+
<a href="https://homepage.hit.edu.cn/wangmengzuo" target="_blank">Wangmeng Zuo</a><sup>5</sup>,</span>
|
| 28 |
+
</span>
|
| 29 |
+
<span class="author-block">
|
| 30 |
+
<a href="https://liuziwei7.github.io/" target="_blank">Ziwei Liu</a><sup>4†</sup>,</span>
|
| 31 |
+
</span>
|
| 32 |
+
<span class="author-block">
|
| 33 |
+
<a href="https://i.cs.hku.hk/~kykwong/" target="_blank">Kwan-Yee K. Wong</a><sup>1†</sup>
|
| 34 |
+
</span>
|
| 35 |
+
</div>
|
| 36 |
+
|
| 37 |
+
<div align="center">
|
| 38 |
+
<sup>1</sup>The University of Hong Kong
|
| 39 |
+
<sup>2</sup>Nanjing University <br>
|
| 40 |
+
<sup>3</sup>University of Chinese Academy of Sciences
|
| 41 |
+
<sup>4</sup>Nanyang Technological University<br>
|
| 42 |
+
<sup>5</sup>Harbin Institute of Technology
|
| 43 |
+
</div>
|
| 44 |
+
<div align="center">(*Equal Contribution. <sup>‡</sup>Project Leader. <sup>†</sup>Corresponding Author.)</div>
|
| 45 |
+
|
| 46 |
+
<p align="center">
|
| 47 |
+
<a href="https://arxiv.org/abs/">Paper</a> |
|
| 48 |
+
<a href="https://vchitect.github.io/TACA/">Project Page</a> |
|
| 49 |
+
<a href="https://huggingface.co/ldiex/TACA/tree/main">LoRA Weights</a>
|
| 50 |
+
</p>
|
| 51 |
+
|
| 52 |
+
# About
|
| 53 |
+
We propose **TACA**, a parameter-efficient method that dynamically rebalances cross-modal attention in multimodal diffusion transformers to improve text-image alignment.
|