| | --- |
| | base_model: |
| | - black-forest-labs/FLUX.1-dev |
| | - stabilityai/stable-diffusion-3.5-medium |
| | license: mit |
| | pipeline_tag: text-to-image |
| | library_name: diffusers |
| | --- |
| | |
| | <div align="center"> |
| | <h1>TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers</h1> |
| | </div> |
| |
|
| | <div align="center"> |
| | <span class="author-block"> |
| | <a href="https://scholar.google.com/citations?user=FkkaUgwAAAAJ&hl=en" target="_blank">Zhengyao Lv*</a><sup>1</sup>,</span> |
| | </span> |
| | <span class="author-block"> |
| | <a href="https://tianlinn.com/" target="_blank">Tianlin Pan*</a><sup>2,3</sup>,</span> |
| | </span> |
| | <span class="author-block"> |
| | <a href="https://chenyangsi.github.io/" target="_blank">Chenyang Si</a><sup>2‡†</sup>,</span> |
| | </span> |
| | <span class="author-block"> |
| | <a href="https://frozenburning.github.io/" target="_blank">Zhaoxi Chen</a><sup>4</sup>,</span> |
| | </span> |
| | <span class="author-block"> |
| | <a href="https://homepage.hit.edu.cn/wangmengzuo" target="_blank">Wangmeng Zuo</a><sup>5</sup>,</span> |
| | </span> |
| | <span class="author-block"> |
| | <a href="https://liuziwei7.github.io/" target="_blank">Ziwei Liu</a><sup>4†</sup>,</span> |
| | </span> |
| | <span class="author-block"> |
| | <a href="https://i.cs.hku.hk/~kykwong/" target="_blank">Kwan-Yee K. Wong</a><sup>1†</sup> |
| | </span> |
| | </div> |
| | |
| | <div align="center"> |
| | <sup>1</sup>The University of Hong Kong |
| | <sup>2</sup>Nanjing University <br> |
| | <sup>3</sup>University of Chinese Academy of Sciences |
| | <sup>4</sup>Nanyang Technological University<br> |
| | <sup>5</sup>Harbin Institute of Technology |
| | </div> |
| | <div align="center">(*Equal Contribution. <sup>‡</sup>Project Leader. <sup>†</sup>Corresponding Author.)</div> |
| | |
| | <p align="center"> |
| | <a href="https://arxiv.org/abs/2506.07986">Paper</a> | |
| | <a href="https://vchitect.github.io/TACA/">Project Page</a> | |
| | <a href="https://huggingface.co/ldiex/TACA/tree/main">LoRA Weights</a> | |
| | <a href="https://github.com/Vchitect/TACA">Code</a> |
| | </p> |
| | |
| | # About |
| | We propose **TACA**, a parameter-efficient method that dynamically rebalances cross-modal attention in multimodal diffusion transformers to improve text-image alignment. |