TACA / README.md

Improve model card with abstract, diffusers usage, benchmarks, and showcases (#3)

23c695f verified 6 months ago

6.15 kB

	---
	base_model:
	- black-forest-labs/FLUX.1-dev
	- stabilityai/stable-diffusion-3.5-medium
	library_name: diffusers
	license: mit
	pipeline_tag: text-to-image
	---

	<div align="center">
	<h1>TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers</h1>
	</div>

	<div align="center">
	<span class="author-block">
	<a href="https://scholar.google.com/citations?user=FkkaUgwAAAAJ&hl=en" target="_blank">Zhengyao Lv*</a><sup>1</sup>,</span>
	</span>
	<span class="author-block">
	<a href="https://tianlinn.com/" target="_blank">Tianlin Pan*</a><sup>2,3</sup>,</span>
	</span>
	<span class="author-block">
	<a href="https://chenyangsi.github.io/" target="_blank">Chenyang Si</a><sup>2‡†</sup>,</span>
	</span>
	<span class="author-block">
	<a href="https://frozenburning.github.io/" target="_blank">Zhaoxi Chen</a><sup>4</sup>,</span>
	</span>
	<span class="author-block">
	<a href="https://homepage.hit.edu.cn/wangmengmengzuo" target="_blank">Wangmeng Zuo</a><sup>5</sup>,</span>
	</span>
	<span class="author-block">
	<a href="https://liuziwei7.github.io/" target="_blank">Ziwei Liu</a><sup>4†</sup>,</span>
	</span>
	<span class="author-block">
	<a href="https://i.cs.hku.hk/~kykwong/" target="_blank">Kwan-Yee K. Wong</a><sup>1†</sup>
	</span>
	</div>

	<div align="center">
	<sup>1</sup>The University of Hong Kong
	<sup>2</sup>Nanjing University <br>
	<sup>3</sup>University of Chinese Academy of Sciences
	<sup>4</sup>Nanyang Technological University<br>
	<sup>5</sup>Harbin Institute of Technology
	</div>
	<div align="center">(*Equal Contribution.    <sup>‡</sup>Project Leader.    <sup>†</sup>Corresponding Author.)</div>

	<p align="center">
	<a href="https://huggingface.co/papers/2506.07986">Paper</a> \|
	<a href="https://vchitect.github.io/TACA/">Project Page</a> \|
	<a href="https://huggingface.co/ldiex/TACA/tree/main">LoRA Weights</a> \|
	<a href="https://github.com/Vchitect/TACA">Code</a>
	</p>

	# About
	Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose Temperature-Adjusted Cross-modal Attention (TACA), a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{ this https URL }

	https://github.com/user-attachments/assets/ae15a853-ee99-4eee-b0fd-8f5f53c308f9

	# Usage

	You can use `TACA` with `Stable Diffusion 3.5` or `FLUX.1` models.

	## With Stable Diffusion 3.5

	```python
	from diffusers import StableDiffusionXLPipeline
	import torch

	# Load the base model and LoRA weights
	pipe = StableDiffusionXLPipeline.from_pretrained(
	"stabilityai/stable-diffusion-3.5-medium", torch_dtype=torch.float16
	)
	pipe.load_lora_weights("ldiex/TACA", weight_name="taca_sd3_r64.safetensors")
	pipe.to("cuda")

	# Generate an image
	prompt = "A majestic lion standing proudly on a rocky cliff overlooking a vast savanna at sunset."
	image = pipe(prompt).images[0]

	image.save("lion_sunset.png")
	```

	## With FLUX.1

	```python
	from diffusers import FluxPipeline
	import torch

	# Load the base model and LoRA weights
	pipe = FluxPipeline.from_pretrained(
	"black-forest-labs/FLUX.1-dev", torch_dtype=torch.float16
	)
	pipe.load_lora_weights("ldiex/TACA", weight_name="taca_flux_r64.safetensors")
	pipe.to("cuda")

	# Generate an image
	prompt = "A majestic lion standing proudly on a rocky cliff overlooking a vast savanna at sunset."
	image = pipe(prompt).images[0]

	image.save("lion_sunset.png")
	```

	# Benchmark
	Comparison of alignment evaluation on T2I-CompBench for FLUX.1-Dev-based and SD3.5-Medium-based models.

	\| Model \| Attribute Binding \| \| \| Object Relationship \| \| Complex $\uparrow$ \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| \| Color $\uparrow$ \| Shape $\uparrow$ \| Texture $\uparrow$ \| Spatial $\uparrow$ \| Non-Spatial $\uparrow$ \| \|
	\| FLUX.1-Dev \| 0.7678 \| 0.5064 \| 0.6756 \| 0.2066 \| 0.3035 \| 0.4359 \|
	\| FLUX.1-Dev + TACA ($r = 64$) \| 0.7843 \| 0.5362 \| 0.6872 \| 0.2405 \| 0.3041 \| 0.4494 \|
	\| FLUX.1-Dev + TACA ($r = 16$) \| 0.7842 \| 0.5347 \| 0.6814 \| 0.2321 \| 0.3046 \| 0.4479 \|
	\| SD3.5-Medium \| 0.7890 \| 0.5770 \| 0.7328 \| 0.2087 \| 0.3104 \| 0.4441 \|
	\| SD3.5-Medium + TACA ($r = 64$) \| 0.8074 \| 0.5938 \| 0.7522 \| 0.2678 \| 0.3106 \| 0.4470 \|
	\| SD3.5-Medium + TACA ($r = 16$) \| 0.7984 \| 0.5834 \| 0.7467 \| 0.2374 \| 0.3111 \| 0.4505 \|

	# Showcases
	![](https://github.com/Vchitect/TACA/raw/main/static/images/short_1.png)
	![](https://github.com/Vchitect/TACA/raw/main/static/images/short_2.png)
	![](https://github.com/Vchitect/TACA/raw/main/static/images/long_1.png)
	![](https://github.com/Vchitect/TACA/raw/main/static/images/long_2.png)

	# Citation
	```bibtex
	@article{lv2025taca,
	title={TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers},
	author={Lv, Zhengyao and Pan, Tianlin and Si, Chenyang and Chen, Zhaoxi and Zuo, Wangmeng and Liu, Ziwei and Wong, Kwan-Yee K},
	journal={arXiv preprint arXiv:2506.07986},
	year={2025}
	}
	```