Improve model card with abstract, diffusers usage, benchmarks, and showcases
#3
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -25,7 +25,7 @@ pipeline_tag: text-to-image
|
|
| 25 |
<a href="https://frozenburning.github.io/" target="_blank">Zhaoxi Chen</a><sup>4</sup>,</span>
|
| 26 |
</span>
|
| 27 |
<span class="author-block">
|
| 28 |
-
<a href="https://homepage.hit.edu.cn/
|
| 29 |
</span>
|
| 30 |
<span class="author-block">
|
| 31 |
<a href="https://liuziwei7.github.io/" target="_blank">Ziwei Liu</a><sup>4†</sup>,</span>
|
|
@@ -52,4 +52,79 @@ pipeline_tag: text-to-image
|
|
| 52 |
</p>
|
| 53 |
|
| 54 |
# About
|
| 55 |
-
We propose **TACA**, a parameter-efficient method that dynamically rebalances cross-modal attention in
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
<a href="https://frozenburning.github.io/" target="_blank">Zhaoxi Chen</a><sup>4</sup>,</span>
|
| 26 |
</span>
|
| 27 |
<span class="author-block">
|
| 28 |
+
<a href="https://homepage.hit.edu.cn/wangmengmengzuo" target="_blank">Wangmeng Zuo</a><sup>5</sup>,</span>
|
| 29 |
</span>
|
| 30 |
<span class="author-block">
|
| 31 |
<a href="https://liuziwei7.github.io/" target="_blank">Ziwei Liu</a><sup>4†</sup>,</span>
|
|
|
|
| 52 |
</p>
|
| 53 |
|
| 54 |
# About
|
| 55 |
+
Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose **Temperature-Adjusted Cross-modal Attention (TACA)**, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{ this https URL }
|
| 56 |
+
|
| 57 |
+
https://github.com/user-attachments/assets/ae15a853-ee99-4eee-b0fd-8f5f53c308f9
|
| 58 |
+
|
| 59 |
+
# Usage
|
| 60 |
+
|
| 61 |
+
You can use `TACA` with `Stable Diffusion 3.5` or `FLUX.1` models.
|
| 62 |
+
|
| 63 |
+
## With Stable Diffusion 3.5
|
| 64 |
+
|
| 65 |
+
```python
|
| 66 |
+
from diffusers import StableDiffusionXLPipeline
|
| 67 |
+
import torch
|
| 68 |
+
|
| 69 |
+
# Load the base model and LoRA weights
|
| 70 |
+
pipe = StableDiffusionXLPipeline.from_pretrained(
|
| 71 |
+
"stabilityai/stable-diffusion-3.5-medium", torch_dtype=torch.float16
|
| 72 |
+
)
|
| 73 |
+
pipe.load_lora_weights("ldiex/TACA", weight_name="taca_sd3_r64.safetensors")
|
| 74 |
+
pipe.to("cuda")
|
| 75 |
+
|
| 76 |
+
# Generate an image
|
| 77 |
+
prompt = "A majestic lion standing proudly on a rocky cliff overlooking a vast savanna at sunset."
|
| 78 |
+
image = pipe(prompt).images[0]
|
| 79 |
+
|
| 80 |
+
image.save("lion_sunset.png")
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
## With FLUX.1
|
| 84 |
+
|
| 85 |
+
```python
|
| 86 |
+
from diffusers import FluxPipeline
|
| 87 |
+
import torch
|
| 88 |
+
|
| 89 |
+
# Load the base model and LoRA weights
|
| 90 |
+
pipe = FluxPipeline.from_pretrained(
|
| 91 |
+
"black-forest-labs/FLUX.1-dev", torch_dtype=torch.float16
|
| 92 |
+
)
|
| 93 |
+
pipe.load_lora_weights("ldiex/TACA", weight_name="taca_flux_r64.safetensors")
|
| 94 |
+
pipe.to("cuda")
|
| 95 |
+
|
| 96 |
+
# Generate an image
|
| 97 |
+
prompt = "A majestic lion standing proudly on a rocky cliff overlooking a vast savanna at sunset."
|
| 98 |
+
image = pipe(prompt).images[0]
|
| 99 |
+
|
| 100 |
+
image.save("lion_sunset.png")
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
# Benchmark
|
| 104 |
+
Comparison of alignment evaluation on T2I-CompBench for FLUX.1-Dev-based and SD3.5-Medium-based models.
|
| 105 |
+
|
| 106 |
+
| Model | Attribute Binding | | | Object Relationship | | Complex $\uparrow$ |
|
| 107 |
+
|---|---|---|---|---|---|---|
|
| 108 |
+
| | Color $\uparrow$ | Shape $\uparrow$ | Texture $\uparrow$ | Spatial $\uparrow$ | Non-Spatial $\uparrow$ | |
|
| 109 |
+
| FLUX.1-Dev | 0.7678 | 0.5064 | 0.6756 | 0.2066 | 0.3035 | 0.4359 |
|
| 110 |
+
| FLUX.1-Dev + TACA ($r = 64$) | **0.7843** | **0.5362** | **0.6872** | **0.2405** | 0.3041 | **0.4494** |
|
| 111 |
+
| FLUX.1-Dev + TACA ($r = 16$) | 0.7842 | 0.5347 | 0.6814 | 0.2321 | **0.3046** | 0.4479 |
|
| 112 |
+
| SD3.5-Medium | 0.7890 | 0.5770 | 0.7328 | 0.2087 | 0.3104 | 0.4441 |
|
| 113 |
+
| SD3.5-Medium + TACA ($r = 64$) | **0.8074** | **0.5938** | **0.7522** | **0.2678** | 0.3106 | 0.4470 |
|
| 114 |
+
| SD3.5-Medium + TACA ($r = 16$) | 0.7984 | 0.5834 | 0.7467 | 0.2374 | **0.3111** | **0.4505** |
|
| 115 |
+
|
| 116 |
+
# Showcases
|
| 117 |
+

|
| 118 |
+

|
| 119 |
+

|
| 120 |
+

|
| 121 |
+
|
| 122 |
+
# Citation
|
| 123 |
+
```bibtex
|
| 124 |
+
@article{lv2025taca,
|
| 125 |
+
title={TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers},
|
| 126 |
+
author={Lv, Zhengyao and Pan, Tianlin and Si, Chenyang and Chen, Zhaoxi and Zuo, Wangmeng and Liu, Ziwei and Wong, Kwan-Yee K},
|
| 127 |
+
journal={arXiv preprint arXiv:2506.07986},
|
| 128 |
+
year={2025}
|
| 129 |
+
}
|
| 130 |
+
```
|