Improve model card with abstract, diffusers usage, benchmarks, and showcases

#3
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +77 -2
README.md CHANGED
@@ -25,7 +25,7 @@ pipeline_tag: text-to-image
25
  <a href="https://frozenburning.github.io/" target="_blank">Zhaoxi Chen</a><sup>4</sup>,</span>
26
  </span>
27
  <span class="author-block">
28
- <a href="https://homepage.hit.edu.cn/wangmengzuo" target="_blank">Wangmeng Zuo</a><sup>5</sup>,</span>
29
  </span>
30
  <span class="author-block">
31
  <a href="https://liuziwei7.github.io/" target="_blank">Ziwei Liu</a><sup>4†</sup>,</span>
@@ -52,4 +52,79 @@ pipeline_tag: text-to-image
52
  </p>
53
 
54
  # About
55
- We propose **TACA**, a parameter-efficient method that dynamically rebalances cross-modal attention in multimodal diffusion transformers to improve text-image alignment.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  <a href="https://frozenburning.github.io/" target="_blank">Zhaoxi Chen</a><sup>4</sup>,</span>
26
  </span>
27
  <span class="author-block">
28
+ <a href="https://homepage.hit.edu.cn/wangmengmengzuo" target="_blank">Wangmeng Zuo</a><sup>5</sup>,</span>
29
  </span>
30
  <span class="author-block">
31
  <a href="https://liuziwei7.github.io/" target="_blank">Ziwei Liu</a><sup>4†</sup>,</span>
 
52
  </p>
53
 
54
  # About
55
+ Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose **Temperature-Adjusted Cross-modal Attention (TACA)**, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{ this https URL }
56
+
57
+ https://github.com/user-attachments/assets/ae15a853-ee99-4eee-b0fd-8f5f53c308f9
58
+
59
+ # Usage
60
+
61
+ You can use `TACA` with `Stable Diffusion 3.5` or `FLUX.1` models.
62
+
63
+ ## With Stable Diffusion 3.5
64
+
65
+ ```python
66
+ from diffusers import StableDiffusionXLPipeline
67
+ import torch
68
+
69
+ # Load the base model and LoRA weights
70
+ pipe = StableDiffusionXLPipeline.from_pretrained(
71
+ "stabilityai/stable-diffusion-3.5-medium", torch_dtype=torch.float16
72
+ )
73
+ pipe.load_lora_weights("ldiex/TACA", weight_name="taca_sd3_r64.safetensors")
74
+ pipe.to("cuda")
75
+
76
+ # Generate an image
77
+ prompt = "A majestic lion standing proudly on a rocky cliff overlooking a vast savanna at sunset."
78
+ image = pipe(prompt).images[0]
79
+
80
+ image.save("lion_sunset.png")
81
+ ```
82
+
83
+ ## With FLUX.1
84
+
85
+ ```python
86
+ from diffusers import FluxPipeline
87
+ import torch
88
+
89
+ # Load the base model and LoRA weights
90
+ pipe = FluxPipeline.from_pretrained(
91
+ "black-forest-labs/FLUX.1-dev", torch_dtype=torch.float16
92
+ )
93
+ pipe.load_lora_weights("ldiex/TACA", weight_name="taca_flux_r64.safetensors")
94
+ pipe.to("cuda")
95
+
96
+ # Generate an image
97
+ prompt = "A majestic lion standing proudly on a rocky cliff overlooking a vast savanna at sunset."
98
+ image = pipe(prompt).images[0]
99
+
100
+ image.save("lion_sunset.png")
101
+ ```
102
+
103
+ # Benchmark
104
+ Comparison of alignment evaluation on T2I-CompBench for FLUX.1-Dev-based and SD3.5-Medium-based models.
105
+
106
+ | Model | Attribute Binding | | | Object Relationship | | Complex $\uparrow$ |
107
+ |---|---|---|---|---|---|---|
108
+ | | Color $\uparrow$ | Shape $\uparrow$ | Texture $\uparrow$ | Spatial $\uparrow$ | Non-Spatial $\uparrow$ | |
109
+ | FLUX.1-Dev | 0.7678 | 0.5064 | 0.6756 | 0.2066 | 0.3035 | 0.4359 |
110
+ | FLUX.1-Dev + TACA ($r = 64$) | **0.7843** | **0.5362** | **0.6872** | **0.2405** | 0.3041 | **0.4494** |
111
+ | FLUX.1-Dev + TACA ($r = 16$) | 0.7842 | 0.5347 | 0.6814 | 0.2321 | **0.3046** | 0.4479 |
112
+ | SD3.5-Medium | 0.7890 | 0.5770 | 0.7328 | 0.2087 | 0.3104 | 0.4441 |
113
+ | SD3.5-Medium + TACA ($r = 64$) | **0.8074** | **0.5938** | **0.7522** | **0.2678** | 0.3106 | 0.4470 |
114
+ | SD3.5-Medium + TACA ($r = 16$) | 0.7984 | 0.5834 | 0.7467 | 0.2374 | **0.3111** | **0.4505** |
115
+
116
+ # Showcases
117
+ ![](https://github.com/Vchitect/TACA/raw/main/static/images/short_1.png)
118
+ ![](https://github.com/Vchitect/TACA/raw/main/static/images/short_2.png)
119
+ ![](https://github.com/Vchitect/TACA/raw/main/static/images/long_1.png)
120
+ ![](https://github.com/Vchitect/TACA/raw/main/static/images/long_2.png)
121
+
122
+ # Citation
123
+ ```bibtex
124
+ @article{lv2025taca,
125
+ title={TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers},
126
+ author={Lv, Zhengyao and Pan, Tianlin and Si, Chenyang and Chen, Zhaoxi and Zuo, Wangmeng and Liu, Ziwei and Wong, Kwan-Yee K},
127
+ journal={arXiv preprint arXiv:2506.07986},
128
+ year={2025}
129
+ }
130
+ ```