Add model card and metadata for Rethinking Global Text Conditioning

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +103 -3
README.md CHANGED
@@ -1,3 +1,103 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: diffusers
4
+ pipeline_tag: text-to-image
5
+ base_model: stabilityai/stable-diffusion-3.5-large
6
+ tags:
7
+ - lora
8
+ - stable-diffusion-3
9
+ - text-to-image
10
+ ---
11
+
12
+ # Rethinking Global Text Conditioning in Diffusion Transformers
13
+
14
+ This repository contains the DMD2 LoRA adapter for Stable Diffusion 3.5 Large, as presented in the paper [Rethinking Global Text Conditioning in Diffusion Transformers](https://huggingface.co/papers/2602.09268).
15
+
16
+ ## Introduction
17
+
18
+ Diffusion transformers typically incorporate textual information via attention layers and a modulation mechanism using a pooled text embedding. Our analysis shows that, in its conventional usage, the pooled embedding contributes little to overall performance. However, we reveal that the pooled embedding can provide significant gains when used as **guidance**—enabling controllable shifts toward more desirable properties like complexity, realism, and better hand rendering.
19
+
20
+ This approach is training-free, simple to implement, and can be applied to various tasks including text-to-image/video generation and image editing.
21
+
22
+ * **GitHub Repository:** [quickjkee/modulation-guidance](https://github.com/quickjkee/modulation-guidance)
23
+ * **Paper:** [arXiv:2602.09268](https://huggingface.co/papers/2602.09268)
24
+
25
+ ## Usage
26
+
27
+ This model is a LoRA adapter designed to work with the `diffusers` library. Below is an example of how to use it with **SD3.5-Large DMD2** and the proposed modulation guidance.
28
+
29
+ *Note: This usage requires the helper functions `encode_prompt` and `forward_modulation_guidance` from the [official repository](https://github.com/quickjkee/modulation-guidance).*
30
+
31
+ ```python
32
+ import torch
33
+ import types
34
+ from diffusers import StableDiffusion3Pipeline
35
+ from functools import partial
36
+ from models.sd35 import encode_prompt, forward_modulation_guidance
37
+ from peft import PeftModel
38
+
39
+ # Load base model with custom pipeline
40
+ pipe = StableDiffusion3Pipeline.from_pretrained(
41
+ "stabilityai/stable-diffusion-3.5-large",
42
+ torch_dtype=torch.float16,
43
+ custom_pipeline='quickjkee/swd_pipeline'
44
+ )
45
+ pipe = pipe.to("cuda")
46
+
47
+ # Load this LoRA adapter
48
+ lora_path = 'yresearch/stable-diffusion-3.5-large-dmd2'
49
+ pipe.transformer = PeftModel.from_pretrained(
50
+ pipe.transformer,
51
+ lora_path,
52
+ ).to("cuda")
53
+
54
+ # Define hyperparameters
55
+ prompt = "a cardboard spaceship"
56
+ prompt_positive = "Extremely complex, the highest quality"
57
+ prompt_negative = "very simple, no details at all"
58
+ w = 2
59
+ start_layer = 2
60
+
61
+ # Get pooled CLIP embeddings
62
+ clip_positive = encode_prompt(pipe=pipe, prompt=prompt_positive)
63
+ clip_negative = encode_prompt(pipe=pipe, prompt=prompt_negative)
64
+
65
+ # Apply modulation guidance
66
+ forward_modulation_guidance = partial(
67
+ forward_modulation_guidance,
68
+ pooled_projections_1=clip_positive,
69
+ pooled_projections_0=clip_negative,
70
+ w=w,
71
+ start_layer=start_layer
72
+ )
73
+ pipe.transformer.forward = types.MethodType(forward_modulation_guidance, pipe.transformer)
74
+
75
+ # Run generation
76
+ seed = 0
77
+ sigmas = [1.0000, 0.9454, 0.8959, 0.7904, 0.7371, 0.6022, 0.0000]
78
+ scales = [128, 128, 128, 128, 128, 128]
79
+ image = pipe(
80
+ [prompt] * 1,
81
+ sigmas=torch.tensor(sigmas).to('cuda'),
82
+ timesteps=torch.tensor(sigmas[:-1]).to('cuda') * 1000,
83
+ scales=scales,
84
+ guidance_scale=0.0,
85
+ height=int(scales[0] * 8),
86
+ width=int(scales[0] * 8),
87
+ generator=torch.Generator("cpu").manual_seed(seed),
88
+ output_type='pil'
89
+ ).images[0]
90
+
91
+ image.show()
92
+ ```
93
+
94
+ ## Citation
95
+
96
+ ```bibtex
97
+ @article{starodubcev2026rethinking,
98
+ title={Rethinking Global Text Conditioning in Diffusion Transformers},
99
+ author={Starodubcev, Nikita and Pakhomov, Daniil and Wu, Zongze and Drobyshevskiy, Ilya and Liu, Yuchen and Wang, Zhonghao and Zhou, Yuqian and Lin, Zhe and Baranchuk, Dmitry},
100
+ journal={arXiv preprint arXiv:2602.09268},
101
+ year={2026}
102
+ }
103
+ ```