Add model card and metadata for Rethinking Global Text Conditioning
#1
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,3 +1,103 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
library_name: diffusers
|
| 4 |
+
pipeline_tag: text-to-image
|
| 5 |
+
base_model: stabilityai/stable-diffusion-3.5-large
|
| 6 |
+
tags:
|
| 7 |
+
- lora
|
| 8 |
+
- stable-diffusion-3
|
| 9 |
+
- text-to-image
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# Rethinking Global Text Conditioning in Diffusion Transformers
|
| 13 |
+
|
| 14 |
+
This repository contains the DMD2 LoRA adapter for Stable Diffusion 3.5 Large, as presented in the paper [Rethinking Global Text Conditioning in Diffusion Transformers](https://huggingface.co/papers/2602.09268).
|
| 15 |
+
|
| 16 |
+
## Introduction
|
| 17 |
+
|
| 18 |
+
Diffusion transformers typically incorporate textual information via attention layers and a modulation mechanism using a pooled text embedding. Our analysis shows that, in its conventional usage, the pooled embedding contributes little to overall performance. However, we reveal that the pooled embedding can provide significant gains when used as **guidance**—enabling controllable shifts toward more desirable properties like complexity, realism, and better hand rendering.
|
| 19 |
+
|
| 20 |
+
This approach is training-free, simple to implement, and can be applied to various tasks including text-to-image/video generation and image editing.
|
| 21 |
+
|
| 22 |
+
* **GitHub Repository:** [quickjkee/modulation-guidance](https://github.com/quickjkee/modulation-guidance)
|
| 23 |
+
* **Paper:** [arXiv:2602.09268](https://huggingface.co/papers/2602.09268)
|
| 24 |
+
|
| 25 |
+
## Usage
|
| 26 |
+
|
| 27 |
+
This model is a LoRA adapter designed to work with the `diffusers` library. Below is an example of how to use it with **SD3.5-Large DMD2** and the proposed modulation guidance.
|
| 28 |
+
|
| 29 |
+
*Note: This usage requires the helper functions `encode_prompt` and `forward_modulation_guidance` from the [official repository](https://github.com/quickjkee/modulation-guidance).*
|
| 30 |
+
|
| 31 |
+
```python
|
| 32 |
+
import torch
|
| 33 |
+
import types
|
| 34 |
+
from diffusers import StableDiffusion3Pipeline
|
| 35 |
+
from functools import partial
|
| 36 |
+
from models.sd35 import encode_prompt, forward_modulation_guidance
|
| 37 |
+
from peft import PeftModel
|
| 38 |
+
|
| 39 |
+
# Load base model with custom pipeline
|
| 40 |
+
pipe = StableDiffusion3Pipeline.from_pretrained(
|
| 41 |
+
"stabilityai/stable-diffusion-3.5-large",
|
| 42 |
+
torch_dtype=torch.float16,
|
| 43 |
+
custom_pipeline='quickjkee/swd_pipeline'
|
| 44 |
+
)
|
| 45 |
+
pipe = pipe.to("cuda")
|
| 46 |
+
|
| 47 |
+
# Load this LoRA adapter
|
| 48 |
+
lora_path = 'yresearch/stable-diffusion-3.5-large-dmd2'
|
| 49 |
+
pipe.transformer = PeftModel.from_pretrained(
|
| 50 |
+
pipe.transformer,
|
| 51 |
+
lora_path,
|
| 52 |
+
).to("cuda")
|
| 53 |
+
|
| 54 |
+
# Define hyperparameters
|
| 55 |
+
prompt = "a cardboard spaceship"
|
| 56 |
+
prompt_positive = "Extremely complex, the highest quality"
|
| 57 |
+
prompt_negative = "very simple, no details at all"
|
| 58 |
+
w = 2
|
| 59 |
+
start_layer = 2
|
| 60 |
+
|
| 61 |
+
# Get pooled CLIP embeddings
|
| 62 |
+
clip_positive = encode_prompt(pipe=pipe, prompt=prompt_positive)
|
| 63 |
+
clip_negative = encode_prompt(pipe=pipe, prompt=prompt_negative)
|
| 64 |
+
|
| 65 |
+
# Apply modulation guidance
|
| 66 |
+
forward_modulation_guidance = partial(
|
| 67 |
+
forward_modulation_guidance,
|
| 68 |
+
pooled_projections_1=clip_positive,
|
| 69 |
+
pooled_projections_0=clip_negative,
|
| 70 |
+
w=w,
|
| 71 |
+
start_layer=start_layer
|
| 72 |
+
)
|
| 73 |
+
pipe.transformer.forward = types.MethodType(forward_modulation_guidance, pipe.transformer)
|
| 74 |
+
|
| 75 |
+
# Run generation
|
| 76 |
+
seed = 0
|
| 77 |
+
sigmas = [1.0000, 0.9454, 0.8959, 0.7904, 0.7371, 0.6022, 0.0000]
|
| 78 |
+
scales = [128, 128, 128, 128, 128, 128]
|
| 79 |
+
image = pipe(
|
| 80 |
+
[prompt] * 1,
|
| 81 |
+
sigmas=torch.tensor(sigmas).to('cuda'),
|
| 82 |
+
timesteps=torch.tensor(sigmas[:-1]).to('cuda') * 1000,
|
| 83 |
+
scales=scales,
|
| 84 |
+
guidance_scale=0.0,
|
| 85 |
+
height=int(scales[0] * 8),
|
| 86 |
+
width=int(scales[0] * 8),
|
| 87 |
+
generator=torch.Generator("cpu").manual_seed(seed),
|
| 88 |
+
output_type='pil'
|
| 89 |
+
).images[0]
|
| 90 |
+
|
| 91 |
+
image.show()
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
## Citation
|
| 95 |
+
|
| 96 |
+
```bibtex
|
| 97 |
+
@article{starodubcev2026rethinking,
|
| 98 |
+
title={Rethinking Global Text Conditioning in Diffusion Transformers},
|
| 99 |
+
author={Starodubcev, Nikita and Pakhomov, Daniil and Wu, Zongze and Drobyshevskiy, Ilya and Liu, Yuchen and Wang, Zhonghao and Zhou, Yuqian and Lin, Zhe and Baranchuk, Dmitry},
|
| 100 |
+
journal={arXiv preprint arXiv:2602.09268},
|
| 101 |
+
year={2026}
|
| 102 |
+
}
|
| 103 |
+
```
|