exdysa cjy2003 commited on
Commit
734789a
·
0 Parent(s):

Duplicate from mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers

Browse files

Co-authored-by: JUNYU CHEN <cjy2003@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/dc_ae_demo.gif filter=lfs diff=lfs merge=lfs -text
37
+ assets/dc_ae_diffusion_demo.gif filter=lfs diff=lfs merge=lfs -text
38
+ assets/Sana-0.6B-laptop.gif filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: diffusers
4
+ pipeline_tag: text-to-image
5
+ ---
6
+
7
+ # Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models
8
+
9
+ [[paper](https://arxiv.org/abs/2410.10733)] [[GitHub](https://github.com/mit-han-lab/efficientvit)]
10
+
11
+ ![demo](assets/dc_ae_demo.gif)
12
+ <p align="center">
13
+ <b> Figure 1: We address the reconstruction accuracy drop of high spatial-compression autoencoders.
14
+ </p>
15
+
16
+ ![demo](assets/dc_ae_diffusion_demo.gif)
17
+ <p align="center">
18
+ <b> Figure 2: DC-AE delivers significant training and inference speedup without performance drop.
19
+ </p>
20
+
21
+ ![demo](assets/Sana-0.6B-laptop.gif)
22
+
23
+ <p align="center">
24
+ <img src="assets/dc_ae_sana.jpg" width="1200">
25
+ </p>
26
+
27
+ <p align="center">
28
+ <b> Figure 3: DC-AE enables efficient text-to-image generation on the laptop.
29
+ </p>
30
+
31
+ ## Abstract
32
+
33
+ We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder models for accelerating high-resolution diffusion models. Existing autoencoder models have demonstrated impressive results at a moderate spatial compression ratio (e.g., 8x), but fail to maintain satisfactory reconstruction accuracy for high spatial compression ratios (e.g., 64x). We address this challenge by introducing two key techniques: (1) Residual Autoencoding, where we design our models to learn residuals based on the space-to-channel transformed features to alleviate the optimization difficulty of high spatial-compression autoencoders; (2) Decoupled High-Resolution Adaptation, an efficient decoupled three-phases training strategy for mitigating the generalization penalty of high spatial-compression autoencoders. With these designs, we improve the autoencoder's spatial compression ratio up to 128 while maintaining the reconstruction quality. Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop. For example, on ImageNet 512x512, our DC-AE provides 19.1x inference speedup and 17.9x training speedup on H100 GPU for UViT-H while achieving a better FID, compared with the widely used SD-VAE-f8 autoencoder.
34
+
35
+ ## Usage
36
+
37
+ ### Deep Compression Autoencoder
38
+
39
+ ```python
40
+ # build DC-AE models
41
+ # full DC-AE model list: https://huggingface.co/collections/mit-han-lab/dc-ae-670085b9400ad7197bb1009b
42
+ from efficientvit.ae_model_zoo import DCAE_HF
43
+
44
+ dc_ae = DCAE_HF.from_pretrained(f"mit-han-lab/dc-ae-f64c128-in-1.0")
45
+
46
+ # encode
47
+ from PIL import Image
48
+ import torch
49
+ import torchvision.transforms as transforms
50
+ from torchvision.utils import save_image
51
+ from efficientvit.apps.utils.image import DMCrop
52
+
53
+ device = torch.device("cuda")
54
+ dc_ae = dc_ae.to(device).eval()
55
+
56
+ transform = transforms.Compose([
57
+ DMCrop(512), # resolution
58
+ transforms.ToTensor(),
59
+ transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
60
+ ])
61
+ image = Image.open("assets/fig/girl.png")
62
+ x = transform(image)[None].to(device)
63
+ latent = dc_ae.encode(x)
64
+ print(latent.shape)
65
+
66
+ # decode
67
+ y = dc_ae.decode(latent)
68
+ save_image(y * 0.5 + 0.5, "demo_dc_ae.png")
69
+ ```
70
+
71
+ ### Efficient Diffusion Models with DC-AE
72
+
73
+ ```python
74
+ # build DC-AE-Diffusion models
75
+ # full DC-AE-Diffusion model list: https://huggingface.co/collections/mit-han-lab/dc-ae-diffusion-670dbb8d6b6914cf24c1a49d
76
+ from efficientvit.diffusion_model_zoo import DCAE_Diffusion_HF
77
+
78
+ dc_ae_diffusion = DCAE_Diffusion_HF.from_pretrained(f"mit-han-lab/dc-ae-f64c128-in-1.0-uvit-h-in-512px-train2000k")
79
+
80
+ # denoising on the latent space
81
+ import torch
82
+ import numpy as np
83
+ from torchvision.utils import save_image
84
+
85
+ torch.set_grad_enabled(False)
86
+ device = torch.device("cuda")
87
+ dc_ae_diffusion = dc_ae_diffusion.to(device).eval()
88
+
89
+ seed = 0
90
+ torch.manual_seed(seed)
91
+ torch.cuda.manual_seed_all(seed)
92
+ eval_generator = torch.Generator(device=device)
93
+ eval_generator.manual_seed(seed)
94
+
95
+ prompts = torch.tensor(
96
+ [279, 333, 979, 936, 933, 145, 497, 1, 248, 360, 793, 12, 387, 437, 938, 978], dtype=torch.int, device=device
97
+ )
98
+ num_samples = prompts.shape[0]
99
+ prompts_null = 1000 * torch.ones((num_samples,), dtype=torch.int, device=device)
100
+ latent_samples = dc_ae_diffusion.diffusion_model.generate(prompts, prompts_null, 6.0, eval_generator)
101
+ latent_samples = latent_samples / dc_ae_diffusion.scaling_factor
102
+
103
+ # decode
104
+ image_samples = dc_ae_diffusion.autoencoder.decode(latent_samples)
105
+ save_image(image_samples * 0.5 + 0.5, "demo_dc_ae_diffusion.png", nrow=int(np.sqrt(num_samples)))
106
+ ```
107
+
108
+ ## Reference
109
+
110
+ If DC-AE is useful or relevant to your research, please kindly recognize our contributions by citing our papers:
111
+
112
+ ```
113
+ @article{chen2024deep,
114
+ title={Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models},
115
+ author={Chen, Junyu and Cai, Han and Chen, Junsong and Xie, Enze and Yang, Shang and Tang, Haotian and Li, Muyang and Lu, Yao and Han, Song},
116
+ journal={arXiv preprint arXiv:2410.10733},
117
+ year={2024}
118
+ }
119
+ ```
assets/Sana-0.6B-laptop.gif ADDED

Git LFS Details

  • SHA256: e1ae2defa971a773cc1028d4a9aaa7110046bd72bc407ae57cfdabd0c01a0c23
  • Pointer size: 133 Bytes
  • Size of remote file: 36 MB
assets/dc_ae_demo.gif ADDED

Git LFS Details

  • SHA256: 514a8b660d19d583ca031efcab51bb15f3e12822ca737b729da00a1cea257a9a
  • Pointer size: 132 Bytes
  • Size of remote file: 3.74 MB
assets/dc_ae_diffusion_demo.gif ADDED

Git LFS Details

  • SHA256: 5b3860b826dd126845fb2406e91bad3d122aee3b4e54550b75b9ea11fbf31e3a
  • Pointer size: 132 Bytes
  • Size of remote file: 2.63 MB
assets/dc_ae_sana.jpg ADDED
config.json ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "AutoencoderDC",
3
+ "_diffusers_version": "0.32.2",
4
+ "attention_head_dim": 32,
5
+ "decoder_act_fns": "silu",
6
+ "decoder_block_out_channels": [
7
+ 128,
8
+ 256,
9
+ 512,
10
+ 512,
11
+ 1024,
12
+ 1024
13
+ ],
14
+ "decoder_block_types": [
15
+ "ResBlock",
16
+ "ResBlock",
17
+ "ResBlock",
18
+ "EfficientViTBlock",
19
+ "EfficientViTBlock",
20
+ "EfficientViTBlock"
21
+ ],
22
+ "decoder_layers_per_block": [
23
+ 3,
24
+ 3,
25
+ 3,
26
+ 3,
27
+ 3,
28
+ 3
29
+ ],
30
+ "decoder_norm_types": "rms_norm",
31
+ "decoder_qkv_multiscales": [
32
+ [],
33
+ [],
34
+ [],
35
+ [
36
+ 5
37
+ ],
38
+ [
39
+ 5
40
+ ],
41
+ [
42
+ 5
43
+ ]
44
+ ],
45
+ "downsample_block_type": "Conv",
46
+ "encoder_block_out_channels": [
47
+ 128,
48
+ 256,
49
+ 512,
50
+ 512,
51
+ 1024,
52
+ 1024
53
+ ],
54
+ "encoder_block_types": [
55
+ "ResBlock",
56
+ "ResBlock",
57
+ "ResBlock",
58
+ "EfficientViTBlock",
59
+ "EfficientViTBlock",
60
+ "EfficientViTBlock"
61
+ ],
62
+ "encoder_layers_per_block": [
63
+ 2,
64
+ 2,
65
+ 2,
66
+ 3,
67
+ 3,
68
+ 3
69
+ ],
70
+ "encoder_qkv_multiscales": [
71
+ [],
72
+ [],
73
+ [],
74
+ [
75
+ 5
76
+ ],
77
+ [
78
+ 5
79
+ ],
80
+ [
81
+ 5
82
+ ]
83
+ ],
84
+ "in_channels": 3,
85
+ "latent_channels": 32,
86
+ "scaling_factor": 0.41407,
87
+ "upsample_block_type": "interpolate"
88
+ }
diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dfd991d1b54ffabf22745c5885589d8f2a7bc59930d95d92bd741c4fc64454bb
3
+ size 1249044836