Jonathan Chang
commited on
Update model card
Browse files
README.md
CHANGED
|
@@ -1,3 +1,280 @@
|
|
| 1 |
---
|
| 2 |
-
license:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
license: openrail++
|
| 3 |
+
tags:
|
| 4 |
+
- stable-diffusion
|
| 5 |
+
- text-to-image
|
| 6 |
+
pinned: true
|
| 7 |
---
|
| 8 |
+
|
| 9 |
+
# Model Card for flex-diffusion-2-1
|
| 10 |
+
|
| 11 |
+
<!-- Provide a quick summary of what the model is/does. [Optional] -->
|
| 12 |
+
stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned with different aspect ratios.
|
| 13 |
+
|
| 14 |
+
## TLDR:
|
| 15 |
+
|
| 16 |
+
### There are 2 models in this repo:
|
| 17 |
+
- One based on stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for 6k steps.
|
| 18 |
+
- One based on stable-diffusion-2-base (stabilityai/stable-diffusion-2-base) finetuned for 6k steps, on the same dataset.
|
| 19 |
+
|
| 20 |
+
For usage, see - [How to Get Started with the Model](#how-to-get-started-with-the-model)
|
| 21 |
+
|
| 22 |
+
### It aims to solve the following issues:
|
| 23 |
+
1. Generated images looks like they are cropped from a larger image.
|
| 24 |
+
Examples:
|
| 25 |
+
|
| 26 |
+
2. Generating non-square images creates weird results, due to the model being trained on square images.
|
| 27 |
+
Examples:
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
### Limitations:
|
| 31 |
+
1. It's trained on a small dataset, so it's improvements may be limited.
|
| 32 |
+
2. For each aspect ratio, it's trained on only a fixed resolution. So it may not be able to generate images of different resolutions.
|
| 33 |
+
For 1:1 aspect ratio, it's fine-tuned at 512x512, although flex-diffusion-2-1 was last finetuned at 768x768.
|
| 34 |
+
|
| 35 |
+
### Potential improvements:
|
| 36 |
+
1. Train on a larger dataset.
|
| 37 |
+
2. Train on different resolutions even for the same aspect ratio.
|
| 38 |
+
3. Train on specific aspect ratios, instead of a range of aspect ratios.
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
# Table of Contents
|
| 42 |
+
|
| 43 |
+
- [Model Card for flex-diffusion-2-1](#model-card-for--model_id-)
|
| 44 |
+
- [Table of Contents](#table-of-contents)
|
| 45 |
+
- [Table of Contents](#table-of-contents-1)
|
| 46 |
+
- [Model Details](#model-details)
|
| 47 |
+
- [Model Description](#model-description)
|
| 48 |
+
- [Uses](#uses)
|
| 49 |
+
- [Direct Use](#direct-use)
|
| 50 |
+
- [Downstream Use [Optional]](#downstream-use-optional)
|
| 51 |
+
- [Out-of-Scope Use](#out-of-scope-use)
|
| 52 |
+
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
|
| 53 |
+
- [Recommendations](#recommendations)
|
| 54 |
+
- [Training Details](#training-details)
|
| 55 |
+
- [Training Data](#training-data)
|
| 56 |
+
- [Training Procedure](#training-procedure)
|
| 57 |
+
- [Preprocessing](#preprocessing)
|
| 58 |
+
- [Speeds, Sizes, Times](#speeds-sizes-times)
|
| 59 |
+
- [Evaluation](#evaluation)
|
| 60 |
+
- [Testing Data, Factors & Metrics](#testing-data-factors--metrics)
|
| 61 |
+
- [Testing Data](#testing-data)
|
| 62 |
+
- [Factors](#factors)
|
| 63 |
+
- [Metrics](#metrics)
|
| 64 |
+
- [Results](#results)
|
| 65 |
+
- [Model Examination](#model-examination)
|
| 66 |
+
- [Environmental Impact](#environmental-impact)
|
| 67 |
+
- [Technical Specifications [optional]](#technical-specifications-optional)
|
| 68 |
+
- [Model Architecture and Objective](#model-architecture-and-objective)
|
| 69 |
+
- [Compute Infrastructure](#compute-infrastructure)
|
| 70 |
+
- [Hardware](#hardware)
|
| 71 |
+
- [Software](#software)
|
| 72 |
+
- [Citation](#citation)
|
| 73 |
+
- [Glossary [optional]](#glossary-optional)
|
| 74 |
+
- [More Information [optional]](#more-information-optional)
|
| 75 |
+
- [Model Card Authors [optional]](#model-card-authors-optional)
|
| 76 |
+
- [Model Card Contact](#model-card-contact)
|
| 77 |
+
- [How to Get Started with the Model](#how-to-get-started-with-the-model)
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
# Model Details
|
| 81 |
+
|
| 82 |
+
## Model Description
|
| 83 |
+
|
| 84 |
+
<!-- Provide a longer summary of what this model is/does. -->
|
| 85 |
+
stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for dynamic aspect ratios.
|
| 86 |
+
|
| 87 |
+
finetuned resolutions:
|
| 88 |
+
| | width | height | aspect ratio |
|
| 89 |
+
|---:|--------:|---------:|:---------------|
|
| 90 |
+
| 0 | 512 | 1024 | 1:2 |
|
| 91 |
+
| 1 | 576 | 1024 | 9:16 |
|
| 92 |
+
| 2 | 576 | 960 | 3:5 |
|
| 93 |
+
| 3 | 640 | 1024 | 5:8 |
|
| 94 |
+
| 4 | 512 | 768 | 2:3 |
|
| 95 |
+
| 5 | 640 | 896 | 5:7 |
|
| 96 |
+
| 6 | 576 | 768 | 3:4 |
|
| 97 |
+
| 7 | 512 | 640 | 4:5 |
|
| 98 |
+
| 8 | 640 | 768 | 5:6 |
|
| 99 |
+
| 9 | 640 | 704 | 10:11 |
|
| 100 |
+
| 10 | 512 | 512 | 1:1 |
|
| 101 |
+
| 11 | 704 | 640 | 11:10 |
|
| 102 |
+
| 12 | 768 | 640 | 6:5 |
|
| 103 |
+
| 13 | 640 | 512 | 5:4 |
|
| 104 |
+
| 14 | 768 | 576 | 4:3 |
|
| 105 |
+
| 15 | 896 | 640 | 7:5 |
|
| 106 |
+
| 16 | 768 | 512 | 3:2 |
|
| 107 |
+
| 17 | 1024 | 640 | 8:5 |
|
| 108 |
+
| 18 | 960 | 576 | 5:3 |
|
| 109 |
+
| 19 | 1024 | 576 | 16:9 |
|
| 110 |
+
| 20 | 1024 | 512 | 2:1 |
|
| 111 |
+
|
| 112 |
+
- **Developed by:** Jonathan Chang
|
| 113 |
+
- **Model type:** Diffusion-based text-to-image generation model
|
| 114 |
+
- **Language(s)**: English
|
| 115 |
+
- **License:** creativeml-openrail-m
|
| 116 |
+
- **Parent Model:** https://huggingface.co/stabilityai/stable-diffusion-2-1
|
| 117 |
+
- **Resources for more information:** More information needed
|
| 118 |
+
|
| 119 |
+
# Uses
|
| 120 |
+
|
| 121 |
+
- see https://huggingface.co/stabilityai/stable-diffusion-2-1
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
# Training Details
|
| 125 |
+
|
| 126 |
+
## Training Data
|
| 127 |
+
|
| 128 |
+
- LAION aesthetic dataset, subset of it with 6+ rating
|
| 129 |
+
- https://laion.ai/blog/laion-aesthetics/
|
| 130 |
+
- https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus
|
| 131 |
+
- I only used a small portion of that, see [Preprocessing](#preprocessing)
|
| 132 |
+
|
| 133 |
+
|
| 134 |
+
- most common aspect ratios in the dataset (before preprocessing)
|
| 135 |
+
|
| 136 |
+
| | aspect_ratio | counts |
|
| 137 |
+
|---:|:---------------|---------:|
|
| 138 |
+
| 0 | 1:1 | 154727 |
|
| 139 |
+
| 1 | 3:2 | 119615 |
|
| 140 |
+
| 2 | 2:3 | 61197 |
|
| 141 |
+
| 3 | 4:3 | 52276 |
|
| 142 |
+
| 4 | 16:9 | 38862 |
|
| 143 |
+
| 5 | 400:267 | 21893 |
|
| 144 |
+
| 6 | 3:4 | 16893 |
|
| 145 |
+
| 7 | 8:5 | 16258 |
|
| 146 |
+
| 8 | 4:5 | 15684 |
|
| 147 |
+
| 9 | 6:5 | 12228 |
|
| 148 |
+
| 10 | 1000:667 | 12097 |
|
| 149 |
+
| 11 | 2:1 | 11006 |
|
| 150 |
+
| 12 | 800:533 | 10259 |
|
| 151 |
+
| 13 | 5:4 | 9753 |
|
| 152 |
+
| 14 | 500:333 | 9700 |
|
| 153 |
+
| 15 | 250:167 | 9114 |
|
| 154 |
+
| 16 | 5:3 | 8460 |
|
| 155 |
+
| 17 | 200:133 | 7832 |
|
| 156 |
+
| 18 | 1024:683 | 7176 |
|
| 157 |
+
| 19 | 11:10 | 6470 |
|
| 158 |
+
|
| 159 |
+
- predefined aspect ratios
|
| 160 |
+
|
| 161 |
+
| | width | height | aspect ratio |
|
| 162 |
+
|---:|--------:|---------:|:---------------|
|
| 163 |
+
| 0 | 512 | 1024 | 1:2 |
|
| 164 |
+
| 1 | 576 | 1024 | 9:16 |
|
| 165 |
+
| 2 | 576 | 960 | 3:5 |
|
| 166 |
+
| 3 | 640 | 1024 | 5:8 |
|
| 167 |
+
| 4 | 512 | 768 | 2:3 |
|
| 168 |
+
| 5 | 640 | 896 | 5:7 |
|
| 169 |
+
| 6 | 576 | 768 | 3:4 |
|
| 170 |
+
| 7 | 512 | 640 | 4:5 |
|
| 171 |
+
| 8 | 640 | 768 | 5:6 |
|
| 172 |
+
| 9 | 640 | 704 | 10:11 |
|
| 173 |
+
| 10 | 512 | 512 | 1:1 |
|
| 174 |
+
| 11 | 704 | 640 | 11:10 |
|
| 175 |
+
| 12 | 768 | 640 | 6:5 |
|
| 176 |
+
| 13 | 640 | 512 | 5:4 |
|
| 177 |
+
| 14 | 768 | 576 | 4:3 |
|
| 178 |
+
| 15 | 896 | 640 | 7:5 |
|
| 179 |
+
| 16 | 768 | 512 | 3:2 |
|
| 180 |
+
| 17 | 1024 | 640 | 8:5 |
|
| 181 |
+
| 18 | 960 | 576 | 5:3 |
|
| 182 |
+
| 19 | 1024 | 576 | 16:9 |
|
| 183 |
+
| 20 | 1024 | 512 | 2:1 |
|
| 184 |
+
|
| 185 |
+
|
| 186 |
+
## Training Procedure
|
| 187 |
+
|
| 188 |
+
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
| 189 |
+
|
| 190 |
+
### Preprocessing
|
| 191 |
+
|
| 192 |
+
|
| 193 |
+
1. download files with url & caption from https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus
|
| 194 |
+
- I only used the first file `train-00000-of-00007-29aec9150af50f9f.parquet`
|
| 195 |
+
2. use img2dataset to convert to webdataset
|
| 196 |
+
- https://github.com/rom1504/img2dataset
|
| 197 |
+
- I put train-00000-of-00007-29aec9150af50f9f.parquet in a folder called `first-file`
|
| 198 |
+
- the output folder is `/mnt/aesthetics6plus`, change this to your own folder
|
| 199 |
+
|
| 200 |
+
```bash
|
| 201 |
+
echo INPUT_FOLDER=first-file
|
| 202 |
+
echo OUTPUT_FOLDER=/mnt/aesthetics6plus
|
| 203 |
+
img2dataset --url_list $INPUT_FOLDER --input_format "parquet"\
|
| 204 |
+
--url_col "URL" --caption_col "TEXT" --output_format webdataset\
|
| 205 |
+
--output_folder $OUTPUT_FOLDER --processes_count 3 --thread_count 6 --image_size 1024 --resize_only_if_bigger --resize_mode=keep_ratio_largest \
|
| 206 |
+
--save_additional_columns '["WIDTH","HEIGHT","punsafe","similarity"]' --enable_wandb True
|
| 207 |
+
```
|
| 208 |
+
|
| 209 |
+
3. The data-loading code will do preprocessing on the fly, so no need to do anything else. But it's not optimized for speed, the GPU utilization fluctuates between 80% and 100%. And it's not written for multi-GPU training, so use it with caution. The code will do the following:
|
| 210 |
+
- use webdataset to load the data
|
| 211 |
+
- calculate the aspect ratio of each image
|
| 212 |
+
- find the closest aspect ratio & it's associated resolution from the predefined resolutions: `argmin(abs(aspect_ratio - predefined_aspect_ratios))`. E.g. if the aspect ratio is 1:3, the closest resolution is 1:2. and it's associated resolution is 512x1024.
|
| 213 |
+
- keeping the aspect ratio, resize the image such that it's larger or equal to the associated resolution on each side. E.g. resize to 512x(512*3) = 512x1536
|
| 214 |
+
- random crop the image to the associated resolution. E.g. crop to 512x1024
|
| 215 |
+
- if more than 10% of the image is lost in the cropping, discard this example.
|
| 216 |
+
- batch examples by aspect ratio, so all examples in a batch have the same aspect ratio
|
| 217 |
+
|
| 218 |
+
|
| 219 |
+
### Speeds, Sizes, Times
|
| 220 |
+
|
| 221 |
+
- Dataset size: 100k image-caption pairs, before filtering.
|
| 222 |
+
- I didn't wait for the whole dataset to be downloaded, I copied the first 10 tar files and their index files to a new folder called `aesthetics6plus-small`, with 100k image-caption pairs in total. The full dataset is a lot bigger.
|
| 223 |
+
|
| 224 |
+
- Hardware: 1 RTX3090 GPUs
|
| 225 |
+
|
| 226 |
+
- Optimizer: 8bit Adam
|
| 227 |
+
|
| 228 |
+
- Batch size: 32
|
| 229 |
+
- actual batch size: 2
|
| 230 |
+
- gradient_accumulation_steps: 16
|
| 231 |
+
- effective batch size: 32
|
| 232 |
+
|
| 233 |
+
- Learning rate: warmup to 2e-6 for 500 steps and then kept constant
|
| 234 |
+
|
| 235 |
+
- Learning rate: 2e-6
|
| 236 |
+
- Training steps: 6k
|
| 237 |
+
- Epoch size (approximate): 32 * 6k / 100k = 1.92 (not accounting for the filtering)
|
| 238 |
+
- Each example is seen 1.92 times on average.
|
| 239 |
+
|
| 240 |
+
- Training time: approximately 1 day
|
| 241 |
+
|
| 242 |
+
## Results
|
| 243 |
+
|
| 244 |
+
More information needed
|
| 245 |
+
|
| 246 |
+
# Model Card Authors
|
| 247 |
+
|
| 248 |
+
Jonathan Chang
|
| 249 |
+
|
| 250 |
+
|
| 251 |
+
# How to Get Started with the Model
|
| 252 |
+
|
| 253 |
+
Use the code below to get started with the model.
|
| 254 |
+
|
| 255 |
+
|
| 256 |
+
```python
|
| 257 |
+
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler, UNet2DConditionModel
|
| 258 |
+
|
| 259 |
+
def use_DPM_solver(pipe):
|
| 260 |
+
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
|
| 261 |
+
return pipe
|
| 262 |
+
|
| 263 |
+
pipe = StableDiffusionPipeline.from_pretrained(
|
| 264 |
+
"stabilityai/stable-diffusion-2-1",
|
| 265 |
+
unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-1/unet", torch_dtype=torch.float16),
|
| 266 |
+
torch_dtype=torch.float16,
|
| 267 |
+
)
|
| 268 |
+
# for v2-base, use the following line instead
|
| 269 |
+
#pipe = StableDiffusionPipeline.from_pretrained(
|
| 270 |
+
# "stabilityai/stable-diffusion-2-base",
|
| 271 |
+
# unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-base/unet", torch_dtype=torch.float16),
|
| 272 |
+
# torch_dtype=torch.float16)
|
| 273 |
+
pipe = use_DPM_solver(pipe).to("cuda")
|
| 274 |
+
pipe = pipe.to("cuda")
|
| 275 |
+
|
| 276 |
+
prompt = "a professional photograph of an astronaut riding a horse"
|
| 277 |
+
image = pipe(prompt).images[0]
|
| 278 |
+
|
| 279 |
+
image.save("astronaut_rides_horse.png")
|
| 280 |
+
```
|