morj commited on
Commit
ec562f8
·
verified ·
1 Parent(s): 1e886f5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +250 -19
README.md CHANGED
@@ -6,7 +6,7 @@ language:
6
  - en
7
  library_name: keras
8
  tags:
9
- - '#stablediffusion '
10
  - '#renaissance'
11
  - '# finetune'
12
  - '#kerascv'
@@ -17,29 +17,260 @@ tags:
17
  base_model: CompVis/stable-diffusion-v1-4
18
  ---
19
 
 
20
  # Model Card for Renaissance Stable Diffusion
21
- <!-- Provide a quick summary of what the model is/does. -->
22
- This model uses the KerasCV implementation of stability.ai's text-to-image model. Unlike other open-source alternatives like Hugging Face's Diffusers, KerasCV offers advantages such as XLA compilation and mixed precision support, resulting in state-of-the-art generation speed.
23
 
24
- ## Model Details
25
- If you'd like to see more regarding our process, results, or additional information about this project, please navigate to the Wiki section of this repository also available [here](https://github.com/martingasparyan/Fine-Tune-Stable-Diffusion/wiki).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
- #### Model Description
28
 
29
- <!-- Provide a longer summary of what this model is. -->
30
- This model can be used to generate and modify images based on text prompts. It is a Latent Diffusion Model that uses a fixed, pretrained text encoder (OpenCLIP-ViT/H) to generate high-quality Reniassance portraits from textual prompts.
31
 
32
- This model uses the KerasCV implementation of stability.ai's text-to-image model, Stable Diffusion.
33
 
 
34
 
35
- - **Developed by:** Martin Gasparyan and Tatev Kyosababyan
36
- - **Model type:** Diffusion-based text-to-image generative model
37
- - **Language(s) (NLP):** Python
38
- - **License:** CreativeML Open RAIL++-M License
39
- - **Finetuned from model [https://huggingface.co/CompVis/stable-diffusion-v1-4]:** https://github.com/keras-team/keras-cv/tree/master/keras_cv/models/stable_diffusion
40
 
 
41
 
42
- ## To Generate your own Examples:
 
43
  ### 1. Install Dependencies
44
  ```python
45
  !pip install keras-cv==0.6.0 -q
@@ -69,23 +300,23 @@ my_base_model = keras_cv.models.StableDiffusion(img_width=512, img_height=512)
69
  ```
70
  ### 4. Load Weights from the h5 model which is hosted on Hugging Face:
71
  ```python
72
- my_base_model.diffusion_model.load_weights('/path/to/file/renaissance_model.h5')
73
  ```
74
  ### 5. Create a variable to hold the values of the to-be-generated image such as prompt, batch size, iterations, and seed
75
  ```python
76
  img = my_base_model.text_to_image(
77
- prompt="A woman with an enigmatic smile against a dark background",
78
  batch_size=1, # How many images to generate at once
79
  num_steps=25, # Number of iterations (controls image quality)
80
  seed=123, # Set this to always get the same image from the same prompt
81
  )
82
  ```
83
- ### 6. Display using the function:
84
  ```python
85
  def plot_images(images):
86
  plt.figure(figsize=(5, 5))
87
  plt.imshow(images)
88
- plt.axis("off")
89
 
90
  plot_images(img)
91
  ```
 
6
  - en
7
  library_name: keras
8
  tags:
9
+ - '#stablediffusion'
10
  - '#renaissance'
11
  - '# finetune'
12
  - '#kerascv'
 
17
  base_model: CompVis/stable-diffusion-v1-4
18
  ---
19
 
20
+
21
  # Model Card for Renaissance Stable Diffusion
 
 
22
 
23
+ <!-- Provide a quick summary of what the model is/does. [Optional] -->
24
+ This is a Fine-tuned Stable Diffusion model on a custom dataset of {image, caption} pairs. This model has been built on top of the fine-tuning script provided by Hugging Face. This model uses the KerasCV implementation of stability.ai's text-to-image model. Unlike other open-source alternatives like Hugging Face's Diffusers, KerasCV offers advantages such as XLA compilation and mixed precision support, resulting in state-of-the-art generation speed. Fine-tuning the Stable Diffusion model for generating high-quality Renaissance-style portraits.
25
+
26
+
27
+
28
+
29
+ # Table of Contents
30
+
31
+ - [Model Card for Renaissance Stable Diffusion](#model-card-for--model_id-)
32
+ - [Table of Contents](#table-of-contents)
33
+ - [Table of Contents](#table-of-contents-1)
34
+ - [Model Details](#model-details)
35
+ - [Model Description](#model-description)
36
+ - [Uses](#uses)
37
+ - [Direct Use](#direct-use)
38
+ - [Downstream Use [Optional]](#downstream-use-optional)
39
+ - [Out-of-Scope Use](#out-of-scope-use)
40
+ - [Bias, Risks, and Limitations](#bias-risks-and-limitations)
41
+ - [Recommendations](#recommendations)
42
+ - [Training Details](#training-details)
43
+ - [Training Data](#training-data)
44
+ - [Training Procedure](#training-procedure)
45
+ - [Preprocessing](#preprocessing)
46
+ - [Speeds, Sizes, Times](#speeds-sizes-times)
47
+ - [Evaluation](#evaluation)
48
+ - [Testing Data, Factors & Metrics](#testing-data-factors--metrics)
49
+ - [Testing Data](#testing-data)
50
+ - [Factors](#factors)
51
+ - [Metrics](#metrics)
52
+ - [Results](#results)
53
+ - [Model Examination](#model-examination)
54
+ - [Environmental Impact](#environmental-impact)
55
+ - [Technical Specifications [optional]](#technical-specifications-optional)
56
+ - [Model Architecture and Objective](#model-architecture-and-objective)
57
+ - [Compute Infrastructure](#compute-infrastructure)
58
+ - [Hardware](#hardware)
59
+ - [Software](#software)
60
+ - [Citation](#citation)
61
+ - [Glossary [optional]](#glossary-optional)
62
+ - [More Information [optional]](#more-information-optional)
63
+ - [Model Card Authors [optional]](#model-card-authors-optional)
64
+ - [Model Card Contact](#model-card-contact)
65
+ - [How to Get Started with the Model](#how-to-get-started-with-the-model)
66
+
67
+
68
+ # Model Details
69
+
70
+ ## Model Description
71
+
72
+ <!-- Provide a longer summary of what this model is/does. -->
73
+ This is a Fine-tuned Stable Diffusion model on a custom dataset of {image, caption} pairs. This model has been built on top of the fine-tuning script provided by Hugging Face. This model uses the KerasCV implementation of stability.ai's text-to-image model. Unlike other open-source alternatives like Hugging Face's Diffusers, KerasCV offers advantages such as XLA compilation and mixed precision support, resulting in state-of-the-art generation speed. Fine-tuning the Stable Diffusion model for generating high-quality Renaissance-style portraits
74
+
75
+ - **Developed by:** Martin Gasparyan, Tatev Kyosababyan
76
+ - **Shared by:** Martin Gasparyan, Tatev Kyosababyan
77
+ - **Model type:** Computer Vision Model
78
+ - **Language(s) (NLP):** eng
79
+ - **License:** creativeml-openrail-m
80
+ - **Parent Model:** CompVis/stable-diffusion-v1-4
81
+ - **Resources for more information:** More information needed
82
+ - [GitHub Repo](https://github.com/martingasparyan/Fine-Tune-Stable-Diffusion)
83
+ - [Associated Paper](https://medium.com/@ngesa254/unlock-creativity-with-stable-diffusion-in-kerascv-9d317199a7c9)
84
+
85
+ # Uses
86
+
87
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
88
+
89
+
90
+ ## Direct Use
91
+
92
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
93
+ <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
94
+ The model is intended for research purposes only. Possible research areas and tasks include
95
+
96
+ - Safe deployment of models which have the potential to generate harmful content.
97
+ - Probing and understanding the limitations and biases of generative models.
98
+ - Generation of artworks and use in design and other artistic processes.
99
+ - Applications in educational or creative tools.
100
+ - Research on generative models.
101
+
102
+ Excluded uses are described below.
103
+
104
+
105
+ ## Downstream Use [Optional]
106
+
107
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
108
+ <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
109
+
110
+
111
+
112
+ ## Out-of-Scope Use
113
+
114
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
115
+ <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
116
+ The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
117
+
118
+ ## Misuse and Malicious Use
119
+ Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:
120
+
121
+ - Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc.
122
+ - Intentionally promoting or propagating discriminatory content or harmful stereotypes.
123
+ - Impersonating individuals without their consent.
124
+ - Sexual content without consent of the people who might see it.
125
+ - Mis- and disinformation
126
+ - Representations of egregious violence and gore
127
+ - Sharing of copyrighted or licensed material in violation of its terms of use.
128
+ - Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use.
129
+
130
+ # Bias, Risks, and Limitations
131
+
132
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
133
+
134
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
135
+ ## Limitations
136
+ - The model does not achieve perfect photorealism
137
+ - The model cannot render legible text
138
+ - The model does not perform well on more difficult tasks which involve compositionality, such as rendering an image corresponding to “A red cube on top of a blue sphere”
139
+ - Faces and people in general may not be generated properly.
140
+ - The model was trained mainly with English captions and will not work as well in other languages.
141
+ - The autoencoding part of the model is lossy
142
+ - The model was trained on a large-scale dataset LAION-5B which contains adult material and is not fit for product use without additional safety mechanisms and considerations.
143
+ - No additional measures were used to deduplicate the dataset. As a result, we observe some degree of memorization for images that are duplicated in the training data. The training data can be searched at https://rom1504.github.io/clip-retrieval/ to possibly assist in the detection of memorized images.
144
+ ## Bias
145
+ While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases. Stable Diffusion v1 was trained on subsets of LAION-2B(en), which consists of images that are primarily limited to English descriptions. Texts and images from communities and cultures that use other languages are likely to be insufficiently accounted for. This affects the overall output of the model, as white and western cultures are often set as the default. Further, the ability of the model to generate content with non-English prompts is significantly worse than with English-language prompts.
146
+
147
+
148
+ # Training Details
149
+ Stable Diffusion v1-4 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,
150
+
151
+ - Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4
152
+ - Text prompts are encoded through a ViT-L/14 text-encoder.
153
+ - The non-pooled output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
154
+ - The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet.
155
+
156
+ ## Training Data
157
+
158
+ <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
159
+
160
+ We used 11 Renaissance portraits to train the model and created a .csv file with two columns, one for image path and the other for textual description. Dataset can be found at https://huggingface.co/datasets/morj/renaissance_portraits and can be downloaded using
161
+ ```
162
+ curl -X GET \
163
+ "https://datasets-server.huggingface.co/splits?dataset=morj%2Frenaissance_portraits"
164
+ ```
165
+
166
+ ## Training Procedure
167
+
168
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
169
+ Note: Only the diffusion model is fine-tuned. The VAE and the text encoder are kept frozen.
170
+
171
+ Training details: The fine-tuning process involves adapting the Stable Diffusion model to the specific task of generating Renaissance-style portraits from textual descriptions.
172
+
173
+ The dataset we trained our model on can be found here. We used 11 Renaissance portraits to train the model and created a .csv file with two columns, one for image path and the other for textual description.
174
+ When launching training, a diffusion model checkpoint is generated epoch-wise only if the current loss is lower than the previous one. To avoid OOM and faster training, we used an A100 GPU in Google Colab.
175
+ We fine-tuned the model on two different resolutions: 256x256 and 512x512. We only varied the batch size and number of epochs for fine-tuning with these two different resolutions. The best results were obtained with 512 x 512 pixels, 72 epochs, batch size of 1 and mixed precision set to True.
176
+
177
+ Hardware: A100 GPUs
178
+
179
+ Optimizer: AdamW
180
+
181
+ Gradient Accumulations: 2
182
+
183
+ Batch: 1
184
+
185
+ Learning rate: warmup to 0.0001 for 10,000 steps and then kept constant
186
+
187
+ # Evaluation
188
+
189
+ <!-- This section describes the evaluation protocols and provides the results. -->
190
+
191
+ ## Testing Data, Factors & Metrics
192
+
193
+ ### Testing Data
194
+
195
+ <!-- This should link to a Data Card if possible. -->
196
+
197
+ More information needed
198
+
199
+
200
+ ### Factors
201
+
202
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
203
+
204
+ More information needed
205
+
206
+ ### Metrics
207
+
208
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
209
+
210
+ More information needed
211
+
212
+ ## Results
213
+
214
+ Please Check out the Github Repo at https://github.com/martingasparyan/Fine-Tune-Stable-Diffusion/wiki
215
+
216
+ # Model Examination
217
+
218
+ More information needed
219
+
220
+ # Environmental Impact
221
+
222
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
223
+
224
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
225
+
226
+ - **Hardware Type:** A100 PCIe 40/80GB
227
+ - **Hours used:** 50
228
+ - **Cloud Provider:** Google Cloud Platform
229
+ - **Compute Region:** us-west1
230
+ - **Carbon Emitted:** 3.75
231
+
232
+
233
+ \usepackage{hyperref}
234
+
235
+ \subsection{CO2 Emission Related to Experiments}
236
+
237
+ Experiments were conducted using Google Cloud Platform in region us-west1, which has a carbon efficiency of 0.3 kgCO$_2$eq/kWh. A cumulative of 50 hours of computation was performed on hardware of type A100 PCIe 40/80GB (TDP of 250W).
238
+
239
+ Total emissions are estimated to be 3.75 kgCO$_2$eq of which 100 percents were directly offset by the cloud provider.
240
+
241
+ %Uncomment if you bought additional offsets:
242
+ %XX kg CO2eq were manually offset through \href{link}{Offset Provider}.
243
+
244
+ Estimations were conducted using the \href{https://mlco2.github.io/impact#compute}{MachineLearning Impact calculator} presented in \cite{lacoste2019quantifying}.
245
+
246
+ @article{lacoste2019quantifying,
247
+ title={Quantifying the Carbon Emissions of Machine Learning},
248
+ author={Lacoste, Alexandre and Luccioni, Alexandra and Schmidt, Victor and Dandres, Thomas},
249
+ journal={arXiv preprint arXiv:1910.09700},
250
+ year={2019}
251
+ }
252
+
253
+
254
+ ### Hardware
255
+
256
+ A100 PCIe 40/80GB
257
+
258
+ ### Software
259
 
260
+ Google Colab, Jupyter Lab
261
 
262
+ # Model Card Authors [optional]
 
263
 
264
+ <!-- This section provides another layer of transparency and accountability. Whose views is this model card representing? How many voices were included in its construction? Etc. -->
265
 
266
+ Martin Gasparyan, Tatev Kyosababyan
267
 
268
+ # Model Card Contact
 
 
 
 
269
 
270
+ martingasparyan@yahoo.com, tatev.kyosababyan@gmail.com
271
 
272
+ # How to Get Started with the Model
273
+ Use the code below to get started with the model.
274
  ### 1. Install Dependencies
275
  ```python
276
  !pip install keras-cv==0.6.0 -q
 
300
  ```
301
  ### 4. Load Weights from the h5 model which is hosted on Hugging Face:
302
  ```python
303
+ my_base_model.diffusion_model.load_weights(&#39;/path/to/file/renaissance_model.h5&#39;)
304
  ```
305
  ### 5. Create a variable to hold the values of the to-be-generated image such as prompt, batch size, iterations, and seed
306
  ```python
307
  img = my_base_model.text_to_image(
308
+ prompt=&#34;A woman with an enigmatic smile against a dark background&#34;,
309
  batch_size=1, # How many images to generate at once
310
  num_steps=25, # Number of iterations (controls image quality)
311
  seed=123, # Set this to always get the same image from the same prompt
312
  )
313
  ```
314
+ ### 6. Display the image using the function:
315
  ```python
316
  def plot_images(images):
317
  plt.figure(figsize=(5, 5))
318
  plt.imshow(images)
319
+ plt.axis(&#34;off&#34;)
320
 
321
  plot_images(img)
322
  ```