File size: 13,266 Bytes
6237efb
 
 
 
 
 
 
 
ec562f8
6237efb
 
 
 
 
 
1e886f5
58c874e
6237efb
 
ec562f8
e10bb8f
6237efb
cecbd92
7e72a74
ec562f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7a9cdad
ec562f8
7a9cdad
 
ec562f8
7ee1bc0
ec562f8
 
 
 
 
 
 
 
 
7e72a74
 
 
 
 
 
 
ec562f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32bef99
ec562f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6237efb
ec562f8
6237efb
9e9e5b1
6237efb
ec562f8
6237efb
ec562f8
6237efb
ec562f8
58c874e
ec562f8
58c874e
ec562f8
 
e10bb8f
 
58c874e
 
 
e10bb8f
 
 
58c874e
 
 
 
 
 
 
 
 
 
 
 
 
 
e10bb8f
 
 
58c874e
e10bb8f
 
 
6872710
e10bb8f
 
 
58c874e
6872710
58c874e
 
 
 
e10bb8f
ec562f8
e10bb8f
58c874e
 
 
6872710
58c874e
e10bb8f
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
---
license: creativeml-openrail-m
datasets:
- morj/renaissance_portraits
language:
- en
library_name: keras
tags:
- '#stablediffusion'
- '#renaissance'
- '# finetune'
- '#kerascv'
- '#keras'
- '#tensorflow'
- '#diffusers'
- '#text2image'
base_model: CompVis/stable-diffusion-v1-4
---


# Model Card for Renaissance Stable Diffusion

<!-- Provide a quick summary of what the model is/does. [] -->
This is a Fine-tuned Stable Diffusion model on a custom dataset of {image, caption} pairs. This model has been built on top of the fine-tuning script provided by Hugging Face. This model uses the KerasCV implementation of stability.ai's text-to-image model. Unlike other open-source alternatives like Hugging Face's Diffusers, KerasCV offers advantages such as XLA compilation and mixed precision support, resulting in state-of-the-art generation speed. 




#  Table of Contents

- [Model Card for Renaissance Stable Diffusion](#model-card-for--model_id-)
- [Table of Contents](#table-of-contents)
- [Model Details](#model-details)
  - [Model Description](#model-description)
- [Uses](#uses)
  - [Direct Use](#direct-use)
  - [Out-of-Scope Use](#out-of-scope-use)
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
- [Training Details](#training-details)
  - [Training Data](#training-data)
  - [Training Procedure](#training-procedure)
- [Results](#results)
- [Environmental Impact](#environmental-impact)
- [Hardware](#hardware)
- [Software](#software)
- [Citation](#citation)
- [Model Card Authors](#model-card-authors) 
- [Model Card Contact](#model-card-contact)
- [How to Get Started with the Model](#how-to-get-started-with-the-model)


# Model Details

## Model Description

<!-- Provide a longer summary of what this model is/does. -->
This model serves as a fine-tuned version of stability.ai's v1-4 Stable Diffusion model for generating high-quality Renaissance-style portraits. 
It is finetuned from KerasCV's implementation of Stable Diffusion.
Keras CV is a deep learning library that is built on top of TensorFlow and Keras. 
It provides a number of pre-trained models for image classification, object detection, and segmentation. 
The Keras CV implementation of Stable Diffusion is a simple and easy-to-use way to generate images from text. 
To use the model, you simply need to provide a text prompt and the model will generate an image that matches the prompt.
In the specific case of this fine-tuned model, upon any prompt input the model is capable of generating an image resembling that of a Renaissance era portrait.

- **Developed by:** Martin Gasparyan, Tatev Kyosababyan
- **Shared by:** Martin Gasparyan, Tatev Kyosababyan
- **Model type:** Computer Vision Model
- **Language(s) (NLP):** eng
- **License:** creativeml-openrail-m
- **Parent Model:** CompVis/stable-diffusion-v1-4
- **Resources for more information:** More information needed
    - [GitHub Repo](https://github.com/martingasparyan/Fine-Tune-Stable-Diffusion)
    - [Associated Paper](https://medium.com/@ngesa254/unlock-creativity-with-stable-diffusion-in-kerascv-9d317199a7c9)

# Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->


## Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
<!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
The model is intended for research purposes only. Possible research areas and tasks include

- Safe deployment of models which have the potential to generate harmful content.
- Probing and understanding the limitations and biases of generative models.
- Generation of artworks and use in design and other artistic processes.
- Applications in educational or creative tools.
- Research on generative models.

Excluded uses are described below.

## Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
<!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.

## Misuse and Malicious Use
Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:

- Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc.
- Intentionally promoting or propagating discriminatory content or harmful stereotypes.
- Impersonating individuals without their consent.
- Sexual content without consent of the people who might see it.
- Mis- and disinformation
- Representations of egregious violence and gore
- Sharing of copyrighted or licensed material in violation of its terms of use.
- Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use.

# Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

## Limitations
- The model does not achieve perfect photorealism
- The model cannot render legible text
- The model does not perform well on more difficult tasks which involve compositionality, such as rendering an image corresponding to “A red cube on top of a blue sphere”
- Faces and people in general may not be generated properly.
- The model was trained mainly with English captions and will not work as well in other languages.
- The autoencoding part of the model is lossy
- The model was trained on a large-scale dataset LAION-5B which contains adult material and is not fit for product use without additional safety mechanisms and considerations.
- No additional measures were used to deduplicate the dataset. As a result, we observe some degree of memorization for images that are duplicated in the training data. The training data can be searched at https://rom1504.github.io/clip-retrieval/ to possibly assist in the detection of memorized images.
## Bias
While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases. Stable Diffusion v1 was trained on subsets of LAION-2B(en), which consists of images that are primarily limited to English descriptions. Texts and images from communities and cultures that use other languages are likely to be insufficiently accounted for. This affects the overall output of the model, as white and western cultures are often set as the default. Further, the ability of the model to generate content with non-English prompts is significantly worse than with English-language prompts.


# Training Details
Stable Diffusion v1-4 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,

- Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4
- Text prompts are encoded through a ViT-L/14 text-encoder.
- The non-pooled output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
- The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet.

## Training Data

<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

We used 11 Renaissance portraits to train the model and created a .csv file with two columns, one for image path and the other for textual description. Dataset can be found at https://huggingface.co/datasets/morj/renaissance_portraits and can be downloaded using 
```
curl -X GET \
     "https://datasets-server.huggingface.co/splits?dataset=morj%2Frenaissance_portraits"
```

## Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
Note: Only the diffusion model is fine-tuned. The VAE and the text encoder are kept frozen.

Training details: The fine-tuning process involves adapting the Stable Diffusion model to the specific task of generating Renaissance-style portraits from textual descriptions.

The dataset we trained our model on can be found here. We used 11 Renaissance portraits to train the model and created a .csv file with two columns, one for image path and the other for textual description.
When launching training, a diffusion model checkpoint is generated epoch-wise only if the current loss is lower than the previous one. To avoid OOM and faster training, we used an A100 GPU in Google Colab.
We fine-tuned the model on two different resolutions: 256x256 and 512x512. We only varied the batch size and number of epochs for fine-tuning with these two different resolutions. The best results were obtained with 512 x 512 pixels, 72 epochs, batch size of 1 and mixed precision set to True.

Hardware: A100 GPU

Optimizer: AdamW

Batch: 1

Learning rate: warmup to 0.0001 for 10,000 steps and then kept constant
 
## Results 

Please Check out the Github Repo at https://github.com/martingasparyan/Fine-Tune-Stable-Diffusion/wiki

# Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** A100 PCIe 40/80GB
- **Hours used:** 50
- **Cloud Provider:** Google Cloud Platform
- **Compute Region:** us-west1
- **Carbon Emitted:** 3.75


\usepackage{hyperref}

\subsection{CO2 Emission Related to Experiments}

Experiments were conducted using Google Cloud Platform in region us-west1, which has a carbon efficiency of 0.3 kgCO$_2$eq/kWh. A cumulative of 50 hours of computation was performed on hardware of type A100 PCIe 40/80GB (TDP of 250W).

Total emissions are estimated to be 3.75 kgCO$_2$eq of which 100 percents were directly offset by the cloud provider.
    
%Uncomment if you bought additional offsets:
%XX kg CO2eq were manually offset through \href{link}{Offset Provider}.

Estimations were conducted using the \href{https://mlco2.github.io/impact#compute}{MachineLearning Impact calculator} presented in \cite{lacoste2019quantifying}.

@article{lacoste2019quantifying,
  title={Quantifying the Carbon Emissions of Machine Learning},
  author={Lacoste, Alexandre and Luccioni, Alexandra and Schmidt, Victor and Dandres, Thomas},
  journal={arXiv preprint arXiv:1910.09700},
  year={2019}
}
      

### Hardware

A100 PCIe 40/80GB

### Software

Google Colab, Jupyter Lab

# Model Card Authors

<!-- This section provides another layer of transparency and accountability. Whose views is this model card representing? How many voices were included in its construction? Etc. -->

Martin Gasparyan, Tatev Kyosababyan

# Model Card Contact

martingasparyan@yahoo.com, tatev.kyosababyan@gmail.com

# How to Get Started with the Model
Use the code below to get started with the model.
### 1. Install Dependencies
```python
!pip install keras-cv==0.6.0 -q
!pip install -U tensorflow -q
!pip install keras-core -q
```
### 2. Imports
```python
from textwrap import wrap
import os
import keras_cv
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow.experimental.numpy as tnp
from keras_cv.models.stable_diffusion.clip_tokenizer import SimpleTokenizer
from keras_cv.models.stable_diffusion.diffusion_model import DiffusionModel
from keras_cv.models.stable_diffusion.image_encoder import ImageEncoder
from keras_cv.models.stable_diffusion.noise_scheduler import NoiseScheduler
from keras_cv.models.stable_diffusion.text_encoder import TextEncoder
from tensorflow import keras
```
### 3. Create a base Stable diffusion Model
```python
my_base_model = keras_cv.models.StableDiffusion(img_width=512, img_height=512)
```
### 4. Load Weights from the h5 model which is hosted on Hugging Face:
```python
my_base_model.diffusion_model.load_weights('path/to/file/renaissance_model.h5')
```
### 5. Create a variable to hold the values of the to-be-generated image such as prompt, batch size, iterations, and seed
```python
img = my_base_model.text_to_image(
       prompt='A woman with an enigmatic smile against a dark background',
       batch_size=1,  # How many images to generate at once
       num_steps=25,  # Number of iterations (controls image quality)
       seed=123,  # Set this to always get the same image from the same prompt
    )
```
### 6. Display the image using the function:
```python
def plot_images(images):
    plt.figure(figsize=(5, 5))
    plt.imshow(images)
    plt.axis('off')
    
plot_images(img)
```