File size: 13,266 Bytes

---
license: creativeml-openrail-m
datasets:
- morj/renaissance_portraits
language:
- en
library_name: keras
tags:
- '#stablediffusion'
- '#renaissance'
- '# finetune'
- '#kerascv'
- '#keras'
- '#tensorflow'
- '#diffusers'
- '#text2image'
base_model: CompVis/stable-diffusion-v1-4
---


# Model Card for Renaissance Stable Diffusion

<!-- Provide a quick summary of what the model is/does. [] -->
This is a Fine-tuned Stable Diffusion model on a custom dataset of {image, caption} pairs. This model has been built on top of the fine-tuning script provided by Hugging Face. This model uses the KerasCV implementation of stability.ai's text-to-image model. Unlike other open-source alternatives like Hugging Face's Diffusers, KerasCV offers advantages such as XLA compilation and mixed precision support, resulting in state-of-the-art generation speed. 




#  Table of Contents

- [Model Card for Renaissance Stable Diffusion](#model-card-for--model_id-)
- [Table of Contents](#table-of-contents)
- [Model Details](#model-details)
  - [Model Description](#model-description)
- [Uses](#uses)
  - [Direct Use](#direct-use)
  - [Out-of-Scope Use](#out-of-scope-use)
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
- [Training Details](#training-details)
  - [Training Data](#training-data)
  - [Training Procedure](#training-procedure)
- [Results](#results)
- [Environmental Impact](#environmental-impact)
- [Hardware](#hardware)
- [Software](#software)
- [Citation](#citation)
- [Model Card Authors](#model-card-authors) 
- [Model Card Contact](#model-card-contact)
- [How to Get Started with the Model](#how-to-get-started-with-the-model)


# Model Details

## Model Description

<!-- Provide a longer summary of what this model is/does. -->
This model serves as a fine-tuned version of stability.ai's v1-4 Stable Diffusion model for generating high-quality Renaissance-style portraits. 
It is finetuned from KerasCV's implementation of Stable Diffusion.
Keras CV is a deep learning library that is built on top of TensorFlow and Keras. 
It provides a number of pre-trained models for image classification, object detection, and segmentation. 
The Keras CV implementation of Stable Diffusion is a simple and easy-to-use way to generate images from text. 
To use the model, you simply need to provide a text prompt and the model will generate an image that matches the prompt.
In the specific case of this fine-tuned model, upon any prompt input the model is capable of generating an image resembling that of a Renaissance era portrait.

- **Developed by:** Martin Gasparyan, Tatev Kyosababyan
- **Shared by:** Martin Gasparyan, Tatev Kyosababyan
- **Model type:** Computer Vision Model
- **Language(s) (NLP):** eng
- **License:** creativeml-openrail-m
- **Parent Model:** CompVis/stable-diffusion-v1-4
- **Resources for more information:** More information needed
    - [GitHub Repo](https://github.com/martingasparyan/Fine-Tune-Stable-Diffusion)
    - [Associated Paper](https://medium.com/@ngesa254/unlock-creativity-with-stable-diffusion-in-kerascv-9d317199a7c9)

# Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->


## Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
<!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
The model is intended for research purposes only. Possible research areas and tasks include

- Safe deployment of models which have the potential to generate harmful content.
- Probing and understanding the limitations and biases of generative models.
- Generation of artworks and use in design and other artistic processes.
- Applications in educational or creative tools.
- Research on generative models.

Excluded uses are described below.

## Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
<!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.

## Misuse and Malicious Use
Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:

- Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc.
- Intentionally promoting or propagating discriminatory content or harmful stereotypes.
- Impersonating individuals without their consent.
- Sexual content without consent of the people who might see it.
- Mis- and disinformation
- Representations of egregious violence and gore
- Sharing of copyrighted or licensed material in violation of its terms of use.
- Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use.

# Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

## Limitations
- The model does not achieve perfect photorealism
- The model cannot render legible text
- The model does not perform well on more difficult tasks which involve compositionality, such as rendering an image corresponding to “A red cube on top of a blue sphere”
- Faces and people in general may not be generated properly.
- The model was trained mainly with English captions and will not work as well in other languages.
- The autoencoding part of the model is lossy
- The model was trained on a large-scale dataset LAION-5B which contains adult material and is not fit for product use without additional safety mechanisms and considerations.
- No additional measures were used to deduplicate the dataset. As a result, we observe some degree of memorization for images that are duplicated in the training data. The training data can be searched at https://rom1504.github.io/clip-retrieval/ to possibly assist in the detection of memorized images.
## Bias
While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases. Stable Diffusion v1 was trained on subsets of LAION-2B(en), which consists of images that are primarily limited to English descriptions. Texts and images from communities and cultures that use other languages are likely to be insufficiently accounted for. This affects the overall output of the model, as white and western cultures are often set as the default. Further, the ability of the model to generate content with non-English prompts is significantly worse than with English-language prompts.


# Training Details
Stable Diffusion v1-4 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,

- Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4
- Text prompts are encoded through a ViT-L/14 text-encoder.
- The non-pooled output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
- The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet.

## Training Data

<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

We used 11 Renaissance portraits to train the model and created a .csv file with two columns, one for image path and the other for textual description. Dataset can be found at https://huggingface.co/datasets/morj/renaissance_portraits and can be downloaded using 
```
curl -X GET \
     "https://datasets-server.huggingface.co/splits?dataset=morj%2Frenaissance_portraits"
```

## Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
Note: Only the diffusion model is fine-tuned. The VAE and the text encoder are kept frozen.

Training details: The fine-tuning process involves adapting the Stable Diffusion model to the specific task of generating Renaissance-style portraits from textual descriptions.

The dataset we trained our model on can be found here. We used 11 Renaissance portraits to train the model and created a .csv file with two columns, one for image path and the other for textual description.
When launching training, a diffusion model checkpoint is generated epoch-wise only if the current loss is lower than the previous one. To avoid OOM and faster training, we used an A100 GPU in Google Colab.
We fine-tuned the model on two different resolutions: 256x256 and 512x512. We only varied the batch size and number of epochs for fine-tuning with these two different resolutions. The best results were obtained with 512 x 512 pixels, 72 epochs, batch size of 1 and mixed precision set to True.

Hardware: A100 GPU

Optimizer: AdamW

Batch: 1

Learning rate: warmup to 0.0001 for 10,000 steps and then kept constant
 
## Results 

Please Check out the Github Repo at https://github.com/martingasparyan/Fine-Tune-Stable-Diffusion/wiki

# Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** A100 PCIe 40/80GB
- **Hours used:** 50
- **Cloud Provider:** Google Cloud Platform
- **Compute Region:** us-west1
- **Carbon Emitted:** 3.75


\usepackage{hyperref}

\subsection{CO2 Emission Related to Experiments}

Experiments were conducted using Google Cloud Platform in region us-west1, which has a carbon efficiency of 0.3 kgCO$_2$eq/kWh. A cumulative of 50 hours of computation was performed on hardware of type A100 PCIe 40/80GB (TDP of 250W).

Total emissions are estimated to be 3.75 kgCO$_2$eq of which 100 percents were directly offset by the cloud provider.
    
%Uncomment if you bought additional offsets:
%XX kg CO2eq were manually offset through \href{link}{Offset Provider}.

Estimations were conducted using the \href{https://mlco2.github.io/impact#compute}{MachineLearning Impact calculator} presented in \cite{lacoste2019quantifying}.

@article{lacoste2019quantifying,
  title={Quantifying the Carbon Emissions of Machine Learning},
  author={Lacoste, Alexandre and Luccioni, Alexandra and Schmidt, Victor and Dandres, Thomas},
  journal={arXiv preprint arXiv:1910.09700},
  year={2019}
}
      

### Hardware

A100 PCIe 40/80GB

### Software

Google Colab, Jupyter Lab

# Model Card Authors

<!-- This section provides another layer of transparency and accountability. Whose views is this model card representing? How many voices were included in its construction? Etc. -->

Martin Gasparyan, Tatev Kyosababyan

# Model Card Contact

martingasparyan@yahoo.com, tatev.kyosababyan@gmail.com

# How to Get Started with the Model
Use the code below to get started with the model.
### 1. Install Dependencies
```python
!pip install keras-cv==0.6.0 -q
!pip install -U tensorflow -q
!pip install keras-core -q
```
### 2. Imports
```python
from textwrap import wrap
import os
import keras_cv
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow.experimental.numpy as tnp
from keras_cv.models.stable_diffusion.clip_tokenizer import SimpleTokenizer
from keras_cv.models.stable_diffusion.diffusion_model import DiffusionModel
from keras_cv.models.stable_diffusion.image_encoder import ImageEncoder
from keras_cv.models.stable_diffusion.noise_scheduler import NoiseScheduler
from keras_cv.models.stable_diffusion.text_encoder import TextEncoder
from tensorflow import keras
```
### 3. Create a base Stable diffusion Model
```python
my_base_model = keras_cv.models.StableDiffusion(img_width=512, img_height=512)
```
### 4. Load Weights from the h5 model which is hosted on Hugging Face:
```python
my_base_model.diffusion_model.load_weights('path/to/file/renaissance_model.h5')
```
### 5. Create a variable to hold the values of the to-be-generated image such as prompt, batch size, iterations, and seed
```python
img = my_base_model.text_to_image(
       prompt='A woman with an enigmatic smile against a dark background',
       batch_size=1,  # How many images to generate at once
       num_steps=25,  # Number of iterations (controls image quality)
       seed=123,  # Set this to always get the same image from the same prompt
    )
```
### 6. Display the image using the function:
```python
def plot_images(images):
    plt.figure(figsize=(5, 5))
    plt.imshow(images)
    plt.axis('off')
    
plot_images(img)
```