nvan13 commited on Dec 31, 2025

Commit

1c8e113

verified ·

1 Parent(s): b7eae34

Add files using upload-large-folder tool

Browse files

Files changed (50) hide show

README.md +283 -0
assets/OHRFT_scheme.png +0 -0
assets/figure_nlp.png +0 -0
generation/control/ControlNet/.gitignore +143 -0
generation/control/ControlNet/LICENSE +201 -0
generation/control/ControlNet/README.md +348 -0
generation/control/ControlNet/cldm/cldm.py +435 -0
generation/control/ControlNet/cldm/ddim_hacked.py +317 -0
generation/control/ControlNet/cldm/hack.py +111 -0
generation/control/ControlNet/cldm/logger.py +76 -0
generation/control/ControlNet/cldm/model.py +28 -0
generation/control/ControlNet/config.py +1 -0
generation/control/ControlNet/docs/annotator.md +49 -0
generation/control/ControlNet/docs/faq.md +21 -0
generation/control/ControlNet/docs/low_vram.md +15 -0
generation/control/ControlNet/docs/train.md +276 -0
generation/control/ControlNet/environment.yaml +35 -0
generation/control/ControlNet/gradio_annotator.py +160 -0
generation/control/ControlNet/gradio_canny2image.py +97 -0
generation/control/ControlNet/gradio_depth2image.py +98 -0
generation/control/ControlNet/gradio_fake_scribble2image.py +102 -0
generation/control/ControlNet/gradio_hed2image.py +98 -0
generation/control/ControlNet/gradio_hough2image.py +100 -0
generation/control/ControlNet/gradio_normal2image.py +99 -0
generation/control/ControlNet/gradio_pose2image.py +98 -0
generation/control/ControlNet/gradio_scribble2image.py +92 -0
generation/control/ControlNet/gradio_scribble2image_interactive.py +102 -0
generation/control/ControlNet/gradio_seg2image.py +97 -0
generation/control/ControlNet/ldm/data/__init__.py +0 -0
generation/control/ControlNet/ldm/models/autoencoder.py +219 -0
generation/control/ControlNet/ldm/models/diffusion/__init__.py +0 -0
generation/control/ControlNet/ldm/models/diffusion/ddim.py +336 -0
generation/control/ControlNet/ldm/util.py +197 -0
generation/control/ControlNet/share.py +8 -0
generation/control/ControlNet/tool_add_control.py +50 -0
generation/control/ControlNet/tool_add_control_sd21.py +50 -0
generation/control/ControlNet/tool_transfer_control.py +59 -0
generation/control/ControlNet/tutorial_dataset.py +39 -0
generation/control/ControlNet/tutorial_dataset_test.py +12 -0
generation/control/ControlNet/tutorial_train.py +35 -0
generation/control/ControlNet/tutorial_train_sd21.py +35 -0
generation/control/download_ade20k.sh +10 -0
generation/control/download_celebhq.sh +10 -0
generation/control/eval_canny.py +130 -0
generation/control/eval_landmark.py +127 -0
generation/control/generation.py +238 -0
generation/control/hra.py +254 -0
generation/control/tool_add_hra.py +81 -0
generation/control/train.py +149 -0
generation/env.yml +172 -0

README.md ADDED Viewed

	@@ -0,0 +1,283 @@

+<div align=center>
+# [NeurIPS 2024 Spotlight] Bridging The Gap between Low-rank and Orthogonal Adaptation via Householder Reflection Adaptation
+[![arXiv](https://img.shields.io/badge/arXiv-2502.14637-b31b1b?style=flat&logo=arxiv)](https://arxiv.org/pdf/2405.17484)
+[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Peft-orange?style=flat&logo=huggingface)](https://huggingface.co/docs/peft/en/package_reference/hra)
+</div>
+<div align="center">
+  <img src="assets/OHRFT_scheme.png" width="1100"/>
+</div>
+## Introduction
+This repository includes the official implementation of [HRA](https://arxiv.org/pdf/2405.17484).
+We propose a simple yet effective adapter-based orthogonal fine-tuning method, HRA.
+Given a pre-trained model, our method fine-tunes its layers by multiplying each frozen weight matrix with an orthogonal matrix constructed by a chain of learnable Householder reflections (HRs).
+## Usage
+### Subject-driven Generation
+<div align="center">
+  <img src="assets/subject.png" width="600"/>
+</div>
+Given several images of a specific subject and a textual prompt, subject-driven generation aims to generate images of the same subject in a context aligning with the prompt.
+#### Environment Setup
+```bash
+cd generation
+conda env create -f env.yml
+```
+#### Prepare Dataset
+Download [dreambooth](https://github.com/google/dreambooth) dataset by running this script.
+```bash
+cd subject
+bash download_dreambooth.sh
+```
+After downloading the data, your directory structure should look like this:
+```
+dreambooth
+├── dataset
+│   ├── backpack
+│   └── backpack_dog
+│       ...
+```
+You can also put your custom images into `dreambooth/dataset`.
+#### Finetune
+```bash
+prompt_idx=0
+class_idx=0
+./train_dreambooth.sh $prompt_idx $class_idx
+```
+where the `$prompt_idx` corresponds to different prompts ranging from 0 to 24 and the `$class_idx` corresponds to different subjects ranging from 0 to 29.
+Launch the training script with `accelerate` and pass hyperparameters, as well as LoRa-specific arguments to it such as:
+- `use_hra`: Enables HRA in the training script.
+- `hra_r`: the number of HRs (i.e., r) across different layers, expressed in `int`.
+As r increases, the number of trainable parameters increases, which generally leads to improved performance.
+However, this also results in higher memory consumption and longer computation times.
+Therefore, r is usually set to 8.
+**Note**, please set r to an even number to avoid potential issues during initialization.
+- `hra_apply_GS`: Applys Gram-Schmidt orthogonalization. Default is `false`.
+- `hra_bias`: specify if the `bias` paramteres should be traind. Can be `none`, `all` or `hra_only`.
+#### Evaluation
+```bash
+python evaluate.py
+python get_result.py
+```
+### Controllable Generation
+<div align="center">
+  <img src="assets/control.png" width="650"/>
+</div>
+Controllable generation aims to generate images aligning with a textual prompt and additional control signals (such as facial landmark annotations, canny edges, and segmentation maps).
+#### Prepare Dataset
+Download ADE20K and CelebA-HQ datasets by running this script.
+```bash
+cd control
+bash download_ade20k.sh
+bash download_celebhq.sh
+```
+For COCO dataset, we follow [OFT](https://github.com/Zeju1997/oft) to download and preprocess it.
+After downloading the data, your directory structure should look like this:
+```
+data
+├── ADE20K
+│ ├── train
+│ │ ├── color
+│ │ ├── segm
+│ │ └── prompt_train_blip.json
+│ └── val
+│ │ ├── color
+│ │ ├── segm
+│ │ └── prompt_val_blip.json
+└── COCO
+│ ├── train
+│ │ ├── color
+│ │ ├── depth
+...
+```
+#### Prepare pre-trained model
+Download the pre-trained model weights [v1-5-pruned.ckpt](https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main) and save it in the `models` directory.
+#### Fine-tuning
+1. Create the model with additional **HRA** parameters:
+```bash
+python tool_add_hra.py \
+  --input_path=./models/v1-5-pruned.ckpt \
+  --output_path=./models/hra_r_8.ckpt \
+  --r=8
+```
+2. Specify the control signal and dataset. Train the model specify the same hyperparameters as above:
+```bash
+python train.py \
+  --r=8 \
+  --control=segm
+```
+#### Generation
+1. After finetuning with **HRA**, run inference to generate images based on control signal. Because the inference takes some time, to perform large scale evaluation, we split the dataset into different sub-datasets and run inference on multiple gpus:
+```bash
+python generation.py
+  --r=8 \
+  --control=segm
+```
+1. To evaluate **HRA** results on the three tasks: canny edge to image (C2I) on the COCO dataset, landmark to face (L2F) on the CelebA-HQ dataset, and segmentation map to image (S2I) on the ADE20K dataset, run the following scripts on the generated images.
+```bash
+python eval_landmark.py
+```
+```bash
+python eval_canny.py
+```
+Note, for evaluating the segmentation map-to-image (S2I) task, please install the [Segformer](https://github.com/NVlabs/SegFormer) repository. Run the following testing command on both the original and generated images.
+```bash
+python tools/test.py local_configs/segformer/B4/segformer.b4.512x512.ade.160k.py ./weights/segformer.b4.512x512.ade.160k.pth
+```
+### Natural Language Understanding
+<div align="center">
+  <img src="assets/figure_nlp.png" width="300"/>
+</div>
+We adapt [DeBERTaV3-base](https://arxiv.org/abs/2111.09543) and test the performance of the adapted models on  [General Language Understanding Evaluation (GLUE) benchmark](https://gluebenchmark.com/).
+#### Environment Setup
+```bash
+cd nlu
+conda env create -f env.yml
+```
+Before fine-tuning, you need to install the dependencies.
+```bash
+python setup.py install
+```
+#### Prepare Dataset
+Run this scipt to download glue dataset.
+```bash
+cache_dir=/tmp/DeBERTa/
+cd experiments/glue
+./download_data.sh  $cache_dir/glue_tasks
+```
+#### Finetune
+Run tasks.
+```bash
+./mnli.sh
+./cola.sh
+./mrpc.sh
+./qnli.sh
+./qqp.sh
+./rte.sh
+./sst2.sh
+./stsb.sh
+```
+### Mathematical reasoning
+We have not yet completed the integration of HRA code into PEFT. Before that, if you want to try using the HRA method to fine-tune large models, you can follow the steps below.
+Go to the llama folder
+```bash
+cd llama
+```
+#### Environment Setup
+We recommend using Python 3.10 for your environment and use the conda to install it.
+```bash
+conda create -n pytorch python=3.10
+```
+Then install the required packages with the following command:
+```bash
+pip install -r requirements.txt
+```
+Please note that the peft package and transformer package must be downloaded with the versions consistent with those listed in the requirements file.
+After completing the download, please replace the **oft** folder inside the **peft/tuners** within your running environment's **python/site-packages** with the **oft** folder from the current directory.
+The path for the oft folder in the environment should be:
+```bash
+/your_path/anaconda3/envs/pytorch/lib/python3.10/site-packages/peft/tuners/
+```
+The **layer.py** in the current oft directory is implemented for when λ is not infinity.
+If you want to simulate when λ is infinity, please replace **layer.py** with **layer_GS_HRA.py**, and set the hyperparameter λ to 0 during training.
+#### Prepare Dataset
+The dataset we use for fine-tuning is MetaMathQA-40K, which can be downloaded through this [link](https://huggingface.co/datasets/meta-math/MetaMathQA-40K).
+#### Prepare model
+The model we use for fine-tuning is llama2. You can choose the model you want to fine-tune.
+#### Finetune
+Run the following code to complete the fine-tuning:
+```bash
+bash tune.sh
+```
+Please note that you need to change the dataset path, the path of the pre-trained model, and you can change the parameters according to your needs in tune.sh. That is:
+```bash
+BASE_MODEL="YOUR_MODEL_PATH"
+DATA_PATH="YOUR_DATA_PATH"
+OUTPUT="YOUR_MODEL_SAVED_PATH"
+```
+#### Evaluation
+After the training is complete, you can run the following command to test:
+```bash
+bash test.sh
+```
+Please note to change the model path in it:
+```bash
+BASE_MODEL="YOUR_MODEL_PATH"
+OUTPUT="YOUR_MODEL_SAVED_PATH"
+```
+## 📌 Citing our work
+If you find our work useful, please cite it:
+```bibtex
+@inproceedings{yuanbridging,
+  title={Bridging The Gap between Low-rank and Orthogonal Adaptation via Householder Reflection Adaptation},
+  author={Yuan, Shen and Liu, Haotian and Xu, Hongteng},
+  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
+  year={2024}
+}
+```

assets/OHRFT_scheme.png ADDED Viewed

assets/figure_nlp.png ADDED Viewed

generation/control/ControlNet/.gitignore ADDED Viewed

	@@ -0,0 +1,143 @@

+.idea/
+training/
+lightning_logs/
+image_log/
+*.pth
+*.pt
+*.ckpt
+*.safetensors
+gradio_pose2image_private.py
+gradio_canny2image_private.py
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+.python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/

generation/control/ControlNet/LICENSE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

generation/control/ControlNet/README.md ADDED Viewed

	@@ -0,0 +1,348 @@

+# News: A nightly version of ControlNet 1.1 is released!
+[ControlNet 1.1](https://github.com/lllyasviel/ControlNet-v1-1-nightly) is released. Those new models will be merged to this repo after we make sure that everything is good.
+# Below is ControlNet 1.0
+Official implementation of [Adding Conditional Control to Text-to-Image Diffusion Models](https://arxiv.org/abs/2302.05543).
+ControlNet is a neural network structure to control diffusion models by adding extra conditions.
+![img](github_page/he.png)
+It copys the weights of neural network blocks into a "locked" copy and a "trainable" copy.
+The "trainable" one learns your condition. The "locked" one preserves your model.
+Thanks to this, training with small dataset of image pairs will not destroy the production-ready diffusion models.
+The "zero convolution" is 1×1 convolution with both weight and bias initialized as zeros.
+Before training, all zero convolutions output zeros, and ControlNet will not cause any distortion.
+No layer is trained from scratch. You are still fine-tuning. Your original model is safe.
+This allows training on small-scale or even personal devices.
+This is also friendly to merge/replacement/offsetting of models/weights/blocks/layers.
+### FAQ
+**Q:** But wait, if the weight of a conv layer is zero, the gradient will also be zero, and the network will not learn anything. Why "zero convolution" works?
+**A:** This is not true. [See an explanation here](docs/faq.md).
+# Stable Diffusion + ControlNet
+By repeating the above simple structure 14 times, we can control stable diffusion in this way:
+![img](github_page/sd.png)
+In this way, the ControlNet can **reuse** the SD encoder as a **deep, strong, robust, and powerful backbone** to learn diverse controls. Many evidences (like [this](https://jerryxu.net/ODISE/) and [this](https://vpd.ivg-research.xyz/)) validate that the SD encoder is an excellent backbone.
+Note that the way we connect layers is computational efficient. The original SD encoder does not need to store gradients (the locked original SD Encoder Block 1234 and Middle). The required GPU memory is not much larger than original SD, although many layers are added. Great!
+# Features & News
+2023/0/14 - We released [ControlNet 1.1](https://github.com/lllyasviel/ControlNet-v1-1-nightly). Those new models will be merged to this repo after we make sure that everything is good.
+2023/03/03 - We released a discussion - [Precomputed ControlNet: Speed up ControlNet by 45%, but is it necessary?](https://github.com/lllyasviel/ControlNet/discussions/216)
+2023/02/26 - We released a blog - [Ablation Study: Why ControlNets use deep encoder? What if it was lighter? Or even an MLP?](https://github.com/lllyasviel/ControlNet/discussions/188)
+2023/02/20 - Implementation for non-prompt mode released. See also [Guess Mode / Non-Prompt Mode](#guess-anchor).
+2023/02/12 - Now you can play with any community model by [Transferring the ControlNet](https://github.com/lllyasviel/ControlNet/discussions/12).
+2023/02/11 - [Low VRAM mode](docs/low_vram.md) is added. Please use this mode if you are using 8GB GPU(s) or if you want larger batch size.
+# Production-Ready Pretrained Models
+First create a new conda environment
+    conda env create -f environment.yaml
+    conda activate control
+All models and detectors can be downloaded from [our Hugging Face page](https://huggingface.co/lllyasviel/ControlNet). Make sure that SD models are put in "ControlNet/models" and detectors are put in "ControlNet/annotator/ckpts". Make sure that you download all necessary pretrained weights and detector models from that Hugging Face page, including HED edge detection model, Midas depth estimation model, Openpose, and so on.
+We provide 9 Gradio apps with these models.
+All test images can be found at the folder "test_imgs".
+## ControlNet with Canny Edge
+Stable Diffusion 1.5 + ControlNet (using simple Canny edge detection)
+    python gradio_canny2image.py
+The Gradio app also allows you to change the Canny edge thresholds. Just try it for more details.
+Prompt: "bird"
+![p](github_page/p1.png)
+Prompt: "cute dog"
+![p](github_page/p2.png)
+## ControlNet with M-LSD Lines
+Stable Diffusion 1.5 + ControlNet (using simple M-LSD straight line detection)
+    python gradio_hough2image.py
+The Gradio app also allows you to change the M-LSD thresholds. Just try it for more details.
+Prompt: "room"
+![p](github_page/p3.png)
+Prompt: "building"
+![p](github_page/p4.png)
+## ControlNet with HED Boundary
+Stable Diffusion 1.5 + ControlNet (using soft HED Boundary)
+    python gradio_hed2image.py
+The soft HED Boundary will preserve many details in input images, making this app suitable for recoloring and stylizing. Just try it for more details.
+Prompt: "oil painting of handsome old man, masterpiece"
+![p](github_page/p5.png)
+Prompt: "Cyberpunk robot"
+![p](github_page/p6.png)
+## ControlNet with User Scribbles
+Stable Diffusion 1.5 + ControlNet (using Scribbles)
+    python gradio_scribble2image.py
+Note that the UI is based on Gradio, and Gradio is somewhat difficult to customize. Right now you need to draw scribbles outside the UI (using your favorite drawing software, for example, MS Paint) and then import the scribble image to Gradio.
+Prompt: "turtle"
+![p](github_page/p7.png)
+Prompt: "hot air balloon"
+![p](github_page/p8.png)
+### Interactive Interface
+We actually provide an interactive interface
+    python gradio_scribble2image_interactive.py
+~~However, because gradio is very [buggy](https://github.com/gradio-app/gradio/issues/3166) and difficult to customize, right now, user need to first set canvas width and heights and then click "Open drawing canvas" to get a drawing area. Please do not upload image to that drawing canvas. Also, the drawing area is very small; it should be bigger. But I failed to find out how to make it larger. Again, gradio is really buggy.~~ (Now fixed, will update asap)
+The below dog sketch is drawn by me. Perhaps we should draw a better dog for showcase.
+Prompt: "dog in a room"
+![p](github_page/p20.png)
+## ControlNet with Fake Scribbles
+Stable Diffusion 1.5 + ControlNet (using fake scribbles)
+    python gradio_fake_scribble2image.py
+Sometimes we are lazy, and we do not want to draw scribbles. This script use the exactly same scribble-based model but use a simple algorithm to synthesize scribbles from input images.
+Prompt: "bag"
+![p](github_page/p9.png)
+Prompt: "shose" (Note that "shose" is a typo; it should be "shoes". But it still seems to work.)
+![p](github_page/p10.png)
+## ControlNet with Human Pose
+Stable Diffusion 1.5 + ControlNet (using human pose)
+    python gradio_pose2image.py
+Apparently, this model deserves a better UI to directly manipulate pose skeleton. However, again, Gradio is somewhat difficult to customize. Right now you need to input an image and then the Openpose will detect the pose for you.
+Prompt: "Chief in the kitchen"
+![p](github_page/p11.png)
+Prompt: "An astronaut on the moon"
+![p](github_page/p12.png)
+## ControlNet with Semantic Segmentation
+Stable Diffusion 1.5 + ControlNet (using semantic segmentation)
+    python gradio_seg2image.py
+This model use ADE20K's segmentation protocol. Again, this model deserves a better UI to directly draw the segmentations. However, again, Gradio is somewhat difficult to customize. Right now you need to input an image and then a model called Uniformer will detect the segmentations for you. Just try it for more details.
+Prompt: "House"
+![p](github_page/p13.png)
+Prompt: "River"
+![p](github_page/p14.png)
+## ControlNet with Depth
+Stable Diffusion 1.5 + ControlNet (using depth map)
+    python gradio_depth2image.py
+Great! Now SD 1.5 also have a depth control. FINALLY. So many possibilities (considering SD1.5 has much more community models than SD2).
+Note that different from Stability's model, the ControlNet receive the full 512×512 depth map, rather than 64×64 depth. Note that Stability's SD2 depth model use 64*64 depth maps. This means that the ControlNet will preserve more details in the depth map.
+This is always a strength because if users do not want to preserve more details, they can simply use another SD to post-process an i2i. But if they want to preserve more details, ControlNet becomes their only choice. Again, SD2 uses 64×64 depth, we use 512×512.
+Prompt: "Stormtrooper's lecture"
+![p](github_page/p15.png)
+## ControlNet with Normal Map
+Stable Diffusion 1.5 + ControlNet (using normal map)
+    python gradio_normal2image.py
+This model use normal map. Rightnow in the APP, the normal is computed from the midas depth map and a user threshold (to determine how many area is background with identity normal face to viewer, tune the "Normal background threshold" in the gradio app to get a feeling).
+Prompt: "Cute toy"
+![p](github_page/p17.png)
+Prompt: "Plaster statue of Abraham Lincoln"
+![p](github_page/p18.png)
+Compared to depth model, this model seems to be a bit better at preserving the geometry. This is intuitive: minor details are not salient in depth maps, but are salient in normal maps. Below is the depth result with same inputs. You can see that the hairstyle of the man in the input image is modified by depth model, but preserved by the normal model.
+Prompt: "Plaster statue of Abraham Lincoln"
+![p](github_page/p19.png)
+## ControlNet with Anime Line Drawing
+We also trained a relatively simple ControlNet for anime line drawings. This tool may be useful for artistic creations. (Although the image details in the results is a bit modified, since it still diffuse latent images.)
+This model is not available right now. We need to evaluate the potential risks before releasing this model. Nevertheless, you may be interested in [transferring the ControlNet to any community model](https://github.com/lllyasviel/ControlNet/discussions/12).
+![p](github_page/p21.png)
+<a id="guess-anchor"></a>
+# Guess Mode / Non-Prompt Mode
+The "guess mode" (or called non-prompt mode) will completely unleash all the power of the very powerful ControlNet encoder.
+See also the blog - [Ablation Study: Why ControlNets use deep encoder? What if it was lighter? Or even an MLP?](https://github.com/lllyasviel/ControlNet/discussions/188)
+You need to manually check the "Guess Mode" toggle to enable this mode.
+In this mode, the ControlNet encoder will try best to recognize the content of the input control map, like depth map, edge map, scribbles, etc, even if you remove all prompts.
+**Let's have fun with some very challenging experimental settings!**
+**No prompts. No "positive" prompts. No "negative" prompts. No extra caption detector. One single diffusion loop.**
+For this mode, we recommend to use 50 steps and guidance scale between 3 and 5.
+![p](github_page/uc2a.png)
+No prompts:
+![p](github_page/uc2b.png)
+Note that the below example is 768×768. No prompts. No "positive" prompts. No "negative" prompts.
+![p](github_page/uc1.png)
+By tuning the parameters, you can get some very intereting results like below:
+![p](github_page/uc3.png)
+Because no prompt is available, the ControlNet encoder will "guess" what is in the control map. Sometimes the guess result is really interesting. Because diffusion algorithm can essentially give multiple results, the ControlNet seems able to give multiple guesses, like this:
+![p](github_page/uc4.png)
+Without prompt, the HED seems good at generating images look like paintings when the control strength is relatively low:
+![p](github_page/uc6.png)
+The Guess Mode is also supported in [WebUI Plugin](https://github.com/Mikubill/sd-webui-controlnet):
+![p](github_page/uci1.png)
+No prompts. Default WebUI parameters. Pure random results with the seed being 12345. Standard SD1.5. Input scribble is in "test_imgs" folder to reproduce.
+![p](github_page/uci2.png)
+Below is another challenging example:
+![p](github_page/uci3.png)
+No prompts. Default WebUI parameters. Pure random results with the seed being 12345. Standard SD1.5. Input scribble is in "test_imgs" folder to reproduce.
+![p](github_page/uci4.png)
+Note that in the guess mode, you will still be able to input prompts. The only difference is that the model will "try harder" to guess what is in the control map even if you do not provide the prompt. Just try it yourself!
+Besides, if you write some scripts (like BLIP) to generate image captions from the "guess mode" images, and then use the generated captions as prompts to diffuse again, you will get a SOTA pipeline for fully automatic conditional image generating.
+# Combining Multiple ControlNets
+ControlNets are composable: more than one ControlNet can be easily composed to multi-condition control.
+Right now this feature is in experimental stage in the [Mikubill' A1111 Webui Plugin](https://github.com/Mikubill/sd-webui-controlnet):
+![p](github_page/multi2.png)
+![p](github_page/multi.png)
+As long as the models are controlling the same SD, the "boundary" between different research projects does not even exist. This plugin also allows different methods to work together!
+# Use ControlNet in Any Community Model (SD1.X)
+This is an experimental feature.
+[See the steps here](https://github.com/lllyasviel/ControlNet/discussions/12).
+Or you may want to use the [Mikubill' A1111 Webui Plugin](https://github.com/Mikubill/sd-webui-controlnet) which is plug-and-play and does not need manual merging.
+# Annotate Your Own Data
+We provide simple python scripts to process images.
+[See a gradio example here](docs/annotator.md).
+# Train with Your Own Data
+Training a ControlNet is as easy as (or even easier than) training a simple pix2pix.
+[See the steps here](docs/train.md).
+# Related Resources
+Special Thank to the great project - [Mikubill' A1111 Webui Plugin](https://github.com/Mikubill/sd-webui-controlnet) !
+We also thank Hysts for making [Hugging Face Space](https://huggingface.co/spaces/hysts/ControlNet) as well as more than 65 models in that amazing [Colab list](https://github.com/camenduru/controlnet-colab)!
+Thank haofanwang for making [ControlNet-for-Diffusers](https://github.com/haofanwang/ControlNet-for-Diffusers)!
+We also thank all authors for making Controlnet DEMOs, including but not limited to [fffiloni](https://huggingface.co/spaces/fffiloni/ControlNet-Video), [other-model](https://huggingface.co/spaces/hysts/ControlNet-with-other-models), [ThereforeGames](https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/7784), [RamAnanth1](https://huggingface.co/spaces/RamAnanth1/ControlNet), etc!
+Besides, you may also want to read these amazing related works:
+[Composer: Creative and Controllable Image Synthesis with Composable Conditions](https://github.com/damo-vilab/composer): A much bigger model to control diffusion!
+[T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models](https://github.com/TencentARC/T2I-Adapter): A much smaller model to control stable diffusion!
+[ControlLoRA: A Light Neural Network To Control Stable Diffusion Spatial Information](https://github.com/HighCWu/ControlLoRA): Implement Controlnet using LORA!
+And these amazing recent projects: [InstructPix2Pix Learning to Follow Image Editing Instructions](https://www.timothybrooks.com/instruct-pix2pix), [Pix2pix-zero: Zero-shot Image-to-Image Translation](https://github.com/pix2pixzero/pix2pix-zero), [Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation](https://github.com/MichalGeyer/plug-and-play), [MaskSketch: Unpaired Structure-guided Masked Image Generation](https://arxiv.org/abs/2302.05496), [SEGA: Instructing Diffusion using Semantic Dimensions](https://arxiv.org/abs/2301.12247), [Universal Guidance for Diffusion Models](https://github.com/arpitbansal297/Universal-Guided-Diffusion), [Region-Aware Diffusion for Zero-shot Text-driven Image Editing](https://github.com/haha-lisa/RDM-Region-Aware-Diffusion-Model), [Domain Expansion of Image Generators](https://arxiv.org/abs/2301.05225), [Image Mixer](https://twitter.com/LambdaAPI/status/1626327289288957956), [MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation](https://multidiffusion.github.io/)
+# Citation
+    @misc{zhang2023adding,
+      title={Adding Conditional Control to Text-to-Image Diffusion Models},
+      author={Lvmin Zhang and Anyi Rao and Maneesh Agrawala},
+      booktitle={IEEE International Conference on Computer Vision (ICCV)}
+      year={2023},
+    }
+[Arxiv Link](https://arxiv.org/abs/2302.05543)
+[Supplementary Materials](https://lllyasviel.github.io/misc/202309/cnet_supp.pdf)

generation/control/ControlNet/cldm/cldm.py ADDED Viewed

	@@ -0,0 +1,435 @@

+import einops
+import torch
+import torch as th
+import torch.nn as nn
+from ldm.modules.diffusionmodules.util import (
+    conv_nd,
+    linear,
+    zero_module,
+    timestep_embedding,
+)
+from einops import rearrange, repeat
+from torchvision.utils import make_grid
+from ldm.modules.attention import SpatialTransformer
+from ldm.modules.diffusionmodules.openaimodel import UNetModel, TimestepEmbedSequential, ResBlock, Downsample, AttentionBlock
+from ldm.models.diffusion.ddpm import LatentDiffusion
+from ldm.util import log_txt_as_img, exists, instantiate_from_config
+from ldm.models.diffusion.ddim import DDIMSampler
+class ControlledUnetModel(UNetModel):
+    def forward(self, x, timesteps=None, context=None, control=None, only_mid_control=False, **kwargs):
+        hs = []
+        with torch.no_grad():
+            t_emb = timestep_embedding(timesteps, self.model_channels, repeat_only=False)
+            emb = self.time_embed(t_emb)
+            h = x.type(self.dtype)
+            for module in self.input_blocks:
+                h = module(h, emb, context)
+                hs.append(h)
+            h = self.middle_block(h, emb, context)
+        if control is not None:
+            h += control.pop()
+        for i, module in enumerate(self.output_blocks):
+            if only_mid_control or control is None:
+                h = torch.cat([h, hs.pop()], dim=1)
+            else:
+                h = torch.cat([h, hs.pop() + control.pop()], dim=1)
+            h = module(h, emb, context)
+        h = h.type(x.dtype)
+        return self.out(h)
+class ControlNet(nn.Module):
+    def __init__(
+            self,
+            image_size,
+            in_channels,
+            model_channels,
+            hint_channels,
+            num_res_blocks,
+            attention_resolutions,
+            dropout=0,
+            channel_mult=(1, 2, 4, 8),
+            conv_resample=True,
+            dims=2,
+            use_checkpoint=False,
+            use_fp16=False,
+            num_heads=-1,
+            num_head_channels=-1,
+            num_heads_upsample=-1,
+            use_scale_shift_norm=False,
+            resblock_updown=False,
+            use_new_attention_order=False,
+            use_spatial_transformer=False,  # custom transformer support
+            transformer_depth=1,  # custom transformer support
+            context_dim=None,  # custom transformer support
+            n_embed=None,  # custom support for prediction of discrete ids into codebook of first stage vq model
+            legacy=True,
+            disable_self_attentions=None,
+            num_attention_blocks=None,
+            disable_middle_self_attn=False,
+            use_linear_in_transformer=False,
+    ):
+        super().__init__()
+        if use_spatial_transformer:
+            assert context_dim is not None, 'Fool!! You forgot to include the dimension of your cross-attention conditioning...'
+        if context_dim is not None:
+            assert use_spatial_transformer, 'Fool!! You forgot to use the spatial transformer for your cross-attention conditioning...'
+            from omegaconf.listconfig import ListConfig
+            if type(context_dim) == ListConfig:
+                context_dim = list(context_dim)
+        if num_heads_upsample == -1:
+            num_heads_upsample = num_heads
+        if num_heads == -1:
+            assert num_head_channels != -1, 'Either num_heads or num_head_channels has to be set'
+        if num_head_channels == -1:
+            assert num_heads != -1, 'Either num_heads or num_head_channels has to be set'
+        self.dims = dims
+        self.image_size = image_size
+        self.in_channels = in_channels
+        self.model_channels = model_channels
+        if isinstance(num_res_blocks, int):
+            self.num_res_blocks = len(channel_mult) * [num_res_blocks]
+        else:
+            if len(num_res_blocks) != len(channel_mult):
+                raise ValueError("provide num_res_blocks either as an int (globally constant) or "
+                                 "as a list/tuple (per-level) with the same length as channel_mult")
+            self.num_res_blocks = num_res_blocks
+        if disable_self_attentions is not None:
+            # should be a list of booleans, indicating whether to disable self-attention in TransformerBlocks or not
+            assert len(disable_self_attentions) == len(channel_mult)
+        if num_attention_blocks is not None:
+            assert len(num_attention_blocks) == len(self.num_res_blocks)
+            assert all(map(lambda i: self.num_res_blocks[i] >= num_attention_blocks[i], range(len(num_attention_blocks))))
+            print(f"Constructor of UNetModel received num_attention_blocks={num_attention_blocks}. "
+                  f"This option has LESS priority than attention_resolutions {attention_resolutions}, "
+                  f"i.e., in cases where num_attention_blocks[i] > 0 but 2**i not in attention_resolutions, "
+                  f"attention will still not be set.")
+        self.attention_resolutions = attention_resolutions
+        self.dropout = dropout
+        self.channel_mult = channel_mult
+        self.conv_resample = conv_resample
+        self.use_checkpoint = use_checkpoint
+        self.dtype = th.float16 if use_fp16 else th.float32
+        self.num_heads = num_heads
+        self.num_head_channels = num_head_channels
+        self.num_heads_upsample = num_heads_upsample
+        self.predict_codebook_ids = n_embed is not None
+        time_embed_dim = model_channels * 4
+        self.time_embed = nn.Sequential(
+            linear(model_channels, time_embed_dim),
+            nn.SiLU(),
+            linear(time_embed_dim, time_embed_dim),
+        )
+        self.input_blocks = nn.ModuleList(
+            [
+                TimestepEmbedSequential(
+                    conv_nd(dims, in_channels, model_channels, 3, padding=1)
+                )
+            ]
+        )
+        self.zero_convs = nn.ModuleList([self.make_zero_conv(model_channels)])
+        self.input_hint_block = TimestepEmbedSequential(
+            conv_nd(dims, hint_channels, 16, 3, padding=1),
+            nn.SiLU(),
+            conv_nd(dims, 16, 16, 3, padding=1),
+            nn.SiLU(),
+            conv_nd(dims, 16, 32, 3, padding=1, stride=2),
+            nn.SiLU(),
+            conv_nd(dims, 32, 32, 3, padding=1),
+            nn.SiLU(),
+            conv_nd(dims, 32, 96, 3, padding=1, stride=2),
+            nn.SiLU(),
+            conv_nd(dims, 96, 96, 3, padding=1),
+            nn.SiLU(),
+            conv_nd(dims, 96, 256, 3, padding=1, stride=2),
+            nn.SiLU(),
+            zero_module(conv_nd(dims, 256, model_channels, 3, padding=1))
+        )
+        self._feature_size = model_channels
+        input_block_chans = [model_channels]
+        ch = model_channels
+        ds = 1
+        for level, mult in enumerate(channel_mult):
+            for nr in range(self.num_res_blocks[level]):
+                layers = [
+                    ResBlock(
+                        ch,
+                        time_embed_dim,
+                        dropout,
+                        out_channels=mult * model_channels,
+                        dims=dims,
+                        use_checkpoint=use_checkpoint,
+                        use_scale_shift_norm=use_scale_shift_norm,
+                    )
+                ]
+                ch = mult * model_channels
+                if ds in attention_resolutions:
+                    if num_head_channels == -1:
+                        dim_head = ch // num_heads
+                    else:
+                        num_heads = ch // num_head_channels
+                        dim_head = num_head_channels
+                    if legacy:
+                        # num_heads = 1
+                        dim_head = ch // num_heads if use_spatial_transformer else num_head_channels
+                    if exists(disable_self_attentions):
+                        disabled_sa = disable_self_attentions[level]
+                    else:
+                        disabled_sa = False
+                    if not exists(num_attention_blocks) or nr < num_attention_blocks[level]:
+                        layers.append(
+                            AttentionBlock(
+                                ch,
+                                use_checkpoint=use_checkpoint,
+                                num_heads=num_heads,
+                                num_head_channels=dim_head,
+                                use_new_attention_order=use_new_attention_order,
+                            ) if not use_spatial_transformer else SpatialTransformer(
+                                ch, num_heads, dim_head, depth=transformer_depth, context_dim=context_dim,
+                                disable_self_attn=disabled_sa, use_linear=use_linear_in_transformer,
+                                use_checkpoint=use_checkpoint
+                            )
+                        )
+                self.input_blocks.append(TimestepEmbedSequential(*layers))
+                self.zero_convs.append(self.make_zero_conv(ch))
+                self._feature_size += ch
+                input_block_chans.append(ch)
+            if level != len(channel_mult) - 1:
+                out_ch = ch
+                self.input_blocks.append(
+                    TimestepEmbedSequential(
+                        ResBlock(
+                            ch,
+                            time_embed_dim,
+                            dropout,
+                            out_channels=out_ch,
+                            dims=dims,
+                            use_checkpoint=use_checkpoint,
+                            use_scale_shift_norm=use_scale_shift_norm,
+                            down=True,
+                        )
+                        if resblock_updown
+                        else Downsample(
+                            ch, conv_resample, dims=dims, out_channels=out_ch
+                        )
+                    )
+                )
+                ch = out_ch
+                input_block_chans.append(ch)
+                self.zero_convs.append(self.make_zero_conv(ch))
+                ds *= 2
+                self._feature_size += ch
+        if num_head_channels == -1:
+            dim_head = ch // num_heads
+        else:
+            num_heads = ch // num_head_channels
+            dim_head = num_head_channels
+        if legacy:
+            # num_heads = 1
+            dim_head = ch // num_heads if use_spatial_transformer else num_head_channels
+        self.middle_block = TimestepEmbedSequential(
+            ResBlock(
+                ch,
+                time_embed_dim,
+                dropout,
+                dims=dims,
+                use_checkpoint=use_checkpoint,
+                use_scale_shift_norm=use_scale_shift_norm,
+            ),
+            AttentionBlock(
+                ch,
+                use_checkpoint=use_checkpoint,
+                num_heads=num_heads,
+                num_head_channels=dim_head,
+                use_new_attention_order=use_new_attention_order,
+            ) if not use_spatial_transformer else SpatialTransformer(  # always uses a self-attn
+                ch, num_heads, dim_head, depth=transformer_depth, context_dim=context_dim,
+                disable_self_attn=disable_middle_self_attn, use_linear=use_linear_in_transformer,
+                use_checkpoint=use_checkpoint
+            ),
+            ResBlock(
+                ch,
+                time_embed_dim,
+                dropout,
+                dims=dims,
+                use_checkpoint=use_checkpoint,
+                use_scale_shift_norm=use_scale_shift_norm,
+            ),
+        )
+        self.middle_block_out = self.make_zero_conv(ch)
+        self._feature_size += ch
+    def make_zero_conv(self, channels):
+        return TimestepEmbedSequential(zero_module(conv_nd(self.dims, channels, channels, 1, padding=0)))
+    def forward(self, x, hint, timesteps, context, **kwargs):
+        t_emb = timestep_embedding(timesteps, self.model_channels, repeat_only=False)
+        emb = self.time_embed(t_emb)
+        guided_hint = self.input_hint_block(hint, emb, context)
+        outs = []
+        h = x.type(self.dtype)
+        for module, zero_conv in zip(self.input_blocks, self.zero_convs):
+            if guided_hint is not None:
+                h = module(h, emb, context)
+                h += guided_hint
+                guided_hint = None
+            else:
+                h = module(h, emb, context)
+            outs.append(zero_conv(h, emb, context))
+        h = self.middle_block(h, emb, context)
+        outs.append(self.middle_block_out(h, emb, context))
+        return outs
+class ControlLDM(LatentDiffusion):
+    def __init__(self, control_stage_config, control_key, only_mid_control, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.control_model = instantiate_from_config(control_stage_config)
+        self.control_key = control_key
+        self.only_mid_control = only_mid_control
+        self.control_scales = [1.0] * 13
+    @torch.no_grad()
+    def get_input(self, batch, k, bs=None, *args, **kwargs):
+        x, c = super().get_input(batch, self.first_stage_key, *args, **kwargs)
+        control = batch[self.control_key]
+        if bs is not None:
+            control = control[:bs]
+        control = control.to(self.device)
+        control = einops.rearrange(control, 'b h w c -> b c h w')
+        control = control.to(memory_format=torch.contiguous_format).float()
+        return x, dict(c_crossattn=[c], c_concat=[control])
+    def apply_model(self, x_noisy, t, cond, *args, **kwargs):
+        assert isinstance(cond, dict)
+        diffusion_model = self.model.diffusion_model
+        cond_txt = torch.cat(cond['c_crossattn'], 1)
+        if cond['c_concat'] is None:
+            eps = diffusion_model(x=x_noisy, timesteps=t, context=cond_txt, control=None, only_mid_control=self.only_mid_control)
+        else:
+            control = self.control_model(x=x_noisy, hint=torch.cat(cond['c_concat'], 1), timesteps=t, context=cond_txt)
+            control = [c * scale for c, scale in zip(control, self.control_scales)]
+            eps = diffusion_model(x=x_noisy, timesteps=t, context=cond_txt, control=control, only_mid_control=self.only_mid_control)
+        return eps
+    @torch.no_grad()
+    def get_unconditional_conditioning(self, N):
+        return self.get_learned_conditioning([""] * N)
+    @torch.no_grad()
+    def log_images(self, batch, N=4, n_row=2, sample=False, ddim_steps=50, ddim_eta=0.0, return_keys=None,
+                   quantize_denoised=True, inpaint=True, plot_denoise_rows=False, plot_progressive_rows=True,
+                   plot_diffusion_rows=False, unconditional_guidance_scale=9.0, unconditional_guidance_label=None,
+                   use_ema_scope=True,
+                   **kwargs):
+        use_ddim = ddim_steps is not None
+        log = dict()
+        z, c = self.get_input(batch, self.first_stage_key, bs=N)
+        c_cat, c = c["c_concat"][0][:N], c["c_crossattn"][0][:N]
+        N = min(z.shape[0], N)
+        n_row = min(z.shape[0], n_row)
+        log["reconstruction"] = self.decode_first_stage(z)
+        log["control"] = c_cat * 2.0 - 1.0
+        log["conditioning"] = log_txt_as_img((512, 512), batch[self.cond_stage_key], size=16)
+        if plot_diffusion_rows:
+            # get diffusion row
+            diffusion_row = list()
+            z_start = z[:n_row]
+            for t in range(self.num_timesteps):
+                if t % self.log_every_t == 0 or t == self.num_timesteps - 1:
+                    t = repeat(torch.tensor([t]), '1 -> b', b=n_row)
+                    t = t.to(self.device).long()
+                    noise = torch.randn_like(z_start)
+                    z_noisy = self.q_sample(x_start=z_start, t=t, noise=noise)
+                    diffusion_row.append(self.decode_first_stage(z_noisy))
+            diffusion_row = torch.stack(diffusion_row)  # n_log_step, n_row, C, H, W
+            diffusion_grid = rearrange(diffusion_row, 'n b c h w -> b n c h w')
+            diffusion_grid = rearrange(diffusion_grid, 'b n c h w -> (b n) c h w')
+            diffusion_grid = make_grid(diffusion_grid, nrow=diffusion_row.shape[0])
+            log["diffusion_row"] = diffusion_grid
+        if sample:
+            # get denoise row
+            samples, z_denoise_row = self.sample_log(cond={"c_concat": [c_cat], "c_crossattn": [c]},
+                                                     batch_size=N, ddim=use_ddim,
+                                                     ddim_steps=ddim_steps, eta=ddim_eta)
+            x_samples = self.decode_first_stage(samples)
+            log["samples"] = x_samples
+            if plot_denoise_rows:
+                denoise_grid = self._get_denoise_row_from_list(z_denoise_row)
+                log["denoise_row"] = denoise_grid
+        if unconditional_guidance_scale > 1.0:
+            uc_cross = self.get_unconditional_conditioning(N)
+            uc_cat = c_cat  # torch.zeros_like(c_cat)
+            uc_full = {"c_concat": [uc_cat], "c_crossattn": [uc_cross]}
+            samples_cfg, _ = self.sample_log(cond={"c_concat": [c_cat], "c_crossattn": [c]},
+                                             batch_size=N, ddim=use_ddim,
+                                             ddim_steps=ddim_steps, eta=ddim_eta,
+                                             unconditional_guidance_scale=unconditional_guidance_scale,
+                                             unconditional_conditioning=uc_full,
+                                             )
+            x_samples_cfg = self.decode_first_stage(samples_cfg)
+            log[f"samples_cfg_scale_{unconditional_guidance_scale:.2f}"] = x_samples_cfg
+        return log
+    @torch.no_grad()
+    def sample_log(self, cond, batch_size, ddim, ddim_steps, **kwargs):
+        ddim_sampler = DDIMSampler(self)
+        b, c, h, w = cond["c_concat"][0].shape
+        shape = (self.channels, h // 8, w // 8)
+        samples, intermediates = ddim_sampler.sample(ddim_steps, batch_size, shape, cond, verbose=False, **kwargs)
+        return samples, intermediates
+    def configure_optimizers(self):
+        lr = self.learning_rate
+        params = list(self.control_model.parameters())
+        if not self.sd_locked:
+            params += list(self.model.diffusion_model.output_blocks.parameters())
+            params += list(self.model.diffusion_model.out.parameters())
+        opt = torch.optim.AdamW(params, lr=lr)
+        return opt
+    def low_vram_shift(self, is_diffusing):
+        if is_diffusing:
+            self.model = self.model.cuda()
+            self.control_model = self.control_model.cuda()
+            self.first_stage_model = self.first_stage_model.cpu()
+            self.cond_stage_model = self.cond_stage_model.cpu()
+        else:
+            self.model = self.model.cpu()
+            self.control_model = self.control_model.cpu()
+            self.first_stage_model = self.first_stage_model.cuda()
+            self.cond_stage_model = self.cond_stage_model.cuda()

generation/control/ControlNet/cldm/ddim_hacked.py ADDED Viewed

	@@ -0,0 +1,317 @@

+"""SAMPLING ONLY."""
+import torch
+import numpy as np
+from tqdm import tqdm
+from ldm.modules.diffusionmodules.util import make_ddim_sampling_parameters, make_ddim_timesteps, noise_like, extract_into_tensor
+class DDIMSampler(object):
+    def __init__(self, model, schedule="linear", **kwargs):
+        super().__init__()
+        self.model = model
+        self.ddpm_num_timesteps = model.num_timesteps
+        self.schedule = schedule
+    def register_buffer(self, name, attr):
+        if type(attr) == torch.Tensor:
+            if attr.device != torch.device("cuda"):
+                attr = attr.to(torch.device("cuda"))
+        setattr(self, name, attr)
+    def make_schedule(self, ddim_num_steps, ddim_discretize="uniform", ddim_eta=0., verbose=True):
+        self.ddim_timesteps = make_ddim_timesteps(ddim_discr_method=ddim_discretize, num_ddim_timesteps=ddim_num_steps,
+                                                  num_ddpm_timesteps=self.ddpm_num_timesteps,verbose=verbose)
+        alphas_cumprod = self.model.alphas_cumprod
+        assert alphas_cumprod.shape[0] == self.ddpm_num_timesteps, 'alphas have to be defined for each timestep'
+        to_torch = lambda x: x.clone().detach().to(torch.float32).to(self.model.device)
+        self.register_buffer('betas', to_torch(self.model.betas))
+        self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod))
+        self.register_buffer('alphas_cumprod_prev', to_torch(self.model.alphas_cumprod_prev))
+        # calculations for diffusion q(x_t | x_{t-1}) and others
+        self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod.cpu())))
+        self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod.cpu())))
+        self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod.cpu())))
+        self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu())))
+        self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu() - 1)))
+        # ddim sampling parameters
+        ddim_sigmas, ddim_alphas, ddim_alphas_prev = make_ddim_sampling_parameters(alphacums=alphas_cumprod.cpu(),
+                                                                                   ddim_timesteps=self.ddim_timesteps,
+                                                                                   eta=ddim_eta,verbose=verbose)
+        self.register_buffer('ddim_sigmas', ddim_sigmas)
+        self.register_buffer('ddim_alphas', ddim_alphas)
+        self.register_buffer('ddim_alphas_prev', ddim_alphas_prev)
+        self.register_buffer('ddim_sqrt_one_minus_alphas', np.sqrt(1. - ddim_alphas))
+        sigmas_for_original_sampling_steps = ddim_eta * torch.sqrt(
+            (1 - self.alphas_cumprod_prev) / (1 - self.alphas_cumprod) * (
+                        1 - self.alphas_cumprod / self.alphas_cumprod_prev))
+        self.register_buffer('ddim_sigmas_for_original_num_steps', sigmas_for_original_sampling_steps)
+    @torch.no_grad()
+    def sample(self,
+               S,
+               batch_size,
+               shape,
+               conditioning=None,
+               callback=None,
+               normals_sequence=None,
+               img_callback=None,
+               quantize_x0=False,
+               eta=0.,
+               mask=None,
+               x0=None,
+               temperature=1.,
+               noise_dropout=0.,
+               score_corrector=None,
+               corrector_kwargs=None,
+               verbose=True,
+               x_T=None,
+               log_every_t=100,
+               unconditional_guidance_scale=1.,
+               unconditional_conditioning=None, # this has to come in the same format as the conditioning, # e.g. as encoded tokens, ...
+               dynamic_threshold=None,
+               ucg_schedule=None,
+               **kwargs
+               ):
+        if conditioning is not None:
+            if isinstance(conditioning, dict):
+                ctmp = conditioning[list(conditioning.keys())[0]]
+                while isinstance(ctmp, list): ctmp = ctmp[0]
+                cbs = ctmp.shape[0]
+                if cbs != batch_size:
+                    print(f"Warning: Got {cbs} conditionings but batch-size is {batch_size}")
+            elif isinstance(conditioning, list):
+                for ctmp in conditioning:
+                    if ctmp.shape[0] != batch_size:
+                        print(f"Warning: Got {cbs} conditionings but batch-size is {batch_size}")
+            else:
+                if conditioning.shape[0] != batch_size:
+                    print(f"Warning: Got {conditioning.shape[0]} conditionings but batch-size is {batch_size}")
+        self.make_schedule(ddim_num_steps=S, ddim_eta=eta, verbose=verbose)
+        # sampling
+        C, H, W = shape
+        size = (batch_size, C, H, W)
+        print(f'Data shape for DDIM sampling is {size}, eta {eta}')
+        samples, intermediates = self.ddim_sampling(conditioning, size,
+                                                    callback=callback,
+                                                    img_callback=img_callback,
+                                                    quantize_denoised=quantize_x0,
+                                                    mask=mask, x0=x0,
+                                                    ddim_use_original_steps=False,
+                                                    noise_dropout=noise_dropout,
+                                                    temperature=temperature,
+                                                    score_corrector=score_corrector,
+                                                    corrector_kwargs=corrector_kwargs,
+                                                    x_T=x_T,
+                                                    log_every_t=log_every_t,
+                                                    unconditional_guidance_scale=unconditional_guidance_scale,
+                                                    unconditional_conditioning=unconditional_conditioning,
+                                                    dynamic_threshold=dynamic_threshold,
+                                                    ucg_schedule=ucg_schedule
+                                                    )
+        return samples, intermediates
+    @torch.no_grad()
+    def ddim_sampling(self, cond, shape,
+                      x_T=None, ddim_use_original_steps=False,
+                      callback=None, timesteps=None, quantize_denoised=False,
+                      mask=None, x0=None, img_callback=None, log_every_t=100,
+                      temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None,
+                      unconditional_guidance_scale=1., unconditional_conditioning=None, dynamic_threshold=None,
+                      ucg_schedule=None):
+        device = self.model.betas.device
+        b = shape[0]
+        if x_T is None:
+            img = torch.randn(shape, device=device)
+        else:
+            img = x_T
+        if timesteps is None:
+            timesteps = self.ddpm_num_timesteps if ddim_use_original_steps else self.ddim_timesteps
+        elif timesteps is not None and not ddim_use_original_steps:
+            subset_end = int(min(timesteps / self.ddim_timesteps.shape[0], 1) * self.ddim_timesteps.shape[0]) - 1
+            timesteps = self.ddim_timesteps[:subset_end]
+        intermediates = {'x_inter': [img], 'pred_x0': [img]}
+        time_range = reversed(range(0,timesteps)) if ddim_use_original_steps else np.flip(timesteps)
+        total_steps = timesteps if ddim_use_original_steps else timesteps.shape[0]
+        print(f"Running DDIM Sampling with {total_steps} timesteps")
+        iterator = tqdm(time_range, desc='DDIM Sampler', total=total_steps)
+        for i, step in enumerate(iterator):
+            index = total_steps - i - 1
+            ts = torch.full((b,), step, device=device, dtype=torch.long)
+            if mask is not None:
+                assert x0 is not None
+                img_orig = self.model.q_sample(x0, ts)  # TODO: deterministic forward pass?
+                img = img_orig * mask + (1. - mask) * img
+            if ucg_schedule is not None:
+                assert len(ucg_schedule) == len(time_range)
+                unconditional_guidance_scale = ucg_schedule[i]
+            outs = self.p_sample_ddim(img, cond, ts, index=index, use_original_steps=ddim_use_original_steps,
+                                      quantize_denoised=quantize_denoised, temperature=temperature,
+                                      noise_dropout=noise_dropout, score_corrector=score_corrector,
+                                      corrector_kwargs=corrector_kwargs,
+                                      unconditional_guidance_scale=unconditional_guidance_scale,
+                                      unconditional_conditioning=unconditional_conditioning,
+                                      dynamic_threshold=dynamic_threshold)
+            img, pred_x0 = outs
+            if callback: callback(i)
+            if img_callback: img_callback(pred_x0, i)
+            if index % log_every_t == 0 or index == total_steps - 1:
+                intermediates['x_inter'].append(img)
+                intermediates['pred_x0'].append(pred_x0)
+        return img, intermediates
+    @torch.no_grad()
+    def p_sample_ddim(self, x, c, t, index, repeat_noise=False, use_original_steps=False, quantize_denoised=False,
+                      temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None,
+                      unconditional_guidance_scale=1., unconditional_conditioning=None,
+                      dynamic_threshold=None):
+        b, *_, device = *x.shape, x.device
+        if unconditional_conditioning is None or unconditional_guidance_scale == 1.:
+            model_output = self.model.apply_model(x, t, c)
+        else:
+            model_t = self.model.apply_model(x, t, c)
+            model_uncond = self.model.apply_model(x, t, unconditional_conditioning)
+            model_output = model_uncond + unconditional_guidance_scale * (model_t - model_uncond)
+        if self.model.parameterization == "v":
+            e_t = self.model.predict_eps_from_z_and_v(x, t, model_output)
+        else:
+            e_t = model_output
+        if score_corrector is not None:
+            assert self.model.parameterization == "eps", 'not implemented'
+            e_t = score_corrector.modify_score(self.model, e_t, x, t, c, **corrector_kwargs)
+        alphas = self.model.alphas_cumprod if use_original_steps else self.ddim_alphas
+        alphas_prev = self.model.alphas_cumprod_prev if use_original_steps else self.ddim_alphas_prev
+        sqrt_one_minus_alphas = self.model.sqrt_one_minus_alphas_cumprod if use_original_steps else self.ddim_sqrt_one_minus_alphas
+        sigmas = self.model.ddim_sigmas_for_original_num_steps if use_original_steps else self.ddim_sigmas
+        # select parameters corresponding to the currently considered timestep
+        a_t = torch.full((b, 1, 1, 1), alphas[index], device=device)
+        a_prev = torch.full((b, 1, 1, 1), alphas_prev[index], device=device)
+        sigma_t = torch.full((b, 1, 1, 1), sigmas[index], device=device)
+        sqrt_one_minus_at = torch.full((b, 1, 1, 1), sqrt_one_minus_alphas[index],device=device)
+        # current prediction for x_0
+        if self.model.parameterization != "v":
+            pred_x0 = (x - sqrt_one_minus_at * e_t) / a_t.sqrt()
+        else:
+            pred_x0 = self.model.predict_start_from_z_and_v(x, t, model_output)
+        if quantize_denoised:
+            pred_x0, _, *_ = self.model.first_stage_model.quantize(pred_x0)
+        if dynamic_threshold is not None:
+            raise NotImplementedError()
+        # direction pointing to x_t
+        dir_xt = (1. - a_prev - sigma_t**2).sqrt() * e_t
+        noise = sigma_t * noise_like(x.shape, device, repeat_noise) * temperature
+        if noise_dropout > 0.:
+            noise = torch.nn.functional.dropout(noise, p=noise_dropout)
+        x_prev = a_prev.sqrt() * pred_x0 + dir_xt + noise
+        return x_prev, pred_x0
+    @torch.no_grad()
+    def encode(self, x0, c, t_enc, use_original_steps=False, return_intermediates=None,
+               unconditional_guidance_scale=1.0, unconditional_conditioning=None, callback=None):
+        timesteps = np.arange(self.ddpm_num_timesteps) if use_original_steps else self.ddim_timesteps
+        num_reference_steps = timesteps.shape[0]
+        assert t_enc <= num_reference_steps
+        num_steps = t_enc
+        if use_original_steps:
+            alphas_next = self.alphas_cumprod[:num_steps]
+            alphas = self.alphas_cumprod_prev[:num_steps]
+        else:
+            alphas_next = self.ddim_alphas[:num_steps]
+            alphas = torch.tensor(self.ddim_alphas_prev[:num_steps])
+        x_next = x0
+        intermediates = []
+        inter_steps = []
+        for i in tqdm(range(num_steps), desc='Encoding Image'):
+            t = torch.full((x0.shape[0],), timesteps[i], device=self.model.device, dtype=torch.long)
+            if unconditional_guidance_scale == 1.:
+                noise_pred = self.model.apply_model(x_next, t, c)
+            else:
+                assert unconditional_conditioning is not None
+                e_t_uncond, noise_pred = torch.chunk(
+                    self.model.apply_model(torch.cat((x_next, x_next)), torch.cat((t, t)),
+                                           torch.cat((unconditional_conditioning, c))), 2)
+                noise_pred = e_t_uncond + unconditional_guidance_scale * (noise_pred - e_t_uncond)
+            xt_weighted = (alphas_next[i] / alphas[i]).sqrt() * x_next
+            weighted_noise_pred = alphas_next[i].sqrt() * (
+                    (1 / alphas_next[i] - 1).sqrt() - (1 / alphas[i] - 1).sqrt()) * noise_pred
+            x_next = xt_weighted + weighted_noise_pred
+            if return_intermediates and i % (
+                    num_steps // return_intermediates) == 0 and i < num_steps - 1:
+                intermediates.append(x_next)
+                inter_steps.append(i)
+            elif return_intermediates and i >= num_steps - 2:
+                intermediates.append(x_next)
+                inter_steps.append(i)
+            if callback: callback(i)
+        out = {'x_encoded': x_next, 'intermediate_steps': inter_steps}
+        if return_intermediates:
+            out.update({'intermediates': intermediates})
+        return x_next, out
+    @torch.no_grad()
+    def stochastic_encode(self, x0, t, use_original_steps=False, noise=None):
+        # fast, but does not allow for exact reconstruction
+        # t serves as an index to gather the correct alphas
+        if use_original_steps:
+            sqrt_alphas_cumprod = self.sqrt_alphas_cumprod
+            sqrt_one_minus_alphas_cumprod = self.sqrt_one_minus_alphas_cumprod
+        else:
+            sqrt_alphas_cumprod = torch.sqrt(self.ddim_alphas)
+            sqrt_one_minus_alphas_cumprod = self.ddim_sqrt_one_minus_alphas
+        if noise is None:
+            noise = torch.randn_like(x0)
+        return (extract_into_tensor(sqrt_alphas_cumprod, t, x0.shape) * x0 +
+                extract_into_tensor(sqrt_one_minus_alphas_cumprod, t, x0.shape) * noise)
+    @torch.no_grad()
+    def decode(self, x_latent, cond, t_start, unconditional_guidance_scale=1.0, unconditional_conditioning=None,
+               use_original_steps=False, callback=None):
+        timesteps = np.arange(self.ddpm_num_timesteps) if use_original_steps else self.ddim_timesteps
+        timesteps = timesteps[:t_start]
+        time_range = np.flip(timesteps)
+        total_steps = timesteps.shape[0]
+        print(f"Running DDIM Sampling with {total_steps} timesteps")
+        iterator = tqdm(time_range, desc='Decoding image', total=total_steps)
+        x_dec = x_latent
+        for i, step in enumerate(iterator):
+            index = total_steps - i - 1
+            ts = torch.full((x_latent.shape[0],), step, device=x_latent.device, dtype=torch.long)
+            x_dec, _ = self.p_sample_ddim(x_dec, cond, ts, index=index, use_original_steps=use_original_steps,
+                                          unconditional_guidance_scale=unconditional_guidance_scale,
+                                          unconditional_conditioning=unconditional_conditioning)
+            if callback: callback(i)
+        return x_dec

generation/control/ControlNet/cldm/hack.py ADDED Viewed

	@@ -0,0 +1,111 @@

+import torch
+import einops
+import ldm.modules.encoders.modules
+import ldm.modules.attention
+from transformers import logging
+from ldm.modules.attention import default
+def disable_verbosity():
+    logging.set_verbosity_error()
+    print('logging improved.')
+    return
+def enable_sliced_attention():
+    ldm.modules.attention.CrossAttention.forward = _hacked_sliced_attentin_forward
+    print('Enabled sliced_attention.')
+    return
+def hack_everything(clip_skip=0):
+    disable_verbosity()
+    ldm.modules.encoders.modules.FrozenCLIPEmbedder.forward = _hacked_clip_forward
+    ldm.modules.encoders.modules.FrozenCLIPEmbedder.clip_skip = clip_skip
+    print('Enabled clip hacks.')
+    return
+# Written by Lvmin
+def _hacked_clip_forward(self, text):
+    PAD = self.tokenizer.pad_token_id
+    EOS = self.tokenizer.eos_token_id
+    BOS = self.tokenizer.bos_token_id
+    def tokenize(t):
+        return self.tokenizer(t, truncation=False, add_special_tokens=False)["input_ids"]
+    def transformer_encode(t):
+        if self.clip_skip > 1:
+            rt = self.transformer(input_ids=t, output_hidden_states=True)
+            return self.transformer.text_model.final_layer_norm(rt.hidden_states[-self.clip_skip])
+        else:
+            return self.transformer(input_ids=t, output_hidden_states=False).last_hidden_state
+    def split(x):
+        return x[75 * 0: 75 * 1], x[75 * 1: 75 * 2], x[75 * 2: 75 * 3]
+    def pad(x, p, i):
+        return x[:i] if len(x) >= i else x + [p] * (i - len(x))
+    raw_tokens_list = tokenize(text)
+    tokens_list = []
+    for raw_tokens in raw_tokens_list:
+        raw_tokens_123 = split(raw_tokens)
+        raw_tokens_123 = [[BOS] + raw_tokens_i + [EOS] for raw_tokens_i in raw_tokens_123]
+        raw_tokens_123 = [pad(raw_tokens_i, PAD, 77) for raw_tokens_i in raw_tokens_123]
+        tokens_list.append(raw_tokens_123)
+    tokens_list = torch.IntTensor(tokens_list).to(self.device)
+    feed = einops.rearrange(tokens_list, 'b f i -> (b f) i')
+    y = transformer_encode(feed)
+    z = einops.rearrange(y, '(b f) i c -> b (f i) c', f=3)
+    return z
+# Stolen from https://github.com/basujindal/stable-diffusion/blob/main/optimizedSD/splitAttention.py
+def _hacked_sliced_attentin_forward(self, x, context=None, mask=None):
+    h = self.heads
+    q = self.to_q(x)
+    context = default(context, x)
+    k = self.to_k(context)
+    v = self.to_v(context)
+    del context, x
+    q, k, v = map(lambda t: einops.rearrange(t, 'b n (h d) -> (b h) n d', h=h), (q, k, v))
+    limit = k.shape[0]
+    att_step = 1
+    q_chunks = list(torch.tensor_split(q, limit // att_step, dim=0))
+    k_chunks = list(torch.tensor_split(k, limit // att_step, dim=0))
+    v_chunks = list(torch.tensor_split(v, limit // att_step, dim=0))
+    q_chunks.reverse()
+    k_chunks.reverse()
+    v_chunks.reverse()
+    sim = torch.zeros(q.shape[0], q.shape[1], v.shape[2], device=q.device)
+    del k, q, v
+    for i in range(0, limit, att_step):
+        q_buffer = q_chunks.pop()
+        k_buffer = k_chunks.pop()
+        v_buffer = v_chunks.pop()
+        sim_buffer = torch.einsum('b i d, b j d -> b i j', q_buffer, k_buffer) * self.scale
+        del k_buffer, q_buffer
+        # attention, what we cannot get enough of, by chunks
+        sim_buffer = sim_buffer.softmax(dim=-1)
+        sim_buffer = torch.einsum('b i j, b j d -> b i d', sim_buffer, v_buffer)
+        del v_buffer
+        sim[i:i + att_step, :, :] = sim_buffer
+        del sim_buffer
+    sim = einops.rearrange(sim, '(b h) n d -> b n (h d)', h=h)
+    return self.to_out(sim)

generation/control/ControlNet/cldm/logger.py ADDED Viewed

	@@ -0,0 +1,76 @@

+import os
+import numpy as np
+import torch
+import torchvision
+from PIL import Image
+from pytorch_lightning.callbacks import Callback
+from pytorch_lightning.utilities.distributed import rank_zero_only
+class ImageLogger(Callback):
+    def __init__(self, batch_frequency=2000, max_images=4, clamp=True, increase_log_steps=True,
+                 rescale=True, disabled=False, log_on_batch_idx=False, log_first_step=False,
+                 log_images_kwargs=None):
+        super().__init__()
+        self.rescale = rescale
+        self.batch_freq = batch_frequency
+        self.max_images = max_images
+        if not increase_log_steps:
+            self.log_steps = [self.batch_freq]
+        self.clamp = clamp
+        self.disabled = disabled
+        self.log_on_batch_idx = log_on_batch_idx
+        self.log_images_kwargs = log_images_kwargs if log_images_kwargs else {}
+        self.log_first_step = log_first_step
+    @rank_zero_only
+    def log_local(self, save_dir, split, images, global_step, current_epoch, batch_idx):
+        root = os.path.join(save_dir, "image_log", split)
+        for k in images:
+            grid = torchvision.utils.make_grid(images[k], nrow=4)
+            if self.rescale:
+                grid = (grid + 1.0) / 2.0  # -1,1 -> 0,1; c,h,w
+            grid = grid.transpose(0, 1).transpose(1, 2).squeeze(-1)
+            grid = grid.numpy()
+            grid = (grid * 255).astype(np.uint8)
+            filename = "{}_gs-{:06}_e-{:06}_b-{:06}.png".format(k, global_step, current_epoch, batch_idx)
+            path = os.path.join(root, filename)
+            os.makedirs(os.path.split(path)[0], exist_ok=True)
+            Image.fromarray(grid).save(path)
+    def log_img(self, pl_module, batch, batch_idx, split="train"):
+        check_idx = batch_idx  # if self.log_on_batch_idx else pl_module.global_step
+        if (self.check_frequency(check_idx) and  # batch_idx % self.batch_freq == 0
+                hasattr(pl_module, "log_images") and
+                callable(pl_module.log_images) and
+                self.max_images > 0):
+            logger = type(pl_module.logger)
+            is_train = pl_module.training
+            if is_train:
+                pl_module.eval()
+            with torch.no_grad():
+                images = pl_module.log_images(batch, split=split, **self.log_images_kwargs)
+            for k in images:
+                N = min(images[k].shape[0], self.max_images)
+                images[k] = images[k][:N]
+                if isinstance(images[k], torch.Tensor):
+                    images[k] = images[k].detach().cpu()
+                    if self.clamp:
+                        images[k] = torch.clamp(images[k], -1., 1.)
+            self.log_local(pl_module.logger.save_dir, split, images,
+                           pl_module.global_step, pl_module.current_epoch, batch_idx)
+            if is_train:
+                pl_module.train()
+    def check_frequency(self, check_idx):
+        return check_idx % self.batch_freq == 0
+    def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx):
+        if not self.disabled:
+            self.log_img(pl_module, batch, batch_idx, split="train")

generation/control/ControlNet/cldm/model.py ADDED Viewed

	@@ -0,0 +1,28 @@

+import os
+import torch
+from omegaconf import OmegaConf
+from ldm.util import instantiate_from_config
+def get_state_dict(d):
+    return d.get('state_dict', d)
+def load_state_dict(ckpt_path, location='cpu'):
+    _, extension = os.path.splitext(ckpt_path)
+    if extension.lower() == ".safetensors":
+        import safetensors.torch
+        state_dict = safetensors.torch.load_file(ckpt_path, device=location)
+    else:
+        state_dict = get_state_dict(torch.load(ckpt_path, map_location=torch.device(location)))
+    state_dict = get_state_dict(state_dict)
+    print(f'Loaded state_dict from [{ckpt_path}]')
+    return state_dict
+def create_model(config_path):
+    config = OmegaConf.load(config_path)
+    model = instantiate_from_config(config.model).cpu()
+    print(f'Loaded model config from [{config_path}]')
+    return model

generation/control/ControlNet/config.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ save_memory = False

generation/control/ControlNet/docs/annotator.md ADDED Viewed

	@@ -0,0 +1,49 @@

+# Automatic Annotations
+We provide gradio examples to obtain annotations that are aligned to our pretrained production-ready models.
+Just run
+    python gradio_annotator.py
+Since everyone has different habit to organize their datasets, we do not hard code any scripts for batch processing. But "gradio_annotator.py" is written in a super readable way, and modifying it to annotate your images should be easy.
+In the gradio UI of "gradio_annotator.py" we have the following interfaces:
+### Canny Edge
+Be careful about "black edge and white background" or "white edge and black background".
+![p](../github_page/a1.png)
+### HED Edge
+Be careful about "black edge and white background" or "white edge and black background".
+![p](../github_page/a2.png)
+### MLSD Edge
+Be careful about "black edge and white background" or "white edge and black background".
+![p](../github_page/a3.png)
+### MIDAS Depth and Normal
+Be careful about RGB or BGR in normal maps.
+![p](../github_page/a4.png)
+### Openpose
+Be careful about RGB or BGR in pose maps.
+For our production-ready model, the hand pose option is turned off.
+![p](../github_page/a5.png)
+### Uniformer Segmentation
+Be careful about RGB or BGR in segmentation maps.
+![p](../github_page/a6.png)

generation/control/ControlNet/docs/faq.md ADDED Viewed

	@@ -0,0 +1,21 @@

+# FAQs
+**Q:** If the weight of a conv layer is zero, the gradient will also be zero, and the network will not learn anything. Why "zero convolution" works?
+**A:** This is wrong. Let us consider a very simple
+$$y=wx+b$$
+and we have
+$$\partial y/\partial w=x, \partial y/\partial x=w, \partial y/\partial b=1$$
+and if $w=0$ and $x \neq 0$, then
+$$\partial y/\partial w \neq 0, \partial y/\partial x=0, \partial y/\partial b\neq 0$$
+which means as long as $x \neq 0$, one gradient descent iteration will make $w$ non-zero. Then
+$$\partial y/\partial x\neq 0$$
+so that the zero convolutions will progressively become a common conv layer with non-zero weights.

generation/control/ControlNet/docs/low_vram.md ADDED Viewed

	@@ -0,0 +1,15 @@

+# Enable Low VRAM Mode
+If you are using 8GB GPU card (or if you want larger batch size), please open "config.py", and then set
+```python
+save_memory = True
+```
+This feature is still being tested - not all graphics cards are guaranteed to succeed.
+But it should be neat as I can diffuse at a batch size of 12 now.
+(prompt "man")
+![p](../github_page/ram12.jpg)

generation/control/ControlNet/docs/train.md ADDED Viewed

	@@ -0,0 +1,276 @@

+# Train a ControlNet to Control SD
+You are here because you want to control SD in your own way, maybe you have an idea for your perfect research project, and you will annotate some data or have already annotated your own dataset automatically or manually. Herein, the control can be anything that can be converted to images, such as edges, keypoints, segments, etc.
+Before moving on to your own dataset, we highly recommend to first try the toy dataset, Fill50K, as a sanity check. This will help you get a "feeling" for the training. You will know how long it will take for the model to converge and whether your device will be able to complete the training in an acceptable amount of time. And what it "feels" like when the model converges.
+We hope that after you read this page, you will find that training a ControlNet is as easy as (or easier than) training a pix2pix.
+## Step 0 - Design your control
+Let us take a look at a very simple task to control SD to fill color in circles.
+![p](../github_page/t1.png)
+This is simple: we want to control SD to fill a circle with colors, and the prompt contains some description of our target.
+Stable diffusion is trained on billions of images, and it already knows what is "cyan", what is "circle", what is "pink", and what is "background".
+But it does not know the meaning of that "Control Image (Source Image)". Our target is to let it know.
+## Step 1 - Get a dataset
+Just download the Fill50K dataset from [our huggingface page](https://huggingface.co/lllyasviel/ControlNet) (training/fill50k.zip, the file is only 200M!). Make sure that the data is decompressed as
+    ControlNet/training/fill50k/prompt.json
+    ControlNet/training/fill50k/source/X.png
+    ControlNet/training/fill50k/target/X.png
+In the folder "fill50k/source", you will have 50k images of circle lines.
+![p](../github_page/t2.png)
+In the folder "fill50k/target", you will have 50k images of filled circles.
+![p](../github_page/t3.png)
+In the "fill50k/prompt.json", you will have their filenames and prompts. Each prompt is like "a balabala color circle in some other color background."
+![p](../github_page/t4.png)
+## Step 2 - Load the dataset
+Then you need to write a simple script to read this dataset for pytorch. (In fact we have written it for you in "tutorial_dataset.py".)
+```python
+import json
+import cv2
+import numpy as np
+from torch.utils.data import Dataset
+class MyDataset(Dataset):
+    def __init__(self):
+        self.data = []
+        with open('./training/fill50k/prompt.json', 'rt') as f:
+            for line in f:
+                self.data.append(json.loads(line))
+    def __len__(self):
+        return len(self.data)
+    def __getitem__(self, idx):
+        item = self.data[idx]
+        source_filename = item['source']
+        target_filename = item['target']
+        prompt = item['prompt']
+        source = cv2.imread('./training/fill50k/' + source_filename)
+        target = cv2.imread('./training/fill50k/' + target_filename)
+        # Do not forget that OpenCV read images in BGR order.
+        source = cv2.cvtColor(source, cv2.COLOR_BGR2RGB)
+        target = cv2.cvtColor(target, cv2.COLOR_BGR2RGB)
+        # Normalize source images to [0, 1].
+        source = source.astype(np.float32) / 255.0
+        # Normalize target images to [-1, 1].
+        target = (target.astype(np.float32) / 127.5) - 1.0
+        return dict(jpg=target, txt=prompt, hint=source)
+```
+This will make your dataset into an array-like object in python. You can test this dataset simply by accessing the array, like this
+```python
+from tutorial_dataset import MyDataset
+dataset = MyDataset()
+print(len(dataset))
+item = dataset[1234]
+jpg = item['jpg']
+txt = item['txt']
+hint = item['hint']
+print(txt)
+print(jpg.shape)
+print(hint.shape)
+```
+The outputs of this simple test on my machine are
+    50000
+    burly wood circle with orange background
+    (512, 512, 3)
+    (512, 512, 3)
+And this code is in "tutorial_dataset_test.py".
+In this way, the dataset is an array-like object with 50000 items. Each item is a dict with three entry "jpg", "txt", and "hint". The "jpg" is the target image, the "hint" is the control image, and the "txt" is the prompt.
+Do not ask us why we use these three names - this is related to the dark history of a library called LDM.
+## Step 3 - What SD model do you want to control?
+Then you need to decide which Stable Diffusion Model you want to control. In this example, we will just use standard SD1.5. You can download it from the [official page of Stability](https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main). You want the file ["v1-5-pruned.ckpt"](https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main).
+(Or ["v2-1_512-ema-pruned.ckpt"](https://huggingface.co/stabilityai/stable-diffusion-2-1-base/tree/main) if you are using SD2.)
+Then you need to attach a control net to the SD model. The architecture is
+![img](../github_page/sd.png)
+Note that all weights inside the ControlNet are also copied from SD so that no layer is trained from scratch, and you are still finetuning the entire model.
+We provide a simple script for you to achieve this easily. If your SD filename is "./models/v1-5-pruned.ckpt" and you want the script to save the processed model (SD+ControlNet) at location "./models/control_sd15_ini.ckpt", you can just run:
+    python tool_add_control.py ./models/v1-5-pruned.ckpt ./models/control_sd15_ini.ckpt
+Or if you are using SD2:
+    python tool_add_control_sd21.py ./models/v2-1_512-ema-pruned.ckpt ./models/control_sd21_ini.ckpt
+You may also use other filenames as long as the command is "python tool_add_control.py input_path output_path".
+This is the correct output from my machine:
+![img](../github_page/t5.png)
+## Step 4 - Train!
+Happy! We finally come to the most exciting part: training!
+The training code in "tutorial_train.py" is actually surprisingly simple:
+```python
+import pytorch_lightning as pl
+from torch.utils.data import DataLoader
+from tutorial_dataset import MyDataset
+from cldm.logger import ImageLogger
+from cldm.model import create_model, load_state_dict
+# Configs
+resume_path = './models/control_sd15_ini.ckpt'
+batch_size = 4
+logger_freq = 300
+learning_rate = 1e-5
+sd_locked = True
+only_mid_control = False
+# First use cpu to load models. Pytorch Lightning will automatically move it to GPUs.
+model = create_model('./models/cldm_v15.yaml').cpu()
+model.load_state_dict(load_state_dict(resume_path, location='cpu'))
+model.learning_rate = learning_rate
+model.sd_locked = sd_locked
+model.only_mid_control = only_mid_control
+# Misc
+dataset = MyDataset()
+dataloader = DataLoader(dataset, num_workers=0, batch_size=batch_size, shuffle=True)
+logger = ImageLogger(batch_frequency=logger_freq)
+trainer = pl.Trainer(gpus=1, precision=32, callbacks=[logger])
+# Train!
+trainer.fit(model, dataloader)
+```
+(or "tutorial_train_sd21.py" if you are using SD2)
+Thanks to our organized dataset pytorch object and the power of pytorch_lightning, the entire code is just super short.
+Now, you may take a look at [Pytorch Lightning Official DOC](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.trainer.trainer.Trainer.html#trainer) to find out how to enable many useful features like gradient accumulation, multiple GPU training, accelerated dataset loading, flexible checkpoint saving, etc. All these only need about one line of code. Great!
+Note that if you find OOM, perhaps you need to enable [Low VRAM mode](low_vram.md), and perhaps you also need to use smaller batch size and gradient accumulation. Or you may also want to use some “advanced” tricks like sliced attention or xformers. For example:
+```python
+# Configs
+batch_size = 1
+# Misc
+trainer = pl.Trainer(gpus=1, precision=32, callbacks=[logger], accumulate_grad_batches=4)  # But this will be 4x slower
+```
+Note that training with 8 GB laptop GPU is challenging. We will need some GPU memory optimization at least as good as automatic1111’s UI. This may require expert modifications to the code.
+### Screenshots
+The training is fast. After 4000 steps (batch size 4, learning rate 1e-5, about 50 minutes on PCIE 40G), the results on my machine (in an output folder "image_log") is
+Control:
+![img](../github_page/t/ip.png)
+Prompt:
+![img](../github_page/t/t.png)
+Prediction:
+![img](../github_page/t/op.png)
+Ground Truth:
+![img](../github_page/t/gt.png)
+Note that the SD's capability is preserved. Even training on this super aligned dataset, it still draws some random textures and those snow decorations. (Besides, note that the ground truth looks a bit modified because it is converted from SD's latent image.)
+Larger batch size and longer training will further improve this. Adequate training will make the filling perfect.
+Of course, training SD to fill circles is meaningless, but this is a successful beginning of your story.
+Let us work together to control large models more and more.
+## Other options
+Beyond standard things, we also provide two important parameters "sd_locked" and "only_mid_control" that you need to know.
+### only_mid_control
+By default, only_mid_control is False. When it is True, you will train the below architecture.
+![img](../github_page/t6.png)
+This can be helpful when your computation power is limited and want to speed up the training, or when you want to facilitate the "global" context learning. Note that sometimes you may pause training, set it to True, resume training, and pause again, and set it again, and resume again.
+If your computation device is good, perhaps you do not need this. But I also know some artists are willing to train a model on their laptop for a month - in that case, perhaps this option can be useful.
+### sd_locked
+By default, sd_locked is True. When it is False, you will train the below architecture.
+![img](../github_page/t7.png)
+This will unlock some layers in SD and you will train them as a whole.
+This option is DANGEROUS! If your dataset is not good enough, this may downgrade the capability of your SD model.
+However, this option is also very useful when you are training on images with some specific style, or when you are training with special datasets (like medical dataset with X-ray images or geographic datasets with lots of Google Maps). You can understand this as simultaneously training the ControlNet and something like a DreamBooth.
+Also, if your dataset is large, you may want to end the training with a few thousands of steps with those layer unlocked. This usually improve the "problem-specific" solutions a little. You may try it yourself to feel the difference.
+Also, if you unlock some original layers, you may want a lower learning rate, like 2e-6.
+## More Consideration: Sudden Converge Phenomenon and Gradient Accumulation
+![img](../github_page/ex1.jpg)
+Because we use zero convolutions, the SD should always be able to predict meaningful images. (If it cannot, the training has already failed.)
+You will always find that at some iterations, the model "suddenly" be able to fit some training conditions. This means that you will get a basically usable model at about 3k to 7k steps (future training will improve it, but that model after the first "sudden converge" should be basically functional).
+Note that 3k to 7k steps is not very large, and you should consider larger batch size rather than more training steps. If you can observe the "sudden converge" at 3k step using batch size 4, then, rather than train it with 300k further steps, a better idea is to use 100× gradient accumulation to re-train that 3k steps with 100× batch size. Note that perhaps we should not do this *too* extremely (perhaps 100x accumulation is too extreme), but you should consider that, since "sudden converge" will *always* happen at that certain point, getting a better converge is more important.
+Because that "sudden converge" always happens, lets say "sudden converge" will happen at 3k step and our money can optimize 90k step, then we have two options: (1) train 3k steps, sudden converge, then train 87k steps. (2) 30x gradient accumulation, train 3k steps (90k real computation steps), then sudden converge.
+In my experiments, (2) is usually better than (1). However, in real cases, perhaps you may need to balance the steps before and after the "sudden converge" on your own to find a balance. The training after "sudden converge" is also important.
+But usually, if your logic batch size is already bigger than 256, then further extending the batch size is not very meaningful. In that case, perhaps a better idea is to train more steps. I tried some "common" logic batch size at 64 or 96 or 128 (by gradient accumulation), it seems that many complicated conditions can be solved very well already.

generation/control/ControlNet/environment.yaml ADDED Viewed

	@@ -0,0 +1,35 @@

+name: control
+channels:
+  - pytorch
+  - defaults
+dependencies:
+  - python=3.8.5
+  - pip=20.3
+  - cudatoolkit=11.3
+  - pytorch=1.12.1
+  - torchvision=0.13.1
+  - numpy=1.23.1
+  - pip:
+      - gradio==3.16.2
+      - albumentations==1.3.0
+      - opencv-contrib-python==4.3.0.36
+      - imageio==2.9.0
+      - imageio-ffmpeg==0.4.2
+      - pytorch-lightning==1.5.0
+      - omegaconf==2.1.1
+      - test-tube>=0.7.5
+      - streamlit==1.12.1
+      - einops==0.3.0
+      - transformers==4.19.2
+      - webdataset==0.2.5
+      - kornia==0.6
+      - open_clip_torch==2.0.2
+      - invisible-watermark>=0.1.5
+      - streamlit-drawable-canvas==0.8.0
+      - torchmetrics==0.6.0
+      - timm==0.6.12
+      - addict==2.4.0
+      - yapf==0.32.0
+      - prettytable==3.6.0
+      - safetensors==0.2.7
+      - basicsr==1.4.2

generation/control/ControlNet/gradio_annotator.py ADDED Viewed

	@@ -0,0 +1,160 @@

+import gradio as gr
+from annotator.util import resize_image, HWC3
+model_canny = None
+def canny(img, res, l, h):
+    img = resize_image(HWC3(img), res)
+    global model_canny
+    if model_canny is None:
+        from annotator.canny import CannyDetector
+        model_canny = CannyDetector()
+    result = model_canny(img, l, h)
+    return [result]
+model_hed = None
+def hed(img, res):
+    img = resize_image(HWC3(img), res)
+    global model_hed
+    if model_hed is None:
+        from annotator.hed import HEDdetector
+        model_hed = HEDdetector()
+    result = model_hed(img)
+    return [result]
+model_mlsd = None
+def mlsd(img, res, thr_v, thr_d):
+    img = resize_image(HWC3(img), res)
+    global model_mlsd
+    if model_mlsd is None:
+        from annotator.mlsd import MLSDdetector
+        model_mlsd = MLSDdetector()
+    result = model_mlsd(img, thr_v, thr_d)
+    return [result]
+model_midas = None
+def midas(img, res, a):
+    img = resize_image(HWC3(img), res)
+    global model_midas
+    if model_midas is None:
+        from annotator.midas import MidasDetector
+        model_midas = MidasDetector()
+    results = model_midas(img, a)
+    return results
+model_openpose = None
+def openpose(img, res, has_hand):
+    img = resize_image(HWC3(img), res)
+    global model_openpose
+    if model_openpose is None:
+        from annotator.openpose import OpenposeDetector
+        model_openpose = OpenposeDetector()
+    result, _ = model_openpose(img, has_hand)
+    return [result]
+model_uniformer = None
+def uniformer(img, res):
+    img = resize_image(HWC3(img), res)
+    global model_uniformer
+    if model_uniformer is None:
+        from annotator.uniformer import UniformerDetector
+        model_uniformer = UniformerDetector()
+    result = model_uniformer(img)
+    return [result]
+block = gr.Blocks().queue()
+with block:
+    with gr.Row():
+        gr.Markdown("## Canny Edge")
+    with gr.Row():
+        with gr.Column():
+            input_image = gr.Image(source='upload', type="numpy")
+            low_threshold = gr.Slider(label="low_threshold", minimum=1, maximum=255, value=100, step=1)
+            high_threshold = gr.Slider(label="high_threshold", minimum=1, maximum=255, value=200, step=1)
+            resolution = gr.Slider(label="resolution", minimum=256, maximum=1024, value=512, step=64)
+            run_button = gr.Button(label="Run")
+        with gr.Column():
+            gallery = gr.Gallery(label="Generated images", show_label=False).style(height="auto")
+    run_button.click(fn=canny, inputs=[input_image, resolution, low_threshold, high_threshold], outputs=[gallery])
+    with gr.Row():
+        gr.Markdown("## HED Edge")
+    with gr.Row():
+        with gr.Column():
+            input_image = gr.Image(source='upload', type="numpy")
+            resolution = gr.Slider(label="resolution", minimum=256, maximum=1024, value=512, step=64)
+            run_button = gr.Button(label="Run")
+        with gr.Column():
+            gallery = gr.Gallery(label="Generated images", show_label=False).style(height="auto")
+    run_button.click(fn=hed, inputs=[input_image, resolution], outputs=[gallery])
+    with gr.Row():
+        gr.Markdown("## MLSD Edge")
+    with gr.Row():
+        with gr.Column():
+            input_image = gr.Image(source='upload', type="numpy")
+            value_threshold = gr.Slider(label="value_threshold", minimum=0.01, maximum=2.0, value=0.1, step=0.01)
+            distance_threshold = gr.Slider(label="distance_threshold", minimum=0.01, maximum=20.0, value=0.1, step=0.01)
+            resolution = gr.Slider(label="resolution", minimum=256, maximum=1024, value=384, step=64)
+            run_button = gr.Button(label="Run")
+        with gr.Column():
+            gallery = gr.Gallery(label="Generated images", show_label=False).style(height="auto")
+    run_button.click(fn=mlsd, inputs=[input_image, resolution, value_threshold, distance_threshold], outputs=[gallery])
+    with gr.Row():
+        gr.Markdown("## MIDAS Depth and Normal")
+    with gr.Row():
+        with gr.Column():
+            input_image = gr.Image(source='upload', type="numpy")
+            alpha = gr.Slider(label="alpha", minimum=0.1, maximum=20.0, value=6.2, step=0.01)
+            resolution = gr.Slider(label="resolution", minimum=256, maximum=1024, value=384, step=64)
+            run_button = gr.Button(label="Run")
+        with gr.Column():
+            gallery = gr.Gallery(label="Generated images", show_label=False).style(height="auto")
+    run_button.click(fn=midas, inputs=[input_image, resolution, alpha], outputs=[gallery])
+    with gr.Row():
+        gr.Markdown("## Openpose")
+    with gr.Row():
+        with gr.Column():
+            input_image = gr.Image(source='upload', type="numpy")
+            hand = gr.Checkbox(label='detect hand', value=False)
+            resolution = gr.Slider(label="resolution", minimum=256, maximum=1024, value=512, step=64)
+            run_button = gr.Button(label="Run")
+        with gr.Column():
+            gallery = gr.Gallery(label="Generated images", show_label=False).style(height="auto")
+    run_button.click(fn=openpose, inputs=[input_image, resolution, hand], outputs=[gallery])
+    with gr.Row():
+        gr.Markdown("## Uniformer Segmentation")
+    with gr.Row():
+        with gr.Column():
+            input_image = gr.Image(source='upload', type="numpy")
+            resolution = gr.Slider(label="resolution", minimum=256, maximum=1024, value=512, step=64)
+            run_button = gr.Button(label="Run")
+        with gr.Column():
+            gallery = gr.Gallery(label="Generated images", show_label=False).style(height="auto")
+    run_button.click(fn=uniformer, inputs=[input_image, resolution], outputs=[gallery])
+block.launch(server_name='0.0.0.0')

generation/control/ControlNet/gradio_canny2image.py ADDED Viewed

	@@ -0,0 +1,97 @@

+from share import *
+import config
+import cv2
+import einops
+import gradio as gr
+import numpy as np
+import torch
+import random
+from pytorch_lightning import seed_everything
+from annotator.util import resize_image, HWC3
+from annotator.canny import CannyDetector
+from cldm.model import create_model, load_state_dict
+from cldm.ddim_hacked import DDIMSampler
+apply_canny = CannyDetector()
+model = create_model('./models/cldm_v15.yaml').cpu()
+model.load_state_dict(load_state_dict('./models/control_sd15_canny.pth', location='cuda'))
+model = model.cuda()
+ddim_sampler = DDIMSampler(model)
+def process(input_image, prompt, a_prompt, n_prompt, num_samples, image_resolution, ddim_steps, guess_mode, strength, scale, seed, eta, low_threshold, high_threshold):
+    with torch.no_grad():
+        img = resize_image(HWC3(input_image), image_resolution)
+        H, W, C = img.shape
+        detected_map = apply_canny(img, low_threshold, high_threshold)
+        detected_map = HWC3(detected_map)
+        control = torch.from_numpy(detected_map.copy()).float().cuda() / 255.0
+        control = torch.stack([control for _ in range(num_samples)], dim=0)
+        control = einops.rearrange(control, 'b h w c -> b c h w').clone()
+        if seed == -1:
+            seed = random.randint(0, 65535)
+        seed_everything(seed)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=False)
+        cond = {"c_concat": [control], "c_crossattn": [model.get_learned_conditioning([prompt + ', ' + a_prompt] * num_samples)]}
+        un_cond = {"c_concat": None if guess_mode else [control], "c_crossattn": [model.get_learned_conditioning([n_prompt] * num_samples)]}
+        shape = (4, H // 8, W // 8)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=True)
+        model.control_scales = [strength * (0.825 ** float(12 - i)) for i in range(13)] if guess_mode else ([strength] * 13)  # Magic number. IDK why. Perhaps because 0.825**12<0.01 but 0.826**12>0.01
+        samples, intermediates = ddim_sampler.sample(ddim_steps, num_samples,
+                                                     shape, cond, verbose=False, eta=eta,
+                                                     unconditional_guidance_scale=scale,
+                                                     unconditional_conditioning=un_cond)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=False)
+        x_samples = model.decode_first_stage(samples)
+        x_samples = (einops.rearrange(x_samples, 'b c h w -> b h w c') * 127.5 + 127.5).cpu().numpy().clip(0, 255).astype(np.uint8)
+        results = [x_samples[i] for i in range(num_samples)]
+    return [255 - detected_map] + results
+block = gr.Blocks().queue()
+with block:
+    with gr.Row():
+        gr.Markdown("## Control Stable Diffusion with Canny Edge Maps")
+    with gr.Row():
+        with gr.Column():
+            input_image = gr.Image(source='upload', type="numpy")
+            prompt = gr.Textbox(label="Prompt")
+            run_button = gr.Button(label="Run")
+            with gr.Accordion("Advanced options", open=False):
+                num_samples = gr.Slider(label="Images", minimum=1, maximum=12, value=1, step=1)
+                image_resolution = gr.Slider(label="Image Resolution", minimum=256, maximum=768, value=512, step=64)
+                strength = gr.Slider(label="Control Strength", minimum=0.0, maximum=2.0, value=1.0, step=0.01)
+                guess_mode = gr.Checkbox(label='Guess Mode', value=False)
+                low_threshold = gr.Slider(label="Canny low threshold", minimum=1, maximum=255, value=100, step=1)
+                high_threshold = gr.Slider(label="Canny high threshold", minimum=1, maximum=255, value=200, step=1)
+                ddim_steps = gr.Slider(label="Steps", minimum=1, maximum=100, value=20, step=1)
+                scale = gr.Slider(label="Guidance Scale", minimum=0.1, maximum=30.0, value=9.0, step=0.1)
+                seed = gr.Slider(label="Seed", minimum=-1, maximum=2147483647, step=1, randomize=True)
+                eta = gr.Number(label="eta (DDIM)", value=0.0)
+                a_prompt = gr.Textbox(label="Added Prompt", value='best quality, extremely detailed')
+                n_prompt = gr.Textbox(label="Negative Prompt",
+                                      value='longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality')
+        with gr.Column():
+            result_gallery = gr.Gallery(label='Output', show_label=False, elem_id="gallery").style(grid=2, height='auto')
+    ips = [input_image, prompt, a_prompt, n_prompt, num_samples, image_resolution, ddim_steps, guess_mode, strength, scale, seed, eta, low_threshold, high_threshold]
+    run_button.click(fn=process, inputs=ips, outputs=[result_gallery])
+block.launch(server_name='0.0.0.0')

generation/control/ControlNet/gradio_depth2image.py ADDED Viewed

	@@ -0,0 +1,98 @@

+from share import *
+import config
+import cv2
+import einops
+import gradio as gr
+import numpy as np
+import torch
+import random
+from pytorch_lightning import seed_everything
+from annotator.util import resize_image, HWC3
+from annotator.midas import MidasDetector
+from cldm.model import create_model, load_state_dict
+from cldm.ddim_hacked import DDIMSampler
+apply_midas = MidasDetector()
+model = create_model('./models/cldm_v15.yaml').cpu()
+model.load_state_dict(load_state_dict('./models/control_sd15_depth.pth', location='cuda'))
+model = model.cuda()
+ddim_sampler = DDIMSampler(model)
+def process(input_image, prompt, a_prompt, n_prompt, num_samples, image_resolution, detect_resolution, ddim_steps, guess_mode, strength, scale, seed, eta):
+    with torch.no_grad():
+        input_image = HWC3(input_image)
+        detected_map, _ = apply_midas(resize_image(input_image, detect_resolution))
+        detected_map = HWC3(detected_map)
+        img = resize_image(input_image, image_resolution)
+        H, W, C = img.shape
+        detected_map = cv2.resize(detected_map, (W, H), interpolation=cv2.INTER_LINEAR)
+        control = torch.from_numpy(detected_map.copy()).float().cuda() / 255.0
+        control = torch.stack([control for _ in range(num_samples)], dim=0)
+        control = einops.rearrange(control, 'b h w c -> b c h w').clone()
+        if seed == -1:
+            seed = random.randint(0, 65535)
+        seed_everything(seed)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=False)
+        cond = {"c_concat": [control], "c_crossattn": [model.get_learned_conditioning([prompt + ', ' + a_prompt] * num_samples)]}
+        un_cond = {"c_concat": None if guess_mode else [control], "c_crossattn": [model.get_learned_conditioning([n_prompt] * num_samples)]}
+        shape = (4, H // 8, W // 8)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=True)
+        model.control_scales = [strength * (0.825 ** float(12 - i)) for i in range(13)] if guess_mode else ([strength] * 13)  # Magic number. IDK why. Perhaps because 0.825**12<0.01 but 0.826**12>0.01
+        samples, intermediates = ddim_sampler.sample(ddim_steps, num_samples,
+                                                     shape, cond, verbose=False, eta=eta,
+                                                     unconditional_guidance_scale=scale,
+                                                     unconditional_conditioning=un_cond)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=False)
+        x_samples = model.decode_first_stage(samples)
+        x_samples = (einops.rearrange(x_samples, 'b c h w -> b h w c') * 127.5 + 127.5).cpu().numpy().clip(0, 255).astype(np.uint8)
+        results = [x_samples[i] for i in range(num_samples)]
+    return [detected_map] + results
+block = gr.Blocks().queue()
+with block:
+    with gr.Row():
+        gr.Markdown("## Control Stable Diffusion with Depth Maps")
+    with gr.Row():
+        with gr.Column():
+            input_image = gr.Image(source='upload', type="numpy")
+            prompt = gr.Textbox(label="Prompt")
+            run_button = gr.Button(label="Run")
+            with gr.Accordion("Advanced options", open=False):
+                num_samples = gr.Slider(label="Images", minimum=1, maximum=12, value=1, step=1)
+                image_resolution = gr.Slider(label="Image Resolution", minimum=256, maximum=768, value=512, step=64)
+                strength = gr.Slider(label="Control Strength", minimum=0.0, maximum=2.0, value=1.0, step=0.01)
+                guess_mode = gr.Checkbox(label='Guess Mode', value=False)
+                detect_resolution = gr.Slider(label="Depth Resolution", minimum=128, maximum=1024, value=384, step=1)
+                ddim_steps = gr.Slider(label="Steps", minimum=1, maximum=100, value=20, step=1)
+                scale = gr.Slider(label="Guidance Scale", minimum=0.1, maximum=30.0, value=9.0, step=0.1)
+                seed = gr.Slider(label="Seed", minimum=-1, maximum=2147483647, step=1, randomize=True)
+                eta = gr.Number(label="eta (DDIM)", value=0.0)
+                a_prompt = gr.Textbox(label="Added Prompt", value='best quality, extremely detailed')
+                n_prompt = gr.Textbox(label="Negative Prompt",
+                                      value='longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality')
+        with gr.Column():
+            result_gallery = gr.Gallery(label='Output', show_label=False, elem_id="gallery").style(grid=2, height='auto')
+    ips = [input_image, prompt, a_prompt, n_prompt, num_samples, image_resolution, detect_resolution, ddim_steps, guess_mode, strength, scale, seed, eta]
+    run_button.click(fn=process, inputs=ips, outputs=[result_gallery])
+block.launch(server_name='0.0.0.0')

generation/control/ControlNet/gradio_fake_scribble2image.py ADDED Viewed

	@@ -0,0 +1,102 @@

+from share import *
+import config
+import cv2
+import einops
+import gradio as gr
+import numpy as np
+import torch
+import random
+from pytorch_lightning import seed_everything
+from annotator.util import resize_image, HWC3
+from annotator.hed import HEDdetector, nms
+from cldm.model import create_model, load_state_dict
+from cldm.ddim_hacked import DDIMSampler
+apply_hed = HEDdetector()
+model = create_model('./models/cldm_v15.yaml').cpu()
+model.load_state_dict(load_state_dict('./models/control_sd15_scribble.pth', location='cuda'))
+model = model.cuda()
+ddim_sampler = DDIMSampler(model)
+def process(input_image, prompt, a_prompt, n_prompt, num_samples, image_resolution, detect_resolution, ddim_steps, guess_mode, strength, scale, seed, eta):
+    with torch.no_grad():
+        input_image = HWC3(input_image)
+        detected_map = apply_hed(resize_image(input_image, detect_resolution))
+        detected_map = HWC3(detected_map)
+        img = resize_image(input_image, image_resolution)
+        H, W, C = img.shape
+        detected_map = cv2.resize(detected_map, (W, H), interpolation=cv2.INTER_LINEAR)
+        detected_map = nms(detected_map, 127, 3.0)
+        detected_map = cv2.GaussianBlur(detected_map, (0, 0), 3.0)
+        detected_map[detected_map > 4] = 255
+        detected_map[detected_map < 255] = 0
+        control = torch.from_numpy(detected_map.copy()).float().cuda() / 255.0
+        control = torch.stack([control for _ in range(num_samples)], dim=0)
+        control = einops.rearrange(control, 'b h w c -> b c h w').clone()
+        if seed == -1:
+            seed = random.randint(0, 65535)
+        seed_everything(seed)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=False)
+        cond = {"c_concat": [control], "c_crossattn": [model.get_learned_conditioning([prompt + ', ' + a_prompt] * num_samples)]}
+        un_cond = {"c_concat": None if guess_mode else [control], "c_crossattn": [model.get_learned_conditioning([n_prompt] * num_samples)]}
+        shape = (4, H // 8, W // 8)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=True)
+        model.control_scales = [strength * (0.825 ** float(12 - i)) for i in range(13)] if guess_mode else ([strength] * 13)  # Magic number. IDK why. Perhaps because 0.825**12<0.01 but 0.826**12>0.01
+        samples, intermediates = ddim_sampler.sample(ddim_steps, num_samples,
+                                                     shape, cond, verbose=False, eta=eta,
+                                                     unconditional_guidance_scale=scale,
+                                                     unconditional_conditioning=un_cond)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=False)
+        x_samples = model.decode_first_stage(samples)
+        x_samples = (einops.rearrange(x_samples, 'b c h w -> b h w c') * 127.5 + 127.5).cpu().numpy().clip(0, 255).astype(np.uint8)
+        results = [x_samples[i] for i in range(num_samples)]
+    return [255 - detected_map] + results
+block = gr.Blocks().queue()
+with block:
+    with gr.Row():
+        gr.Markdown("## Control Stable Diffusion with Fake Scribble Maps")
+    with gr.Row():
+        with gr.Column():
+            input_image = gr.Image(source='upload', type="numpy")
+            prompt = gr.Textbox(label="Prompt")
+            run_button = gr.Button(label="Run")
+            with gr.Accordion("Advanced options", open=False):
+                num_samples = gr.Slider(label="Images", minimum=1, maximum=12, value=1, step=1)
+                image_resolution = gr.Slider(label="Image Resolution", minimum=256, maximum=768, value=512, step=64)
+                strength = gr.Slider(label="Control Strength", minimum=0.0, maximum=2.0, value=1.0, step=0.01)
+                guess_mode = gr.Checkbox(label='Guess Mode', value=False)
+                detect_resolution = gr.Slider(label="HED Resolution", minimum=128, maximum=1024, value=512, step=1)
+                ddim_steps = gr.Slider(label="Steps", minimum=1, maximum=100, value=20, step=1)
+                scale = gr.Slider(label="Guidance Scale", minimum=0.1, maximum=30.0, value=9.0, step=0.1)
+                seed = gr.Slider(label="Seed", minimum=-1, maximum=2147483647, step=1, randomize=True)
+                eta = gr.Number(label="eta (DDIM)", value=0.0)
+                a_prompt = gr.Textbox(label="Added Prompt", value='best quality, extremely detailed')
+                n_prompt = gr.Textbox(label="Negative Prompt",
+                                      value='longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality')
+        with gr.Column():
+            result_gallery = gr.Gallery(label='Output', show_label=False, elem_id="gallery").style(grid=2, height='auto')
+    ips = [input_image, prompt, a_prompt, n_prompt, num_samples, image_resolution, detect_resolution, ddim_steps, guess_mode, strength, scale, seed, eta]
+    run_button.click(fn=process, inputs=ips, outputs=[result_gallery])
+block.launch(server_name='0.0.0.0')

generation/control/ControlNet/gradio_hed2image.py ADDED Viewed

	@@ -0,0 +1,98 @@

+from share import *
+import config
+import cv2
+import einops
+import gradio as gr
+import numpy as np
+import torch
+import random
+from pytorch_lightning import seed_everything
+from annotator.util import resize_image, HWC3
+from annotator.hed import HEDdetector
+from cldm.model import create_model, load_state_dict
+from cldm.ddim_hacked import DDIMSampler
+apply_hed = HEDdetector()
+model = create_model('./models/cldm_v15.yaml').cpu()
+model.load_state_dict(load_state_dict('./models/control_sd15_hed.pth', location='cuda'))
+model = model.cuda()
+ddim_sampler = DDIMSampler(model)
+def process(input_image, prompt, a_prompt, n_prompt, num_samples, image_resolution, detect_resolution, ddim_steps, guess_mode, strength, scale, seed, eta):
+    with torch.no_grad():
+        input_image = HWC3(input_image)
+        detected_map = apply_hed(resize_image(input_image, detect_resolution))
+        detected_map = HWC3(detected_map)
+        img = resize_image(input_image, image_resolution)
+        H, W, C = img.shape
+        detected_map = cv2.resize(detected_map, (W, H), interpolation=cv2.INTER_LINEAR)
+        control = torch.from_numpy(detected_map.copy()).float().cuda() / 255.0
+        control = torch.stack([control for _ in range(num_samples)], dim=0)
+        control = einops.rearrange(control, 'b h w c -> b c h w').clone()
+        if seed == -1:
+            seed = random.randint(0, 65535)
+        seed_everything(seed)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=False)
+        cond = {"c_concat": [control], "c_crossattn": [model.get_learned_conditioning([prompt + ', ' + a_prompt] * num_samples)]}
+        un_cond = {"c_concat": None if guess_mode else [control], "c_crossattn": [model.get_learned_conditioning([n_prompt] * num_samples)]}
+        shape = (4, H // 8, W // 8)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=True)
+        model.control_scales = [strength * (0.825 ** float(12 - i)) for i in range(13)] if guess_mode else ([strength] * 13)  # Magic number. IDK why. Perhaps because 0.825**12<0.01 but 0.826**12>0.01
+        samples, intermediates = ddim_sampler.sample(ddim_steps, num_samples,
+                                                     shape, cond, verbose=False, eta=eta,
+                                                     unconditional_guidance_scale=scale,
+                                                     unconditional_conditioning=un_cond)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=False)
+        x_samples = model.decode_first_stage(samples)
+        x_samples = (einops.rearrange(x_samples, 'b c h w -> b h w c') * 127.5 + 127.5).cpu().numpy().clip(0, 255).astype(np.uint8)
+        results = [x_samples[i] for i in range(num_samples)]
+    return [detected_map] + results
+block = gr.Blocks().queue()
+with block:
+    with gr.Row():
+        gr.Markdown("## Control Stable Diffusion with HED Maps")
+    with gr.Row():
+        with gr.Column():
+            input_image = gr.Image(source='upload', type="numpy")
+            prompt = gr.Textbox(label="Prompt")
+            run_button = gr.Button(label="Run")
+            with gr.Accordion("Advanced options", open=False):
+                num_samples = gr.Slider(label="Images", minimum=1, maximum=12, value=1, step=1)
+                image_resolution = gr.Slider(label="Image Resolution", minimum=256, maximum=768, value=512, step=64)
+                strength = gr.Slider(label="Control Strength", minimum=0.0, maximum=2.0, value=1.0, step=0.01)
+                guess_mode = gr.Checkbox(label='Guess Mode', value=False)
+                detect_resolution = gr.Slider(label="HED Resolution", minimum=128, maximum=1024, value=512, step=1)
+                ddim_steps = gr.Slider(label="Steps", minimum=1, maximum=100, value=20, step=1)
+                scale = gr.Slider(label="Guidance Scale", minimum=0.1, maximum=30.0, value=9.0, step=0.1)
+                seed = gr.Slider(label="Seed", minimum=-1, maximum=2147483647, step=1, randomize=True)
+                eta = gr.Number(label="eta (DDIM)", value=0.0)
+                a_prompt = gr.Textbox(label="Added Prompt", value='best quality, extremely detailed')
+                n_prompt = gr.Textbox(label="Negative Prompt",
+                                      value='longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality')
+        with gr.Column():
+            result_gallery = gr.Gallery(label='Output', show_label=False, elem_id="gallery").style(grid=2, height='auto')
+    ips = [input_image, prompt, a_prompt, n_prompt, num_samples, image_resolution, detect_resolution, ddim_steps, guess_mode, strength, scale, seed, eta]
+    run_button.click(fn=process, inputs=ips, outputs=[result_gallery])
+block.launch(server_name='0.0.0.0')

generation/control/ControlNet/gradio_hough2image.py ADDED Viewed

	@@ -0,0 +1,100 @@

+from share import *
+import config
+import cv2
+import einops
+import gradio as gr
+import numpy as np
+import torch
+import random
+from pytorch_lightning import seed_everything
+from annotator.util import resize_image, HWC3
+from annotator.mlsd import MLSDdetector
+from cldm.model import create_model, load_state_dict
+from cldm.ddim_hacked import DDIMSampler
+apply_mlsd = MLSDdetector()
+model = create_model('./models/cldm_v15.yaml').cpu()
+model.load_state_dict(load_state_dict('./models/control_sd15_mlsd.pth', location='cuda'))
+model = model.cuda()
+ddim_sampler = DDIMSampler(model)
+def process(input_image, prompt, a_prompt, n_prompt, num_samples, image_resolution, detect_resolution, ddim_steps, guess_mode, strength, scale, seed, eta, value_threshold, distance_threshold):
+    with torch.no_grad():
+        input_image = HWC3(input_image)
+        detected_map = apply_mlsd(resize_image(input_image, detect_resolution), value_threshold, distance_threshold)
+        detected_map = HWC3(detected_map)
+        img = resize_image(input_image, image_resolution)
+        H, W, C = img.shape
+        detected_map = cv2.resize(detected_map, (W, H), interpolation=cv2.INTER_NEAREST)
+        control = torch.from_numpy(detected_map.copy()).float().cuda() / 255.0
+        control = torch.stack([control for _ in range(num_samples)], dim=0)
+        control = einops.rearrange(control, 'b h w c -> b c h w').clone()
+        if seed == -1:
+            seed = random.randint(0, 65535)
+        seed_everything(seed)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=False)
+        cond = {"c_concat": [control], "c_crossattn": [model.get_learned_conditioning([prompt + ', ' + a_prompt] * num_samples)]}
+        un_cond = {"c_concat": None if guess_mode else [control], "c_crossattn": [model.get_learned_conditioning([n_prompt] * num_samples)]}
+        shape = (4, H // 8, W // 8)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=True)
+        model.control_scales = [strength * (0.825 ** float(12 - i)) for i in range(13)] if guess_mode else ([strength] * 13)  # Magic number. IDK why. Perhaps because 0.825**12<0.01 but 0.826**12>0.01
+        samples, intermediates = ddim_sampler.sample(ddim_steps, num_samples,
+                                                     shape, cond, verbose=False, eta=eta,
+                                                     unconditional_guidance_scale=scale,
+                                                     unconditional_conditioning=un_cond)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=False)
+        x_samples = model.decode_first_stage(samples)
+        x_samples = (einops.rearrange(x_samples, 'b c h w -> b h w c') * 127.5 + 127.5).cpu().numpy().clip(0, 255).astype(np.uint8)
+        results = [x_samples[i] for i in range(num_samples)]
+    return [255 - cv2.dilate(detected_map, np.ones(shape=(3, 3), dtype=np.uint8), iterations=1)] + results
+block = gr.Blocks().queue()
+with block:
+    with gr.Row():
+        gr.Markdown("## Control Stable Diffusion with Hough Line Maps")
+    with gr.Row():
+        with gr.Column():
+            input_image = gr.Image(source='upload', type="numpy")
+            prompt = gr.Textbox(label="Prompt")
+            run_button = gr.Button(label="Run")
+            with gr.Accordion("Advanced options", open=False):
+                num_samples = gr.Slider(label="Images", minimum=1, maximum=12, value=1, step=1)
+                image_resolution = gr.Slider(label="Image Resolution", minimum=256, maximum=768, value=512, step=64)
+                strength = gr.Slider(label="Control Strength", minimum=0.0, maximum=2.0, value=1.0, step=0.01)
+                guess_mode = gr.Checkbox(label='Guess Mode', value=False)
+                detect_resolution = gr.Slider(label="Hough Resolution", minimum=128, maximum=1024, value=512, step=1)
+                value_threshold = gr.Slider(label="Hough value threshold (MLSD)", minimum=0.01, maximum=2.0, value=0.1, step=0.01)
+                distance_threshold = gr.Slider(label="Hough distance threshold (MLSD)", minimum=0.01, maximum=20.0, value=0.1, step=0.01)
+                ddim_steps = gr.Slider(label="Steps", minimum=1, maximum=100, value=20, step=1)
+                scale = gr.Slider(label="Guidance Scale", minimum=0.1, maximum=30.0, value=9.0, step=0.1)
+                seed = gr.Slider(label="Seed", minimum=-1, maximum=2147483647, step=1, randomize=True)
+                eta = gr.Number(label="eta (DDIM)", value=0.0)
+                a_prompt = gr.Textbox(label="Added Prompt", value='best quality, extremely detailed')
+                n_prompt = gr.Textbox(label="Negative Prompt",
+                                      value='longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality')
+        with gr.Column():
+            result_gallery = gr.Gallery(label='Output', show_label=False, elem_id="gallery").style(grid=2, height='auto')
+    ips = [input_image, prompt, a_prompt, n_prompt, num_samples, image_resolution, detect_resolution, ddim_steps, guess_mode, strength, scale, seed, eta, value_threshold, distance_threshold]
+    run_button.click(fn=process, inputs=ips, outputs=[result_gallery])
+block.launch(server_name='0.0.0.0')

generation/control/ControlNet/gradio_normal2image.py ADDED Viewed

	@@ -0,0 +1,99 @@

+from share import *
+import config
+import cv2
+import einops
+import gradio as gr
+import numpy as np
+import torch
+import random
+from pytorch_lightning import seed_everything
+from annotator.util import resize_image, HWC3
+from annotator.midas import MidasDetector
+from cldm.model import create_model, load_state_dict
+from cldm.ddim_hacked import DDIMSampler
+apply_midas = MidasDetector()
+model = create_model('./models/cldm_v15.yaml').cpu()
+model.load_state_dict(load_state_dict('./models/control_sd15_normal.pth', location='cuda'))
+model = model.cuda()
+ddim_sampler = DDIMSampler(model)
+def process(input_image, prompt, a_prompt, n_prompt, num_samples, image_resolution, detect_resolution, ddim_steps, guess_mode, strength, scale, seed, eta, bg_threshold):
+    with torch.no_grad():
+        input_image = HWC3(input_image)
+        _, detected_map = apply_midas(resize_image(input_image, detect_resolution), bg_th=bg_threshold)
+        detected_map = HWC3(detected_map)
+        img = resize_image(input_image, image_resolution)
+        H, W, C = img.shape
+        detected_map = cv2.resize(detected_map, (W, H), interpolation=cv2.INTER_LINEAR)
+        control = torch.from_numpy(detected_map[:, :, ::-1].copy()).float().cuda() / 255.0
+        control = torch.stack([control for _ in range(num_samples)], dim=0)
+        control = einops.rearrange(control, 'b h w c -> b c h w').clone()
+        if seed == -1:
+            seed = random.randint(0, 65535)
+        seed_everything(seed)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=False)
+        cond = {"c_concat": [control], "c_crossattn": [model.get_learned_conditioning([prompt + ', ' + a_prompt] * num_samples)]}
+        un_cond = {"c_concat": None if guess_mode else [control], "c_crossattn": [model.get_learned_conditioning([n_prompt] * num_samples)]}
+        shape = (4, H // 8, W // 8)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=True)
+        model.control_scales = [strength * (0.825 ** float(12 - i)) for i in range(13)] if guess_mode else ([strength] * 13)  # Magic number. IDK why. Perhaps because 0.825**12<0.01 but 0.826**12>0.01
+        samples, intermediates = ddim_sampler.sample(ddim_steps, num_samples,
+                                                     shape, cond, verbose=False, eta=eta,
+                                                     unconditional_guidance_scale=scale,
+                                                     unconditional_conditioning=un_cond)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=False)
+        x_samples = model.decode_first_stage(samples)
+        x_samples = (einops.rearrange(x_samples, 'b c h w -> b h w c') * 127.5 + 127.5).cpu().numpy().clip(0, 255).astype(np.uint8)
+        results = [x_samples[i] for i in range(num_samples)]
+    return [detected_map] + results
+block = gr.Blocks().queue()
+with block:
+    with gr.Row():
+        gr.Markdown("## Control Stable Diffusion with Normal Maps")
+    with gr.Row():
+        with gr.Column():
+            input_image = gr.Image(source='upload', type="numpy")
+            prompt = gr.Textbox(label="Prompt")
+            run_button = gr.Button(label="Run")
+            with gr.Accordion("Advanced options", open=False):
+                num_samples = gr.Slider(label="Images", minimum=1, maximum=12, value=1, step=1)
+                image_resolution = gr.Slider(label="Image Resolution", minimum=256, maximum=768, value=512, step=64)
+                strength = gr.Slider(label="Control Strength", minimum=0.0, maximum=2.0, value=1.0, step=0.01)
+                guess_mode = gr.Checkbox(label='Guess Mode', value=False)
+                detect_resolution = gr.Slider(label="Normal Resolution", minimum=128, maximum=1024, value=384, step=1)
+                bg_threshold = gr.Slider(label="Normal background threshold", minimum=0.0, maximum=1.0, value=0.4, step=0.01)
+                ddim_steps = gr.Slider(label="Steps", minimum=1, maximum=100, value=20, step=1)
+                scale = gr.Slider(label="Guidance Scale", minimum=0.1, maximum=30.0, value=9.0, step=0.1)
+                seed = gr.Slider(label="Seed", minimum=-1, maximum=2147483647, step=1, randomize=True)
+                eta = gr.Number(label="eta (DDIM)", value=0.0)
+                a_prompt = gr.Textbox(label="Added Prompt", value='best quality, extremely detailed')
+                n_prompt = gr.Textbox(label="Negative Prompt",
+                                      value='longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality')
+        with gr.Column():
+            result_gallery = gr.Gallery(label='Output', show_label=False, elem_id="gallery").style(grid=2, height='auto')
+    ips = [input_image, prompt, a_prompt, n_prompt, num_samples, image_resolution, detect_resolution, ddim_steps, guess_mode, strength, scale, seed, eta, bg_threshold]
+    run_button.click(fn=process, inputs=ips, outputs=[result_gallery])
+block.launch(server_name='0.0.0.0')

generation/control/ControlNet/gradio_pose2image.py ADDED Viewed

	@@ -0,0 +1,98 @@

+from share import *
+import config
+import cv2
+import einops
+import gradio as gr
+import numpy as np
+import torch
+import random
+from pytorch_lightning import seed_everything
+from annotator.util import resize_image, HWC3
+from annotator.openpose import OpenposeDetector
+from cldm.model import create_model, load_state_dict
+from cldm.ddim_hacked import DDIMSampler
+apply_openpose = OpenposeDetector()
+model = create_model('./models/cldm_v15.yaml').cpu()
+model.load_state_dict(load_state_dict('./models/control_sd15_openpose.pth', location='cuda'))
+model = model.cuda()
+ddim_sampler = DDIMSampler(model)
+def process(input_image, prompt, a_prompt, n_prompt, num_samples, image_resolution, detect_resolution, ddim_steps, guess_mode, strength, scale, seed, eta):
+    with torch.no_grad():
+        input_image = HWC3(input_image)
+        detected_map, _ = apply_openpose(resize_image(input_image, detect_resolution))
+        detected_map = HWC3(detected_map)
+        img = resize_image(input_image, image_resolution)
+        H, W, C = img.shape
+        detected_map = cv2.resize(detected_map, (W, H), interpolation=cv2.INTER_NEAREST)
+        control = torch.from_numpy(detected_map.copy()).float().cuda() / 255.0
+        control = torch.stack([control for _ in range(num_samples)], dim=0)
+        control = einops.rearrange(control, 'b h w c -> b c h w').clone()
+        if seed == -1:
+            seed = random.randint(0, 65535)
+        seed_everything(seed)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=False)
+        cond = {"c_concat": [control], "c_crossattn": [model.get_learned_conditioning([prompt + ', ' + a_prompt] * num_samples)]}
+        un_cond = {"c_concat": None if guess_mode else [control], "c_crossattn": [model.get_learned_conditioning([n_prompt] * num_samples)]}
+        shape = (4, H // 8, W // 8)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=True)
+        model.control_scales = [strength * (0.825 ** float(12 - i)) for i in range(13)] if guess_mode else ([strength] * 13)  # Magic number. IDK why. Perhaps because 0.825**12<0.01 but 0.826**12>0.01
+        samples, intermediates = ddim_sampler.sample(ddim_steps, num_samples,
+                                                     shape, cond, verbose=False, eta=eta,
+                                                     unconditional_guidance_scale=scale,
+                                                     unconditional_conditioning=un_cond)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=False)
+        x_samples = model.decode_first_stage(samples)
+        x_samples = (einops.rearrange(x_samples, 'b c h w -> b h w c') * 127.5 + 127.5).cpu().numpy().clip(0, 255).astype(np.uint8)
+        results = [x_samples[i] for i in range(num_samples)]
+    return [detected_map] + results
+block = gr.Blocks().queue()
+with block:
+    with gr.Row():
+        gr.Markdown("## Control Stable Diffusion with Human Pose")
+    with gr.Row():
+        with gr.Column():
+            input_image = gr.Image(source='upload', type="numpy")
+            prompt = gr.Textbox(label="Prompt")
+            run_button = gr.Button(label="Run")
+            with gr.Accordion("Advanced options", open=False):
+                num_samples = gr.Slider(label="Images", minimum=1, maximum=12, value=1, step=1)
+                image_resolution = gr.Slider(label="Image Resolution", minimum=256, maximum=768, value=512, step=64)
+                strength = gr.Slider(label="Control Strength", minimum=0.0, maximum=2.0, value=1.0, step=0.01)
+                guess_mode = gr.Checkbox(label='Guess Mode', value=False)
+                detect_resolution = gr.Slider(label="OpenPose Resolution", minimum=128, maximum=1024, value=512, step=1)
+                ddim_steps = gr.Slider(label="Steps", minimum=1, maximum=100, value=20, step=1)
+                scale = gr.Slider(label="Guidance Scale", minimum=0.1, maximum=30.0, value=9.0, step=0.1)
+                seed = gr.Slider(label="Seed", minimum=-1, maximum=2147483647, step=1, randomize=True)
+                eta = gr.Number(label="eta (DDIM)", value=0.0)
+                a_prompt = gr.Textbox(label="Added Prompt", value='best quality, extremely detailed')
+                n_prompt = gr.Textbox(label="Negative Prompt",
+                                      value='longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality')
+        with gr.Column():
+            result_gallery = gr.Gallery(label='Output', show_label=False, elem_id="gallery").style(grid=2, height='auto')
+    ips = [input_image, prompt, a_prompt, n_prompt, num_samples, image_resolution, detect_resolution, ddim_steps, guess_mode, strength, scale, seed, eta]
+    run_button.click(fn=process, inputs=ips, outputs=[result_gallery])
+block.launch(server_name='0.0.0.0')

generation/control/ControlNet/gradio_scribble2image.py ADDED Viewed

	@@ -0,0 +1,92 @@

+from share import *
+import config
+import cv2
+import einops
+import gradio as gr
+import numpy as np
+import torch
+import random
+from pytorch_lightning import seed_everything
+from annotator.util import resize_image, HWC3
+from cldm.model import create_model, load_state_dict
+from cldm.ddim_hacked import DDIMSampler
+model = create_model('./models/cldm_v15.yaml').cpu()
+model.load_state_dict(load_state_dict('./models/control_sd15_scribble.pth', location='cuda'))
+model = model.cuda()
+ddim_sampler = DDIMSampler(model)
+def process(input_image, prompt, a_prompt, n_prompt, num_samples, image_resolution, ddim_steps, guess_mode, strength, scale, seed, eta):
+    with torch.no_grad():
+        img = resize_image(HWC3(input_image), image_resolution)
+        H, W, C = img.shape
+        detected_map = np.zeros_like(img, dtype=np.uint8)
+        detected_map[np.min(img, axis=2) < 127] = 255
+        control = torch.from_numpy(detected_map.copy()).float().cuda() / 255.0
+        control = torch.stack([control for _ in range(num_samples)], dim=0)
+        control = einops.rearrange(control, 'b h w c -> b c h w').clone()
+        if seed == -1:
+            seed = random.randint(0, 65535)
+        seed_everything(seed)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=False)
+        cond = {"c_concat": [control], "c_crossattn": [model.get_learned_conditioning([prompt + ', ' + a_prompt] * num_samples)]}
+        un_cond = {"c_concat": None if guess_mode else [control], "c_crossattn": [model.get_learned_conditioning([n_prompt] * num_samples)]}
+        shape = (4, H // 8, W // 8)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=True)
+        model.control_scales = [strength * (0.825 ** float(12 - i)) for i in range(13)] if guess_mode else ([strength] * 13)  # Magic number. IDK why. Perhaps because 0.825**12<0.01 but 0.826**12>0.01
+        samples, intermediates = ddim_sampler.sample(ddim_steps, num_samples,
+                                                     shape, cond, verbose=False, eta=eta,
+                                                     unconditional_guidance_scale=scale,
+                                                     unconditional_conditioning=un_cond)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=False)
+        x_samples = model.decode_first_stage(samples)
+        x_samples = (einops.rearrange(x_samples, 'b c h w -> b h w c') * 127.5 + 127.5).cpu().numpy().clip(0, 255).astype(np.uint8)
+        results = [x_samples[i] for i in range(num_samples)]
+    return [255 - detected_map] + results
+block = gr.Blocks().queue()
+with block:
+    with gr.Row():
+        gr.Markdown("## Control Stable Diffusion with Scribble Maps")
+    with gr.Row():
+        with gr.Column():
+            input_image = gr.Image(source='upload', type="numpy")
+            prompt = gr.Textbox(label="Prompt")
+            run_button = gr.Button(label="Run")
+            with gr.Accordion("Advanced options", open=False):
+                num_samples = gr.Slider(label="Images", minimum=1, maximum=12, value=1, step=1)
+                image_resolution = gr.Slider(label="Image Resolution", minimum=256, maximum=768, value=512, step=64)
+                strength = gr.Slider(label="Control Strength", minimum=0.0, maximum=2.0, value=1.0, step=0.01)
+                guess_mode = gr.Checkbox(label='Guess Mode', value=False)
+                ddim_steps = gr.Slider(label="Steps", minimum=1, maximum=100, value=20, step=1)
+                scale = gr.Slider(label="Guidance Scale", minimum=0.1, maximum=30.0, value=9.0, step=0.1)
+                seed = gr.Slider(label="Seed", minimum=-1, maximum=2147483647, step=1, randomize=True)
+                eta = gr.Number(label="eta (DDIM)", value=0.0)
+                a_prompt = gr.Textbox(label="Added Prompt", value='best quality, extremely detailed')
+                n_prompt = gr.Textbox(label="Negative Prompt",
+                                      value='longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality')
+        with gr.Column():
+            result_gallery = gr.Gallery(label='Output', show_label=False, elem_id="gallery").style(grid=2, height='auto')
+    ips = [input_image, prompt, a_prompt, n_prompt, num_samples, image_resolution, ddim_steps, guess_mode, strength, scale, seed, eta]
+    run_button.click(fn=process, inputs=ips, outputs=[result_gallery])
+block.launch(server_name='0.0.0.0')

generation/control/ControlNet/gradio_scribble2image_interactive.py ADDED Viewed

	@@ -0,0 +1,102 @@

+from share import *
+import config
+import cv2
+import einops
+import gradio as gr
+import numpy as np
+import torch
+import random
+from pytorch_lightning import seed_everything
+from annotator.util import resize_image, HWC3
+from cldm.model import create_model, load_state_dict
+from cldm.ddim_hacked import DDIMSampler
+model = create_model('./models/cldm_v15.yaml').cpu()
+model.load_state_dict(load_state_dict('./models/control_sd15_scribble.pth', location='cuda'))
+model = model.cuda()
+ddim_sampler = DDIMSampler(model)
+def process(input_image, prompt, a_prompt, n_prompt, num_samples, image_resolution, ddim_steps, guess_mode, strength, scale, seed, eta):
+    with torch.no_grad():
+        img = resize_image(HWC3(input_image['mask'][:, :, 0]), image_resolution)
+        H, W, C = img.shape
+        detected_map = np.zeros_like(img, dtype=np.uint8)
+        detected_map[np.min(img, axis=2) > 127] = 255
+        control = torch.from_numpy(detected_map.copy()).float().cuda() / 255.0
+        control = torch.stack([control for _ in range(num_samples)], dim=0)
+        control = einops.rearrange(control, 'b h w c -> b c h w').clone()
+        if seed == -1:
+            seed = random.randint(0, 65535)
+        seed_everything(seed)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=False)
+        cond = {"c_concat": [control], "c_crossattn": [model.get_learned_conditioning([prompt + ', ' + a_prompt] * num_samples)]}
+        un_cond = {"c_concat": None if guess_mode else [control], "c_crossattn": [model.get_learned_conditioning([n_prompt] * num_samples)]}
+        shape = (4, H // 8, W // 8)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=True)
+        model.control_scales = [strength * (0.825 ** float(12 - i)) for i in range(13)] if guess_mode else ([strength] * 13)  # Magic number. IDK why. Perhaps because 0.825**12<0.01 but 0.826**12>0.01
+        samples, intermediates = ddim_sampler.sample(ddim_steps, num_samples,
+                                                     shape, cond, verbose=False, eta=eta,
+                                                     unconditional_guidance_scale=scale,
+                                                     unconditional_conditioning=un_cond)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=False)
+        x_samples = model.decode_first_stage(samples)
+        x_samples = (einops.rearrange(x_samples, 'b c h w -> b h w c') * 127.5 + 127.5).cpu().numpy().clip(0, 255).astype(np.uint8)
+        results = [x_samples[i] for i in range(num_samples)]
+    return [255 - detected_map] + results
+def create_canvas(w, h):
+    return np.zeros(shape=(h, w, 3), dtype=np.uint8) + 255
+block = gr.Blocks().queue()
+with block:
+    with gr.Row():
+        gr.Markdown("## Control Stable Diffusion with Interactive Scribbles")
+    with gr.Row():
+        with gr.Column():
+            canvas_width = gr.Slider(label="Canvas Width", minimum=256, maximum=1024, value=512, step=1)
+            canvas_height = gr.Slider(label="Canvas Height", minimum=256, maximum=1024, value=512, step=1)
+            create_button = gr.Button(label="Start", value='Open drawing canvas!')
+            input_image = gr.Image(source='upload', type='numpy', tool='sketch')
+            gr.Markdown(value='Do not forget to change your brush width to make it thinner. (Gradio do not allow developers to set brush width so you need to do it manually.) '
+                              'Just click on the small pencil icon in the upper right corner of the above block.')
+            create_button.click(fn=create_canvas, inputs=[canvas_width, canvas_height], outputs=[input_image])
+            prompt = gr.Textbox(label="Prompt")
+            run_button = gr.Button(label="Run")
+            with gr.Accordion("Advanced options", open=False):
+                num_samples = gr.Slider(label="Images", minimum=1, maximum=12, value=1, step=1)
+                image_resolution = gr.Slider(label="Image Resolution", minimum=256, maximum=768, value=512, step=64)
+                strength = gr.Slider(label="Control Strength", minimum=0.0, maximum=2.0, value=1.0, step=0.01)
+                guess_mode = gr.Checkbox(label='Guess Mode', value=False)
+                ddim_steps = gr.Slider(label="Steps", minimum=1, maximum=100, value=20, step=1)
+                scale = gr.Slider(label="Guidance Scale", minimum=0.1, maximum=30.0, value=9.0, step=0.1)
+                seed = gr.Slider(label="Seed", minimum=-1, maximum=2147483647, step=1, randomize=True)
+                eta = gr.Number(label="eta (DDIM)", value=0.0)
+                a_prompt = gr.Textbox(label="Added Prompt", value='best quality, extremely detailed')
+                n_prompt = gr.Textbox(label="Negative Prompt",
+                                      value='longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality')
+        with gr.Column():
+            result_gallery = gr.Gallery(label='Output', show_label=False, elem_id="gallery").style(grid=2, height='auto')
+    ips = [input_image, prompt, a_prompt, n_prompt, num_samples, image_resolution, ddim_steps, guess_mode, strength, scale, seed, eta]
+    run_button.click(fn=process, inputs=ips, outputs=[result_gallery])
+block.launch(server_name='0.0.0.0')

generation/control/ControlNet/gradio_seg2image.py ADDED Viewed

	@@ -0,0 +1,97 @@

+from share import *
+import config
+import cv2
+import einops
+import gradio as gr
+import numpy as np
+import torch
+import random
+from pytorch_lightning import seed_everything
+from annotator.util import resize_image, HWC3
+from annotator.uniformer import UniformerDetector
+from cldm.model import create_model, load_state_dict
+from cldm.ddim_hacked import DDIMSampler
+apply_uniformer = UniformerDetector()
+model = create_model('./models/cldm_v15.yaml').cpu()
+model.load_state_dict(load_state_dict('./models/control_sd15_seg.pth', location='cuda'))
+model = model.cuda()
+ddim_sampler = DDIMSampler(model)
+def process(input_image, prompt, a_prompt, n_prompt, num_samples, image_resolution, detect_resolution, ddim_steps, guess_mode, strength, scale, seed, eta):
+    with torch.no_grad():
+        input_image = HWC3(input_image)
+        detected_map = apply_uniformer(resize_image(input_image, detect_resolution))
+        img = resize_image(input_image, image_resolution)
+        H, W, C = img.shape
+        detected_map = cv2.resize(detected_map, (W, H), interpolation=cv2.INTER_NEAREST)
+        control = torch.from_numpy(detected_map.copy()).float().cuda() / 255.0
+        control = torch.stack([control for _ in range(num_samples)], dim=0)
+        control = einops.rearrange(control, 'b h w c -> b c h w').clone()
+        if seed == -1:
+            seed = random.randint(0, 65535)
+        seed_everything(seed)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=False)
+        cond = {"c_concat": [control], "c_crossattn": [model.get_learned_conditioning([prompt + ', ' + a_prompt] * num_samples)]}
+        un_cond = {"c_concat": None if guess_mode else [control], "c_crossattn": [model.get_learned_conditioning([n_prompt] * num_samples)]}
+        shape = (4, H // 8, W // 8)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=True)
+        model.control_scales = [strength * (0.825 ** float(12 - i)) for i in range(13)] if guess_mode else ([strength] * 13)  # Magic number. IDK why. Perhaps because 0.825**12<0.01 but 0.826**12>0.01
+        samples, intermediates = ddim_sampler.sample(ddim_steps, num_samples,
+                                                     shape, cond, verbose=False, eta=eta,
+                                                     unconditional_guidance_scale=scale,
+                                                     unconditional_conditioning=un_cond)
+        if config.save_memory:
+            model.low_vram_shift(is_diffusing=False)
+        x_samples = model.decode_first_stage(samples)
+        x_samples = (einops.rearrange(x_samples, 'b c h w -> b h w c') * 127.5 + 127.5).cpu().numpy().clip(0, 255).astype(np.uint8)
+        results = [x_samples[i] for i in range(num_samples)]
+    return [detected_map] + results
+block = gr.Blocks().queue()
+with block:
+    with gr.Row():
+        gr.Markdown("## Control Stable Diffusion with Segmentation Maps")
+    with gr.Row():
+        with gr.Column():
+            input_image = gr.Image(source='upload', type="numpy")
+            prompt = gr.Textbox(label="Prompt")
+            run_button = gr.Button(label="Run")
+            with gr.Accordion("Advanced options", open=False):
+                num_samples = gr.Slider(label="Images", minimum=1, maximum=12, value=1, step=1)
+                image_resolution = gr.Slider(label="Image Resolution", minimum=256, maximum=768, value=512, step=64)
+                strength = gr.Slider(label="Control Strength", minimum=0.0, maximum=2.0, value=1.0, step=0.01)
+                guess_mode = gr.Checkbox(label='Guess Mode', value=False)
+                detect_resolution = gr.Slider(label="Segmentation Resolution", minimum=128, maximum=1024, value=512, step=1)
+                ddim_steps = gr.Slider(label="Steps", minimum=1, maximum=100, value=20, step=1)
+                scale = gr.Slider(label="Guidance Scale", minimum=0.1, maximum=30.0, value=9.0, step=0.1)
+                seed = gr.Slider(label="Seed", minimum=-1, maximum=2147483647, step=1, randomize=True)
+                eta = gr.Number(label="eta (DDIM)", value=0.0)
+                a_prompt = gr.Textbox(label="Added Prompt", value='best quality, extremely detailed')
+                n_prompt = gr.Textbox(label="Negative Prompt",
+                                      value='longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality')
+        with gr.Column():
+            result_gallery = gr.Gallery(label='Output', show_label=False, elem_id="gallery").style(grid=2, height='auto')
+    ips = [input_image, prompt, a_prompt, n_prompt, num_samples, image_resolution, detect_resolution, ddim_steps, guess_mode, strength, scale, seed, eta]
+    run_button.click(fn=process, inputs=ips, outputs=[result_gallery])
+block.launch(server_name='0.0.0.0')

generation/control/ControlNet/ldm/data/__init__.py ADDED Viewed

File without changes

generation/control/ControlNet/ldm/models/autoencoder.py ADDED Viewed

	@@ -0,0 +1,219 @@

+import torch
+import pytorch_lightning as pl
+import torch.nn.functional as F
+from contextlib import contextmanager
+from ldm.modules.diffusionmodules.model import Encoder, Decoder
+from ldm.modules.distributions.distributions import DiagonalGaussianDistribution
+from ldm.util import instantiate_from_config
+from ldm.modules.ema import LitEma
+class AutoencoderKL(pl.LightningModule):
+    def __init__(self,
+                 ddconfig,
+                 lossconfig,
+                 embed_dim,
+                 ckpt_path=None,
+                 ignore_keys=[],
+                 image_key="image",
+                 colorize_nlabels=None,
+                 monitor=None,
+                 ema_decay=None,
+                 learn_logvar=False
+                 ):
+        super().__init__()
+        self.learn_logvar = learn_logvar
+        self.image_key = image_key
+        self.encoder = Encoder(**ddconfig)
+        self.decoder = Decoder(**ddconfig)
+        self.loss = instantiate_from_config(lossconfig)
+        assert ddconfig["double_z"]
+        self.quant_conv = torch.nn.Conv2d(2*ddconfig["z_channels"], 2*embed_dim, 1)
+        self.post_quant_conv = torch.nn.Conv2d(embed_dim, ddconfig["z_channels"], 1)
+        self.embed_dim = embed_dim
+        if colorize_nlabels is not None:
+            assert type(colorize_nlabels)==int
+            self.register_buffer("colorize", torch.randn(3, colorize_nlabels, 1, 1))
+        if monitor is not None:
+            self.monitor = monitor
+        self.use_ema = ema_decay is not None
+        if self.use_ema:
+            self.ema_decay = ema_decay
+            assert 0. < ema_decay < 1.
+            self.model_ema = LitEma(self, decay=ema_decay)
+            print(f"Keeping EMAs of {len(list(self.model_ema.buffers()))}.")
+        if ckpt_path is not None:
+            self.init_from_ckpt(ckpt_path, ignore_keys=ignore_keys)
+    def init_from_ckpt(self, path, ignore_keys=list()):
+        sd = torch.load(path, map_location="cpu")["state_dict"]
+        keys = list(sd.keys())
+        for k in keys:
+            for ik in ignore_keys:
+                if k.startswith(ik):
+                    print("Deleting key {} from state_dict.".format(k))
+                    del sd[k]
+        self.load_state_dict(sd, strict=False)
+        print(f"Restored from {path}")
+    @contextmanager
+    def ema_scope(self, context=None):
+        if self.use_ema:
+            self.model_ema.store(self.parameters())
+            self.model_ema.copy_to(self)
+            if context is not None:
+                print(f"{context}: Switched to EMA weights")
+        try:
+            yield None
+        finally:
+            if self.use_ema:
+                self.model_ema.restore(self.parameters())
+                if context is not None:
+                    print(f"{context}: Restored training weights")
+    def on_train_batch_end(self, *args, **kwargs):
+        if self.use_ema:
+            self.model_ema(self)
+    def encode(self, x):
+        h = self.encoder(x)
+        moments = self.quant_conv(h)
+        posterior = DiagonalGaussianDistribution(moments)
+        return posterior
+    def decode(self, z):
+        z = self.post_quant_conv(z)
+        dec = self.decoder(z)
+        return dec
+    def forward(self, input, sample_posterior=True):
+        posterior = self.encode(input)
+        if sample_posterior:
+            z = posterior.sample()
+        else:
+            z = posterior.mode()
+        dec = self.decode(z)
+        return dec, posterior
+    def get_input(self, batch, k):
+        x = batch[k]
+        if len(x.shape) == 3:
+            x = x[..., None]
+        x = x.permute(0, 3, 1, 2).to(memory_format=torch.contiguous_format).float()
+        return x
+    def training_step(self, batch, batch_idx, optimizer_idx):
+        inputs = self.get_input(batch, self.image_key)
+        reconstructions, posterior = self(inputs)
+        if optimizer_idx == 0:
+            # train encoder+decoder+logvar
+            aeloss, log_dict_ae = self.loss(inputs, reconstructions, posterior, optimizer_idx, self.global_step,
+                                            last_layer=self.get_last_layer(), split="train")
+            self.log("aeloss", aeloss, prog_bar=True, logger=True, on_step=True, on_epoch=True)
+            self.log_dict(log_dict_ae, prog_bar=False, logger=True, on_step=True, on_epoch=False)
+            return aeloss
+        if optimizer_idx == 1:
+            # train the discriminator
+            discloss, log_dict_disc = self.loss(inputs, reconstructions, posterior, optimizer_idx, self.global_step,
+                                                last_layer=self.get_last_layer(), split="train")
+            self.log("discloss", discloss, prog_bar=True, logger=True, on_step=True, on_epoch=True)
+            self.log_dict(log_dict_disc, prog_bar=False, logger=True, on_step=True, on_epoch=False)
+            return discloss
+    def validation_step(self, batch, batch_idx):
+        log_dict = self._validation_step(batch, batch_idx)
+        with self.ema_scope():
+            log_dict_ema = self._validation_step(batch, batch_idx, postfix="_ema")
+        return log_dict
+    def _validation_step(self, batch, batch_idx, postfix=""):
+        inputs = self.get_input(batch, self.image_key)
+        reconstructions, posterior = self(inputs)
+        aeloss, log_dict_ae = self.loss(inputs, reconstructions, posterior, 0, self.global_step,
+                                        last_layer=self.get_last_layer(), split="val"+postfix)
+        discloss, log_dict_disc = self.loss(inputs, reconstructions, posterior, 1, self.global_step,
+                                            last_layer=self.get_last_layer(), split="val"+postfix)
+        self.log(f"val{postfix}/rec_loss", log_dict_ae[f"val{postfix}/rec_loss"])
+        self.log_dict(log_dict_ae)
+        self.log_dict(log_dict_disc)
+        return self.log_dict
+    def configure_optimizers(self):
+        lr = self.learning_rate
+        ae_params_list = list(self.encoder.parameters()) + list(self.decoder.parameters()) + list(
+            self.quant_conv.parameters()) + list(self.post_quant_conv.parameters())
+        if self.learn_logvar:
+            print(f"{self.__class__.__name__}: Learning logvar")
+            ae_params_list.append(self.loss.logvar)
+        opt_ae = torch.optim.Adam(ae_params_list,
+                                  lr=lr, betas=(0.5, 0.9))
+        opt_disc = torch.optim.Adam(self.loss.discriminator.parameters(),
+                                    lr=lr, betas=(0.5, 0.9))
+        return [opt_ae, opt_disc], []
+    def get_last_layer(self):
+        return self.decoder.conv_out.weight
+    @torch.no_grad()
+    def log_images(self, batch, only_inputs=False, log_ema=False, **kwargs):
+        log = dict()
+        x = self.get_input(batch, self.image_key)
+        x = x.to(self.device)
+        if not only_inputs:
+            xrec, posterior = self(x)
+            if x.shape[1] > 3:
+                # colorize with random projection
+                assert xrec.shape[1] > 3
+                x = self.to_rgb(x)
+                xrec = self.to_rgb(xrec)
+            log["samples"] = self.decode(torch.randn_like(posterior.sample()))
+            log["reconstructions"] = xrec
+            if log_ema or self.use_ema:
+                with self.ema_scope():
+                    xrec_ema, posterior_ema = self(x)
+                    if x.shape[1] > 3:
+                        # colorize with random projection
+                        assert xrec_ema.shape[1] > 3
+                        xrec_ema = self.to_rgb(xrec_ema)
+                    log["samples_ema"] = self.decode(torch.randn_like(posterior_ema.sample()))
+                    log["reconstructions_ema"] = xrec_ema
+        log["inputs"] = x
+        return log
+    def to_rgb(self, x):
+        assert self.image_key == "segmentation"
+        if not hasattr(self, "colorize"):
+            self.register_buffer("colorize", torch.randn(3, x.shape[1], 1, 1).to(x))
+        x = F.conv2d(x, weight=self.colorize)
+        x = 2.*(x-x.min())/(x.max()-x.min()) - 1.
+        return x
+class IdentityFirstStage(torch.nn.Module):
+    def __init__(self, *args, vq_interface=False, **kwargs):
+        self.vq_interface = vq_interface
+        super().__init__()
+    def encode(self, x, *args, **kwargs):
+        return x
+    def decode(self, x, *args, **kwargs):
+        return x
+    def quantize(self, x, *args, **kwargs):
+        if self.vq_interface:
+            return x, None, [None, None, None]
+        return x
+    def forward(self, x, *args, **kwargs):
+        return x

generation/control/ControlNet/ldm/models/diffusion/__init__.py ADDED Viewed

File without changes

generation/control/ControlNet/ldm/models/diffusion/ddim.py ADDED Viewed

	@@ -0,0 +1,336 @@

+"""SAMPLING ONLY."""
+import torch
+import numpy as np
+from tqdm import tqdm
+from ldm.modules.diffusionmodules.util import make_ddim_sampling_parameters, make_ddim_timesteps, noise_like, extract_into_tensor
+class DDIMSampler(object):
+    def __init__(self, model, schedule="linear", **kwargs):
+        super().__init__()
+        self.model = model
+        self.ddpm_num_timesteps = model.num_timesteps
+        self.schedule = schedule
+    def register_buffer(self, name, attr):
+        if type(attr) == torch.Tensor:
+            if attr.device != torch.device("cuda"):
+                attr = attr.to(torch.device("cuda"))
+        setattr(self, name, attr)
+    def make_schedule(self, ddim_num_steps, ddim_discretize="uniform", ddim_eta=0., verbose=True):
+        self.ddim_timesteps = make_ddim_timesteps(ddim_discr_method=ddim_discretize, num_ddim_timesteps=ddim_num_steps,
+                                                  num_ddpm_timesteps=self.ddpm_num_timesteps,verbose=verbose)
+        alphas_cumprod = self.model.alphas_cumprod
+        assert alphas_cumprod.shape[0] == self.ddpm_num_timesteps, 'alphas have to be defined for each timestep'
+        to_torch = lambda x: x.clone().detach().to(torch.float32).to(self.model.device)
+        self.register_buffer('betas', to_torch(self.model.betas))
+        self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod))
+        self.register_buffer('alphas_cumprod_prev', to_torch(self.model.alphas_cumprod_prev))
+        # calculations for diffusion q(x_t | x_{t-1}) and others
+        self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod.cpu())))
+        self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod.cpu())))
+        self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod.cpu())))
+        self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu())))
+        self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu() - 1)))
+        # ddim sampling parameters
+        ddim_sigmas, ddim_alphas, ddim_alphas_prev = make_ddim_sampling_parameters(alphacums=alphas_cumprod.cpu(),
+                                                                                   ddim_timesteps=self.ddim_timesteps,
+                                                                                   eta=ddim_eta,verbose=verbose)
+        self.register_buffer('ddim_sigmas', ddim_sigmas)
+        self.register_buffer('ddim_alphas', ddim_alphas)
+        self.register_buffer('ddim_alphas_prev', ddim_alphas_prev)
+        self.register_buffer('ddim_sqrt_one_minus_alphas', np.sqrt(1. - ddim_alphas))
+        sigmas_for_original_sampling_steps = ddim_eta * torch.sqrt(
+            (1 - self.alphas_cumprod_prev) / (1 - self.alphas_cumprod) * (
+                        1 - self.alphas_cumprod / self.alphas_cumprod_prev))
+        self.register_buffer('ddim_sigmas_for_original_num_steps', sigmas_for_original_sampling_steps)
+    @torch.no_grad()
+    def sample(self,
+               S,
+               batch_size,
+               shape,
+               conditioning=None,
+               callback=None,
+               normals_sequence=None,
+               img_callback=None,
+               quantize_x0=False,
+               eta=0.,
+               mask=None,
+               x0=None,
+               temperature=1.,
+               noise_dropout=0.,
+               score_corrector=None,
+               corrector_kwargs=None,
+               verbose=True,
+               x_T=None,
+               log_every_t=100,
+               unconditional_guidance_scale=1.,
+               unconditional_conditioning=None, # this has to come in the same format as the conditioning, # e.g. as encoded tokens, ...
+               dynamic_threshold=None,
+               ucg_schedule=None,
+               **kwargs
+               ):
+        if conditioning is not None:
+            if isinstance(conditioning, dict):
+                ctmp = conditioning[list(conditioning.keys())[0]]
+                while isinstance(ctmp, list): ctmp = ctmp[0]
+                cbs = ctmp.shape[0]
+                if cbs != batch_size:
+                    print(f"Warning: Got {cbs} conditionings but batch-size is {batch_size}")
+            elif isinstance(conditioning, list):
+                for ctmp in conditioning:
+                    if ctmp.shape[0] != batch_size:
+                        print(f"Warning: Got {cbs} conditionings but batch-size is {batch_size}")
+            else:
+                if conditioning.shape[0] != batch_size:
+                    print(f"Warning: Got {conditioning.shape[0]} conditionings but batch-size is {batch_size}")
+        self.make_schedule(ddim_num_steps=S, ddim_eta=eta, verbose=verbose)
+        # sampling
+        C, H, W = shape
+        size = (batch_size, C, H, W)
+        print(f'Data shape for DDIM sampling is {size}, eta {eta}')
+        samples, intermediates = self.ddim_sampling(conditioning, size,
+                                                    callback=callback,
+                                                    img_callback=img_callback,
+                                                    quantize_denoised=quantize_x0,
+                                                    mask=mask, x0=x0,
+                                                    ddim_use_original_steps=False,
+                                                    noise_dropout=noise_dropout,
+                                                    temperature=temperature,
+                                                    score_corrector=score_corrector,
+                                                    corrector_kwargs=corrector_kwargs,
+                                                    x_T=x_T,
+                                                    log_every_t=log_every_t,
+                                                    unconditional_guidance_scale=unconditional_guidance_scale,
+                                                    unconditional_conditioning=unconditional_conditioning,
+                                                    dynamic_threshold=dynamic_threshold,
+                                                    ucg_schedule=ucg_schedule
+                                                    )
+        return samples, intermediates
+    @torch.no_grad()
+    def ddim_sampling(self, cond, shape,
+                      x_T=None, ddim_use_original_steps=False,
+                      callback=None, timesteps=None, quantize_denoised=False,
+                      mask=None, x0=None, img_callback=None, log_every_t=100,
+                      temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None,
+                      unconditional_guidance_scale=1., unconditional_conditioning=None, dynamic_threshold=None,
+                      ucg_schedule=None):
+        device = self.model.betas.device
+        b = shape[0]
+        if x_T is None:
+            img = torch.randn(shape, device=device)
+        else:
+            img = x_T
+        if timesteps is None:
+            timesteps = self.ddpm_num_timesteps if ddim_use_original_steps else self.ddim_timesteps
+        elif timesteps is not None and not ddim_use_original_steps:
+            subset_end = int(min(timesteps / self.ddim_timesteps.shape[0], 1) * self.ddim_timesteps.shape[0]) - 1
+            timesteps = self.ddim_timesteps[:subset_end]
+        intermediates = {'x_inter': [img], 'pred_x0': [img]}
+        time_range = reversed(range(0,timesteps)) if ddim_use_original_steps else np.flip(timesteps)
+        total_steps = timesteps if ddim_use_original_steps else timesteps.shape[0]
+        print(f"Running DDIM Sampling with {total_steps} timesteps")
+        iterator = tqdm(time_range, desc='DDIM Sampler', total=total_steps)
+        for i, step in enumerate(iterator):
+            index = total_steps - i - 1
+            ts = torch.full((b,), step, device=device, dtype=torch.long)
+            if mask is not None:
+                assert x0 is not None
+                img_orig = self.model.q_sample(x0, ts)  # TODO: deterministic forward pass?
+                img = img_orig * mask + (1. - mask) * img
+            if ucg_schedule is not None:
+                assert len(ucg_schedule) == len(time_range)
+                unconditional_guidance_scale = ucg_schedule[i]
+            outs = self.p_sample_ddim(img, cond, ts, index=index, use_original_steps=ddim_use_original_steps,
+                                      quantize_denoised=quantize_denoised, temperature=temperature,
+                                      noise_dropout=noise_dropout, score_corrector=score_corrector,
+                                      corrector_kwargs=corrector_kwargs,
+                                      unconditional_guidance_scale=unconditional_guidance_scale,
+                                      unconditional_conditioning=unconditional_conditioning,
+                                      dynamic_threshold=dynamic_threshold)
+            img, pred_x0 = outs
+            if callback: callback(i)
+            if img_callback: img_callback(pred_x0, i)
+            if index % log_every_t == 0 or index == total_steps - 1:
+                intermediates['x_inter'].append(img)
+                intermediates['pred_x0'].append(pred_x0)
+        return img, intermediates
+    @torch.no_grad()
+    def p_sample_ddim(self, x, c, t, index, repeat_noise=False, use_original_steps=False, quantize_denoised=False,
+                      temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None,
+                      unconditional_guidance_scale=1., unconditional_conditioning=None,
+                      dynamic_threshold=None):
+        b, *_, device = *x.shape, x.device
+        if unconditional_conditioning is None or unconditional_guidance_scale == 1.:
+            model_output = self.model.apply_model(x, t, c)
+        else:
+            x_in = torch.cat([x] * 2)
+            t_in = torch.cat([t] * 2)
+            if isinstance(c, dict):
+                assert isinstance(unconditional_conditioning, dict)
+                c_in = dict()
+                for k in c:
+                    if isinstance(c[k], list):
+                        c_in[k] = [torch.cat([
+                            unconditional_conditioning[k][i],
+                            c[k][i]]) for i in range(len(c[k]))]
+                    else:
+                        c_in[k] = torch.cat([
+                                unconditional_conditioning[k],
+                                c[k]])
+            elif isinstance(c, list):
+                c_in = list()
+                assert isinstance(unconditional_conditioning, list)
+                for i in range(len(c)):
+                    c_in.append(torch.cat([unconditional_conditioning[i], c[i]]))
+            else:
+                c_in = torch.cat([unconditional_conditioning, c])
+            model_uncond, model_t = self.model.apply_model(x_in, t_in, c_in).chunk(2)
+            model_output = model_uncond + unconditional_guidance_scale * (model_t - model_uncond)
+        if self.model.parameterization == "v":
+            e_t = self.model.predict_eps_from_z_and_v(x, t, model_output)
+        else:
+            e_t = model_output
+        if score_corrector is not None:
+            assert self.model.parameterization == "eps", 'not implemented'
+            e_t = score_corrector.modify_score(self.model, e_t, x, t, c, **corrector_kwargs)
+        alphas = self.model.alphas_cumprod if use_original_steps else self.ddim_alphas
+        alphas_prev = self.model.alphas_cumprod_prev if use_original_steps else self.ddim_alphas_prev
+        sqrt_one_minus_alphas = self.model.sqrt_one_minus_alphas_cumprod if use_original_steps else self.ddim_sqrt_one_minus_alphas
+        sigmas = self.model.ddim_sigmas_for_original_num_steps if use_original_steps else self.ddim_sigmas
+        # select parameters corresponding to the currently considered timestep
+        a_t = torch.full((b, 1, 1, 1), alphas[index], device=device)
+        a_prev = torch.full((b, 1, 1, 1), alphas_prev[index], device=device)
+        sigma_t = torch.full((b, 1, 1, 1), sigmas[index], device=device)
+        sqrt_one_minus_at = torch.full((b, 1, 1, 1), sqrt_one_minus_alphas[index],device=device)
+        # current prediction for x_0
+        if self.model.parameterization != "v":
+            pred_x0 = (x - sqrt_one_minus_at * e_t) / a_t.sqrt()
+        else:
+            pred_x0 = self.model.predict_start_from_z_and_v(x, t, model_output)
+        if quantize_denoised:
+            pred_x0, _, *_ = self.model.first_stage_model.quantize(pred_x0)
+        if dynamic_threshold is not None:
+            raise NotImplementedError()
+        # direction pointing to x_t
+        dir_xt = (1. - a_prev - sigma_t**2).sqrt() * e_t
+        noise = sigma_t * noise_like(x.shape, device, repeat_noise) * temperature
+        if noise_dropout > 0.:
+            noise = torch.nn.functional.dropout(noise, p=noise_dropout)
+        x_prev = a_prev.sqrt() * pred_x0 + dir_xt + noise
+        return x_prev, pred_x0
+    @torch.no_grad()
+    def encode(self, x0, c, t_enc, use_original_steps=False, return_intermediates=None,
+               unconditional_guidance_scale=1.0, unconditional_conditioning=None, callback=None):
+        num_reference_steps = self.ddpm_num_timesteps if use_original_steps else self.ddim_timesteps.shape[0]
+        assert t_enc <= num_reference_steps
+        num_steps = t_enc
+        if use_original_steps:
+            alphas_next = self.alphas_cumprod[:num_steps]
+            alphas = self.alphas_cumprod_prev[:num_steps]
+        else:
+            alphas_next = self.ddim_alphas[:num_steps]
+            alphas = torch.tensor(self.ddim_alphas_prev[:num_steps])
+        x_next = x0
+        intermediates = []
+        inter_steps = []
+        for i in tqdm(range(num_steps), desc='Encoding Image'):
+            t = torch.full((x0.shape[0],), i, device=self.model.device, dtype=torch.long)
+            if unconditional_guidance_scale == 1.:
+                noise_pred = self.model.apply_model(x_next, t, c)
+            else:
+                assert unconditional_conditioning is not None
+                e_t_uncond, noise_pred = torch.chunk(
+                    self.model.apply_model(torch.cat((x_next, x_next)), torch.cat((t, t)),
+                                           torch.cat((unconditional_conditioning, c))), 2)
+                noise_pred = e_t_uncond + unconditional_guidance_scale * (noise_pred - e_t_uncond)
+            xt_weighted = (alphas_next[i] / alphas[i]).sqrt() * x_next
+            weighted_noise_pred = alphas_next[i].sqrt() * (
+                    (1 / alphas_next[i] - 1).sqrt() - (1 / alphas[i] - 1).sqrt()) * noise_pred
+            x_next = xt_weighted + weighted_noise_pred
+            if return_intermediates and i % (
+                    num_steps // return_intermediates) == 0 and i < num_steps - 1:
+                intermediates.append(x_next)
+                inter_steps.append(i)
+            elif return_intermediates and i >= num_steps - 2:
+                intermediates.append(x_next)
+                inter_steps.append(i)
+            if callback: callback(i)
+        out = {'x_encoded': x_next, 'intermediate_steps': inter_steps}
+        if return_intermediates:
+            out.update({'intermediates': intermediates})
+        return x_next, out
+    @torch.no_grad()
+    def stochastic_encode(self, x0, t, use_original_steps=False, noise=None):
+        # fast, but does not allow for exact reconstruction
+        # t serves as an index to gather the correct alphas
+        if use_original_steps:
+            sqrt_alphas_cumprod = self.sqrt_alphas_cumprod
+            sqrt_one_minus_alphas_cumprod = self.sqrt_one_minus_alphas_cumprod
+        else:
+            sqrt_alphas_cumprod = torch.sqrt(self.ddim_alphas)
+            sqrt_one_minus_alphas_cumprod = self.ddim_sqrt_one_minus_alphas
+        if noise is None:
+            noise = torch.randn_like(x0)
+        return (extract_into_tensor(sqrt_alphas_cumprod, t, x0.shape) * x0 +
+                extract_into_tensor(sqrt_one_minus_alphas_cumprod, t, x0.shape) * noise)
+    @torch.no_grad()
+    def decode(self, x_latent, cond, t_start, unconditional_guidance_scale=1.0, unconditional_conditioning=None,
+               use_original_steps=False, callback=None):
+        timesteps = np.arange(self.ddpm_num_timesteps) if use_original_steps else self.ddim_timesteps
+        timesteps = timesteps[:t_start]
+        time_range = np.flip(timesteps)
+        total_steps = timesteps.shape[0]
+        print(f"Running DDIM Sampling with {total_steps} timesteps")
+        iterator = tqdm(time_range, desc='Decoding image', total=total_steps)
+        x_dec = x_latent
+        for i, step in enumerate(iterator):
+            index = total_steps - i - 1
+            ts = torch.full((x_latent.shape[0],), step, device=x_latent.device, dtype=torch.long)
+            x_dec, _ = self.p_sample_ddim(x_dec, cond, ts, index=index, use_original_steps=use_original_steps,
+                                          unconditional_guidance_scale=unconditional_guidance_scale,
+                                          unconditional_conditioning=unconditional_conditioning)
+            if callback: callback(i)
+        return x_dec

generation/control/ControlNet/ldm/util.py ADDED Viewed

	@@ -0,0 +1,197 @@

+import importlib
+import torch
+from torch import optim
+import numpy as np
+from inspect import isfunction
+from PIL import Image, ImageDraw, ImageFont
+def log_txt_as_img(wh, xc, size=10):
+    # wh a tuple of (width, height)
+    # xc a list of captions to plot
+    b = len(xc)
+    txts = list()
+    for bi in range(b):
+        txt = Image.new("RGB", wh, color="white")
+        draw = ImageDraw.Draw(txt)
+        font = ImageFont.truetype('font/DejaVuSans.ttf', size=size)
+        nc = int(40 * (wh[0] / 256))
+        lines = "\n".join(xc[bi][start:start + nc] for start in range(0, len(xc[bi]), nc))
+        try:
+            draw.text((0, 0), lines, fill="black", font=font)
+        except UnicodeEncodeError:
+            print("Cant encode string for logging. Skipping.")
+        txt = np.array(txt).transpose(2, 0, 1) / 127.5 - 1.0
+        txts.append(txt)
+    txts = np.stack(txts)
+    txts = torch.tensor(txts)
+    return txts
+def ismap(x):
+    if not isinstance(x, torch.Tensor):
+        return False
+    return (len(x.shape) == 4) and (x.shape[1] > 3)
+def isimage(x):
+    if not isinstance(x,torch.Tensor):
+        return False
+    return (len(x.shape) == 4) and (x.shape[1] == 3 or x.shape[1] == 1)
+def exists(x):
+    return x is not None
+def default(val, d):
+    if exists(val):
+        return val
+    return d() if isfunction(d) else d
+def mean_flat(tensor):
+    """
+    https://github.com/openai/guided-diffusion/blob/27c20a8fab9cb472df5d6bdd6c8d11c8f430b924/guided_diffusion/nn.py#L86
+    Take the mean over all non-batch dimensions.
+    """
+    return tensor.mean(dim=list(range(1, len(tensor.shape))))
+def count_params(model, verbose=False):
+    total_params = sum(p.numel() for p in model.parameters())
+    if verbose:
+        print(f"{model.__class__.__name__} has {total_params*1.e-6:.2f} M params.")
+    return total_params
+def instantiate_from_config(config):
+    if not "target" in config:
+        if config == '__is_first_stage__':
+            return None
+        elif config == "__is_unconditional__":
+            return None
+        raise KeyError("Expected key `target` to instantiate.")
+    return get_obj_from_str(config["target"])(**config.get("params", dict()))
+def get_obj_from_str(string, reload=False):
+    module, cls = string.rsplit(".", 1)
+    if reload:
+        module_imp = importlib.import_module(module)
+        importlib.reload(module_imp)
+    return getattr(importlib.import_module(module, package=None), cls)
+class AdamWwithEMAandWings(optim.Optimizer):
+    # credit to https://gist.github.com/crowsonkb/65f7265353f403714fce3b2595e0b298
+    def __init__(self, params, lr=1.e-3, betas=(0.9, 0.999), eps=1.e-8,  # TODO: check hyperparameters before using
+                 weight_decay=1.e-2, amsgrad=False, ema_decay=0.9999,   # ema decay to match previous code
+                 ema_power=1., param_names=()):
+        """AdamW that saves EMA versions of the parameters."""
+        if not 0.0 <= lr:
+            raise ValueError("Invalid learning rate: {}".format(lr))
+        if not 0.0 <= eps:
+            raise ValueError("Invalid epsilon value: {}".format(eps))
+        if not 0.0 <= betas[0] < 1.0:
+            raise ValueError("Invalid beta parameter at index 0: {}".format(betas[0]))
+        if not 0.0 <= betas[1] < 1.0:
+            raise ValueError("Invalid beta parameter at index 1: {}".format(betas[1]))
+        if not 0.0 <= weight_decay:
+            raise ValueError("Invalid weight_decay value: {}".format(weight_decay))
+        if not 0.0 <= ema_decay <= 1.0:
+            raise ValueError("Invalid ema_decay value: {}".format(ema_decay))
+        defaults = dict(lr=lr, betas=betas, eps=eps,
+                        weight_decay=weight_decay, amsgrad=amsgrad, ema_decay=ema_decay,
+                        ema_power=ema_power, param_names=param_names)
+        super().__init__(params, defaults)
+    def __setstate__(self, state):
+        super().__setstate__(state)
+        for group in self.param_groups:
+            group.setdefault('amsgrad', False)
+    @torch.no_grad()
+    def step(self, closure=None):
+        """Performs a single optimization step.
+        Args:
+            closure (callable, optional): A closure that reevaluates the model
+                and returns the loss.
+        """
+        loss = None
+        if closure is not None:
+            with torch.enable_grad():
+                loss = closure()
+        for group in self.param_groups:
+            params_with_grad = []
+            grads = []
+            exp_avgs = []
+            exp_avg_sqs = []
+            ema_params_with_grad = []
+            state_sums = []
+            max_exp_avg_sqs = []
+            state_steps = []
+            amsgrad = group['amsgrad']
+            beta1, beta2 = group['betas']
+            ema_decay = group['ema_decay']
+            ema_power = group['ema_power']
+            for p in group['params']:
+                if p.grad is None:
+                    continue
+                params_with_grad.append(p)
+                if p.grad.is_sparse:
+                    raise RuntimeError('AdamW does not support sparse gradients')
+                grads.append(p.grad)
+                state = self.state[p]
+                # State initialization
+                if len(state) == 0:
+                    state['step'] = 0
+                    # Exponential moving average of gradient values
+                    state['exp_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format)
+                    # Exponential moving average of squared gradient values
+                    state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
+                    if amsgrad:
+                        # Maintains max of all exp. moving avg. of sq. grad. values
+                        state['max_exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
+                    # Exponential moving average of parameter values
+                    state['param_exp_avg'] = p.detach().float().clone()
+                exp_avgs.append(state['exp_avg'])
+                exp_avg_sqs.append(state['exp_avg_sq'])
+                ema_params_with_grad.append(state['param_exp_avg'])
+                if amsgrad:
+                    max_exp_avg_sqs.append(state['max_exp_avg_sq'])
+                # update the steps for each param group update
+                state['step'] += 1
+                # record the step after step update
+                state_steps.append(state['step'])
+            optim._functional.adamw(params_with_grad,
+                    grads,
+                    exp_avgs,
+                    exp_avg_sqs,
+                    max_exp_avg_sqs,
+                    state_steps,
+                    amsgrad=amsgrad,
+                    beta1=beta1,
+                    beta2=beta2,
+                    lr=group['lr'],
+                    weight_decay=group['weight_decay'],
+                    eps=group['eps'],
+                    maximize=False)
+            cur_ema_decay = min(ema_decay, 1 - state['step'] ** -ema_power)
+            for param, ema_param in zip(params_with_grad, ema_params_with_grad):
+                ema_param.mul_(cur_ema_decay).add_(param.float(), alpha=1 - cur_ema_decay)
+        return loss

generation/control/ControlNet/share.py ADDED Viewed

	@@ -0,0 +1,8 @@

+import config
+from cldm.hack import disable_verbosity, enable_sliced_attention
+disable_verbosity()
+if config.save_memory:
+    enable_sliced_attention()

generation/control/ControlNet/tool_add_control.py ADDED Viewed

	@@ -0,0 +1,50 @@

+import sys
+import os
+assert len(sys.argv) == 3, 'Args are wrong.'
+input_path = sys.argv[1]
+output_path = sys.argv[2]
+assert os.path.exists(input_path), 'Input model does not exist.'
+assert not os.path.exists(output_path), 'Output filename already exists.'
+assert os.path.exists(os.path.dirname(output_path)), 'Output path is not valid.'
+import torch
+from share import *
+from cldm.model import create_model
+def get_node_name(name, parent_name):
+    if len(name) <= len(parent_name):
+        return False, ''
+    p = name[:len(parent_name)]
+    if p != parent_name:
+        return False, ''
+    return True, name[len(parent_name):]
+model = create_model(config_path='./models/cldm_v15.yaml')
+pretrained_weights = torch.load(input_path)
+if 'state_dict' in pretrained_weights:
+    pretrained_weights = pretrained_weights['state_dict']
+scratch_dict = model.state_dict()
+target_dict = {}
+for k in scratch_dict.keys():
+    is_control, name = get_node_name(k, 'control_')
+    if is_control:
+        copy_k = 'model.diffusion_' + name
+    else:
+        copy_k = k
+    if copy_k in pretrained_weights:
+        target_dict[k] = pretrained_weights[copy_k].clone()
+    else:
+        target_dict[k] = scratch_dict[k].clone()
+        print(f'These weights are newly added: {k}')
+model.load_state_dict(target_dict, strict=True)
+torch.save(model.state_dict(), output_path)
+print('Done.')

generation/control/ControlNet/tool_add_control_sd21.py ADDED Viewed

	@@ -0,0 +1,50 @@

+import sys
+import os
+assert len(sys.argv) == 3, 'Args are wrong.'
+input_path = sys.argv[1]
+output_path = sys.argv[2]
+assert os.path.exists(input_path), 'Input model does not exist.'
+assert not os.path.exists(output_path), 'Output filename already exists.'
+assert os.path.exists(os.path.dirname(output_path)), 'Output path is not valid.'
+import torch
+from share import *
+from cldm.model import create_model
+def get_node_name(name, parent_name):
+    if len(name) <= len(parent_name):
+        return False, ''
+    p = name[:len(parent_name)]
+    if p != parent_name:
+        return False, ''
+    return True, name[len(parent_name):]
+model = create_model(config_path='./models/cldm_v21.yaml')
+pretrained_weights = torch.load(input_path)
+if 'state_dict' in pretrained_weights:
+    pretrained_weights = pretrained_weights['state_dict']
+scratch_dict = model.state_dict()
+target_dict = {}
+for k in scratch_dict.keys():
+    is_control, name = get_node_name(k, 'control_')
+    if is_control:
+        copy_k = 'model.diffusion_' + name
+    else:
+        copy_k = k
+    if copy_k in pretrained_weights:
+        target_dict[k] = pretrained_weights[copy_k].clone()
+    else:
+        target_dict[k] = scratch_dict[k].clone()
+        print(f'These weights are newly added: {k}')
+model.load_state_dict(target_dict, strict=True)
+torch.save(model.state_dict(), output_path)
+print('Done.')

generation/control/ControlNet/tool_transfer_control.py ADDED Viewed

	@@ -0,0 +1,59 @@

+path_sd15 = './models/v1-5-pruned.ckpt'
+path_sd15_with_control = './models/control_sd15_openpose.pth'
+path_input = './models/anything-v3-full.safetensors'
+path_output = './models/control_any3_openpose.pth'
+import os
+assert os.path.exists(path_sd15), 'Input path_sd15 does not exists!'
+assert os.path.exists(path_sd15_with_control), 'Input path_sd15_with_control does not exists!'
+assert os.path.exists(path_input), 'Input path_input does not exists!'
+assert os.path.exists(os.path.dirname(path_output)), 'Output folder not exists!'
+import torch
+from share import *
+from cldm.model import load_state_dict
+sd15_state_dict = load_state_dict(path_sd15)
+sd15_with_control_state_dict = load_state_dict(path_sd15_with_control)
+input_state_dict = load_state_dict(path_input)
+def get_node_name(name, parent_name):
+    if len(name) <= len(parent_name):
+        return False, ''
+    p = name[:len(parent_name)]
+    if p != parent_name:
+        return False, ''
+    return True, name[len(parent_name):]
+keys = sd15_with_control_state_dict.keys()
+final_state_dict = {}
+for key in keys:
+    is_first_stage, _ = get_node_name(key, 'first_stage_model')
+    is_cond_stage, _ = get_node_name(key, 'cond_stage_model')
+    if is_first_stage or is_cond_stage:
+        final_state_dict[key] = input_state_dict[key]
+        continue
+    p = sd15_with_control_state_dict[key]
+    is_control, node_name = get_node_name(key, 'control_')
+    if is_control:
+        sd15_key_name = 'model.diffusion_' + node_name
+    else:
+        sd15_key_name = key
+    if sd15_key_name in input_state_dict:
+        p_new = p + input_state_dict[sd15_key_name] - sd15_state_dict[sd15_key_name]
+        # print(f'Offset clone from [{sd15_key_name}] to [{key}]')
+    else:
+        p_new = p
+        # print(f'Direct clone to [{key}]')
+    final_state_dict[key] = p_new
+torch.save(final_state_dict, path_output)
+print('Transferred model saved at ' + path_output)

generation/control/ControlNet/tutorial_dataset.py ADDED Viewed

	@@ -0,0 +1,39 @@

+import json
+import cv2
+import numpy as np
+from torch.utils.data import Dataset
+class MyDataset(Dataset):
+    def __init__(self):
+        self.data = []
+        with open('./training/fill50k/prompt.json', 'rt') as f:
+            for line in f:
+                self.data.append(json.loads(line))
+    def __len__(self):
+        return len(self.data)
+    def __getitem__(self, idx):
+        item = self.data[idx]
+        source_filename = item['source']
+        target_filename = item['target']
+        prompt = item['prompt']
+        source = cv2.imread('./training/fill50k/' + source_filename)
+        target = cv2.imread('./training/fill50k/' + target_filename)
+        # Do not forget that OpenCV read images in BGR order.
+        source = cv2.cvtColor(source, cv2.COLOR_BGR2RGB)
+        target = cv2.cvtColor(target, cv2.COLOR_BGR2RGB)
+        # Normalize source images to [0, 1].
+        source = source.astype(np.float32) / 255.0
+        # Normalize target images to [-1, 1].
+        target = (target.astype(np.float32) / 127.5) - 1.0
+        return dict(jpg=target, txt=prompt, hint=source)

generation/control/ControlNet/tutorial_dataset_test.py ADDED Viewed

	@@ -0,0 +1,12 @@

+from tutorial_dataset import MyDataset
+dataset = MyDataset()
+print(len(dataset))
+item = dataset[1234]
+jpg = item['jpg']
+txt = item['txt']
+hint = item['hint']
+print(txt)
+print(jpg.shape)
+print(hint.shape)

generation/control/ControlNet/tutorial_train.py ADDED Viewed

	@@ -0,0 +1,35 @@

+from share import *
+import pytorch_lightning as pl
+from torch.utils.data import DataLoader
+from tutorial_dataset import MyDataset
+from cldm.logger import ImageLogger
+from cldm.model import create_model, load_state_dict
+# Configs
+resume_path = './models/control_sd15_ini.ckpt'
+batch_size = 4
+logger_freq = 300
+learning_rate = 1e-5
+sd_locked = True
+only_mid_control = False
+# First use cpu to load models. Pytorch Lightning will automatically move it to GPUs.
+model = create_model('./models/cldm_v15.yaml').cpu()
+model.load_state_dict(load_state_dict(resume_path, location='cpu'))
+model.learning_rate = learning_rate
+model.sd_locked = sd_locked
+model.only_mid_control = only_mid_control
+# Misc
+dataset = MyDataset()
+dataloader = DataLoader(dataset, num_workers=0, batch_size=batch_size, shuffle=True)
+logger = ImageLogger(batch_frequency=logger_freq)
+trainer = pl.Trainer(gpus=1, precision=32, callbacks=[logger])
+# Train!
+trainer.fit(model, dataloader)

generation/control/ControlNet/tutorial_train_sd21.py ADDED Viewed

	@@ -0,0 +1,35 @@

+from share import *
+import pytorch_lightning as pl
+from torch.utils.data import DataLoader
+from tutorial_dataset import MyDataset
+from cldm.logger import ImageLogger
+from cldm.model import create_model, load_state_dict
+# Configs
+resume_path = './models/control_sd21_ini.ckpt'
+batch_size = 4
+logger_freq = 300
+learning_rate = 1e-5
+sd_locked = True
+only_mid_control = False
+# First use cpu to load models. Pytorch Lightning will automatically move it to GPUs.
+model = create_model('./models/cldm_v21.yaml').cpu()
+model.load_state_dict(load_state_dict(resume_path, location='cpu'))
+model.learning_rate = learning_rate
+model.sd_locked = sd_locked
+model.only_mid_control = only_mid_control
+# Misc
+dataset = MyDataset()
+dataloader = DataLoader(dataset, num_workers=0, batch_size=batch_size, shuffle=True)
+logger = ImageLogger(batch_frequency=logger_freq)
+trainer = pl.Trainer(gpus=1, precision=32, callbacks=[logger])
+# Train!
+trainer.fit(model, dataloader)

generation/control/download_ade20k.sh ADDED Viewed

	@@ -0,0 +1,10 @@

+#!/bin/bash
+if [ ! -d "data" ]; then
+  mkdir data
+fi
+cd data
+wget -O ade20k.zip https://keeper.mpdl.mpg.de/f/80b2fc97ffc3430c98de/?dl=1
+unzip ade20k.zip
+rm ade20k.zip

generation/control/download_celebhq.sh ADDED Viewed

	@@ -0,0 +1,10 @@

+#!/bin/bash
+if [ ! -d "data" ]; then
+  mkdir data
+fi
+cd data
+wget -O celechq-text.zip https://keeper.mpdl.mpg.de/f/72c34a6017cb40b896e9/?dl=1
+unzip celechq-text.zip
+rm celechq-text.zip

generation/control/eval_canny.py ADDED Viewed

	@@ -0,0 +1,130 @@

+from PIL import Image
+from tqdm import tqdm
+import sys
+import os
+import json
+import cv2
+import torch
+from torch.autograd import Variable
+import torchvision
+from torch.utils.data import DataLoader, Dataset
+from torchvision.datasets import CocoCaptions
+from torchvision.transforms import ToTensor
+import torchvision.transforms as transforms
+from torchvision.utils import save_image
+import numpy as np
+# from net_canny import Net
+from ControlNet.annotator.canny import CannyDetector
+from ControlNet.annotator.util import resize_image, HWC3
+# from pytorch_msssim import ssim, ms_ssim, SSIM
+class ResultFolderDataset(Dataset):
+    def __init__(self, data_dir, results_dir, n, transform=None):
+        self.data_dir = data_dir
+        self.results_dir = results_dir
+        self.n = n
+        self.image_paths = sorted([f for f in os.listdir(data_dir) if f.lower().endswith(('.png'))])
+        # self.image_paths = sorted([os.path.join(folder, f) for f in os.listdir(folder) if f.lower().endswith('_{}.png'.format(n))])
+        self.transform = transform
+    def __len__(self):
+        return len(self.image_paths)
+    def __getitem__(self, idx):
+        image_name = self.image_paths[idx]
+        source_path = os.path.join(self.data_dir, image_name)
+        base_name = image_name.split('_')[1].split('.')[0]  # Extract 'x' from 'image_x.png'
+        image_name2 = f'result_{base_name}_{self.n}.png'
+        result_path = os.path.join(self.results_dir, image_name2)
+        source_image = Image.open(source_path) #.convert('RGB')
+        result_image = Image.open(result_path) #.convert('RGB')
+        if self.transform:
+            source_image = self.transform(source_image)
+            result_image = self.transform(result_image)
+        return source_image, result_image
+def calculate_metrics(pred, target):
+    intersection = np.logical_and(pred, target)
+    union = np.logical_or(pred, target)
+    iou_score = np.sum(intersection) / np.sum(union)
+    accuracy = np.sum(pred == target) / target.size
+    tp = np.sum(intersection)  # True positive
+    fp = np.sum(pred) - tp  # False positive
+    fn = np.sum(target) - tp  # False negative
+    f1_score = (2 * tp) / (2 * tp + fp + fn)
+    return iou_score, accuracy, f1_score
+if __name__ == '__main__':
+    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
+    low_threshold = 100
+    high_threshold = 200
+    n = 0
+    # epoch = 10
+    # experiment = './log/image_log_oft_COCO_canny_eps_1-3_r_4_cayley_4gpu'
+    experiment = 'log/image_log_householder_gramschmidt_COCO_canny_eps_7e-06_pe_diff_mlp_l_8_8gpu_2024-05-19-21-22-24-466032'
+    if 'train_with_norm' in experiment:
+        epoch = 4
+    else:
+        if 'COCO' in experiment:
+            epoch = 10
+        else:
+            epoch = 19
+    data_dir = os.path.join(experiment, 'source', str(epoch))
+    result_dir = os.path.join(experiment, 'results', str(epoch))
+    json_file = os.path.join(experiment, 'results.json')
+    # Define the transforms to apply to the images
+    transform = transforms.Compose([
+        # transforms.Resize((512, 512)),
+        # transforms.CenterCrop(512),
+        transforms.ToTensor()
+    ])
+    dataset = ResultFolderDataset(data_dir, result_dir, n=n, transform=transform)
+    data_loader = DataLoader(dataset, batch_size=1, shuffle=False)
+    apply_canny = CannyDetector()
+    loss = 0
+    iou_score_mean = 0
+    accuracy_mean = 0
+    f1_score_mean = 0
+    ssim_mean = 0
+    for i, data in tqdm(enumerate(data_loader), total=len(data_loader)):
+        source_image, result_image = data
+        # Convert the tensor to a numpy array and transpose it to have the channels last (H, W, C)
+        source_image_np = source_image.squeeze(0).permute(1, 2, 0).numpy()
+        result_image_np = result_image.squeeze(0).permute(1, 2, 0).numpy()
+        # Convert the image to 8-bit unsigned integers (0-255)
+        source_image_np = (source_image_np * 255).astype(np.uint8)
+        result_image_np = (result_image_np * 255).astype(np.uint8)
+        source_detected_map = apply_canny(source_image_np, low_threshold, high_threshold) / 255
+        result_detected_map = apply_canny(result_image_np, low_threshold, high_threshold) / 255
+        iou_score, accuracy, f1_score = calculate_metrics(result_detected_map, source_detected_map)
+        iou_score_mean = iou_score_mean + iou_score
+        accuracy_mean = accuracy_mean +  accuracy
+        f1_score_mean = f1_score_mean + f1_score
+    iou_score_mean = iou_score_mean / len(dataset)
+    accuracy_mean = accuracy_mean / len(dataset)
+    f1_score_mean = f1_score_mean / len(dataset)
+    print(experiment)
+    print('[Canny]', '[IOU]:', iou_score_mean, '[F1 Score]:', f1_score_mean)

generation/control/eval_landmark.py ADDED Viewed

	@@ -0,0 +1,127 @@

+import os
+from glob import glob
+from tqdm import tqdm
+import numpy as np
+from pathlib import Path
+from skimage.io import imread, imsave
+import cv2
+import json
+end_list = np.array([17, 22, 27, 42, 48, 31, 36, 68], dtype = np.int32) - 1
+def plot_kpts(image, kpts, color = 'g'):
+    ''' Draw 68 key points
+    Args:
+        image: the input image
+        kpt: (68, 3).
+    '''
+    if color == 'r':
+        c = (255, 0, 0)
+    elif color == 'g':
+        c = (0, 255, 0)
+    elif color == 'b':
+        c = (255, 0, 0)
+    image = image.copy()
+    kpts = kpts.copy()
+    radius = max(int(min(image.shape[0], image.shape[1])/200), 1)
+    for i in range(kpts.shape[0]):
+        st = kpts[i, :2]
+        if kpts.shape[1]==4:
+            if kpts[i, 3] > 0.5:
+                c = (0, 255, 0)
+            else:
+                c = (0, 0, 255)
+        image = cv2.circle(image,(int(st[0]), int(st[1])), radius, c, radius*2)
+        if i in end_list:
+            continue
+        ed = kpts[i + 1, :2]
+        image = cv2.line(image, (int(st[0]), int(st[1])), (int(ed[0]), int(ed[1])), (255, 255, 255), radius)
+    return image
+def plot_points(image, kpts, color = 'w'):
+    ''' Draw 68 key points
+    Args:
+        image: the input image
+        kpt: (n, 3).
+    '''
+    if color == 'r':
+        c = (255, 0, 0)
+    elif color == 'g':
+        c = (0, 255, 0)
+    elif color == 'b':
+        c = (0, 0, 255)
+    elif color == 'y':
+        c = (0, 255, 255)
+    elif color == 'w':
+        c = (255, 255, 255)
+    image = image.copy()
+    kpts = kpts.copy()
+    kpts = kpts.astype(np.int32)
+    radius = max(int(min(image.shape[0], image.shape[1])/200), 1)
+    for i in range(kpts.shape[0]):
+        st = kpts[i, :2]
+        image = cv2.circle(image,(int(st[0]), int(st[1])), radius, c, radius*2)
+    return image
+def generate_landmark2d(inputpath, savepath, n=0, device='cuda:0', vis=False):
+    print(f'generate 2d landmarks')
+    os.makedirs(savepath, exist_ok=True)
+    import face_alignment
+    detect_model = face_alignment.FaceAlignment(face_alignment.LandmarksType._2D, device=device, flip_input=False)
+    imagepath_list = glob(os.path.join(inputpath, '*_{}.png'.format(n)))
+    imagepath_list = sorted(imagepath_list)
+    for imagepath in tqdm(imagepath_list):
+        name = Path(imagepath).stem
+        image = imread(imagepath)[:,:,:3]
+        out = detect_model.get_landmarks(image)
+        if out is None:
+            continue
+        kpt = out[0].squeeze()
+        np.savetxt(os.path.join(savepath, f'{name}.txt'), kpt)
+        if vis:
+            image = cv2.imread(imagepath)
+            image_point = plot_kpts(image, kpt)
+            # check
+            cv2.imwrite(os.path.join(savepath, f'{name}_overlay.jpg'), image_point)
+            # background = np.zeros_like(image)
+            # cv2.imwrite(os.path.join(savepath, f'{name}_line.jpg'), plot_kpts(background, kpt))
+            # cv2.imwrite(os.path.join(savepath, f'{name}_point.jpg'), plot_points(background, kpt))
+        # exit()
+def landmark_comparison(lmk_folder, gt_lmk_folder, n=0):
+    print(f'calculate reprojection error')
+    lmk_err = []
+    gt_lmk_folder = './data/celebhq-text/celeba-hq-landmark2d'
+    with open('./data/celebhq-text/prompt_val_blip_full.json', 'rt') as f: # fill50k, COCO
+        for line in f:
+            val_data = json.loads(line)
+    # for i in tqdm(range(2000)):
+    for i in tqdm(range(len(val_data))):
+        # import ipdb; ipdb.set_trace()
+        # line = val_data[n]
+        line = val_data[i]
+        img_name = line["image"][:-4]
+        lmk1_path = os.path.join(gt_lmk_folder, f'{img_name}.txt')
+        lmk1 = np.loadtxt(lmk1_path) / 2
+        lmk2_path = os.path.join(lmk_folder, f'result_{i}_{n}.txt')
+        if not os.path.exists(lmk2_path):
+            print(f'{lmk2_path} not exist')
+            continue
+        lmk2 = np.loadtxt(lmk2_path)
+        lmk_err.append(np.mean(np.linalg.norm(lmk1 - lmk2, axis=1)))
+    print(np.mean(lmk_err))
+    np.save(os.path.join(lmk_folder, 'lmk_err.npy'), lmk_err)
+n = 0
+epoch = 19
+gt_lmk_folder = './data/celebhq-text/celeba-hq-landmark2d'
+# input_folder = os.path.join('./data/image_log_opt_lora_CelebA_landmark_lr_5-6_pe_diff_mlp_r_4_cayley_4gpu/results', str(epoch))
+input_folder = os.path.join('log/image_log_householder_none_ADE20K_segm_eps_7e-06_pe_diff_mlp_l_8_8gpu_2024-05-15-19-33-41-650524/results', str(epoch))
+# input_folder = os.path.join('log/image_log_oft_CelebA_landmark_eps_0.001_pe_diff_mlp_r_4_8gpu_2024-03-21-19-07-34-175825/train_with_norm/results', str(epoch))
+save_folder = os.path.join(input_folder, 'landmark')
+generate_landmark2d(input_folder, save_folder, n, device='cuda:0', vis=False)
+landmark_comparison(save_folder, gt_lmk_folder, n)

generation/control/generation.py ADDED Viewed

	@@ -0,0 +1,238 @@

+from oldm.hack import disable_verbosity
+disable_verbosity()
+import sys
+import os
+import cv2
+import einops
+import gradio as gr
+import numpy as np
+import torch
+import random
+import json
+import argparse
+file_path = os.path.abspath(__file__)
+parent_dir = os.path.abspath(os.path.dirname(file_path) + '/..')
+if parent_dir not in sys.path:
+    sys.path.append(parent_dir)
+from PIL import Image
+from pytorch_lightning import seed_everything
+from oldm.model import create_model, load_state_dict
+from oldm.ddim_hacked import DDIMSampler
+from oft import inject_trainable_oft, inject_trainable_oft_conv, inject_trainable_oft_extended, inject_trainable_oft_with_norm
+from hra import inject_trainable_hra
+from lora import inject_trainable_lora
+from dataset.utils import return_dataset
+def process(input_image, prompt, hint_image, a_prompt, n_prompt, num_samples, image_resolution, ddim_steps, guess_mode, strength, scale, seed, eta, low_threshold, high_threshold):
+    with torch.no_grad():
+        # img = resize_image(HWC3(input_image), image_resolution)
+        H, W, C = input_image.shape
+        #detected_map = apply_canny(input_image, low_threshold, high_threshold)
+        #detected_map = HWC3(detected_map)
+        # control = torch.from_numpy(hint_image.copy()).float().cuda() / 255.0
+        control = torch.from_numpy(hint_image.copy()).float().cuda()
+        control = torch.stack([control for _ in range(num_samples)], dim=0)
+        control = einops.rearrange(control, 'b h w c -> b c h w').clone()
+        if seed == -1:
+            seed = random.randint(0, 65535)
+        seed_everything(seed)
+        # if config.save_memory:
+        #     model.low_vram_shift(is_diffusing=False)
+        cond = {"c_concat": [control], "c_crossattn": [model.get_learned_conditioning([prompt + ', ' + a_prompt] * num_samples)]}
+        un_cond = {"c_concat": None if guess_mode else [control], "c_crossattn": [model.get_learned_conditioning([n_prompt] * num_samples)]}
+        shape = (4, H // 8, W // 8)
+        # if config.save_memory:
+        #     model.low_vram_shift(is_diffusing=True)
+        model.control_scales = [strength * (0.825 ** float(12 - i)) for i in range(13)] if guess_mode else ([strength] * 13)  # Magic number. IDK why. Perhaps because 0.825**12<0.01 but 0.826**12>0.01
+        samples, intermediates = ddim_sampler.sample(ddim_steps, num_samples,
+                                                     shape, cond, verbose=False, eta=eta,
+                                                     unconditional_guidance_scale=scale,
+                                                     unconditional_conditioning=un_cond)
+        # if config.save_memory:
+        #     model.low_vram_shift(is_diffusing=False)
+        x_samples = model.decode_first_stage(samples)
+        x_samples = (einops.rearrange(x_samples, 'b c h w -> b h w c') * 127.5 + 127.5).cpu().numpy().clip(0, 255).astype(np.uint8)
+        results = [x_samples[i] for i in range(num_samples)]
+    # return [255 - hint_image] + results
+    return [input_image] + [hint_image] + results
+parser = argparse.ArgumentParser()
+parser.add_argument('--d', type=int, help='the index of GPU', default=0)
+# HRA
+parser.add_argument('--hra_r', type=int, default=8)
+parser.add_argument('--hra_apply_GS', action="store_true", default=False)
+# OFT
+parser.add_argument('--oft_r', type=int, default=4)
+parser.add_argument('--oft_eps',
+                    type=float,
+                    choices=[1e-3, 2e-5, 7e-6],
+                    default=7e-6,
+                    )
+parser.add_argument('--oft_coft', action="store_true", default=True)
+parser.add_argument('--oft_block_share', action="store_true", default=False)
+parser.add_argument('--img_ID', type=int, default=1)
+parser.add_argument('--num_samples', type=int, default=1)
+parser.add_argument('--batch', type=int, default=20)
+parser.add_argument('--sd_locked', action="store_true", default=True)
+parser.add_argument('--only_mid_control', action="store_true", default=False)
+parser.add_argument('--num_gpus', type=int, default=8)
+# parser.add_argument('--time_str', type=str, default=datetime.now().strftime("%Y-%m-%d-%H-%M-%S-%f"))
+parser.add_argument('--time_str', type=str, default='2024-03-18-10-55-21-089985')
+parser.add_argument('--epoch', type=int, default=19)
+parser.add_argument('--control',
+                    type=str,
+                    help='control signal. Options are [segm, sketch, densepose, depth, canny, landmark]',
+                    default="segm")
+args = parser.parse_args()
+if __name__ == '__main__':
+    # Configs
+    epoch = args.epoch
+    control = args.control
+    _, dataset, data_name, logger_freq, max_epochs = return_dataset(control, full=True)
+    # specify the experiment name
+    # experiment = './log/image_log_oft_{}_{}_eps_{}_pe_diff_mlp_r_{}_{}gpu'.format(data_name, control, args.eps, args.r, args.num_gpus)
+    num_gpus = args.num_gpus
+    time_str = args.time_str
+    # experiment = 'log/image_log_oft_{}_{}_eps_{}_pe_diff_mlp_r_{}_{}gpu_{}'.format(data_name, control, args.eps, args.r, num_gpus, time_str)
+    experiment = './log/image_log_hra_0.0_ADE20K_segm_pe_diff_mlp_r_8_8gpu_2024-06-27-19-57-34-979197'
+    # experiment = './log/image_log_oft_ADE20K_segm_eps_0.001_pe_diff_mlp_r_4_8gpu_2024-03-25-21-04-17-549433/train_with_norm'
+    assert args.control in experiment
+    if 'train_with_norm' in experiment:
+        epoch = 4
+    else:
+        if 'COCO' in experiment:
+            epoch = 9
+        else:
+            epoch = 19
+    resume_path = os.path.join(experiment, f'model-epoch={epoch:02d}.ckpt')
+    sd_locked = args.sd_locked
+    only_mid_control = args.only_mid_control
+    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
+    # Result directory
+    result_dir = os.path.join(experiment, 'results', str(epoch))
+    os.makedirs(result_dir, exist_ok=True)
+    source_dir = os.path.join(experiment, 'source', str(epoch))
+    os.makedirs(source_dir, exist_ok=True)
+    hint_dir = os.path.join(experiment, 'hints', str(epoch))
+    os.makedirs(hint_dir, exist_ok=True)
+    model = create_model('./configs/oft_ldm_v15.yaml').cpu()
+    model.model.requires_grad_(False)
+    if 'hra' in experiment:
+        unet_lora_params, train_names = inject_trainable_hra(model.model, r=args.hra_r, apply_GS=args.hra_apply_GS)
+    elif 'lora' in experiment:
+        unet_lora_params, train_names = inject_trainable_lora(model.model, rank=args.r, network_alpha=None)
+    else:
+        if 'train_with_norm' in experiment:
+            unet_opt_params, train_names = inject_trainable_oft_with_norm(model.model, r=args.oft_r, eps=args.oft_eps, is_coft=args.oft_coft, block_share=args.oft_block_share)
+        else:
+            unet_lora_params, train_names = inject_trainable_oft(model.model, r=args.oft_r, eps=args.oft_eps, is_coft=args.oft_coft, block_share=args.oft_block_share)
+        # unet_lora_params, train_names = inject_trainable_oft_conv(model.model, r=args.r, eps=args.eps, is_coft=args.coft, block_share=args.block_share)
+        # unet_lora_params, train_names = inject_trainable_oft_extended(model.model, r=args.r, eps=args.eps, is_coft=args.coft, block_share=args.block_share)
+    model.load_state_dict(load_state_dict(resume_path, location='cuda'))
+    model = model.cuda()
+    ddim_sampler = DDIMSampler(model)
+    # pack = range(0, len(dataset), args.batch)
+    # formatted_data = {}
+    # for index in range(args.batch):
+    #     # import ipdb; ipdb.set_trace()
+    #     start_point = pack[args.img_ID]
+    #     idx = start_point + index
+    # canny
+    # img_list = [378, 441, 0, 31, 115, 182, 59, 60, 66, 269, ]
+    # landmark
+    # img_list = [139, 179, 197, 144, 54, 71, 76, 98, 100, 277, ]
+    # segm
+    # img_list = [14, 667, 576, 1387, 1603, 1697, 987, 1830, 1232, 1881, ]
+    # for idx in img_list:
+    num_pack = len(dataset) // args.num_gpus
+    start_idx = args.d * num_pack
+    end_idx = (args.d + 1) * num_pack if args.d < args.num_gpus - 1 else len(dataset)
+    for idx in range(start_idx, end_idx):
+        data = dataset[idx]
+        input_image, prompt, hint = data['jpg'], data['txt'], data['hint']
+        # input_image, hint = input_image.to(device), hint.to(device)
+        if not os.path.exists(os.path.join(result_dir, f'result_{idx}_0.png')):
+            result_images = process(
+                input_image=input_image,
+                prompt=prompt,
+                hint_image=hint,
+                a_prompt="",
+                n_prompt="",
+                num_samples=args.num_samples,
+                image_resolution=512,
+                ddim_steps=50,
+                guess_mode=False,
+                strength=1,
+                scale=9.0,
+                seed=-1,
+                eta=0.0,
+                low_threshold=100,
+                high_threshold=200,
+            )
+            for i, image in enumerate(result_images):
+                if i == 0:
+                    image = ((image + 1) * 127.5).clip(0, 255).astype(np.uint8)
+                    pil_image = Image.fromarray(image)
+                    output_path = os.path.join(source_dir, f'image_{idx}.png')
+                    pil_image.save(output_path)
+                elif i == 1:
+                    image = (image * 255).clip(0, 255).astype(np.uint8)
+                    # Convert numpy array to PIL Image
+                    pil_image = Image.fromarray(image)
+                    # Save PIL Image to file
+                    output_path = os.path.join(hint_dir, f'hint_{idx}.png')
+                    pil_image.save(output_path)
+                else:
+                    n = i - 2
+                    # Convert numpy array to PIL Image
+                    pil_image = Image.fromarray(image)
+                    # Save PIL Image to file
+                    output_path = os.path.join(result_dir, f'result_{idx}_{n}.png')
+                    pil_image.save(output_path)
+        # formatted_data[f"item{idx}"] = {
+        #     "image_name": f'result_{idx}.png',
+        #     "prompt": prompt
+        # }
+    # with open(os.path.join(experiment, 'results_{}.json'.format(img_ID)), 'w') as f:
+    #     json.dump(formatted_data, f)

generation/control/hra.py ADDED Viewed

	@@ -0,0 +1,254 @@

+"""
+This script utilizes code from lora available at:
+https://github.com/cloneofsimo/lora/blob/master/lora_diffusion/lora.py
+Original Author: Simo Ryu
+License: Apache License 2.0
+"""
+import json
+import math
+from itertools import groupby
+from typing import Callable, Dict, List, Optional, Set, Tuple, Type, Union
+import pickle
+import numpy as np
+import PIL
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import matplotlib.pyplot as plt
+try:
+    from safetensors.torch import safe_open
+    from safetensors.torch import save_file as safe_save
+    safetensors_available = True
+except ImportError:
+    from .safe_open import safe_open
+    def safe_save(
+        tensors: Dict[str, torch.Tensor],
+        filename: str,
+        metadata: Optional[Dict[str, str]] = None,
+    ) -> None:
+        raise EnvironmentError(
+            "Saving safetensors requires the safetensors library. Please install with pip or similar."
+        )
+    safetensors_available = False
+def project(R, eps):
+    I = torch.zeros((R.size(0), R.size(0)), dtype=R.dtype, device=R.device)
+    diff = R - I
+    norm_diff = torch.norm(diff)
+    if norm_diff <= eps:
+        return R
+    else:
+        return I + eps * (diff / norm_diff)
+def project_batch(R, eps=1e-5):
+    # scaling factor for each of the smaller block matrix
+    eps = eps * 1 / torch.sqrt(torch.tensor(R.shape[0]))
+    I = torch.zeros((R.size(1), R.size(1)), device=R.device, dtype=R.dtype).unsqueeze(0).expand_as(R)
+    diff = R - I
+    norm_diff = torch.norm(R - I, dim=(1, 2), keepdim=True)
+    mask = (norm_diff <= eps).bool()
+    out = torch.where(mask, R, I + eps * (diff / norm_diff))
+    return out
+class HRAInjectedLinear(nn.Module):
+    def __init__(
+        self, in_features, out_features, bias=False, r=8, apply_GS=False,
+    ):
+        super().__init__()
+        self.in_features=in_features
+        self.out_features=out_features
+        self.r = r
+        self.apply_GS = apply_GS
+        half_u = torch.zeros(in_features, r // 2)
+        nn.init.kaiming_uniform_(half_u, a=math.sqrt(5))
+        self.hra_u = nn.Parameter(torch.repeat_interleave(half_u, 2, dim=1), requires_grad=True)
+        self.fixed_linear = torch.nn.Linear(in_features=in_features, out_features=out_features, bias=bias)
+    def forward(self, x):
+        orig_weight = self.fixed_linear.weight.data
+        if self.apply_GS:
+            weight = [(self.hra_u[:, 0] / self.hra_u[:, 0].norm()).view(-1, 1)]
+            for i in range(1, self.r):
+                ui = self.hra_u[:, i].view(-1, 1)
+                for j in range(i):
+                    ui = ui - (weight[j].t() @ ui) * weight[j]
+                weight.append((ui / ui.norm()).view(-1, 1))
+            weight = torch.cat(weight, dim=1)
+            new_weight = orig_weight @ (torch.eye(self.in_features, device=x.device) - 2 * weight @ weight.t())
+        else:
+            new_weight = orig_weight
+            hra_u_norm = self.hra_u / self.hra_u.norm(dim=0)
+            for i in range(self.r):
+                ui = hra_u_norm[:, i].view(-1, 1)
+                new_weight = torch.mm(new_weight, torch.eye(self.in_features, device=x.device) - 2 * ui @ ui.t())
+        out = nn.functional.linear(input=x, weight=new_weight, bias=self.fixed_linear.bias)
+        return out
+UNET_DEFAULT_TARGET_REPLACE = {"CrossAttention", "Attention", "GEGLU"}
+UNET_CONV_TARGET_REPLACE = {"ResBlock"}
+UNET_EXTENDED_TARGET_REPLACE = {"ResBlock", "CrossAttention", "Attention", "GEGLU"}
+TEXT_ENCODER_DEFAULT_TARGET_REPLACE = {"CLIPAttention"}
+TEXT_ENCODER_EXTENDED_TARGET_REPLACE = {"CLIPAttention"}
+DEFAULT_TARGET_REPLACE = UNET_DEFAULT_TARGET_REPLACE
+EMBED_FLAG = "<embed>"
+def _find_children(
+    model,
+    search_class: List[Type[nn.Module]] = [nn.Linear],
+):
+    """
+    Find all modules of a certain class (or union of classes).
+    Returns all matching modules, along with the parent of those moduless and the
+    names they are referenced by.
+    """
+    result = []
+    for parent in model.modules():
+        for name, module in parent.named_children():
+            if any([isinstance(module, _class) for _class in search_class]):
+                result.append((parent, name, module))  # Append the result to the list
+    return result  # Return the list instead of using 'yield'
+def _find_modules_v2(
+    model,
+    ancestor_class: Optional[Set[str]] = None,
+    search_class: List[Type[nn.Module]] = [nn.Linear],
+    exclude_children_of: Optional[List[Type[nn.Module]]] = [
+        HRAInjectedLinear,
+    ],
+):
+    """
+    Find all modules of a certain class (or union of classes) that are direct or
+    indirect descendants of other modules of a certain class (or union of classes).
+    Returns all matching modules, along with the parent of those moduless and the
+    names they are referenced by.
+    """
+    # Get the targets we should replace all linears under
+    if ancestor_class is not None:
+        ancestors = (
+            module
+            for module in model.modules()
+            if module.__class__.__name__ in ancestor_class
+        )
+    else:
+        # the first modules is the most senior father class.
+        # this, incase you want to naively iterate over all modules.
+        for module in model.modules():
+            ancestor_class = module.__class__.__name__
+            break
+        ancestors = (
+            module
+            for module in model.modules()
+            if module.__class__.__name__ in ancestor_class
+        )
+    results = []
+    # For each target find every linear_class module that isn't a child of a HRAInjectedLinear
+    for ancestor in ancestors:
+        for fullname, module in ancestor.named_modules():
+            if any([isinstance(module, _class) for _class in search_class]):
+                # Find the direct parent if this is a descendant, not a child, of target
+                *path, name = fullname.split(".")
+                parent = ancestor
+                while path:
+                    parent = parent.get_submodule(path.pop(0))
+                # Skip this linear if it's a child of a HRAInjectedLinear
+                if exclude_children_of and any(
+                    [isinstance(parent, _class) for _class in exclude_children_of]
+                ):
+                    continue
+                results.append((parent, name, module))  # Append the result to the list
+    return results  # Return the list instead of using 'yield'
+def _find_modules_old(
+    model,
+    ancestor_class: Set[str] = DEFAULT_TARGET_REPLACE,
+    search_class: List[Type[nn.Module]] = [nn.Linear],
+    exclude_children_of: Optional[List[Type[nn.Module]]] = [HRAInjectedLinear],
+):
+    ret = []
+    for _module in model.modules():
+        if _module.__class__.__name__ in ancestor_class:
+            for name, _child_module in _module.named_modules():
+                if _child_module.__class__ in search_class:
+                    ret.append((_module, name, _child_module))
+    return ret
+_find_modules = _find_modules_v2
+# _find_modules = _find_modules_old
+def inject_trainable_hra(
+    model: nn.Module,
+    target_replace_module: Set[str] = DEFAULT_TARGET_REPLACE,
+    verbose: bool = False,
+    r: int = 8,
+    apply_GS: str = False,
+):
+    """
+    inject hra into model, and returns hra parameter groups.
+    """
+    require_grad_params = []
+    names = []
+    for _module, name, _child_module in _find_modules(
+        model, target_replace_module, search_class=[nn.Linear]
+    ):
+        weight = _child_module.weight
+        bias = _child_module.bias
+        if verbose:
+            print("HRA Injection : injecting hra into ", name)
+            print("HRA Injection : weight shape", weight.shape)
+        _tmp = HRAInjectedLinear(
+            _child_module.in_features,
+            _child_module.out_features,
+            _child_module.bias is not None,
+            r=r,
+            apply_GS=apply_GS,
+        )
+        _tmp.fixed_linear.weight = weight
+        if bias is not None:
+            _tmp.fixed_linear.bias = bias
+        # switch the module
+        _tmp.to(_child_module.weight.device).to(_child_module.weight.dtype)
+        _module._modules[name] = _tmp
+        require_grad_params.append(_module._modules[name].hra_u)
+        _module._modules[name].hra_u.requires_grad = True
+        names.append(name)
+    return require_grad_params, names

generation/control/tool_add_hra.py ADDED Viewed

	@@ -0,0 +1,81 @@

+"""
+This script utilizes code from ControlNet available at:
+https://github.com/lllyasviel/ControlNet/blob/main/tool_add_control.py
+Original Author: Lvmin Zhang
+License: Apache License 2.0
+"""
+import sys
+import os
+os.environ['HF_HOME'] = '/tmp'
+# assert len(sys.argv) == 3, 'Args are wrong.'
+# input_path = sys.argv[1]
+# output_path = sys.argv[2]
+import torch
+from oldm.hack import disable_verbosity
+disable_verbosity()
+from oldm.model import create_model
+from hra import inject_trainable_hra
+import argparse
+parser = argparse.ArgumentParser()
+parser.add_argument('--input_path', type=str, default='./models/v1-5-pruned.ckpt')
+parser.add_argument('--output_path', type=str, default='./models/hra_half_init_l_8.ckpt')
+parser.add_argument('--r', type=int, default=8)
+parser.add_argument('--apply_GS', action='store_true', default=False)
+args = parser.parse_args()
+# args.output_path = f'./models/hra_none_l_8.ckpt'
+assert os.path.exists(args.input_path), 'Input model does not exist.'
+# assert not os.path.exists(output_path), 'Output filename already exists.'
+assert os.path.exists(os.path.dirname(args.output_path)), 'Output path is not valid.'
+def get_node_name(name, parent_name):
+    if len(name) <= len(parent_name):
+        return False, ''
+    p = name[:len(parent_name)]
+    if p != parent_name:
+        return False, ''
+    return True, name[len(parent_name):]
+model = create_model(config_path='./configs/oft_ldm_v15.yaml')
+model.model.requires_grad_(False)
+unet_lora_params, train_names = inject_trainable_hra(model.model, r=args.r, apply_GS=args.apply_GS)
+pretrained_weights = torch.load(args.input_path)
+if 'state_dict' in pretrained_weights:
+    pretrained_weights = pretrained_weights['state_dict']
+scratch_dict = model.state_dict()
+target_dict = {}
+names = []
+for k in scratch_dict.keys():
+    names.append(k)
+    if k in pretrained_weights:
+        target_dict[k] = pretrained_weights[k].clone()
+    else:
+        if 'fixed_linear.' in k:
+            copy_k = k.replace('fixed_linear.', '')
+            target_dict[k] = pretrained_weights[copy_k].clone()
+        else:
+            target_dict[k] = scratch_dict[k].clone()
+            print(f'These weights are newly added: {k}')
+with open('HRA_model_names.txt', 'w') as file:
+    for element in names:
+        file.write(element + '\n')
+model.load_state_dict(target_dict, strict=True)
+torch.save(model.state_dict(), args.output_path)
+# print('没有保存模型')
+print('Done.')

generation/control/train.py ADDED Viewed

	@@ -0,0 +1,149 @@

+from oldm.hack import disable_verbosity
+disable_verbosity()
+import os
+import sys
+import torch
+from datetime import datetime
+file_path = os.path.abspath(__file__)
+parent_dir = os.path.abspath(os.path.dirname(file_path) + '/..')
+if parent_dir not in sys.path:
+    sys.path.append(parent_dir)
+import pytorch_lightning as pl
+from pytorch_lightning.callbacks import ModelCheckpoint
+from torch.utils.data import DataLoader
+from oldm.logger import ImageLogger
+from oldm.model import create_model, load_state_dict
+from dataset.utils import return_dataset
+from oft import inject_trainable_oft, inject_trainable_oft_conv, inject_trainable_oft_extended, inject_trainable_oft_with_norm
+from hra import inject_trainable_hra
+from lora import inject_trainable_lora
+import argparse
+parser = argparse.ArgumentParser()
+# HRA
+parser.add_argument('--hra_r', type=int, default=8)
+parser.add_argument('--hra_apply_GS', action='store_true', default=False)
+parser.add_argument('--hra_lamb', type=float, default=0.0)
+# OFT
+parser.add_argument('--oft_r', type=int, default=4)
+parser.add_argument('--oft_eps', type=float, default=7e-6)
+parser.add_argument('--oft_coft', action="store_true", default=True)
+parser.add_argument('--oft_block_share', action="store_true", default=False)
+parser.add_argument('--batch_size', type=int, default=8)
+parser.add_argument('--num_samples', type=int, default=1)
+parser.add_argument('--plot_frequency', type=int, default=100)
+parser.add_argument('--learning_rate', type=float, default=9e-4)
+parser.add_argument('--sd_locked', action="store_true", default=True)
+parser.add_argument('--only_mid_control', action="store_true", default=False)
+parser.add_argument('--num_gpus', type=int, default=torch.cuda.device_count())
+parser.add_argument('--resume_path',
+                    type=str,
+                    default='./models/hra_half_init_l_8.ckpt',
+                    )
+parser.add_argument('--time_str', type=str, default=datetime.now().strftime("%Y-%m-%d-%H-%M-%S-%f"))
+parser.add_argument('--num_workers', type=int, default=8)
+parser.add_argument('--control',
+                    type=str,
+                    help='control signal. Options are [segm, sketch, densepose, depth, canny, landmark]',
+                    default="segm")
+args = parser.parse_args()
+if __name__ == "__main__":
+    # specify the control signal and dataset
+    control = args.control
+    # create dataset
+    train_dataset, val_dataset, data_name, logger_freq, max_epochs = return_dataset(control) # , n_samples=n_samples)
+    # Configs
+    resume_path = args.resume_path
+    batch_size = args.batch_size
+    num_samples = args.num_samples
+    plot_frequency = args.plot_frequency
+    learning_rate = args.learning_rate
+    sd_locked = args.sd_locked
+    only_mid_control = args.only_mid_control
+    num_gpus = args.num_gpus
+    time_str = args.time_str
+    num_workers = args.num_workers
+    for arg in vars(args):
+        print(f'{arg}: {getattr(args, arg)}')
+    print(f'data_name: {data_name}\nlogger_freq: {logger_freq}\nmax_epochs: {max_epochs}')
+    if 'oft' in args.resume_path:
+        experiment = 'oft_{}_{}_eps_{}_pe_diff_mlp_r_{}_{}gpu_{}'.format(data_name, control, args.oft_eps, args.oft_r, num_gpus, time_str)
+    elif 'hra' in args.resume_path:
+        if args.hra_apply_GS:
+            experiment = 'hra_apply_GS_{}_{}_pe_diff_mlp_r_{}_{}gpu_{}'.format(data_name, control, args.hra_r, num_gpus, time_str)
+        else:
+            experiment = 'hra_{}_{}_pe_diff_mlp_r_{}_lambda_{}_lr_{}_{}gpu_{}'.format(data_name, control, args.hra_r,  args.hra_lamb, args.learning_rate, num_gpus, time_str)
+    elif 'lora' in args.resume_path:
+        experiment = 'lora_{}_{}_pe_diff_mlp_r_{}_{}gpu_{}'.format(data_name, control, args.r, num_gpus, time_str)
+    # First use cpu to load models. Pytorch Lightning will automatically move it to GPUs.
+    model = create_model('./configs/oft_ldm_v15.yaml').cpu()
+    model.model.requires_grad_(False)
+    print(f'Total parameters not requiring grad: {sum([p.numel() for p in model.model.parameters() if p.requires_grad == False])}')
+    # inject trainable oft parameters
+    if 'oft' in args.resume_path:
+        unet_lora_params, train_names = inject_trainable_oft(model.model, r=args.oft_r, eps=args.oft_eps, is_coft=args.oft_coft, block_share=args.oft_block_share)
+    elif 'hra' in args.resume_path:
+        unet_lora_params, train_names = inject_trainable_hra(model.model, r=args.hra_r, apply_GS=args.hra_apply_GS)
+    elif 'lora' in args.resume_path:
+        unet_lora_params, train_names = inject_trainable_lora(model.model, rank=args.r, network_alpha=None)
+    print(f'Total parameters requiring grad: {sum([p.numel() for p in model.model.parameters() if p.requires_grad == True])}')
+    model.load_state_dict(load_state_dict(resume_path, location='cpu'))
+    model.learning_rate = learning_rate
+    model.sd_locked = sd_locked
+    model.only_mid_control = only_mid_control
+    checkpoint_callback = ModelCheckpoint(
+        dirpath='log/image_log_' + experiment,
+        filename='model-{epoch:02d}',
+        save_top_k=-1,
+        save_last=True,
+        every_n_epochs=1,
+        monitor=None,  # No specific metric to monitor for saving
+    )
+    # Misc
+    train_dataloader = DataLoader(train_dataset, num_workers=num_workers, batch_size=batch_size, shuffle=False)
+    val_dataloader = DataLoader(val_dataset, num_workers=num_workers, batch_size=1, shuffle=False)
+    logger = ImageLogger(
+        val_dataloader=val_dataloader,
+        batch_frequency=logger_freq,
+        experiment=experiment,
+        plot_frequency=plot_frequency,
+        num_samples=num_samples,
+    )
+    trainer = pl.Trainer(
+        max_epochs=max_epochs,
+        gpus=num_gpus,
+        precision=32,
+        callbacks=[logger, checkpoint_callback],
+    )
+    # Train!
+    last_model_path = 'log/image_log_' + experiment + '/last.ckpt'
+    if os.path.exists(last_model_path):
+        trainer.fit(model, train_dataloader, ckpt_path=last_model_path)
+    else:
+        trainer.fit(model, train_dataloader)

generation/env.yml ADDED Viewed

	@@ -0,0 +1,172 @@

+name: generation
+dependencies:
+  - _libgcc_mutex=0.1=main
+  - _openmp_mutex=5.1=1_gnu
+  - ca-certificates=2023.12.12=h06a4308_0
+  - ld_impl_linux-64=2.38=h1181459_1
+  - libffi=3.4.4=h6a678d5_0
+  - libgcc-ng=11.2.0=h1234567_1
+  - libgomp=11.2.0=h1234567_1
+  - libstdcxx-ng=11.2.0=h1234567_1
+  - ncurses=6.4=h6a678d5_0
+  - openssl=3.0.13=h7f8727e_0
+  - pip=23.3.1=py39h06a4308_0
+  - python=3.9.18=h955ad1f_0
+  - readline=8.2=h5eee18b_0
+  - setuptools=68.2.2=py39h06a4308_0
+  - sqlite=3.41.2=h5eee18b_0
+  - tk=8.6.12=h1ccaba5_0
+  - wheel=0.41.2=py39h06a4308_0
+  - xz=5.4.6=h5eee18b_0
+  - zlib=1.2.13=h5eee18b_0
+  - pip:
+    - absl-py==2.1.0
+    - accelerate==0.21.0
+    - addict==2.4.0
+    - aiofiles==23.2.1
+    - aiohttp==3.9.3
+    - aiosignal==1.3.1
+    - altair==5.2.0
+    - annotated-types==0.6.0
+    - antlr4-python3-runtime==4.8
+    - anyio==4.3.0
+    - appdirs==1.4.4
+    - async-timeout==4.0.3
+    - attrs==23.2.0
+    - basicsr==1.4.2
+    - certifi==2024.2.2
+    - charset-normalizer==3.3.2
+    - click==8.1.7
+    - conda-pack==0.7.1
+    - contourpy==1.2.0
+    - cycler==0.12.1
+    - datasets==2.14.2
+    - diffusers==0.17.1
+    - dill==0.3.7
+    - docker-pycreds==0.4.0
+    - einops==0.7.0
+    - exceptiongroup==1.2.0
+    - face-alignment==1.3.4
+    - fastapi==0.110.0
+    - ffmpy==0.3.2
+    - filelock==3.13.1
+    - fire==0.6.0
+    - fonttools==4.49.0
+    - frozenlist==1.4.1
+    - fsspec==2024.2.0
+    - ftfy==6.1.3
+    - future==1.0.0
+    - gitdb==4.0.11
+    - gitpython==3.1.42
+    - gradio==3.16.2
+    - grpcio==1.62.1
+    - h11==0.14.0
+    - httpcore==1.0.4
+    - httpx==0.27.0
+    - huggingface-hub==0.21.4
+    - idna==3.6
+    - imageio==2.34.0
+    - importlib-metadata==7.0.2
+    - importlib-resources==6.3.0
+    - jinja2==3.1.3
+    - jsonschema==4.21.1
+    - jsonschema-specifications==2023.12.1
+    - kiwisolver==1.4.5
+    - lazy-loader==0.3
+    - lightning-utilities==0.10.1
+    - linkify-it-py==2.0.3
+    - llvmlite==0.42.0
+    - lmdb==1.4.1
+    - lpips==0.1.4
+    - markdown==3.5.2
+    - markdown-it-py==3.0.0
+    - markupsafe==2.1.5
+    - matplotlib==3.8.3
+    - mdit-py-plugins==0.4.0
+    - mdurl==0.1.2
+    - mpmath==1.3.0
+    - multidict==6.0.5
+    - multiprocess==0.70.15
+    - networkx==3.2.1
+    - numba==0.59.0
+    - numpy==1.26.4
+    - nvidia-cublas-cu12==12.1.3.1
+    - nvidia-cuda-cupti-cu12==12.1.105
+    - nvidia-cuda-nvrtc-cu12==12.1.105
+    - nvidia-cuda-runtime-cu12==12.1.105
+    - nvidia-cudnn-cu12==8.9.2.26
+    - nvidia-cufft-cu12==11.0.2.54
+    - nvidia-curand-cu12==10.3.2.106
+    - nvidia-cusolver-cu12==11.4.5.107
+    - nvidia-cusparse-cu12==12.1.0.106
+    - nvidia-nccl-cu12==2.19.3
+    - nvidia-nvjitlink-cu12==12.4.99
+    - nvidia-nvtx-cu12==12.1.105
+    - omegaconf==2.1.1
+    - open-clip-torch==2.0.2
+    - opencv-python==4.9.0.80
+    - orjson==3.9.15
+    - packaging==24.0
+    - pandas==2.2.1
+    - pillow==10.2.0
+    - platformdirs==4.2.0
+    - protobuf==4.25.3
+    - psutil==5.9.8
+    - pyarrow==15.0.1
+    - pycocotools==2.0.7
+    - pycryptodome==3.20.0
+    - pydantic==2.6.4
+    - pydantic-core==2.16.3
+    - pydeprecate==0.3.1
+    - pydub==0.25.1
+    - pyparsing==3.1.2
+    - python-dateutil==2.9.0.post0
+    - python-multipart==0.0.9
+    - pytorch-fid==0.3.0
+    - pytorch-lightning==1.5.0
+    - pytz==2024.1
+    - pyyaml==6.0.1
+    - referencing==0.33.0
+    - regex==2023.12.25
+    - requests==2.31.0
+    - rpds-py==0.18.0
+    - safetensors==0.4.2
+    - scikit-image==0.22.0
+    - scipy==1.12.0
+    - sentry-sdk==1.42.0
+    - setproctitle==1.3.3
+    - six==1.16.0
+    - smmap==5.0.1
+    - sniffio==1.3.1
+    - starlette==0.36.3
+    - sympy==1.12
+    - tb-nightly==2.17.0a20240313
+    - tensorboard==2.16.2
+    - tensorboard-data-server==0.7.2
+    - termcolor==2.4.0
+    - tifffile==2024.2.12
+    - timm==0.9.16
+    - tokenizers==0.13.3
+    - tomli==2.0.1
+    - toolz==0.12.1
+    - torch==2.2.1
+    - torch-fidelity==0.3.0
+    - torchaudio==2.2.1
+    - torchmetrics==1.3.1
+    - torchvision==0.17.1
+    - tqdm==4.66.2
+    - transformers==4.25.1
+    - triton==2.2.0
+    - typing-extensions==4.10.0
+    - tzdata==2024.1
+    - uc-micro-py==1.0.3
+    - urllib3==2.2.1
+    - uvicorn==0.28.0
+    - wandb==0.16.4
+    - wcwidth==0.2.13
+    - websockets==12.0
+    - werkzeug==3.0.1
+    - xxhash==3.4.1
+    - yapf==0.40.2
+    - yarl==1.9.4
+    - zipp==3.18.0