|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- mse |
|
|
base_model: |
|
|
- lllyasviel/sd-controlnet-seg |
|
|
pipeline_tag: image-to-image |
|
|
tags: |
|
|
- controlnet |
|
|
- stable-diffusion |
|
|
- conditional-generation |
|
|
- segmentation |
|
|
model-index: |
|
|
- name: Facades-ControlNet-SD15 |
|
|
results: |
|
|
- task: |
|
|
type: image-to-image |
|
|
name: Conditional Image Generation |
|
|
dataset: |
|
|
name: CMP Facades Dataset |
|
|
type: facades |
|
|
url: https://www.kaggle.com/datasets/balraj98/facades-dataset |
|
|
metrics: |
|
|
- name: Mean Squared Error |
|
|
type: mse |
|
|
value: 0.0178 |
|
|
source: |
|
|
name: Custom Evaluation |
|
|
url: https://www.kaggle.com/datasets/balraj98/facades-dataset |
|
|
--- |
|
|
|
|
|
# Model Card for Facades ControlNet with Stable Diffusion v1.5 |
|
|
|
|
|
 |
|
|
|
|
|
This model is a fine-tuned version of ControlNet built on top of **Stable Diffusion v1.5**, specifically conditioned on **semantic segmentation maps** from the **Facades dataset**. It enables structure-aware image generation by combining natural language prompts with pixel-level guidance in the form of building façade segmentation masks. The result is highly controllable generation of realistic architectural scenes that reflect both structural layout and textual context. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Base Model**: [stable-diffusion-v1-5/stable-diffusion-v1-5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) |
|
|
|
|
|
- **Control Type**: Semantic segmentation maps (Facades-style RGB masks) |
|
|
|
|
|
- **Architecture**: U-Net + ControlNet adapter + Variational Autoencoder (VAE) + CLIP Text Encoder (ViT-L/14) |
|
|
|
|
|
- **Training Epochs**: 30 full passes over the training data |
|
|
|
|
|
- **Training Dataset**: [Facades dataset](https://www.kaggle.com/datasets/balraj98/facades-dataset) |
|
|
|
|
|
- **Resolution**: Trained at 512×512 resolution |
|
|
|
|
|
- **Hardware**: NVIDIA A100 40GB GPU — total training time was approximately 1 hours |
|
|
|
|
|
- **Loss Function**: Mean Squared Error (MSE) between predicted and true noise vectors (used in DDPM training) |
|
|
|
|
|
|
|
|
The ControlNet branches were trained while freezing the base Stable Diffusion weights. This retains the generative capabilities of the original model while specializing it to generate façade-aligned structures. |
|
|
|
|
|
## Usage |
|
|
|
|
|
This model is available via the `diffusers` library. Here's how to load and use it: |
|
|
|
|
|
```python |
|
|
from diffusers import StableDiffusionControlNetPipeline |
|
|
import torch |
|
|
|
|
|
pipe = StableDiffusionControlNetPipeline.from_pretrained( |
|
|
"doguilmak/facade-controlnet-sd15", |
|
|
torch_dtype=torch.float32, |
|
|
safety_checker=None |
|
|
) |
|
|
pipe.to("cuda") |
|
|
|
|
|
# Load your segmentation map (RGB format expected) |
|
|
from PIL import Image |
|
|
control = Image.open("facades_segmentation_map.png").convert("RGB") |
|
|
|
|
|
# Run generation |
|
|
result = pipe( |
|
|
prompt="a modern building with large glass windows", |
|
|
negative_prompt="blurry, distorted", |
|
|
image=control, |
|
|
control_image=control, |
|
|
num_inference_steps=50, |
|
|
guidance_scale=9, |
|
|
output_type="pil" |
|
|
).images[0] |
|
|
|
|
|
result.save("facade_result.png") |
|
|
|
|
|
``` |
|
|
|
|
|
## Example Outputs |
|
|
|
|
|
These example illustrate the model’s ability to generate photorealistic urban scenes guided by semantic segmentation maps. The output demonstrate strong spatial alignment between the input masks and the synthesized content. |
|
|
|
|
|
 |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- The model was trained on **512×512** resolution; using higher resolutions without resizing may cause artifacts. |
|
|
|
|
|
- It performs best on scenes resembling architectural façades. |
|
|
|
|
|
- The control image should resemble **Facades-style segmentation formats** for optimal results. |
|
|
|
|
|
|
|
|
## License |
|
|
|
|
|
This stable diffusion base model is distributed under the [CreativeML Open RAIL-M license](https://huggingface.co/spaces/CompVis/stable-diffusion-license). |
|
|
|
|
|
Our model is distributed under the [MIT license](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/mit.md). |
|
|
|
|
|
## References |
|
|
|
|
|
- **ControlNet Segmentation Model**: [lllyasviel/sd-controlnet-seg @ Hugging Face](https://huggingface.co/lllyasviel/sd-controlnet-seg) |
|
|
|
|
|
- **ControlNet Paper**: Y. Zhao _et al._, “Adding Conditional Control to Text-to-Image Diffusion Models,” _arXiv preprint_ arXiv:2302.05543, 2023. |
|
|
|
|
|
- **Facades Dataset**: [Kaggle: Facades Dataset](https://www.kaggle.com/datasets/balraj98/facades-dataset) |